Predictive Modeling

Gene expression-based prediction modeling

The MAQC-II Project: A Comprehensive Study of Common Practices for the Development and Validation of Microarray-Based Predictive Models

Gene expression data from microarrays are being applied to predict preclinical and clinical endpoints, but the reliability of these predictions has not been established. In the MAQC-II project, 36 independent teams analyzed six microarray data sets to generate predictive models for classifying a sample with respect to one of 13 endpoints indicative of lung or liver toxicity in rodents, or of breast cancer, multiple myeloma or neuroblastoma in humans. In total, more than 30,000 models were built using many combinations of analytical methods. The teams generated predictive models without knowing the biological meaning of some of the endpoints and, to mimic clinical reality, tested the models on data that had not been used for training. We found that model performance depended largely on the endpoint and team proficiency and that different approaches generated models of similar performance. The conclusions and recommendations from MAQC-II are useful for regulatory agencies, study committees and independent investigators that evaluate methods for global gene expression analysis.


Figure: Experimental design and timeline of the MAQC-II project. Numbers (1-10) order the steps of analysis. Step 10 indicates when the original training and validation data sets were swapped to repeat steps 4-9. Every effort was made to ensure the complete independence of the validation data sets from the training sets. Each model is characterized by several modeling factors and seven internal and external validation performance metrics. The modeling factors include: (i) organization; (ii) data set; (iii) endpoint; (iv) summary and normalization method; (v) feature selection method; (vi) number of features used; (vii) classification algorithm; (viii) batch-effect removal method; (ix) type of internal validation; and (x) number of iterations of internal validation. The seven performance metrics for internal validation and external validation are: (i) MCC; (ii) accuracy; (iii) sensitivity; (iv) specificity; (v) AUC; (vi) mean of sensitivity and specificity; and (vii) r.m.s.e. s.d. of metrics are also provided for internal validation results.
Shi L and The Microarray Quality Control Consortium (MAQC-II) (Wang MD, Phan JH, Stokes TH, Parry RM, and Moffitt RA are contributing authors).“The MAQC-II Project: A comprehensive study of common practices for the development and validation of microarray-based predictive models.” Nat Biotechnol. 2010 Aug; 28(8): 827-38.


K-Nearest Neighbor Models for Microarray Gene Expression Analysis and Clinical Outcome Prediction

In the clinical application of genomic data analysis and modeling, a number of factors contribute to the performance of disease classification and clinical outcome prediction. This study focused on the k-nearest neighbor (KNN) modeling strategy and its clinical use. Although KNN is simple and clinically appealing, large performance variations were found among experienced data analysis teams in the MicroArray Quality Control Phase II (MAQC-II) project. For clinical end points and controls from breast cancer, neuroblastoma and multiple myeloma, we systematically generated 463,320 KNN models by varying feature ranking method, number of features, distance metric, number of neighbors, vote weighting and decision threshold. We identified factors that contribute to the MAQC-II project performance variation, and validated a KNN data analysis protocol using a newly generated clinical data set with 478 neuroblastoma patients. We interpreted the biological and practical significance of the derived KNN models, and compared their performance with existing clinical factors.


Figure: Neuroblastoma case study to show clinical applications of the KNN classifier. We designed a method to test whether KNN produces classifiers of good clinical relevance. First, we developed our approach using MAQC-II gene expression data. Then, we applied this approach to additional Neuroblastoma data and compared it to existing clinical factors for risk.
Parry RM*, Jones W*, Stokes TH* (equal contributing authors), Phan JH, Moffitt RA, Fang H, Shi L, Oberthuer A, Fischer M, Tong W, andWang MD. “K-nearest neighbors (KNN) models for microarray gene-expression analysis and clinical outcome prediction.” Pharmacogenomics J. 2010 Aug; 10(4): 292-309. Nat Biotechnol Supplement. 2010 Oct; S62-S79.


omniClassifier: A Desktop Grid Computing System for Big Data Prediction Modeling

Robust prediction models are important for numerous science, engineering, and biomedical applications. However, best-practice procedures for optimizing prediction models can be computationally complex, especially when choosing models from among hundreds or thousands of parameter choices. Computational complexity has further increased with the growth of data in these fields, concurrent with the era of “Big Data“. Grid computing is a potential solution to the computational challenges of Big Data. Desktop grid computing, which uses idle CPU cycles of commodity desktop machines, coupled with commercial cloud computing resources can enable research labs to gain easier and more cost effective access to vast computing resources. We developed omniClassifier, a multi-purpose prediction modeling application that provides researchers with a tool for conducting machine learning research within the guidelines of recommended best-practices. omniClassifier is implemented as a desktop grid computing system using the Berkeley Open Infrastructure for Network Computing (BOINC) middleware. We used various gene expression datasets to demonstrate the potential scalability of omniClassifier for efficient and robust Big Data prediction modeling.


Figure: omniClassifier Prediction Modeling System. The system includes a web server with an interface for uploading data and submitting jobs, a MySQL database for storing datasets and prediction results, and the BOINC server. The BOINC server communicates with BOINC compute nodes to asynchronously distribute work units and collect results.


Phan JH, Kothari S, and Wang MD. “omniClassifier: a desktop grid computing system for Big Data prediction modeling.” ACM Conference on Bioinformatics, Computational Biology and Biomedicine, ACM-BCB. Newport Beach, CA, USA. 2014 Sep 20; In Press.