The Annals of Applied Statistics

Sequential double cross-validation for assessment of added predictive ability in high-dimensional omic applications

Mar Rodríguez-Girondo, Perttu Salo, Tomasz Burzykowski, Markus Perola, Jeanine Houwing-Duistermaat, and Bart Mertens

Full-text: Access denied (no subscription detected)

We're sorry, but we are unable to provide you with the full text of this article because we are not able to identify you as a subscriber. If you have a personal subscription to this journal, then please login. If you are already logged in, then you may need to update your profile to register your subscription. Read more about accessing full-text


Enriching existing predictive models with new biomolecular markers is an important task in the new multi-omic era. Clinical studies increasingly include new sets of omic measurements which may prove their added value in terms of predictive performance. We introduce a two-step approach for the assessment of the added predictive ability of omic predictors, based on sequential double cross-validation and regularized regression models. We propose several performance indices to summarize the two-stage prediction procedure and a permutation test to formally assess the added predictive value of a second omic set of predictors over a primary omic source. The performance of the test is investigated through simulations. We illustrate the new method through the systematic assessment and comparison of the performance of transcriptomics and metabolomics sources in the prediction of body mass index (BMI) using longitudinal data from the Dietary, Lifestyle, and Genetic determinants of Obesity and Metabolic syndrome (DILGOM) study, a population-based cohort from Finland.

Article information

Ann. Appl. Stat., Volume 12, Number 3 (2018), 1655-1678.

Received: July 2016
Revised: November 2017
First available in Project Euclid: 11 September 2018

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Added predictive ability double cross-validation regularized regression multiple omics sets


Rodríguez-Girondo, Mar; Salo, Perttu; Burzykowski, Tomasz; Perola, Markus; Houwing-Duistermaat, Jeanine; Mertens, Bart. Sequential double cross-validation for assessment of added predictive ability in high-dimensional omic applications. Ann. Appl. Stat. 12 (2018), no. 3, 1655--1678. doi:10.1214/17-AOAS1125.

Export citation


  • Apalasamy, Y. D. and Mohamed, Z. (2015). Obesity and genomics: Role of technology in unraveling the complex genetic architecture of obesity. Am. J. Hum. Genet. 134 361–374.
  • Boulesteix, A.-L. and Hothorn, T. (2010). Testing the additional predictive value of high-dimensional molecular data. BMC Bioinform. 11 78.
  • Breiman, L. (1996). Stacked regressions. Mach. Learn. 24 49–64.
  • Bühlmann, P. and Hothorn, T. (2007). Boosting algorithms: Regularization, prediction and model fitting. Statist. Sci. 22 477–505.
  • DeLong, E. R., DeLong, D. M. and Clarke-Pearson, D. L. (1988). Comparing the areas under two or more correlated receiver operating characteristic curves: A nonparametric approach. Biometrics 44 837–845.
  • Dudoit, S., Fridlyand, J. and Speed, T. P. (2002). Comparison of discrimination methods for the classification of tumors using gene expression data. J. Amer. Statist. Assoc. 97 77–87.
  • Friedman, J. Hastie, T. and Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 33 1–22.
  • Hardin, J., Garcia, S. R. and Golan, D. (2013). A method for generating realistic correlation matrices. Ann. Appl. Stat. 7 1733–1762.
  • Hastie, T., Tibshirani, R. and Friedman, J. (2001). Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, New York.
  • Herlihy, M. and Shavit, N. (2012). The Art of Multiprocessor Programming (Revised Edition). Elsevier, New York.
  • Hilden, J. and Gerds, T. A. (2014). A note on the evaluation of novel biomarkers: Do not rely on integrated discrimination improvement and net reclassification index. Stat. Med. 33 3405–3414.
  • Hoerl, A. E. and Kennard, R. (1970). Ridge regression: Biased estimation for nonorthogonal problems. Technometrics 12 55–67.
  • Höfling, H. and Tibshirani, R. (2008). A study of pre-validation. Ann. Appl. Stat. 2 643–664.
  • Inouye, M., Kettunen, J., Soininen, P., Silander, K., Ripatti, S. et al. (2010). Metabonomic, transcriptomic, and genomic variation of a population cohort. Mol. Syst. Biol. 6 441.
  • Jenkinson, C. P., Goering, H. H. H., Arya, R., Blangero, J., Duggirala, R. and DeFronzo, R. A. (2016). Transcriptomics in type 2 diabetes: Bridging the gap between genotype and phenotype. Genomics Data 8 25–36.
  • Jolliffe, I. T. (2002). Principal Component Analysis, 2nd ed. Springer, New York.
  • Jonathan, P., Krzanowski, W. J. and McCarthy, M. V. (2000). On the use of cross-validation to assess performance in multivariate prediction. Stat. Comput. 10 209–229.
  • Kerr, K. F., Wang, Z., Janes, H., McClelland, R. L., Psaty, B. M. and Pepe, M. S. (2014). Net reclassification indices for evaluating risk-prediction instruments: A. Critical Review Epidemiology 25 114–121.
  • Kneib, T., Hothorn, T. and Tutz, G. (2009). Variable selection and model choice in geoadditive regression models. Biometrics 65 626–634.
  • Liu, H., D’Andrade, P., Fulmer-Smentek, S., Lorenzi, P., Kohn, K. W., Weinstein, J. N., Pommier, Y. and Reinhold, W. C. (2010). mRNA and microRNA expression profiles of the NCI-60 integrated with drug activities. Mol. Cancer Ther. 9 1080–1091.
  • Martens, H. and Næs, T. (1989). Multivariate Calibration. Wiley, Chichester.
  • Mertens, B. J. A., De Noo, M. E., Tollenaar, R. A. E. M. and Deelder, A. M. (2006). Mass spectrometry proteomic diagnosis: Enacting the double cross-validatory paradigm. J. Comput. Biol. 13 1591–1605.
  • Mertens, B. J. A., van de Burgt, Y. E. M., Velstra, B., Mesker, W. E., Tollenaar, R. A. E. M. and Deelder, A. M. (2011). On the use of double cross-validation for the combination of proteomic mass spectral data for enhanced diagnosis and prediction. Statist. Probab. Lett. 81 759–766.
  • Pencina, M. J., D’Agostino, R. B. Sr., D’Agostino, R. B. Jr. and Vasan, R. S. (2008). Evaluating the added predictive ability of a new marker: From area under the ROC curve to reclassification and beyond. Stat. Med. 27 157–172.
  • Pencina, M. J., D’Agostino, R. B., Pencina, K. M., Janssens, C. J. W. and Greenland, P. (2012). Interpreting incremental value of markers added to risk prediction models. Am. J. Epidemiol. 176 473–481.
  • Pepe, M. S., Janes, H. and Li, C. I. (2014). Net risk reclassification $p$ values: Valid or misleading? J. Natl. Cancer Inst. 106 dju041.
  • Rodrí guez-Girondo, M., Kneib, T., Cadarso-Suárez, C. and Abu-Assi, E. (2013). Model building in nonproportional hazard regression. Stat. Med. 32 5301–5314.
  • Rodríguez-Girondo, M., Salo, P., Burzykowski, T., Perola, M., Houwing-Duistermaat, J. and Mertens, B. (2018). Supplement to “Sequential double cross-validation for assessment of added predictive ability in high-dimensional omic applications.” DOI:10.1214/17-AOAS1125SUPPA, DOI:10.1214/17-AOAS1125SUPPB.
  • Rosenwald, A., Wright, G., Chan, W. C., Connors, J. M., Campo, E. et al. (2002). The use of molecular profiling to predict survival after chemotherapy for diffuse large-b-cell lymphoma. N. Engl. J. Med. 346 1937–1947.
  • Schemper, M. (2003). Predictive accuracy and explained variation. Stat. Med. 22 2299–2308.
  • Schwamborn, K. and Caprioli, R. M. (2010). Molecular imaging by mass spectroscopy — looking beyond classical histology. Nat. Rev. 10 639–646.
  • Simon, N., Friedman, J., Hastie, T. and Tibshirani, R. (2013). A sparse-group lasso. J. Comput. Graph. Statist. 22 231–245.
  • Soininen, P., Kangas, A. J., Wurtz, P., Tukiainen, T., Tynkkynen, T., Laatikainen, R., Jarvelin, M. R., Kahonen, M., Lehtimaki, T., Viikari, J., Raitakari, O. T., Savolainen, M. J. and Ala-Korpela, M. (2009). High-throughput serum NMR metabonomics for cost-effective holistic studies on systemic metabolism. Anal. 134 1781–1785.
  • Stone, M. (1974). Cross-validatory choice and assessment of statistical predictions (with discussion). J. Roy. Statist. Soc. Ser. B 36 111–147.
  • Stroeve, J. H., Saccenti, E., Bouwman, J., Dane, A., Strassburg, K. et al. (2016). Weight loss predictability by plasma metabolic signatures in adults with obesity and morbid obesity of the DiOGenes study. J. Obesity 24 379–388.
  • Theodoratou, E., Thaçi, K., Agakov, F., Timofeeva, M. N., Stambuk, J. et al. (2016). Glycosylation of plasma IgG in colorectal cancer prognosis. Sci. Rep. 6 28098.
  • Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B (Methodol.) 58 267–288.
  • Tibshirani, R. J. and Efron, B. (2002). Pre-validation and inference in microarrays. Stat. Appl. Genet. Mol. Biol. 1 1.
  • Tutz, G. and Binder, H. (2006). Generalized additive modeling with implicit variable selection by likelihood-based boosting. Biometrics 62 961–971.
  • van de Wiel, M. A., Lien, T. G., Verlaat, W., van Wieringen, W. N. and Wilting, S. M. (2016). Better prediction by use of co-data: Adaptive group-regularized ridge regression. Stat. Med. 35 368–381.
  • Varma, S. and Simon, R. (2006). Bias in error estimation when using cross-validation for model selection. BMC Bioinform. 7 91.
  • Westerhuis, J. A., Hoefsloot, H. C. J., Smit, S., Vis, D. J., Smilde, A. K., van Velzen, E. J. J., van Duijnhoven, J. P. M. and van Dorsten, F. A. (2008). Assessment of PLSDA cross validation. Metabolomics 4 81–89.
  • Yuan, M. and Lin, Y. (2006). Model selection and estimation in regression with grouped variables. J. R. Stat. Soc. Ser. B. Stat. Methodol. 68 49–67.
  • Zhang, B. and Horvath, S. (2005). A general framework for weighted gene co-expression network analysis. Stat. Appl. Genet. Mol. Biol. 4 17.
  • Zoldos, V., Horvat, T. and Lauc, G. (2013). Glycomics meets genomics, epigenomics and other high throughput omics for system biology studies. Curr. Opin. Chem. Biol. 17 34–40.
  • Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B. Stat. Methodol. 67 301–320.

Supplemental materials

  • Supplement A: Application to DILGOM data. Supplementary analyses. We provide supplementary analyses in our data application to the DILGOM study.
  • Supplement B: Simulation study. Alternative approaches. We provide supplementary simulation results from alternative approaches.