The Annals of Applied Statistics

Evaluating risk-prediction models using data from electronic health records

Le Wang, Pamela A. Shaw, Hansie M. Mathelier, Stephen E. Kimmel, and Benjamin French

Full-text: Open access


The availability of data from electronic health records facilitates the development and evaluation of risk-prediction models, but estimation of prediction accuracy could be limited by outcome misclassification, which can arise if events are not captured. We evaluate the robustness of prediction accuracy summaries, obtained from receiver operating characteristic curves and risk-reclassification methods, if events are not captured (i.e., “false negatives”). We derive estimators for sensitivity and specificity if misclassification is independent of marker values. In simulation studies, we quantify the potential for bias in prediction accuracy summaries if misclassification depends on marker values. We compare the accuracy of alternative prognostic models for 30-day all-cause hospital readmission among 4548 patients discharged from the University of Pennsylvania Health System with a primary diagnosis of heart failure. Simulation studies indicate that if misclassification depends on marker values, then the estimated accuracy improvement is also biased, but the direction of the bias depends on the direction of the association between markers and the probability of misclassification. In our application, 29% of the 1143 readmitted patients were readmitted to a hospital elsewhere in Pennsylvania, which reduced prediction accuracy. Outcome misclassification can result in erroneous conclusions regarding the accuracy of risk-prediction models.

Article information

Ann. Appl. Stat., Volume 10, Number 1 (2016), 286-304.

Received: December 2014
Revised: July 2015
First available in Project Euclid: 25 March 2016

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Outcome misclassification prediction accuracy risk reclassification ROC curves


Wang, Le; Shaw, Pamela A.; Mathelier, Hansie M.; Kimmel, Stephen E.; French, Benjamin. Evaluating risk-prediction models using data from electronic health records. Ann. Appl. Stat. 10 (2016), no. 1, 286--304. doi:10.1214/15-AOAS891.

Export citation


  • Amarasingham, R., Moore, B. J., Tabak, Y. P., Drazner, M. H., Clark, C. A., Zhang, S., Reed, W. G., Swanson, T. S., Ma, Y. and Halm, E. A. (2010). An automated model to identify heart failure patients at risk for 30-day readmission or death using electronic medical record data. Med. Care 48 981–988.
  • Baillie, C. A., VanZandbergen, C., Tait, G., Hanish, A., Leas, B., French, B., Hanson, C. W., Behta, M. and Umscheid, C. A. (2013). The readmission risk flag: Using the electronic health record to automatically identify patients at risk for 30-day readmission. J. Hosp. Med. 8 689–695.
  • Barron, B. A. (1977). The effects of misclassification on the estimation of relative risk. Biometrics 33 414–418.
  • Bueno, H., Ross, J. S., Wang, Y., Chen, J., Vidán, M. T., Normand, S. L., Curtis, J. P., Drye, E. E., Lichtman, J. H., Keenan, P. S., Kosiborod, M. and Krumholz, H. M. (2010). Trends in length of stay and short-term outcomes among Medicare patients hospitalized for heart failure, 1993–2006. Journal of the American Medical Association 303 2141–2147.
  • Burnum, J. F. (1989). The misinformation era: The fall of the medical record. Annals of Internal Medicine 110 482–484.
  • Chen, L. M., Kennedy, E. H., Sales, A. and Hofer, T. P. (2013). Use of health IT for higher-value critical care. N. Engl. J. Med. 368 594–597.
  • Chin, M. H. and Goldman, L. (1997). Correlates of early hospital readmission or death in patients with congestive heart failure. American Journal of Cardiology 79 1640–1644.
  • Cook, N. R. and Paynter, N. P. (2011). Performance of reclassification statistics in comparing risk prediction models. Biom. J. 53 237–258.
  • Cook, N. R. and Ridker, P. M. (2009). Advances in measuring the effect of individual predictors of cardiovascular risk: The role of reclassification measures. Annals of Internal Medicine 150 795–802.
  • Demler, O. V., Pencina, M. J. and D’Agostino, R. B. Sr. (2012). Misuse of DeLong test to compare AUCs for nested models. Stat. Med. 31 2477–2587.
  • Dunlay, S. M., Shah, N. D., Shi, Q., Morlan, B., VanHouten, H., Long, K. H. and Roger, V. L. (2011). Lifetime costs of medical care after heart failure diagnosis. Circ. Cardiovasc. Qual. Outcomes 4 68–75.
  • Edwards, J. K., Cole, S. R., Troester, M. A. and Richardson, D. B. (2013). Accounting for misclassified outcomes in binary regression models using multiple imputation with internal validation data. Am. J. Epidemiol. 177 904–912.
  • Efron, B. and Tibshirani, R. J. (1993). An Introduction to the Bootstrap. Monographs on Statistics and Applied Probability 57. Chapman & Hall, New York.
  • Felker, G. M., Leimberger, J. D., Califf, R. M., Cuffe, M. S., Massie, B. M., Adams, K. F. J., Gheorghiade, M. and O’Connor, C. M. (2004). Risk stratification after hospitalization for decompensated heart failure. Journal of Cardiac Failure 10 460–466.
  • French, B., Saha-Chaudhuri, P., Ky, B., Cappola, T. P. and Heagerty, P. J. (2012). Development and evaluation of multi-marker risk scores for clinical prognosis. Stat. Methods Med. Res. DOI:10.1177/0962280212451881.
  • Hanley, J. A. and McNeil, B. J. (1982). The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143 29–36.
  • Heagerty, P. J., Lumley, T. and Pepe, M. S. (2000). Time-dependent ROC curves for censored survival data and a diagnostic marker. Biometrics 56 337–344.
  • Heagerty, P. J. and Zheng, Y. (2005). Survival model predictive accuracy and ROC curves. Biometrics 61 92–105.
  • Hilden, J. and Gerds, T. A. (2014). A note on the evaluation of novel biomarkers: Do not rely on integrated discrimination improvement and net reclassification index. Stat. Med. 33 3405–3414.
  • Kerr, K. F., McClelland, R. L., Brown, E. R. and Lumley, T. (2011). Evaluating the incremental value of new biomarkers with integrated discrimination improvement. Am. J. Epidemiol. 174 364–374.
  • Kerr, K. F., Wang, Z., Janes, H., McClelland, R. L., Psaty, B. M. and Pepe, M. S. (2014). Net reclassification indices for evaluating risk prediction instruments: A critical review. Epidemiology 25 114–121.
  • Krumholz, H. M., Chen, Y. T., Wang, Y., Vaccarino, V., Radford, M. J. and Horwitz, R. I. (2000). Predictors of readmission among elderly survivors of admission with heart failure. Am. Heart J. 139 72–77.
  • Lauer, M. S. (2012). Time for a creative transformation of epidemiology in the United States. Journal of the American Medical Association 308 1804–1805.
  • Liao, L., Allen, L. A. and Whellan, D. J. (2008). Economic burden of heart failure in the elderly. Pharmacoeconomics 26 447–462.
  • Liu, M., Kapadia, A. S. and Etzel, C. J. (2010). Evaluating a new risk marker’s predictive contribution in survival models. J. Stat. Theory Pract. 4 845–855.
  • Lyles, R. H., Tang, L., Superak, H. M., King, C. C., Celentano, D. D., Lo, Y. and Sobel, J. D. (2011). Validation data-based adjustments for outcome misclassification in logistic regression: An illustration. Epidemiology 22 589–597.
  • Magder, L. S. and Hughes, J. P. (1997). Logistic regression when the outcome is measured with uncertainty. Am. J. Epidemiol. 146 195–203.
  • Neuhaus, J. M. (1999). Bias and efficiency loss due to misclassified responses in binary regression. Biometrika 86 843–855.
  • O’Connell, J. B. (2000). The economic burden of heart failure. Clin. Cardiol. 23 III6–III10.
  • Pencina, M. J., D’Agostino, R. B. Sr. and Steyerberg, E. W. (2011). Extensions of net reclassification improvement calculations to measure usefulness of new biomarkers. Stat. Med. 30 11–21.
  • Pencina, M. J., D’Agostino, R. B. Sr., D’Agostino, R. B. Jr. and Vasan, R. S. (2008). Evaluating the added predictive ability of a new marker: From area under the ROC curve to reclassification and beyond. Stat. Med. 27 157–172.
  • Pepe, M. S. (1992). Inference using surrogate outcome data and a validation sample. Biometrika 79 355–365.
  • Pepe, M. S. (2011). Problems with risk reclassification methods for evaluating prediction models. Am. J. Epidemiol. 173 1327–1335.
  • Philbin, E. F. and DiSalvo, T. G. (1999). Prediction of hospital readmission for heart failure: Development of a simple risk score based on administrative data. J. Am. Coll. Cardiol. 33 1560–1566.
  • Reilly, M. and Pepe, M. S. (1995). A mean score method for missing and auxiliary covariate data in regression models. Biometrika 82 299–314.
  • Rosner, B., Spiegelman, D. and Willett, W. C. (1990). Correction of logistic regression relative risk estimates and confidence intervals for measurement error: The case of multiple covariates measured with error. Am. J. Epidemiol. 132 734–745.
  • Saha, P. and Heagerty, P. J. (2010). Time-dependent predictive accuracy in the presence of competing risks. Biometrics 66 999–1011.
  • Shepherd, B. E., Shaw, P. A. and Dodd, L. E. (2012). Using audit information to adjust parameter estimates for data errors in clinical trials. Clin. Trials 9 721–729.
  • Shepherd, B. E. and Yu, C. (2011). Accounting for data errors discovered from an audit in multiple linear regression. Biometrics 67 1083–1091.
  • Steyerberg, E. W. and Pencina, M. J. (2010). Reclassification calculations for persons with incomplete follow-up. Annals of Internal Medicine 162 195–196.
  • Uno, H., Tian, L., Cai, T., Kohane, I. S. and Wei, L. J. (2013). A unified inference procedure for a class of measures to assess improvement in risk prediction systems with survival data. Stat. Med. 32 2430–2442.
  • van der Lei, J. (1991). Use and abuse of computer-stored medical records. Methods Inf. Med. 30 79–80.
  • Viallon, V., Ragusa, S., Clavel-Chapelon, F. and Bénichou, J. (2009). How to evaluate the calibration of a disease risk prediction tool. Stat. Med. 28 901–916.
  • Wang, L., Shaw, P. A., Mathelier, H. M., Kimmel, S. E. and French, B. (2016). Supplement to “Evaluating risk-prediction models using data from electronic health records.” DOI:10.1214/15-AOAS891SUPP.
  • Weiskopf, N. G. and Weng, C. (2013). Methods and dimensions of electronic health record data quality assessment: Enabling reuse for clinical research. J. Am. Med. Inform. Assoc. 20 144–151.
  • Wolbers, M., Koller, M. T., Witteman, J. C. M. and Steyerberg, E. W. (2009). Prognostic models with competing risks: Methods and application to coronary risk prediction. Epidemiology 20 555–561.
  • Yamokoski, L. M., Hasselblad, V., Moser, D. K., Binanay, C., Conway, G. A., Glotzer, J. M., Hartman, K. A., Stevenson, L. W. and Leier, C. V. (2007). Prediction of rehospitalization and death in severe heart failure by physicians and nurses of the ESCAPE trial. Journal of Cardiac Failure 13 8–13.

Supplemental materials

  • Supplement to “Evaluating risk-prediction models using data from electronic health records”. The supplement provides additional simulation results by summarizing the distribution of percent bias across simulated datasets.