Open Access
June 2017 Flexible risk prediction models for left or interval-censored data from electronic health records
Noorie Hyun, Li C. Cheung, Qing Pan, Mark Schiffman, Hormuzd A. Katki
Ann. Appl. Stat. 11(2): 1063-1084 (June 2017). DOI: 10.1214/17-AOAS1036
Abstract

Electronic health records are a large and cost-effective data source for developing risk-prediction models. However, for screen-detected diseases, standard risk models (such as Kaplan–Meier or Cox models) do not account for key issues encountered with electronic health record data: left-censoring of pre-existing (prevalent) disease, interval-censoring of incident disease, and ambiguity of whether disease is prevalent or incident when definitive disease ascertainment is not conducted at baseline. Furthermore, researchers might conduct novel screening tests only on a complex two-phase subsample. We propose a family of weighted mixture models that account for left/interval-censoring and complex sampling via inverse-probability weighting in order to estimate current and future absolute risk: we propose a weakly-parametric model for general use and a semiparametric model for checking goodness of fit of the weakly-parametric model. We demonstrate asymptotic properties analytically and by simulation. We used electronic health records to assemble a cohort of 33,295 human papillomavirus (HPV) positive women undergoing cervical cancer screening at Kaiser Permanente Northern California (KPNC) that underlie current screening guidelines. The next guidelines would focus on HPV typing tests, but reporting 14 HPV types is too complex for clinical use. National Cancer Institute along with KPNC conducted a HPV typing test on a complex subsample of 9258 women in the cohort. We used our model to estimate the risk due to each type and grouped the 14 types (the 3-year risk ranges 21.9–1.5) into 4 risk-bands to simplify reporting to clinicians and guidelines. These risk-bands could be adopted by future HPV typing tests and future screening guidelines.

References

1.

Breslow, N. E. and Wellner, J. A. (2007). Weighted likelihood for semiparametric models and two-phase stratified samples, with application to Cox regression. Scand. J. Stat. 34 86–102.Breslow, N. E. and Wellner, J. A. (2007). Weighted likelihood for semiparametric models and two-phase stratified samples, with application to Cox regression. Scand. J. Stat. 34 86–102.

2.

Breslow, N. E., Lumley, T., Ballantyne, C. M., Chambless, L. E. and Kulich, M. (2009). Improved Horvitz-Thompson estimation of model parameter from two-phase stratified samples: Applications in epidemiology. Stat. Biosci. 1 32–49.Breslow, N. E., Lumley, T., Ballantyne, C. M., Chambless, L. E. and Kulich, M. (2009). Improved Horvitz-Thompson estimation of model parameter from two-phase stratified samples: Applications in epidemiology. Stat. Biosci. 1 32–49.

3.

Cai, T. and Zheng, Y. (2013). Resampling procedures for making inference under nested case-control studies. J. Amer. Statist. Assoc. 108 1532–1544.Cai, T. and Zheng, Y. (2013). Resampling procedures for making inference under nested case-control studies. J. Amer. Statist. Assoc. 108 1532–1544.

4.

Castle, P. E., Fetterman, B., SCT (ASCP), Poitras, N., Lorey, T., Shaber, R. and Kinney, W. (2009). Five-year experience of human papillomavirus DNA and Papanicolaou test cotesting. Obstetrics & Gynecology 113 595–600.Castle, P. E., Fetterman, B., SCT (ASCP), Poitras, N., Lorey, T., Shaber, R. and Kinney, W. (2009). Five-year experience of human papillomavirus DNA and Papanicolaou test cotesting. Obstetrics & Gynecology 113 595–600.

5.

Castle, P. E., Stoler, M. H., Wright, Jr., T. C., Sharma, A., Wright, T. L. and Behrens, C. M. (2011). Performance of carcinogenic human papillomavirus (HPV) testing and HPV16 or HPV18 genotyping for cervical cancer screening of women aged 25 years and older: A subanalysis of the ATHENA study. Lancet Oncol. 12 880–890.Castle, P. E., Stoler, M. H., Wright, Jr., T. C., Sharma, A., Wright, T. L. and Behrens, C. M. (2011). Performance of carcinogenic human papillomavirus (HPV) testing and HPV16 or HPV18 genotyping for cervical cancer screening of women aged 25 years and older: A subanalysis of the ATHENA study. Lancet Oncol. 12 880–890.

6.

Chaturvedi, A. K., Katki, H. A., Hildesheim, A., Rodríguez, A. C., Quint, W., Schiffman, M., Van Doorn, L. J., Porras, C., Wacholder, S., Gonzalez, P. and Sherman, M. E. (2011). Human papillomavirus infection with multiple types: Pattern of coinfection and risk of cervical disease. J. Infect. Dis. 203 910–920.Chaturvedi, A. K., Katki, H. A., Hildesheim, A., Rodríguez, A. C., Quint, W., Schiffman, M., Van Doorn, L. J., Porras, C., Wacholder, S., Gonzalez, P. and Sherman, M. E. (2011). Human papillomavirus infection with multiple types: Pattern of coinfection and risk of cervical disease. J. Infect. Dis. 203 910–920.

7.

Cox, D. R. (1972). Regression models and life-tables. J. R. Stat. Soc. Ser. B. Stat. Methodol. 34 187–220.Cox, D. R. (1972). Regression models and life-tables. J. R. Stat. Soc. Ser. B. Stat. Methodol. 34 187–220.

8.

Dorey, F. J., Little, R. J. A. and Schenker, N. (1993). Multiple imputation for threshold-crossing data with interval censoring. Stat. Med. 12 1589–1603.Dorey, F. J., Little, R. J. A. and Schenker, N. (1993). Multiple imputation for threshold-crossing data with interval censoring. Stat. Med. 12 1589–1603.

9.

Graubard, B. I. and Korn, E. L. (1996). Survey inference for subpopulations. Am. J. Epidemiol. 144 102–106.Graubard, B. I. and Korn, E. L. (1996). Survey inference for subpopulations. Am. J. Epidemiol. 144 102–106.

10.

Groeneboom, P. and Wellner, J. A. (1992). Information Bounds and Nonparametric Maximum Likelihood Estimation. DMV Seminar 19. Birkhäuser, Basel.Groeneboom, P. and Wellner, J. A. (1992). Information Bounds and Nonparametric Maximum Likelihood Estimation. DMV Seminar 19. Birkhäuser, Basel.

11.

Horvitz, D. G. and Thompson, D. J. (1952). A generalization of sampling without replacement from a finite universe. J. Amer. Statist. Assoc. 47 663–685.Horvitz, D. G. and Thompson, D. J. (1952). A generalization of sampling without replacement from a finite universe. J. Amer. Statist. Assoc. 47 663–685.

12.

Huang, J. and Rossini, A. J. (1997). Sieve estimation for the proportional-odds failure-time regression model with interval censoring. J. Amer. Statist. Assoc. 92 960–967.Huang, J. and Rossini, A. J. (1997). Sieve estimation for the proportional-odds failure-time regression model with interval censoring. J. Amer. Statist. Assoc. 92 960–967.

13.

Huang, J. and Wellner, J. A. (1997). Interval censored survival data: A review of recent progress. In Proceedings of the First Seattle Symposium in Biostatistics (D. Y. Lin and T. R. Fleming, eds.) 123–169. Springer, New York.Huang, J. and Wellner, J. A. (1997). Interval censored survival data: A review of recent progress. In Proceedings of the First Seattle Symposium in Biostatistics (D. Y. Lin and T. R. Fleming, eds.) 123–169. Springer, New York.

14.

Hyun, N., Cheung, L. C, Pan, Q., Schiffman, M. and Katki, H. A (2017). Supplement to “Flexible risk prediction models for left or interval-censored data from electronic health records.”  DOI:10.1214/17-AOAS1036SUPP.Hyun, N., Cheung, L. C, Pan, Q., Schiffman, M. and Katki, H. A (2017). Supplement to “Flexible risk prediction models for left or interval-censored data from electronic health records.”  DOI:10.1214/17-AOAS1036SUPP.

15.

Kaplan, E. L. and Meier, P. (1958). Nonparametric estimation from incomplete observations. J. Amer. Statist. Assoc. 53 457–481.Kaplan, E. L. and Meier, P. (1958). Nonparametric estimation from incomplete observations. J. Amer. Statist. Assoc. 53 457–481.

16.

Katki, H. A., Kinney, W. K., Fetterman, B., Lorey, T., Poitras, N. E., Cheung, L., Demuth, F., Schiffman, M., Wacholder, S. and Castle, P. E. (2011). Cervical cancer risk for 330,000 women undergoing concurrent HPV testing and cervical cytology in routine clinical practice at a large managed care organization. Lancet Oncol. 12 663–672.Katki, H. A., Kinney, W. K., Fetterman, B., Lorey, T., Poitras, N. E., Cheung, L., Demuth, F., Schiffman, M., Wacholder, S. and Castle, P. E. (2011). Cervical cancer risk for 330,000 women undergoing concurrent HPV testing and cervical cytology in routine clinical practice at a large managed care organization. Lancet Oncol. 12 663–672.

17.

Katki, H. A., Schiffman, M., Castle, P. E., Fetterman, B., Poitras, N. E., Lorey, T., Cheung, L. C., Raine-Bennett, T. R., Gage, J. C. and Kinney, W. K. (2013). Benchmarking CIN3+ risk as the basis for incorporating HPV and Pap cotesting into cervical screening and management guidelines. J. Low. Genit. Tract Dis. 17 S28–S35.Katki, H. A., Schiffman, M., Castle, P. E., Fetterman, B., Poitras, N. E., Lorey, T., Cheung, L. C., Raine-Bennett, T. R., Gage, J. C. and Kinney, W. K. (2013). Benchmarking CIN3+ risk as the basis for incorporating HPV and Pap cotesting into cervical screening and management guidelines. J. Low. Genit. Tract Dis. 17 S28–S35.

18.

Kovalchik, S. A. and Pfeiffer, R. M. (2014). Population-based absolute risk estimation with survey data. Lifetime Data Anal. 20 252–275.Kovalchik, S. A. and Pfeiffer, R. M. (2014). Population-based absolute risk estimation with survey data. Lifetime Data Anal. 20 252–275.

19.

Li, C.-S., Taylor, J. M. G. and Sy, J. P. (2001). Identifiability of cure models. Statist. Probab. Lett. 54 389–395.Li, C.-S., Taylor, J. M. G. and Sy, J. P. (2001). Identifiability of cure models. Statist. Probab. Lett. 54 389–395.

20.

Lumley, T. (2016). Analyses of complex survey samples. Available at  https://cran.r-project.org/web/packages/survey/survey.pdf.Lumley, T. (2016). Analyses of complex survey samples. Available at  https://cran.r-project.org/web/packages/survey/survey.pdf.

21.

Ma, S. (2010). Mixed case interval censored data with a cured subgroup. Statist. Sinica 20 1165–1181.Ma, S. (2010). Mixed case interval censored data with a cured subgroup. Statist. Sinica 20 1165–1181.

22.

Massad, L. S., Einstein, M. H., Huh, W. K., Katki, H. A., Kinney, W. K., Schiffman, M., Solomon, D., Wentzensen, N. and Lawson, H. W. (2013). 2012 updated consensus guidelines for the management of abnormal cervical cancer screening tests and cancer precursors. J. Low. Genit. Tract Dis. 17 S1–S27.Massad, L. S., Einstein, M. H., Huh, W. K., Katki, H. A., Kinney, W. K., Schiffman, M., Solomon, D., Wentzensen, N. and Lawson, H. W. (2013). 2012 updated consensus guidelines for the management of abnormal cervical cancer screening tests and cancer precursors. J. Low. Genit. Tract Dis. 17 S1–S27.

23.

Nelder, J. A. and Wedderburn, R. W. M. (1972). Generalized linear models. J. Roy. Statist. Soc. Ser. A 135 370–384.Nelder, J. A. and Wedderburn, R. W. M. (1972). Generalized linear models. J. Roy. Statist. Soc. Ser. A 135 370–384.

24.

Odell, P. M., Anderson, K. M. and D’Agostino, R. B. (1992). Maximum likelihood estimation for interval-censored data using a Weibull-based accelerated failure time model. Biometrics 951–959.Odell, P. M., Anderson, K. M. and D’Agostino, R. B. (1992). Maximum likelihood estimation for interval-censored data using a Weibull-based accelerated failure time model. Biometrics 951–959.

25.

Robertson, T., Wright, F. T. and Dykstra, R. L. (1988). Order Restricted Statistical Inference. Wiley, Chichester.Robertson, T., Wright, F. T. and Dykstra, R. L. (1988). Order Restricted Statistical Inference. Wiley, Chichester.

26.

Rücker, R. and Messerer, D. (1988). Remission duration: An example of interval-censored observations. Stat. Med. 7 1139–1145.Rücker, R. and Messerer, D. (1988). Remission duration: An example of interval-censored observations. Stat. Med. 7 1139–1145.

27.

Saegusa, T. (2015). Variance estimation under two-phase sampling. Scand. J. Stat. 42 1078–1091.Saegusa, T. (2015). Variance estimation under two-phase sampling. Scand. J. Stat. 42 1078–1091.

28.

Schiffman, M., Wentzensen, N., Wacholder, S., Walter, K., Gage, J. C. and Castle, P. E. (2011). Human papillomavirus testing in the prevention of cervical cancer. J. Natl. Cancer Inst. 103 368–383.Schiffman, M., Wentzensen, N., Wacholder, S., Walter, K., Gage, J. C. and Castle, P. E. (2011). Human papillomavirus testing in the prevention of cervical cancer. J. Natl. Cancer Inst. 103 368–383.

29.

Schiffman, M., Vaughan, L. M., Raine-Bennett, T. R., Castle, P. E., Katki, H. A., Gage, J. C., Fetterman, B., Befano, B. and Wentzensen, N. (2015). A study of HPV typing for the management of HPV-positive ASC-US cervical cytologic results. Gynecol. Oncol. 138 573–578.Schiffman, M., Vaughan, L. M., Raine-Bennett, T. R., Castle, P. E., Katki, H. A., Gage, J. C., Fetterman, B., Befano, B. and Wentzensen, N. (2015). A study of HPV typing for the management of HPV-positive ASC-US cervical cytologic results. Gynecol. Oncol. 138 573–578.

30.

Sen, B. and Banerjee, M. (2007). A pseudolikelihood method for analyzing interval censored data. Biometrika 94 71–86.Sen, B. and Banerjee, M. (2007). A pseudolikelihood method for analyzing interval censored data. Biometrika 94 71–86.

31.

Shao, F., Li, J., Ma, S. and Lee, M.-L. T. (2014). Semiparametric varying-coefficient model for interval censored data with a cured proportion. Stat. Med. 33 1700–1712.Shao, F., Li, J., Ma, S. and Lee, M.-L. T. (2014). Semiparametric varying-coefficient model for interval censored data with a cured proportion. Stat. Med. 33 1700–1712.

32.

Tian, L. and Cai, T. (2006). On the accelerated failure time model for current status and interval censored data. Biometrika 93 329–342.Tian, L. and Cai, T. (2006). On the accelerated failure time model for current status and interval censored data. Biometrika 93 329–342.

33.

Wang, L., McMahan, C. S., Hudgens, M. G. and Qureshi, Z. P. (2016). A flexible, computationally efficient method for fitting the proportional hazards model to interval-censored data. Biometrics 72 222–231.Wang, L., McMahan, C. S., Hudgens, M. G. and Qureshi, Z. P. (2016). A flexible, computationally efficient method for fitting the proportional hazards model to interval-censored data. Biometrics 72 222–231.

34.

Woodward, M. (1999). Epidemiology: Study Design and Data Analysis. Chapman & Hall/CRC, Boca Raton, FL.Woodward, M. (1999). Epidemiology: Study Design and Data Analysis. Chapman & Hall/CRC, Boca Raton, FL.

35.

Zhang, Y., Hua, L. and Huang, J. (2010). A spline-based semiparametric maximum likelihood estimation method for the Cox model with interval-censored data. Scand. J. Stat. 37 338–354.Zhang, Y., Hua, L. and Huang, J. (2010). A spline-based semiparametric maximum likelihood estimation method for the Cox model with interval-censored data. Scand. J. Stat. 37 338–354.
Copyright © 2017 Institute of Mathematical Statistics
Noorie Hyun, Li C. Cheung, Qing Pan, Mark Schiffman, and Hormuzd A. Katki "Flexible risk prediction models for left or interval-censored data from electronic health records," The Annals of Applied Statistics 11(2), 1063-1084, (June 2017). https://doi.org/10.1214/17-AOAS1036
Received: 1 July 2016; Published: June 2017
Vol.11 • No. 2 • June 2017
Back to Top