The Annals of Applied Statistics

Semiparametric regression in testicular germ cell data

Anastasia Voulgaraki, Benjamin Kedem, and Barry I. Graubard

Full-text: Open access


It is possible to approach regression analysis with random covariates from a semiparametric perspective where information is combined from multiple multivariate sources. The approach assumes a semiparametric density ratio model where multivariate distributions are “regressed” on a reference distribution. A kernel density estimator can be constructed from many data sources in conjunction with the semiparametric model. The estimator is shown to be more efficient than the traditional single-sample kernel density estimator, and its optimal bandwidth is discussed in some detail. Each multivariate distribution and the corresponding conditional expectation (regression) of interest are estimated from the combined data using all sources. Graphical and quantitative diagnostic tools are suggested to assess model validity. The method is applied in quantifying the effect of height and age on weight of germ cell testicular cancer patients. Comparisons are made with multiple regression, generalized additive models (GAM) and nonparametric kernel regression.

Article information

Ann. Appl. Stat., Volume 6, Number 3 (2012), 1185-1208.

First available in Project Euclid: 31 August 2012

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Multivariate density ratio model kernel random covariates diagnostic Nadaraya–Watson GAM


Voulgaraki, Anastasia; Kedem, Benjamin; Graubard, Barry I. Semiparametric regression in testicular germ cell data. Ann. Appl. Stat. 6 (2012), no. 3, 1185--1208. doi:10.1214/12-AOAS552.

Export citation


  • Anderson, T. W. (1971). An Introduction to Multivariate Statistical Analysis. Wiley, New York.
  • Bondell, H. D. (2007). Testing goodness-of-fit in logistic case-control studies. Biometrika 94 487–495.
  • Cheng, K. F. and Chu, C. K. (2004). Semiparametric density estimation under a two-sample density ratio model. Bernoulli 10 583–604.
  • Fokianos, K. (2004). Merging information for semiparametric density estimation. J. R. Stat. Soc. Ser. B Stat. Methodol. 66 941–958.
  • Fokianos, K., Kedem, B., Qin, J. and Short, D. A. (2001). A semiparametric approach to the one-way layout. Technometrics 43 56–65.
  • Gilbert, P. B. (2004). Goodness-of-fit tests for semiparametric biased sampling models. J. Statist. Plann. Inference 118 51–81.
  • Gilbert, P. B., Lele, S. R. and Vardi, Y. (1999). Maximum likelihood estimation in semiparametric selection bias models with application to AIDS vaccine trials. Biometrika 86 27–43.
  • Hastie, T. J. and Tibshirani, R. J. (1990). Generalized Additive Models. Monographs on Statistics and Applied Probability 43. Chapman and Hall, London.
  • Kedem, B., Lu, G., Wei, R. and Williams, P. D. (2008). Forecasting mortality rates via density ratio modeling. Canad. J. Statist. 36 193–206.
  • Kedem, B., Kim, E.-y., Voulgaraki, A. and Graubard, B. I. (2009). Two-dimensional semiparametric density ratio modeling of testicular germ cell data. Stat. Med. 28 2147–2159.
  • Li, T.-H. and Song, K.-S. (2002). Asymptotic analysis of a fast algorithm for efficient multiple frequency estimation. IEEE Trans. Inform. Theory 48 2709–2720.
  • Lu, G. (2007). Asymptotic theory for multiple-sample semiparpametric density ratio models and its application to mortality forecasting. Ph.D. dissertation, Univ. Maryland, College Park, MD.
  • McGlynn, K. A. and Cook, M. B. (2010). The epidemiology of testicular cancer. In Male Reproductive Cancers: Epidemiology, Pathology and Genetics (W. D. Foulkes and K. A. Cooney, eds.) 51–83. Springer, New York.
  • McGlynn, K. A., Devesa, S. S., Sigurdson, A. J., Brown, L. M., Tsao, L. and Tarone, R. E. (2003). Trends in the incidence of testicular germ cell tumors in the United States. Cancer 97 63–70.
  • McGlynn, K. A., Sakoda1, L. C., Rubertone, M. V., Sesterhenn, I. A., Lyu, C., Graubard, B. I. and Erickson, R. L. (2007). Body size, dairy consumption, puberty, and risk of testicular germ cell tumors. American Journal of Epidemiology 165 355–363.
  • Nadaraya, E. A. (1964). On estimating regression. Theory Probab. Appl. 9 141–142.
  • Ogden, C. L., Fryar, C. D., Carroll, M. D. and Flegal, K. M. (2004). Mean body weight, height, and body mass index, United States 1960–2002. Adv. Data 347 1–17.
  • Parzen, E. (1962). On estimation of a probability density function and mode. Ann. Math. Statist. 33 1065–1076.
  • Phue, J.-N., Kedem, B., Jaluria, P. and Shiloach, J. (2007). Evaluating microarrays using a semiparametric approach: Application to the central carbon metabolism of Escherichia coli BL21 and JM109. Genomics 89 300–305.
  • Prentice, R. L. and Pyke, R. (1979). Logistic disease incidence models and case-control studies. Biometrika 66 403–411.
  • Qin, J. (1998). Inferences for case-control and semiparametric two-sample density ratio models. Biometrika 85 619–630.
  • Qin, J. and Lawless, J. (1994). Empirical likelihood and general estimating equations. Ann. Statist. 22 300–325.
  • Qin, J. and Zhang, B. (1997). A goodness-of-fit test for logistic regression models based on case-control data. Biometrika 84 609–618.
  • Qin, J. and Zhang, B. (2005). Density estimation under a two-sample semiparametric model. J. Nonparametr. Stat. 17 665–683.
  • Rencher, A. C. (2000). Linear Models in Statistics. Wiley, New York.
  • Shao, J. (2003). Mathematical Statistics, 2nd ed. Springer, New York.
  • Silverman, B. W. (1986). Density Estimation for Statistics and Data Analysis. Chapman and Hall, London.
  • Voulgaraki, A., Kedem, B. and Graubard, B. I. (2012). Supplement to “Semiparametric regression in testicular germ cell data.” DOI:10.1214/12-AOAS552SUPP.
  • Watson, G. S. (1964). Smooth regression analysis. Sankhyā Ser. A 26 359–372.
  • Wen, S. and Kedem, B. (2009). A semiparametric cluster detection method—a comprehensive power comparison with Kulldorff’s method. International Journal of Health Geographics 8. Online journal without page numbers.
  • Wood, S. N. (2006). Generalized Additive Models: An Introduction With R. Chapman & Hall/CRC, Boca Raton, FL.
  • Zhang, B. (1999). A chi-squared goodness-of-fit test for logistic regression models based on case-control data. Biometrika 86 531–539.
  • Zhang, B. (2000). A goodness of fit test for multiplicative-intercept risk models based on case-control data. Statist. Sinica 10 839–865.
  • Zhang, B. (2001). An information matrix test for logistic regression models based on case-control data. Biometrika 88 921–932.
  • Zhang, B. (2002). Assessing goodness-of-fit of generalized logit models based on case-control data. J. Multivariate Anal. 82 17–38.

Supplemental materials

  • Supplementary material: Supplement to “Semiparametric regression in testicular germ cell data”. The supplementary material contains all the mathematical proofs of the lemmas, corrolaries and theorems supporting the statements and results, including some additional references.