The Annals of Applied Statistics

Photo-$z$ estimation: An example of nonparametric conditional density estimation under selection bias

Rafael Izbicki, Ann B. Lee, and Peter E. Freeman

Full-text: Access denied (no subscription detected)

We're sorry, but we are unable to provide you with the full text of this article because we are not able to identify you as a subscriber. If you have a personal subscription to this journal, then please login. If you are already logged in, then you may need to update your profile to register your subscription. Read more about accessing full-text


Redshift is a key quantity for inferring cosmological model parameters. In photometric redshift estimation, cosmologists use the coarse data collected from the vast majority of galaxies to predict the redshift of individual galaxies. To properly quantify the uncertainty in the predictions, however, one needs to go beyond standard regression and instead estimate the full conditional density $f(z|\mathbf{x})$ of a galaxy’s redshift $z$ given its photometric covariates $\mathbf{x}$. The problem is further complicated by selection bias: usually only the rarest and brightest galaxies have known redshifts, and these galaxies have characteristics and measured covariates that do not necessarily match those of more numerous and dimmer galaxies of unknown redshift. Unfortunately, there is not much research on how to best estimate complex multivariate densities in such settings.

Here we describe a general framework for properly constructing and assessing nonparametric conditional density estimators under selection bias, and for combining two or more estimators for optimal performance. We propose new improved photo-$z$ estimators and illustrate our methods on data from the Sloan Data Sky Survey and an application to galaxy–galaxy lensing. Although our main application is photo-$z$ estimation, our methods are relevant to any high-dimensional regression setting with complicated asymmetric and multimodal distributions in the response variable.

Article information

Ann. Appl. Stat., Volume 11, Number 2 (2017), 698-724.

Received: April 2016
Revised: January 2017
First available in Project Euclid: 20 July 2017

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Density estimation nonparametric statistics selection bias photometric redshift


Izbicki, Rafael; Lee, Ann B.; Freeman, Peter E. Photo-$z$ estimation: An example of nonparametric conditional density estimation under selection bias. Ann. Appl. Stat. 11 (2017), no. 2, 698--724. doi:10.1214/16-AOAS1013.

Export citation


  • Aihara, H. et al. (2011). The eighth data release of the Sloan Digital Sky Survey: First data from SDSS-III. Astrophys. J., Suppl. Ser. 193 29.
  • Ball, N. M. and Brunner, R. J. (2010). Data mining and machine learning in astronomy. Internat. J. Modern Phys. D 19 1049–1106.
  • Bickel, S., Brückner, M. and Scheffer, T. (2009). Discriminative learning under covariate shift. J. Mach. Learn. Res. 10 2137–2155.
  • Corradi, V. and Swanson, N. R. (2006). Predictive density evaluation. In Handbook of Economic Forecasting. North-Holland, Amsterdam.
  • Cunha, C. E., Lima, M., Oyaizu, H., Frieman, J. and Lin, H. (2009). Estimating the redshift distribution of photometric galaxy samples—II. Applications and tests of a new method. Mon. Not. R. Astron. Soc. 396 2379–2398.
  • Dahlen, T., Mobasher, B., Faber, S. M., Ferguson, H. C., Barro, G., Finkelstein, S. L., Finlator, K., Fontana, A., Gruetzbauch, R., Johnson, S. et al. (2013). A critical assessment of photometric redshift methods: A CANDELS investigation. Astrophys. J. 775 93.
  • Fernández-Soto, A., Lanzetta, K. M. and Yahil, A. (1998). A new catalog of photometric redshifts in the Hubble Deep Field. Astrophys. J. 513 34–50.
  • Gretton, A., Smola, A., Huang, J., Schmittfull, M., Borgwardt, K. and Schölkopf, B. (2010). Covariate shift by kernel mean matching. In Dataset Shift in Machine Learning (J. Quionero-Candela, M. Sugiyama, A. Schwaighofer and N. D. Lawrence, eds.) Chapter 8. MIT Press, Cambridge, MA.
  • Hall, P. (1987). On Kullback–Leibler loss and density estimation. Ann. Statist. 15 1491–1519.
  • Hastie, T., Tibshirani, R. and Friedman, J. H. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, Berlin.
  • Izbicki, R. and Lee, A. B. (2016). Nonparametric conditional density estimation in a high-dimensional regression setting. J. Comput. Graph. Statist. 25 1297–1316.
  • Izbicki, R., Lee, A. B and Freeman, P. E (2017). Supplement to “Photo-$z$ estimation: An example of nonparametric conditional density estimation under selection bias.” DOI:10.1214/16-AOAS1013SUPP.
  • Izbicki, R., Lee, A. B. and Schafer, C. M. (2014). High-dimensional density ratio estimation with extensions to approximate likelihood computation. J. Mach. Learn. Res. 33.
  • Kanamori, T., Hido, S. and Sugiyama, M. (2009). A least-squares approach to direct importance estimation. J. Mach. Learn. Res. 10 1391–1445.
  • Kanamori, T., Suzuki, T. and Sugiyama, M. (2012). Statistical analysis of kernel-based least-squares density-ratio estimation. Mach. Learn. 86 335–367.
  • Kind, M. C. and Brunner, R. J. (2013). Tpz: Photometric redshift pdfs and ancillary information by using prediction trees and random forests. Mon. Not. R. Astron. Soc. 432 1483–1501.
  • Koenker, R. (2005). Quantile Regression. Econometric Society Monographs 38. Cambridge Univ. Press, Cambridge.
  • Kremer, J., Gieseke, F., Pedersen, K. S. and Igel, C. (2015). Nearest neighbor density ratio estimation for large-scale applications in astronomy. Astron. Comput.
  • Lee, A. B. and Izbicki, R. (2016). A spectral series approach to high-dimensional nonparametric regression. Electron. J. Stat. 10 423–463.
  • Lima, M., Cunha, C. E., Oyaizu, H., Frieman, J., Lin, H. and Sheldon, E. (2008). Estimating the redshift distribution of photometric galaxy samples. Mon. Not. R. Astron. Soc. 390 118–130.
  • Loog, M. (2012). Nearest neighbor-based importance weighting. In IEEE International Workshop on Machine Learning for Signal Processing.
  • Mandelbaum, R., Seljak, U., Hirata, C. M., Bardelli, S., Bolzonella, M., Bongiorno, A., Carollo, M., Contini, T., Cunha, C. E., Garilli, B., Iovino, A., Kampczyk, P., Kneib, J. P., Knobel, C., Koo, D. C., Lamareille, F., Le Fevre, O., Leborgne, J. F., Lilly, S. J., Maier, C., Mainieri, V., Mignoli, M., Newman, J. A., Oesch, P. A., Perez-Montero, E., Ricciardelli, E., Scodeggio, M., Silverman, J. and Tasca, L. (2008). Precision photometric redshift calibration for galaxy–galaxy weak lensing. Mon. Not. R. Astron. Soc. 386 781–806.
  • Margolis, A. (2011). A literature review of domain adaptation with unlabeled data. Available at
  • Minh, H. Q., Niyogi, P. and Yao, Y. (2006). Mercer’s theorem, feature maps, and smoothing. In Learning Theory, 19th Annual Conference on Learning Theory.
  • Moreno-Torres, J. G., Raeder, T., Alaíz-Rodríguez, R., Chawla, N. V. and Herrera, F. (2012). A unifying view on dataset shift in classification. Pattern Recogn. 45 521–530.
  • Oyaizu, H., Lima, M., Cunha, C. E., Lin, H. and Frieman, J. (2008). Photometric redshift error estimators. Astrophys. J. 689 709–720.
  • Quionero-Candela, J., Sugiyama, M., Schwaighofer, A. and Lawrence, N. D. (2009). Dataset Shift in Machine Learning. MIT Press, Cambridge, MA.
  • Rubin, D. B. (1976). Inference and missing data. Biometrika 63 581–592.
  • Sheldon, E. S., Cunha, C. E., Mandelbaum, R., Brinkmann, J. and Weaver, B. A. (2012). Photometric redshift probability distributions for galaxies in the SDSS DR8. Astrophys. J., Suppl. Ser. 201 32.
  • Shimodaira, H. (2000). Improving predictive inference under covariate shift by weighting the log-likelihood function. J. Statist. Plann. Inference 90 227–244.
  • Springel, V., Frenk, C. S. and White, S. D. M. (2006). The large-scale structure of the universe. Nature 440 1137–1144.
  • Sugiyama, M., Suzuki, T., Nakajima, S., Kashima, H., von Bünau, P. and Kawanabe, M. (2008). Direct importance estimation for covariate shift adaptation. Ann. Inst. Statist. Math. 60 699–746.
  • Wasserman, L. (2006). All of Nonparametric Statistics. Springer, New York.
  • Weiner, B. J., Phillips, A. C., Faber, S. M., Willmer, C. N. A., Vogt, N. P. et al. (2005). The DEEP groth strip galaxy redshift survey. III. Redshift catalog and properties of galaxies. Astrophys. J. 620 595.
  • Wittman, D. (2009). What lies beneath: Using $p(z)$ to reduce systematic photometric redshift errors. Astrophys. J. Lett. 700.
  • York, D. G., Adelman, J., Anderson, J. E. Jr., Anderson, S. F., Annis, J., Bahcall, N. A., Bakken, J. A., Barkhouser, R., Bastian, S., Berman, E. et al. (2000). The Sloan Digital Sky Survey: Technical summary. Astrophys. J. 120 1579.
  • Zhao, L. C. and Liu, Z. J. (1985). Strong consistency of the kernel estimators of conditional density function. Acta Math. Appl. Sin. Engl. Ser. 1 314–318.
  • Zheng, H. and Zhang, Y. (2012). Review of techniques for photometric redshift estimation. In Software and Cyberinfrastructure for Astronomy II 8451.

Supplemental materials

  • Supplement to “Photo-$z$ estimation: An example of nonparametric conditional density estimation under selection bias”. We provide the data and code used in the paper as supplementary material.