The Annals of Applied Statistics

Covariate-adaptive clustering of exposures for air pollution epidemiology cohorts

Joshua P. Keller, Mathias Drton, Timothy Larson, Joel D. Kaufman, Dale P. Sandler, and Adam A. Szpiro

Full-text: Access denied (no subscription detected)

We're sorry, but we are unable to provide you with the full text of this article because we are not able to identify you as a subscriber. If you have a personal subscription to this journal, then please login. If you are already logged in, then you may need to update your profile to register your subscription. Read more about accessing full-text


Cohort studies in air pollution epidemiology aim to establish associations between health outcomes and air pollution exposures. Statistical analysis of such associations is complicated by the multivariate nature of the pollutant exposure data as well as the spatial misalignment that arises from the fact that exposure data are collected at regulatory monitoring network locations distinct from cohort locations. We present a novel clustering approach for addressing this challenge. Specifically, we present a method that uses geographic covariate information to cluster multi-pollutant observations and predict cluster membership at cohort locations. Our predictive $k$-means procedure identifies centers using a mixture model and is followed by multiclass spatial prediction. In simulations, we demonstrate that predictive $k$-means can reduce misclassification error by over 50% compared to ordinary $k$-means, with minimal loss in cluster representativeness. The improved prediction accuracy results in large gains of 30% or more in power for detecting effect modification by cluster in a simulated health analysis. In an analysis of the NIEHS Sister Study cohort using predictive $k$-means, we find that the association between systolic blood pressure (SBP) and long-term fine particulate matter (PM$_{2.5}$) exposure varies significantly between different clusters of PM$_{2.5}$ component profiles. Our cluster-based analysis shows that, for subjects assigned to a cluster located in the Midwestern U.S., a 10 $\mu$g/m$^{3}$ difference in exposure is associated with 4.37 mmHg (95% CI, 2.38, 6.35) higher SBP.

Article information

Ann. Appl. Stat., Volume 11, Number 1 (2017), 93-113.

Received: December 2015
Revised: August 2016
First available in Project Euclid: 8 April 2017

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Air pollution clustering dimension reduction particulate matter


Keller, Joshua P.; Drton, Mathias; Larson, Timothy; Kaufman, Joel D.; Sandler, Dale P.; Szpiro, Adam A. Covariate-adaptive clustering of exposures for air pollution epidemiology cohorts. Ann. Appl. Stat. 11 (2017), no. 1, 93--113. doi:10.1214/16-AOAS992.

Export citation


  • Adar, S. D., Klein, R., Klein, B. E. K., Szpiro, A. A., Cotch, M. F., Wong, T. Y., O’Neill, M. S., Shrager, S., Barr, R. G., Siscovick, D. S., Daviglus, M. L., Sampson, P. D. and Kaufman, J. D. (2010). Air pollution and the microvasculature: A cross-sectional assessment of in vivo retinal images in the population-based multi-ethnic study of atherosclerosis (MESA). PLoS Medicine 7 e1000372.
  • Austin, E., Coull, B., Thomas, D. and Koutrakis, P. (2012). A framework for identifying distinct multipollutant profiles in air pollution data. Environment International 45 112–121.
  • Austin, E., Coull, B. A., Zanobetti, A. and Koutrakis, P. (2013). A framework to spatially cluster air pollution monitoring sites in US based on the PM2.5 composition. Environment International 59 244–254.
  • Bell, M. L., Dominici, F., Ebisu, K., Zeger, S. L. and Samet, J. M. (2007). Spatial and temporal variation in PM2.5 chemical composition in the United States for health effects studies. Environmental Health Perspectives 115 989–995.
  • Bergen, S. and Szpiro, A. A. (2015). Mitigating the impact of measurement error when using penalized regression to model exposure in two-stage air pollution epidemiology studies. Environ. Ecol. Stat. 22 601–631.
  • Berk, R., Brown, L. and Zhao, L. (2010). Statistical inference after model selection. Journal of Quantitative Criminology 26 217–236.
  • Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer, New York.
  • Blanchard, C. L. and Hidy, G. M. (2003). Effects of changes in sulfate, ammonia, and nitric acid on particulate nitrate concentrations in the southeastern United States. Journal of the Air & Waste Management Association 53 283–290.
  • Brauer, M. (2010). How much, how long, what, and where: Air pollution exposure assessment for epidemiologic studies of respiratory disease. Proc. Am. Thorac. Soc. 7 111–115.
  • Brauer, M., Hoek, G., van Vliet, P., Meliefste, K., Fischer, P., Gehring, U., Heinrich, J., Cyrys, J., Bellander, T., Lewne, M. and Brunekreef, B. (2003). Estimating long-term average particulate air pollution concentrations: Application of traffic indicators and geographic information systems. Epidemiology 14 228–239.
  • Brook, R. D., Rajagopalan, S., Pope, C. A., Brook, J. R., Bhatnagar, A., Diez-Roux, A. V., Holguin, F., Hong, Y., Luepker, R. V., Mittleman, M. A., Peters, A., Siscovick, D., Smith, S. C., Whitsel, L. and Kaufman, J. D. (2010). Particulate matter air pollution and cardiovascular disease: An update to the scientific statement from the American Heart Association. Circulation 121 2331–2378.
  • Chan, S. H., van Hee, V. C., Bergen, S., Szpiro, A. A., DeRoo, L. A., London, S. J., Marshall, J. D., Kaufman, J. D. and Sandler, D. P. (2015). Long-term air pollution exposure and blood pressure in the Sister Study. Environmental Health Perspectives 123 951–958.
  • Cohen, M. A., Adar, S. D., Allen, R. W., Avol, E., Curl, C. L., Gould, T., Hardie, D., Ho, A., Kinney, P., Larson, T. V., Sampson, P., Sheppard, L., Stukovsky, K. D., Swan, S. S., Liu, L. J. S. and Kaufman, J. D. (2009). Approach to estimating participant pollutant exposures in the multi-ethnic study of atherosclerosis and air pollution (MESA Air). Environmental Science & Technology 43 4687–4693.
  • Dominici, F., Sheppard, L. and Clyde, M. (2003). Health effects of air pollution: A statistical review. Int. Stat. Rev. 71 243–276.
  • Franklin, M., Koutrakis, P. and Schwartz, J. (2008). The role of particle composition on the association between PM2.5 and mortality. Epidemiology 19 680–689.
  • Hartigan, J. and Wong, M. (1979). Algorithm AS 136: A k-means clustering algorithm. Applied Statistics 28 100–108.
  • Jordan, M. and Jacobs, R. (1994). Hierarchical mixtures of experts and the EM algorithm. Neural Comput. 6 181–214.
  • Keller, J. P., Olives, C., Kim, S.-Y., Sheppard, L., Sampson, P. D., Szpiro, A. A., Oron, A. P., Lindström, J., Vedal, S. and Kaufman, J. D. (2015). A unified spatiotemporal modeling approach for predicting concentrations of multiple air pollutants in the multi-ethnic study of atherosclerosis and air pollution. Environmental Health Perspectives 123 301–309.
  • Keller, J. P., Drton, M., Larson, T. V., Kaufman, J. D., Sandler, D. P. and Szpiro, A. A. (2017). Supplement to “Covariate-adaptive clustering of exposures for air pollution epidemiology cohorts.” DOI:10.1214/16-AOAS992SUPP.
  • Kioumourtzoglou, M.-A., Austin, E., Koutrakis, P., Dominici, F., Schwartz, J. and Zanobetti, A. (2015). PM2.5 and survival among older adults: Effect modification by particulate composition. Epidemiology 26 321–327.
  • Künzli, N., Medina, S. and Kaiser, R. (2001). Assessment of deaths attributable to air pollution: Should we use risk estimates based on time series or on cohort studies? Am. J. Epidemiol. 153 1050–1055.
  • Lee, J. D., Sun, D. L., Sun, Y. and Taylor, J. E. (2016). Exact post-selection inference, with application to the lasso. Ann. Statist. 44 907–927.
  • Oakes, M., Baxter, L. and Long, T. C. (2014). Evaluating the application of multipollutant exposure metrics in air pollution health studies. Environment International 69 90–99.
  • Peltier, R. E., Hsu, S.-I., Lall, R. and Lippmann, M. (2009). Residual oil combustion: A major source of airborne nickel in New York city. Journal of Exposure Science & Environmental Epidemiology 19 603–612.
  • Sampson, P. D., Richards, M., Szpiro, A. A., Bergen, S., Sheppard, L., Larson, T. V. and Kaufman, J. D. (2013). A regionalized national universal kriging model using partial least squares regression for estimating annual PM2.5 concentrations in epidemiology. Atmospheric Environment 75 383–392.
  • Shacklette, H. T. and Boerngen, J. (1984). Element concentrations in soils and other surficial materials of the conterminous United States. Technical report.
  • Thurston, G. D., Ito, K., Lall, R., Burnett, R. T., Turner, M. C., Krewski, D., Shi, Y., Jerrett, M., Gapstur, S. M., Diver, W. R. and Pope, C. A. (2013). NPACT study 4. mortality and long-term exposure to PM2.5 and its components in the American cancer society’s cancer prevention study II cohort. In National Particle Component Toxicity (NPACT) Initiative: Integrated Epidemiologic and Toxicologic Studies of the Health Effects of Particulate Matter Components. Research Report 177. Health Effects Institute, Boston, MA.
  • U.S. EPA (2003). Compilation of existing studies on source apportionment for PM2.5. Technical report, Office of Air Quality Planning and Standards, Washington, DC.
  • U.S. EPA (2006). Chapter 4: Air quality impacts. In Regulatory Impact Analysis, 2006 National Ambient Air Quality Standards for Particle Pollution. Research Triangle Park, NC.
  • Wilson, J. G., Kingham, S., Pearce, J. and Sturman, A. P. (2005). A review of intraurban variations in particulate air pollution: Implications for epidemiological research. Atmospheric Environment 39 6444–6462.
  • Zanobetti, A., Franklin, M., Koutrakis, P. and Schwartz, J. (2009). Fine particulate air pollution and its components in association with cause-specific emergency admissions. Environmental Health 8 58.

Supplemental materials

  • Supplemental material for “Covariate-adaptive clustering of exposures for air pollution epidemiology cohorts”. The Supplemental Material document contains details of the algorithm for selecting predictive $k$-means cluster centers, additional results from the simulations, sensitivity results from the PM$_{2.5}$ analysis that use different numbers of clusters, and the results from applying $k$-means clustering to the PM$_{2.5}$ data.