The Annals of Applied Statistics

Dynamic mixtures of factor analyzers to characterize multivariate air pollutant exposures

Antonello Maruotti, Jan Bulla, Francesco Lagona, Marco Picone, and Francesca Martella

Full-text: Access denied (no subscription detected)

We're sorry, but we are unable to provide you with the full text of this article because we are not able to identify you as a subscriber. If you have a personal subscription to this journal, then please login. If you are already logged in, then you may need to update your profile to register your subscription. Read more about accessing full-text


The assessment of pollution exposure is based on the analysis of a multivariate time series that include the concentrations of several pollutants as well as the measurements of multiple atmospheric variables. It typically requires methods of dimensionality reduction that are capable of identifying potentially dangerous combinations of pollutants and simultaneously segmenting exposure periods according to air quality conditions. When the data are high-dimensional, however, efficient methods of dimensionality reduction are challenging because of the formidable structure of cross-correlations that arise from the dynamic interaction between weather conditions and natural/anthropogenic pollution sources. In order to assess pollution exposure in an urban area while taking the above mentioned difficulties into account, we have developed a class of parsimonious hidden Markov models. In a multivariate time series setting, this approach simultaneously allows for the performance of temporal segmentation and dimensionality reduction. We specifically approximate the distribution of multiple pollutant concentrations by mixtures of factor analysis models, whose parameters evolve according to a latent Markov chain. Covariates are included as predictors of the chain transition probabilities. Parameter constraints on the factorial component of the model are exploited to tune the flexibility of dimensionality reduction. In order to estimate the model parameters efficiently, we have proposed a novel three-step Alternating Expected Conditional Maximization (AECM) algorithm, which is also assessed in a simulation study. In the case study, the proposed methods could (1) describe the exposure to pollution in terms of a few latent regimes, (2) associate these regimes with specific combinations of pollutant concentration levels as well as distinct correlation structures between concentrations, and (3) capture the influence of weather conditions on transitions between regimes.

Article information

Ann. Appl. Stat. Volume 11, Number 3 (2017), 1617-1648.

Received: November 2016
Revised: March 2017
First available in Project Euclid: 5 October 2017

Permanent link to this document

Digital Object Identifier

Hidden Markov models AECM algorithm dimensionality reduction three-step algorithm


Maruotti, Antonello; Bulla, Jan; Lagona, Francesco; Picone, Marco; Martella, Francesca. Dynamic mixtures of factor analyzers to characterize multivariate air pollutant exposures. Ann. Appl. Stat. 11 (2017), no. 3, 1617--1648. doi:10.1214/17-AOAS1049.

Export citation


  • Allman, E. S., Matias, C. and Rhodes, J. A. (2009). Identifiability of parameters in latent structure models with many observed variables. Ann. Statist. 37 3099–3132.
  • Anderson, T. W. and Rubin, H. (1956). Statistical inference in factor analysis. In Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability, 19541955, Vol. V (J. Neyman, ed.) 111–150. Univ. California Press, Berkeley, CA.
  • Bartolucci, F. and Farcomeni, A. (2009). A multivariate extension of the dynamic logit model for longitudinal data based on a latent Markov heterogeneity structure. J. Amer. Statist. Assoc. 104 816–831.
  • Bartolucci, F. and Farcomeni, A. (2010). A note on the mixture transition distribution and hidden Markov models. J. Time Series Anal. 31 132–138.
  • Bartolucci, F., Farcomeni, A. and Pennoni, F. (2013). Latent Markov Models for Longitudinal Data. CRC Press, Boca Raton, FL.
  • Bartolucci, F., Montanari, G. E. and Pandolfi, S. (2015). Three-step estimation of latent Markov models with covariates. Comput. Statist. Data Anal. 83 287–301.
  • Baum, L. E., Petrie, T., Soules, G. and Weiss, N. (1970). A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains. Ann. Math. Stat. 41 164–171.
  • Bornn, L., Shaddick, G. and Zidek, J. V. (2012). Modeling nonstationary processes through dimension expansion. J. Amer. Statist. Assoc. 107 281–289.
  • Bouveyron, C. and Brunet-Saumard, C. (2014). Model-based clustering of high-dimensional data: A review. Comput. Statist. Data Anal. 71 52–78.
  • Bulla, J. and Berzel, A. (2008). Computational issues in parameter estimation for stationary hidden Markov models. Comput. Statist. 23 1–18.
  • Bulla, J., Lagona, F., Maruotti, A. and Picone, M. (2012). A multivariate hidden Markov model for the identification of sea regimes from incomplete skewed and circular time series. J. Agric. Biol. Environ. Stat. 17 544–567.
  • Chattopadhyay, A. K., Mondal, S. and Biswas, A. (2015). Independent component analysis and clustering for pollution data. Environ. Ecol. Stat. 22 33–43.
  • Chatzis, S. P. (2010). Hidden Markov models with nonelliptically contoured state densities. IEEE Trans. Pattern Anal. Mach. Intell. 32 2297–2304.
  • Chatzis, S. P., Kosmopoulos, D. I. and Varvarigou, T. A. (2009). Robust sequential data modelling using an outlier tolerant hidden Markov model. IEEE Trans. Pattern Anal. Mach. Intell. 31 1657–1669.
  • Cooley, D., Davis, R. A. and Naveau, P. (2012). Approximating the conditional density given large observed values via a multivariate extremes framework, with application to environmental data. Ann. Appl. Stat. 6 1406–1429.
  • Dannemann, J., Holzmann, H. and Lesiter, A. (2014). Semiparameteric hidden Markov models: Identifiability and estimation. Wiley Interdiscip. Rev.: Comput. Stat. 6 418–425.
  • Farcomeni, A. and Greco, L. (2015). S-estimation of hidden Markov models. Comput. Statist. 30 57–80.
  • Fassò, A., Cameletti, M. and Nicolis, O. (2007). Air quality monitoring using heterogeneous networks. Environmetrics 18 245–264.
  • Field, M., Stirling, D., Pan, Z. and Naghdy, F. (2016). Learning trajectories for robot programming by demonstration using a coordinated mixture of factor analyzers. IEEE Trans. Cybern. 46 706–717.
  • Ghahramani, Z. and Hinton, G. E. (1997). The EM algorithm for factor analyzers. Technical report CRG-TR-96-1, Univ. Toronto.
  • Greven, S., Dominici, F. and Zeger, S. (2011). An approach to the estimation of chronic air pollution effects using spatio-temporal information. J. Amer. Statist. Assoc. 106 395–406.
  • Hubert, L. and Arabie, P. (1985). Comparing partitions. J. Classification 2 193–218.
  • Kim, K. H., Jahan, S. A. and Kabir, E. (2013). A review on human health perspective of air pollution with respect to allergies and asthma. Environ. Int. 59 41–52.
  • Lagona, F., Maruotti, A. and Padovano, F. (2015). Multilevel multivariate modelling of legislative count data, with a hidden Markov chain. J. Roy. Statist. Soc. Ser. A 178 705–723.
  • Lagona, F., Maruotti, A. and Picone, M. (2011). A non-homogeneous hidden Markov model for the analysis of multi-pollutant exceedances data. In Hidden Markov Models: Theory and Applications (P. Dymarski, ed.) 207–222. InTech Publisher, Rijeka.
  • Lagona, F., Picone, M. and Maruotti, A. (2015). A hidden Markov model for the analysis of cyclindrical time series. Environmetrics 26 534–544.
  • Latza, U., Gerdes, S. and Baur, X. (2009). Effects of nitrogen dioxide on human health: Systematic review of experimental and epidemiological studies conducted between 2002 and 2006. Int. J. Hyg. Environ. Health 212 271–287.
  • Lee, D., Rushworth, A. and Sahu, S. K. (2014). A Bayesian localized conditional autoregressive model for estimating the health effects of air pollution. Biometrics 70 419–429.
  • Lee, D. and Sahu, S. (2016). Estimating the health impact of environmental pollution fields. In Handbook of Spatial Epidemiology (A. Lawson, S. Banerjee, R. Haining and L. Ugarte, eds.) 271 – 278. Chapman & Hall/CRC, Boca Raton, FL.
  • Leroux, B. G. (1992). Maximum-likelihood estimation for hidden Markov models. Stochastic Process. Appl. 40 127–143.
  • Martinez-Zarzoso, I. and Maruotti, A. (2013). The environmental Kuznets curve: Functional form, time-varying heterogeneity and outliers in a panel setting. Environmetrics 24 461–475.
  • Maruotti, A. (2011). Mixed hidden Markov models for longitudinal data: An overview. Int. Stat. Rev. 79 427–454.
  • Maruotti, A. and Rocci, R. (2012). A mixed non-homogeneous hidden Markov model for categorical data, with application to alcohol consumption. Stat. Med. 31 871–886.
  • Maruotti, A., Punzo, A., Mastrantonio, G. and Lagona, F. (2016). A time-dependent extension of the projected normal regression model for longitudinal circular data based on a hidden Markov heterogeneity structure. Stoch. Environ. Res. Risk Assess. 30 1725–1740.
  • McLachlan, G. J., Peel, D. and Bean, R. W. (2003). Modelling high-dimensional data by mixtures of factor analyzers. Comput. Statist. Data Anal. 41 379–388.
  • McNicholas, P. D. and Murphy, T. B. (2008). Parsimonious Gaussian mixture models. Stat. Comput. 18 285–296.
  • McNicholas, P. D. and Murphy, T. B. (2010). Model-based clustering of microarray expression data via latent Gaussian mixture models. Bioinformatics 26 2705–2712.
  • Meng, X.-L. and van Dyk, D. (1997). The EM algorithm—An old folk-song sung to a fast new tune. J. Roy. Statist. Soc. Ser. B 59 511–567.
  • Paciorek, C. J., Yanosky, J. D., Puett, R. C., Laden, F. and Suh, H. H. (2009). Practical large-scale spatio-temporal modeling of particulate matter concentrations. Ann. Appl. Stat. 3 370–397.
  • Park, E. S., Guttorp, P. and Henry, R. C. (2001). Multivariate receptor modeling for temporally correlated data by using MCMC. J. Amer. Statist. Assoc. 96 1171–1183.
  • Pollice, A. and Jona Lasinio, G. (2009). Two approaches to imputation and adjustment of air quality data from a composite monitoring network. J. Data Sci. 7 43–59.
  • Punzo, A. and Maruotti, A. (2016). Clustering multivariate longitudinal observations: The contaminated Gaussian hidden Markov model. J. Comput. Graph. Statist. 25 1097–1116.
  • Punzo, A. and McNicholas, P. D. (2016). Parsimonious mixtures of multivariate contaminated normal distributions. Biom. J. 58 1506–1537.
  • Rosti, A. V. I. and Gales, M. J. F. (2002). Factor analysed hidden Markov models. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing 949–952.
  • Sahu, S. K., Baffour, B., Harper, P. R., Minty, J. H. and Sarran, C. (2014). A hierarchical Bayesian model for improving short-term forecasting of hospital demand by including meteorological information. J. Roy. Statist. Soc. Ser. A 177 39–61.
  • Scott, S. L., James, G. M. and Sugar, C. A. (2005). Hidden Markov models for longitudinal comparisons. J. Amer. Statist. Assoc. 100 359–369.
  • Shaddick, G. and Wakefield, J. (2002). Modelling daily multivariate pollutant data at multiple sites. J. Roy. Statist. Soc. Ser. C 51 351–372.
  • Shaddick, G., Lee, D., Zidek, J. V. and Salway, R. (2008). Estimating exposure response functions using ambient pollution concentrations. Ann. Appl. Stat. 2 1249–1270.
  • Visser, I., Raijmakers, M. and Molenaar, P. (2000). Confidence intervals for hidden Markov model parameters. Br. J. Math. Stat. Psychol. 53 317–327.
  • Viterbi, A. J. (1967). Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Trans. Inform. Theory 13 260–269.
  • Welch, L. (2003). Hidden Markov models and the Baum–Welch algorithm. IEEE Inf. Theory Soc. Newsl. 53 1–13.
  • Yao, K., Paliwal, K. K. and Lee, T. W. (2005). Generative factor analyzed HMM for automatic speech recognition. Speech Commun. 45 435–454.
  • Zucchini, W., MacDonald, I. L. and Langrock, R. (2016). Hidden Markov Models for Time Series: An Introduction Using R, 2nd ed. Monographs on Statistics and Applied Probability 150. CRC Press, Boca Raton, FL.