Bayesian Analysis

Selection sampling from large data sets for targeted inference in mixture modeling

Cliburn Chan, Ioanna Manolopoulou, and Mike West

Full-text: Open access


One of the challenges in using Markov chain Monte Carlo for model analysis in studies with very large datasets is the need to scan through the whole data at each iteration of the sampler, which can be computationally prohibitive. Several approaches have been developed to address this, typically drawing computationally manageable subsamples of the data. Here we consider the specific case where most of the data from a mixture model provides little or no information about the parameters of interest, and we aim to select subsamples such that the information extracted is most relevant. The motivating application arises in flow cytometry, where several measurements from a vast number of cells are available. Interest lies in identifying specific rare cell subtypes and characterizing them according to their corresponding markers. We present a Markov chain Monte Carlo approach where an initial subsample of the full dataset is used to guide selection sampling of a further set of observations targeted at a scientificallyinteresting, low probability region. We define a Sequential Monte Carlo strategy in which the targeted subsample is augmented sequentially as estimates improve, and introduce a stopping rule for determining the size of the targeted subsample. An example from flow cytometry illustrates the ability of the approach to increase the resolution of inferences for rare cell subtypes.

Article information

Bayesian Anal., Volume 5, Number 3 (2010), 429-449.

First available in Project Euclid: 22 June 2012

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Flow citometry large data sets mixture models rare events resampling selection sampling sequential Monte Carlo


Manolopoulou, Ioanna; Chan, Cliburn; West, Mike. Selection sampling from large data sets for targeted inference in mixture modeling. Bayesian Anal. 5 (2010), no. 3, 429--449. doi:10.1214/10-BA517.

Export citation


  • Balakrishnan, S. and Madigan, D. (2006). "A one-pass sequential Monte Carlo method for Bayesian analysis of massive datasets"." Bayesian Analysis, 1(2): 345–362.
  • Bayarri, M. J. and Berger, J. (1998). "Robust Bayesian analysis of selection models." Annals of Statistics, 26(2): 645–659.
  • Carvalho, C., Johannes, M., Lopes, H., and Polson, N. (2010). "Particle Learning and Smoothing." Statistical Science. To appear.
  • Chan, C., Feng, F., Ottinger, J., Foster, D., West, M., and Kepler, T. (2008). "Statistical mixture modeling for cell subtype identification in flow cytometry"." Cytometry A, 73: 693–701.
  • Chopin, N. (2002). "A sequential particle filter method for static models." Biometrika, 89(3): 539–552.
  • Doucet, A., De Freitas, N., and Gordon, N. (2001). Sequential Monte Carlo Methods in Practice. Springer-Verlag, New York.
  • Gordon, N. J., Salmond, D. J., and Smith, A. F. M. (1993). "Novel approach to nonlinear/non-Gaussian Bayesian state estimation"." In IEE Proceedings, volume 140, 107–113.
  • Heckman, J. J. (1979). "Sample selection bias as a specification error." Econometrica: Journal of the econometric society, 47(1): 153–161.
  • Ishwaran, H. and James, L. (2002). "Approximate Dirichlet" process computing in finite normal mixtures: Smoothing and prior information. Journal of Computational and Graphical Statistics, 11: 508–532.
  • Liu, J. and West, M. (2000). "Combined parameter and state estimation in simulation-based filtering." In Doucet, A., De Freitas, J. F. G., and Gordon, N. J. (eds.), Sequential Monte Carlo Methods in Practice, 197–223. Springer-Verlag, New York.
  • MacEachern, S. N. (1998). "Estimating mixture of Dirichlet process models"." Journal of Computational and Graphical Statistics, 7(2): 223–238.
  • Müller, P., Erkanli, A., and West, M. (1996). "Bayesian curve fitting using multivariate normal mixtures." Biometrika, 83(1): 67.
  • Pyne, S., Hu, X., Wang, K., Rossin, E., Lin, T. I., Maier, L. M., Baecher-Allan, C., McLachlan, G. J., Tamayo, P., Hafler, D. A., De Jagera, P. L., and Mesirova, J. P. (2009). "Automated high-dimensional flow cytometric data analysis." Proceedings of the National Academy of Sciences, 106(21): 8519.
  • Ridgeway, G. and Madigan, D. (2003). "A sequential Monte Carlo method for Bayesian analysis of massive datasets"." Data Mining and Knowledge Discovery, 7(3): 301–319.
  • Seder, R., Darrah, P., and Roederer, M. (2008). "T-cell quality in memory and protection: implications for vaccine design"." Nature Reviews Immunology, 8(4): 247–258.
  • Suchard, M., Wang, Q., Chan, C., Frelinger, J., Cron, A., and West, M. (2010). "Understanding GPU programming for statistical computation: Studies in massively parallel massive mixtures." Journal of Computational and Graphical Statistics, 19: 419–438.
  • West, M. (1994). "Discovery sampling and selection models"." In Gupta, S. S. and Berger, J. O. (eds.), Statistical Decision Theory and Related Topics, 221–235. Springer-Verlag, New York.
  • –- (1996). "Inference in successive sampling discovery models"." Journal of Econometrics, 75(1): 217–238.
  • West, M. and Harrison, P. J. (1997). Bayesian Forecasting and Dynamic Models. Springer-Verlag, New York, 2nd edition.

See also

  • Related item: Fabio Rigat. Comment on article by Manolopoulou et al. Bayesian Anal., Vol. 5, Iss. 3 (2010), 451-455.
  • Related item: Nick Whiteley. Comment on article by Manolopoulou et al. Bayesian Anal., Vol. 5, Iss. 3 (2010), 457-460.
  • Related item: Cliburn Chan, Ionna Manolopoulou, Mike West. Rejoinder. Bayesian Anal., Vol. 5, Iss. 3(2010), 461-463.