The Annals of Applied Statistics

Sequential Dirichlet process mixtures of multivariate skew $t$-distributions for model-based clustering of flow cytometry data

Boris P. Hejblum, Chariff Alkhassim, Raphael Gottardo, François Caron, and Rodolphe Thiébaut

Full-text: Access denied (no subscription detected)

We're sorry, but we are unable to provide you with the full text of this article because we are not able to identify you as a subscriber. If you have a personal subscription to this journal, then please login. If you are already logged in, then you may need to update your profile to register your subscription. Read more about accessing full-text


Flow cytometry is a high-throughput technology used to quantify multiple surface and intracellular markers at the level of a single cell. This enables us to identify cell subtypes and to determine their relative proportions. Improvements of this technology allow us to describe millions of individual cells from a blood sample using multiple markers. This results in high-dimensional datasets, whose manual analysis is highly time-consuming and poorly reproducible. While several methods have been developed to perform automatic recognition of cell populations most of them treat and analyze each sample independently. However, in practice individual samples are rarely independent, especially in longitudinal studies. Here we analyze new longitudinal flow-cytometry data from the DALIA-1 trial, which evaluates a therapeutic vaccine against HIV, by proposing a new Bayesian nonparametric approach with Dirichlet process mixture (DPM) of multivariate skew $t$-distributions to perform model based clustering of flow-cytometry data. DPM models directly estimate the number of cell populations from the data, avoiding model selection issues, and skew $t$-distributions provides robustness to outliers and nonelliptical shape of cell populations. To accommodate repeated measurements, we propose a sequential strategy relying on a parametric approximation of the posterior. We illustrate the good performance of our method on simulated data and on an experimental benchmark dataset. This sequential strategy outperforms all other methods evaluated on the benchmark dataset and leads to improved performance on the DALIA-1 data.

Article information

Ann. Appl. Stat., Volume 13, Number 1 (2019), 638-660.

Received: July 2017
Revised: July 2018
First available in Project Euclid: 10 April 2019

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Automatic gating Bayesian nonparametrics Dirichlet process flow cytometry HIV mixture model skew $t$-distribution


Hejblum, Boris P.; Alkhassim, Chariff; Gottardo, Raphael; Caron, François; Thiébaut, Rodolphe. Sequential Dirichlet process mixtures of multivariate skew $t$-distributions for model-based clustering of flow cytometry data. Ann. Appl. Stat. 13 (2019), no. 1, 638--660. doi:10.1214/18-AOAS1209.

Export citation


  • Aghaeepour, N., Finak, G., Hoos, H., Mosmann, T. R., Brinkman, R. R., Gottardo, R. and Scheuermann, R. H. (2013). Critical assessment of automated flow cytometry data analysis techniques Nat. Methods 10 228–238.
  • Aghaeepour, N., Nikolic, R., Hoos, H. H. and Brinkman, R. R. (2011). Rapid cell population identification in flow cytometry data Cytometry Part A 79 6–13.
  • Azzalini, A., Browne, R. P., Genton, M. G. and McNicholas, P. D. (2016). On nomenclature for, and the relative merits of, two formulations of skew distributions. Statist. Probab. Lett. 110 201–206.
  • Azzalini, A. and Capitanio, A. (2003). Distributions generated by perturbation of symmetry with emphasis on a multivariate skew $t$-distribution. J. R. Stat. Soc. Ser. B. Stat. Methodol. 65 367–389.
  • Azzalini, A. and Dalla Valle, A. (1996). The multivariate skew-normal distribution. Biometrika 83 715–726.
  • Biernacki, C., Celeux, G. and Govaert, G. (2000). Assessing a mixture model for clustering with the integrated completed likelihood. IEEE Trans. Pattern Anal. Mach. Intell. 22 719–725.
  • Binder, D. A. (1978). Bayesian cluster analysis. Biometrika 65 31–38.
  • Binder, D. A. (1981). Approximations to Bayesian clustering rules. Biometrika 68 275–285.
  • Brinkman, R. R., Gasparetto, M., Lee, S.-J. J., Ribickas, A. J., Perkins, J., Janssen, W., Smiley, R. and Smith, C. (2007). High-content flow cytometry and temporal data analysis for defining a cellular signature of graft-versus-host disease. J. Amer. Soc. Blood Marrow Transplantol. Biol. Blood Marrow Transplant. 13 691–700.
  • Caron, F., Davy, M., Doucet, A., Duflos, E. and Vanheeghe, P. (2008). Bayesian inference for linear dynamic models with Dirichlet process mixtures. IEEE Trans. Signal Process. 56 71–84.
  • Caron, F., Neiswanger, W., Wood, F., Doucet, A. and Davy, M. (2017). Generalized Pólya urn for time-varying Pitman–Yor processes. J. Mach. Learn. Res. 18 Paper No. 27.
  • Caron, F., Teh, Y. W. and Murphy, T. B. (2014). Bayesian nonparametric Plackett–Luce models for the analysis of preferences for college degree programmes. Ann. Appl. Stat. 8 1145–1181.
  • Chan, C., Feng, F., Ottinger, J., Foster, D., West, M. and Kepler, T. B. (2008). Statistical mixture modeling for cell subtype identification in flow cytometry. Cytometry, Part A J. Internat. Soc. Anal. Cytol. 73 693–701.
  • Cron, A., Gouttefangeas, C., Frelinger, J., Lin, L., Singh, S. K., Britten, C. M., Welters, M. J. P., van der Burg, S. H., West, M. and Chan, C. (2013). Hierarchical modeling for rare event detection and cell subset alignment across flow cytometry samples. PLoS Comput. Biol. 9 e1003130.
  • Dahl, D. B. (2006). Model-based clustering for expression data via a dirichlet process mixture model. In Bayesian Inference for Gene Expression and Proteomics (K.-A. Do, P. Müller & M. Vannucci, eds.) 201–218. Cambridge Univ. Press, Cambridge.
  • Dundar, M., Akova, F., Yerebakan, H. Z. and Rajwa, B. (2014). A non-parametric Bayesian model for joint cell clustering and cluster matching: Identification of anomalous sample phenotypes with random effects. BMC Bioinform. 15 314.
  • Escobar, M. D. and West, M. (1995). Bayesian density estimation and inference using mixtures. J. Amer. Statist. Assoc. 90 577–588.
  • Ferguson, T. S. (1973). A Bayesian analysis of some nonparametric problems. Ann. Statist. 1 209–230.
  • Finak, G., Bashashati, A., Brinkman, R. and Gottardo, R. (2009). Merging mixture components for cell population identification in flow cytometry. Adv. Bioinform. 2009 247646.
  • Finak, G., Perez, J.-M., Weng, A. and Gottardo, R. (2010). Optimizing transformations for automated, high throughput analysis of flow cytometry data. BMC Bioinform. 11 546.
  • Fritsch, A. and Ickstadt, K. (2009). Improved criteria for clustering based on the posterior similarity matrix. Bayesian Anal. 4 367–391.
  • Frühwirth-Schnatter, S. and Pyne, S. (2010). Bayesian inference for finite mixtures of univariate and multivariate skew-normal and skew-t distributions. Biostatistics 11 317–336.
  • Ge, Y. and Sealfon, S. C. (2012). flowPeaks: A fast unsupervised clustering for flow cytometry data via $K$-means and density peak finding. Bioinformatics 28 2052–2058.
  • Gondois-Rey, F., Granjeaud, S., Rouillier, P., Rioualen, C., Bidaut, G. and Olive, D. (2016). Multi-parametric cytometry from a complex cellular sample: Improvements and limits of manual versus computational-based interactive analyses. Cytometry Part A 89 480–490.
  • Hejblum, B. P, Alkhassim, C., Gottardo, R., Caron, F. and Thiébaut, R. (2019). Supplement to “Sequential Dirichlet process mixtures of multivariate skew $t$-distributions for model-based clustering of flow cytometry data.” DOI:10.1214/18-AOAS1209SUPP.
  • Huang, A. and Wand, M. P. (2013). Simple marginally noninformative prior distributions for covariance matrices. Bayesian Anal. 8 439–451.
  • Huang, Z. and Gelman, A. (2005). Sampling for Bayesian computation with large datasets. SSRN Electron. J. 1–21.
  • Jasra, A., Holmes, C. C. and Stephens, D. A. (2005). Markov chain Monte Carlo methods and the label switching problem in Bayesian mixture modeling. Statist. Sci. 20 50–67.
  • Johnsson, K., Wallin, J. and Fontes, M. (2016). BayesFlow: Latent modeling of flow cytometry cell populations. BMC Bioinform. 17 25.
  • Juárez, M. A. and Steel, M. F. J. (2010). Model-based clustering of non-Gaussian panel data based on skew-$t$ distributions. J. Bus. Econom. Statist. 28 52–66.
  • Kalli, M., Griffin, J. E. and Walker, S. G. (2011). Slice sampling mixture models. Stat. Comput. 21 93–105.
  • Kessler, D. C., Hoff, P. D. and Dunson, D. B. (2015). Marginally specified priors for non-parametric Bayesian estimation. J. R. Stat. Soc. Ser. B. Stat. Methodol. 77 35–58.
  • Larbi, A. and Fulop, T. (2014). From “truly naïve” to “exhausted senescent” T cells: When markers predict functionality. Cytometry Part A 85 25–35.
  • Lau, J. W. and Green, P. J. (2007). Bayesian model-based clustering procedures. J. Comput. Graph. Statist. 16 526–558.
  • Lee, S. X. and McLachlan, G. J. (2013). On mixtures of skew normal and skew $t$-distributions. Adv. Data Anal. Classif. 7 241–266.
  • Lee, S. X. and McLachlan, G. J. (2016). Finite mixtures of canonical fundamental skew $t$-distributions. Stat. Comput. 26 573–589.
  • Lévy, Y., Thiébaut, R., Gougeon, M.-L., Molina, J.-M., Weiss, L., Girard, P.-M., Venet, A., Morlat, P., Poirier, B., Lascaux, A.-S., Boucherie, C., Sereni, D., Rouzioux, C., Viard, J.-P., Lane, C., Delfraissy, J.-F., Sereti, I., Chêne, G. and ILIADE Study Group (2012). Effect of intermittent interleukin-2 therapy on CD4$+$ T-cell counts following antiretroviral cessation in patients with HIV. AIDS 26 711–720.
  • Lévy, Y., Thiébaut, R., Montes, M., Lacabaratz, C., Sloan, L., King, B., Pérusat, S., Harrod, C., Cobb, A., Roberts, L. K., Surenaud, M., Boucherie, C., Zurawski, S., Delaugerre, C., Richert, L., Chêne, G., Banchereau, J. and Palucka, K. (2014). Dendritic cell-based therapeutic vaccine elicits polyfunctional HIV-specific T-cell immunity associated with control of viral load. Eur. J. Immunol. 44 2802–2810.
  • Lin, L., Chan, C., Hadrup, S. R., Froesig, T. M., Wang, Q. and West, M. (2013). Hierarchical Bayesian mixture modelling for antigen-specific T-cell subtyping in combinatorially encoded flow cytometry studies. Stat. Appl. Genet. Mol. Biol. 12 309–331.
  • Lo, A. Y. (1984). On a class of Bayesian nonparametric estimates. I. Density estimates. Ann. Statist. 12 351–357.
  • Lo, K., Brinkman, R. R. and Gottardo, R. (2008). Automated gating of flow cytometry data via robust model-based clustering. Cytometry, Part A J. Internat. Soc. Anal. Cytol. 73 321–332.
  • Lo, K. and Gottardo, R. (2012). Flexible mixture modeling via the multivariate $t$ distribution with the Box–Cox transformation: An alternative to the skew-$t$ distribution. Stat. Comput. 22 33–52.
  • McLachlan, G. J. and Lee, S. X. (2016). Comment on “On nomenclature, and the relative merits of two formulations of skew distributions” by A. Azzalini, R. Browne, M. Genton, and P. McNicholas. Statist. Probab. Lett. 116 1–5.
  • Medvedovic, M. and Sivaganesan, S. (2002). Bayesian infinite mixture model based clustering of gene expression profiles. Bioinformatics 18 1194–1206.
  • Melchiotti, R., Gracio, F., Kordasti, S., Todd, A. K. and de Rinaldis, E. (2017). Cluster stability in the analysis of mass cytometry data. Cytometry Part A 91 73–84.
  • Mosmann, T. R., Naim, I., Rebhahn, J., Datta, S., Cavenaugh, J. S., Weaver, J. M. and Sharma, G. (2014). SWIFT-scalable clustering for automated identification of rare cell populations in large, high-dimensional flow cytometry datasets, Part 2: Biological evaluation. Cytometry Part A 85 422–433.
  • Murray, P. M., Browne, R. P. and McNicholas, P. D. (2014). Mixtures of skew-$t$ factor analyzers. Comput. Statist. Data Anal. 77 326–335.
  • Naim, I., Datta, S., Rebhahn, J., Cavenaugh, J. S., Mosmann, T. R. and Sharma, G. (2014). SWIFT-scalable clustering for automated identification of rare cell populations in large, high-dimensional flow cytometry datasets, Part 1: Algorithm design. Cytometry Part A 85 408–421.
  • Neal, R. M. (2003). Slice sampling. Ann. Statist. 31 705–767.
  • Pitman, J. (2006). Combinatorial Stochastic Processes. Lecture Notes in Math. 1875. Springer, Berlin. Lectures from the 32nd Summer School on Probability Theory held in Saint-Flour, July 7–24, 2002. With a foreword by Jean Picard.
  • Pyne, S., Hu, X., Wang, K., Rossin, E., Lin, T.-I., Maier, L. M., Baecher-Allan, C., McLachlan, G. J., Tamayo, P., Hafler, D. A., De Jager, P. L. and Mesirov, J. P. (2009). Automated high-dimensional flow cytometric data analysis Proc. Natl. Acad. Sci. USA 106 8519–8524.
  • Qian, Y., Wei, C., Eun-Hyung Lee, F., Campbell, J., Halliley, J., Lee, J. A., Cai, J., Kong, Y. M., Sadat, E., Thomson, E., Dunn, P., Seegmiller, A. C., Karandikar, N. J., Tipton, C. M., Mosmann, T., Sanz, I. and Scheuermann, R. H. (2010). Elucidation of seventeen human peripheral blood B-cell subsets and quantification of the tetanus response using a density-based method for the automated identification of cell populations in multidimensional flow cytometry data. Cytometry, Part B Clin. Cytom. 78 Suppl 1 S69–82.
  • Sethuraman, J. (1994). A constructive definition of Dirichlet priors. Statist. Sinica 4 639–650.
  • Sugár, I. P. and Sealfon, S. C. (2010). Misty Mountain clustering: Application to fast unsupervised flow cytometry gating. BMC Bioinform. 11 502.
  • Teh, Y. W. (2010). Dirichlet process. In Encyclopedia of Machine Learning 280–287. Springer US, Boston, MA.
  • Thiébaut, R., Pellegrin, I., Chêne, G., Viallard, J. F., Fleury, H., Moreau, J. F., Pellegrin, J. L. and Blanco, P. (2005). Immunological markers after long-term treatment interruption in chronically HIV-1 infected patients with CD4 cell count above 400 $\times$ 10(6) cells/l. AIDS 19 53–61.
  • Tibshirani, R., Walther, G. and Hastie, T. (2001). Estimating the number of clusters in a data set via the gap statistic. J. R. Stat. Soc. Ser. B. Stat. Methodol. 63 411–423.
  • Van Dyk, D. A. and Jiao, X. (2015). Metropolis–Hastings within partially collapsed Gibbs samplers. J. Comput. Graph. Statist. 24 301–327.
  • van Dyk, D. A. and Park, T. (2008). Partially collapsed Gibbs samplers: Theory and methods. J. Amer. Statist. Assoc. 103 790–796.
  • Welters, M. J. P., Gouttefangeas, C., Ramwadhdoebe, T. H., Letsch, A., Ottensmeier, C. H., Britten, C. M. and Van Der Burg, S. H. (2012). Harmonization of the intracellular cytokine staining assay. Cancer Immunol. Immunother. 61 967–978.
  • Zare, H., Shooshtari, P., Gupta, A. and Brinkman, R. R. (2010). Data reduction for spectral clustering to analyze high throughput flow cytometry data. BMC Bioinform. 11 403.

Supplemental materials

  • Online Supplement to “Sequential Dirichlet process mixtures of multivariate skew $t$-distributions for model-based clustering of flow cytometry data”. We provide additional mathematical details for the proposed Gibbs samplers and the parameter estimations, as well as additional plots showing the good performance of the sequential strategy.