The Annals of Applied Statistics

Loglinear model selection and human mobility

Adrian Dobra and Reza Mohammadi

Full-text: Open access

Abstract

Methods for selecting loglinear models were among Steve Fienberg’s research interests since the start of his long and fruitful career. After we dwell upon the string of papers focusing on loglinear models that can be partly attributed to Steve’s contributions and influential ideas, we develop a new algorithm for selecting graphical loglinear models that is suitable for analyzing hyper-sparse contingency tables. We show how multi-way contingency tables can be used to represent patterns of human mobility. We analyze a dataset of geolocated tweets from South Africa that comprises $46$ million latitude/longitude locations of $476\mbox{,}601$ Twitter users that is summarized as a contingency table with $214$ variables.

Article information

Source
Ann. Appl. Stat., Volume 12, Number 2 (2018), 815-845.

Dates
Received: November 2017
Revised: March 2018
First available in Project Euclid: 28 July 2018

Permanent link to this document
https://projecteuclid.org/euclid.aoas/1532743478

Digital Object Identifier
doi:10.1214/18-AOAS1164

Mathematical Reviews number (MathSciNet)
MR3834287

Keywords
Contingency tables model selection human mobility graphical models Bayesian structural learning birth–death processes pseudo-likelihood

Citation

Dobra, Adrian; Mohammadi, Reza. Loglinear model selection and human mobility. Ann. Appl. Stat. 12 (2018), no. 2, 815--845. doi:10.1214/18-AOAS1164. https://projecteuclid.org/euclid.aoas/1532743478


Export citation

References

  • Agresti, A. (1990). Categorical Data Analysis. Wiley, New York.
  • Albert, R. and Barabási, A.-L. (2002). Statistical mechanics of complex networks. Rev. Modern Phys. 74 47–97.
  • Baldi, P., Brunak, S., Chauvin, Y., Andersen, C. A. and Nielsen, H. (2000). Assessing the accuracy of prediction algorithms for classification: An overview. Bioinformatics 16 412–424.
  • Baltazar, C. S., Horth, R., Inguane, C., Sathane, I., César, F., Ricardo, H., Botão, C., Augusto, Â., Cooley, L., Cummings, B., Raymond, H. F. and Young, P. W. (2015). HIV prevalence and risk behaviors among Mozambicans working in South African mines. AIDS Behav. 19 59–67.
  • Becker, R., Cáceres, R., Hanson, K., Isaacman, S., Loh, J. M., Martonosi, M., Rowland, J., Urbanek, S., Varshavsky, A. and Volinsky, C. (2013). Human mobility characterization from cellular network data. Commun. ACM 56 74–82.
  • Besag, J. (1975). Statistical analysis of non-lattice data. J. R. Stat. Soc., Ser. D Stat. 24 179–195.
  • Besag, J. (1977). Efficiency of pseudolikelihood estimation for simple Gaussian fields. Biometrika 64 616–618.
  • Bhattacharya, A. and Dunson, D. B. (2012). Simplex factor models for multivariate unordered categorical data. J. Amer. Statist. Assoc. 107 362–377.
  • Bishop, Y. M. M., Fienberg, S. E. and Holland, P. W. (1975). Discrete Multivariate Analysis: Theory and Practice. MIT Press, Cambridge, MA. With the collaboration of Richard J. Light and Frederick Mosteller.
  • Brockmann, D., Hufnagel, L. and Geisel, T. (2006). The scaling laws of human travel. Nature 439 462–465.
  • Calabrese, F., Diao, M., Lorenzo, G. D., Ferreira Jr., J. and Ratti, C. (2013). Understanding individual mobility patterns from urban sensing data: A mobile phone trace example. Transp. Res., Part C, Emerg. Technol. 26 301–313.
  • Canale, A. and Dunson, D. B. (2011). Bayesian kernel mixtures for counts. J. Amer. Statist. Assoc. 106 1528–1539.
  • Cappé, O., Robert, C. P. and Rydén, T. (2003). Reversible jump, birth-and-death and more general continuous time Markov chain Monte Carlo samplers. J. R. Stat. Soc. Ser. B. Stat. Methodol. 65 679–700.
  • Cheng, Y. and Lenkoski, A. (2012). Hierarchical Gaussian graphical models: Beyond reversible jump. Electron. J. Stat. 6 2309–2331.
  • Clyde, M. and George, E. I. (2004). Model uncertainty. Statist. Sci. 19 81–94.
  • Dellaportas, P. and Forster, J. J. (1999). Markov chain Monte Carlo model determination for hierarchical and graphical log-linear models. Biometrika 86 615–633.
  • Dellaportas, P. and Tarantola, C. (2005). Model determination for categorical data with factor level merging. J. R. Stat. Soc. Ser. B. Stat. Methodol. 67 269–283.
  • Descombes, X., Minlos, R. and Zhizhina, E. (2009). Object extraction using a stochastic birth-and-death dynamics in continuum. J. Math. Imaging Vision 33 347–359.
  • Dobra, A. and Lenkoski, A. (2011). Copula Gaussian graphical models and their application to modeling functional disability data. Ann. Appl. Stat. 5 969–993.
  • Dobra, A., Lenkoski, A. and Rodriguez, A. (2011). Bayesian inference for general Gaussian graphical models with application to multivariate lattice data. J. Amer. Statist. Assoc. 106 1418–1433.
  • Dobra, A. and Massam, H. (2010). The mode oriented stochastic search (MOSS) algorithm for log-linear models with conjugate priors. Stat. Methodol. 7 240–253.
  • Dobra, A. and Mohammadi, R. (2018). Supplement to “Loglinear model selection and human mobility.” DOI:10.1214/18-AOAS1164SUPP.
  • Dobra, A., Williams, N. E. and Eagle, N. (2015). Spatiotemporal detection of unusual human population behavior using mobile phone data. PLoS ONE 10 1–20.
  • Dobra, A., Bärnighausen, T., Vandormael, A. and Tanser, F. (2017). Space-time migration patterns and risk of HIV acquisition in rural South Africa. AIDS 31 37–145.
  • Donato, K. M. (1993). Current trends and patterns of female migration: Evidence from Mexico. Int. Migr. Rev. 27 748–771.
  • Drton, M. and Maathuis, M. H. (2017). Structure learning in graphical modeling. Annu. Rev. Statist. Appl. 4 365–393.
  • Dunson, D. B. and Xing, C. (2009). Nonparametric Bayes modeling of multivariate categorical data. J. Amer. Statist. Assoc. 104 1042–1051.
  • Durand, J., Kandel, W., Parrado, E. A. and Massey, D. S. (1996). International migration and development in Mexican communities. Demography 33 249–264.
  • Edwards, D. and Havránek, T. (1985). A fast procedure for model search in multidimensional contingency tables. Biometrika 72 339–351.
  • Fienberg, S. E. (1970). The analysis of multidimensional contingency tables. Ecology 51 419–433.
  • Fienberg, S. E. (1980). The Analysis of Cross-Classified Categorical Data, 2nd ed. MIT Press, Cambridge, MA.
  • Fienberg, S. E. and Rinaldo, A. (2007). Three centuries of categorical data analysis: Log-linear models and maximum likelihood estimation. J. Statist. Plann. Inference 137 3430–3445.
  • Fienberg, S. E. and Rinaldo, A. (2012). Maximum likelihood estimation in log-linear models. Ann. Statist. 40 996–1023.
  • Gamal-Eldin, A., Descombes, X. and Zerubia, J. (2010). Multiple birth and cut algorithm for point process optimization. In 2010 Sixth International Conference on Signal-Image Technology and Internet-Based Systems (SITIS) 35–42. IEEE, Los Alamitos, CA.
  • Gamal-Eldin, A., Descombes, X., Charpiat, G. and Zerubia, J. (2011). A fast multiple birth and cut algorithm using belief propagation. In 2011 18th IEEE International Conference on Image Processing 2813–2816. IEEE, Los Alamitos, CA.
  • Gonzalez, M. C., Hidalgo, C. A. and Barabasi, A.-L. (2008). Understanding individual human mobility patterns. Nature 453 779–782.
  • Green, P. J. (1995). Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika 82 711–732.
  • Guerzhoy, M. and Hertzmann, A. (2014). Learning latent factor models of travel data for travel prediction and analysis. In Advances in Artificial Intelligence. Lecture Notes in Computer Science 8436 131–142. Springer, Cham.
  • Harris, J. R. and Todaro, M. P. (1970). Migration, unemployment and development: A two-sector analysis. Am. Econ. Rev. 60 126–142.
  • Hoff, P. D. (2008). Multiplicative latent factor models for description and prediction of social networks. Comput. Math. Organ. Theory 15 Art. ID 261.
  • Höfling, H. and Tibshirani, R. (2009). Estimation of sparse binary pairwise Markov networks using pseudo-likelihoods. J. Mach. Learn. Res. 10 883–906.
  • Højsgaard, S., Edwards, D. and Lauritzen, S. (2012). Graphical Models with R. Springer, New York.
  • Imai, K. (2017). Quantitative Social Science: An Introduction. Princeton Univ. Press, Princeton, NJ.
  • Jones, B., Carvalho, C., Dobra, A., Hans, C., Carter, C. and West, M. (2005). Experiments in stochastic computation for high-dimensional graphical models. Statist. Sci. 20 388–400.
  • Jurdak, R., Zhao, K., Liu, J., AbouJaoude, M., Cameron, M. and Newth, D. (2015). Understanding human mobility from Twitter. PLoS ONE 10 1–16.
  • Kunihama, T. and Dunson, D. B. (2013). Bayesian modeling of temporal dependence in large sparse contingency tables. J. Amer. Statist. Assoc. 108 1324–1338.
  • Lauritzen, S. L. (1996). Graphical Models. Oxford Statistical Science Series 17. The Clarendon Press, Oxford Univ. Press, New York.
  • Leetaru, K., Wang, S., Cao, G., Padmanabhan, A. and Shook, E. (2013). Mapping the global Twitter heartbeat: The geography of Twitter. First Monday 18. Available at http://firstmonday.org/ojs/index.php/fm/article/view/4366/3654.
  • Lenkoski, A. and Dobra, A. (2011). Computational aspects related to inference in Gaussian graphical models with the G-Wishart prior. J. Comput. Graph. Statist. 20 140–157. Supplementary material available online.
  • Letac, G. and Massam, H. (2012). Bayes factors and the geometry of discrete hierarchical loglinear models. Ann. Statist. 40 861–890.
  • Madigan, D. and Raftery, A. E. (1994). Model selection and accounting for model uncertainty in graphical models using Occam’s window. J. Amer. Statist. Assoc. 89 1535–1546.
  • Madigan, D. and York, J. (1995). Bayesian graphical models for discrete data. Int. Stat. Rev. 63 215–232.
  • Madigan, D. and York, J. C. (1997). Bayesian methods for estimation of the size of a closed population. Biometrika 84 19–31.
  • Madigan, D., Raftery, A. E., Volinsky, C. and Hoeting, J. (1996). Bayesian model averaging. In Proceedings of the AAAI Workshop on Integrating Multiple Learned Models 77–83.
  • Massam, H., Liu, J. and Dobra, A. (2009). A conjugate prior for discrete hierarchical log-linear models. Ann. Statist. 37 3431–3467.
  • Massey, D. S. (1990). Social structure, household strategies, and the cumulative causation of migration. Popul. Index 56 3–26.
  • Massey, D. S. and Espinosa, K. E. (1997). What’s driving Mexico–U.S. migration? A theoretical, empirical, and policy analysis. Am. J. Sociol. 102 939–999.
  • Massey, D. S., Arango, J., Hugo, G., Kouaouci, A., Pellegrino, A. and Taylor, J. E. (1993). Theories of international migration: A review and appraisal. Popul. Dev. Rev. 19 431–466.
  • Massey, D. S., Williams, N., Axinn, W. G. and Ghimire, D. (2010). Community services and out-migration. Int. Migr. 48 1–41.
  • Mohammadi, A. and Dobra, A. (2017). The R package BDgraph for Bayesian structure learning in graphical models. ISBA Bull. 4 11–16.
  • Mohammadi, A., Massam, H. and Letac, G. (2017). The ratio of normalizing constants for Bayesian graphical Gaussian model selection. Preprint. Available at arXiv:1706.04416.
  • Mohammadi, A. and Wit, E. C. (2015). Bayesian structure learning in sparse Gaussian graphical models. Bayesian Anal. 10 109–138.
  • Mohammadi, R. and Wit, E. C. (2017). BDgraph: An R package for Bayesian structure learning in graphical models. Preprint. Available at arXiv:1501.05108v4.
  • Mohammadi, R. and Wit, E. C. and Dobra, A. (2018). BDgraph: Bayesian structure learning in graphical models using birth–death MCMC. R package version 2.49.
  • Mohammadi, A., Abegaz, F., van den Heuvel, E. and Wit, E. C. (2017). Bayesian modelling of Dupuytren disease by using Gaussian copula graphical models. J. R. Stat. Soc. Ser. C. Appl. Stat. 66 629–645.
  • Nardi, Y. and Rinaldo, A. (2012). The log-linear group-lasso estimator and its asymptotic properties. Bernoulli 18 945–974.
  • Neubauer, G., Huber, H., Vogl, A., Jager, B., Preinerstorfer, A., Schirnhofer, S., Schimak, G. and Havlik, D. (2015). On the volume of geo-referenced tweets and their relationship to events relevant for migration tracking. In Environmental Software Systems. Infrastructures, Services and Applications: 11th IFIP WG 5.11 International Symposium, ISESS 2015, Melbourne, VIC, Australia, March 2527, 2015. Proceedings (R. Denzer, R. M. Argent, G. Schimak and J. Hřebíček, eds.) 520–530. Springer, Cham.
  • OpenMP Architecture Review Board (2008). OpenMP application program interface version 3.0.
  • Pensar, J., Nyman, H., Niiranen, J. and Corander, J. (2017). Marginal pseudo-likelihood learning of discrete Markov network structures. Bayesian Anal. 12 1195–1215.
  • Preston, C. (1975). Spatial birth-and-death processes. Bull. Inst. Int. Stat. 46 371–391, 405–408 (1975). With discussion.
  • Ravikumar, P., Wainwright, M. J. and Lafferty, J. D. (2010). High-dimensional Ising model selection using $\ell_{1}$-regularized logistic regression. Ann. Statist. 38 1287–1319.
  • Raymer, J., Abel, G. and Smith, P. W. F. (2007). Combining census and registration data to estimate detailed elderly migration flows in England and Wales. J. Roy. Statist. Soc. Ser. A 170 891–908.
  • Raymer, J., Wiśniowski, A., Forster, J. J., Smith, P. W. F. and Bijak, J. (2013). Integrated modeling of European migration. J. Amer. Statist. Assoc. 108 801–819.
  • Scott, J. G. and Carvalho, C. M. (2008). Feature-inclusion stochastic search for Gaussian graphical models. J. Comput. Graph. Statist. 17 790–808.
  • SMaPP (2017). smappR package: Tools for analysis of Twitter data, Social Media and Participation, New York University. Available at https://github.com/SMAPPNYU/smappR.
  • Smith, P. W. F., Raymer, J. and Giulietti, C. (2010). Combining available migration data in England to study economic activity flows over time. J. Roy. Statist. Soc. Ser. A 173 733–753.
  • Stark, O. and Bloom, D. E. (1985). The new economics of labor migration. Am. Econ. Rev. 75 173–178.
  • Stark, O. and Taylor, J. E. (1985). Migration incentives, migration types: The role of relative deprivation. Econ. J. 101 1163–1178.
  • Stopher, P. R. and Greaves, S. P. (2007). Household travel surveys: Where are we going? Transp. Res., Part A Policy Pract. 41 367–381.
  • Tarantola, C. (2004). MCMC model determination for discrete graphical models. Stat. Model. 4 39–61.
  • Tatem, A. J. (2014). Mapping population and pathogen movements. Int. Health 6 5–11.
  • Taylor, J. E. (1987). Undocumented Mexico–U.S. migration and the returns to households in rural Mexico. Am. J. Agric. Econ. 69 616–638.
  • Todaro, M. P. (1969). A model of labor migration and urban unemployment in less developed countries. Am. Econ. Rev. 59 138–148.
  • Todaro, M. P. and Maruszko, L. (1987). Illegal immigration and U.S. immigration reform: A conceptual framework. Popul. Dev. Rev. 13 101–114.
  • Tsamardinos, I., Brown, L. E. and Aliferis, C. F. (2006). The max–min hill-climbing Bayesian network structure learning algorithm. Mach. Learn. 65 31–78.
  • Twitter, Inc. (2017). Twitter REST APIs. Available at https://dev.twitter.com/rest/public.
  • VanWey, L. K. (2005). Land ownership as a determinant of international and internal migration in Mexico and internal migration in Thailand. Int. Migr. Rev. 39 141–172.
  • Wainwright, M. and Jordan, M. (2008). Graphical models, exponential families and variational inference. Found. Trends Mach. Learn. 1 1–305.
  • Wang, H. and Li, S. Z. (2012). Efficient Gaussian graphical model determination under $G$-Wishart prior distributions. Electron. J. Stat. 6 168–198.
  • Whittaker, J. (1990). Graphical Models in Applied Multivariate Statistics. Wiley, Chichester.
  • Williams, N. (2009). Education, gender, and migration in the context of social change. Soc. Sci. Res. 38 883–896.
  • Williams, N. E., Thomas, T. A., Dunbar, M., Eagle, N. and Dobra, A. (2015). Measures of human mobility using mobile phone records enhanced with GIS data. PLoS ONE 10 1–16.
  • Wolf, J., Oliveira, M. and Thompson, M. (2003). Impact of underreporting on mileage and travel time estimates: Results from global positioning system-enhanced household travel survey. Transp. Res. Rec. 1854 189–198.

Supplemental materials

  • Additional proofs, maps, figures and tables. In this online supplementary material, we provide the proof for Theorem 5.1, together with additional maps, figures, and tables referenced in this article.