Annals of Applied Statistics
- Ann. Appl. Stat.
- Volume 12, Number 2 (2018), 815-845.
Loglinear model selection and human mobility
Adrian Dobra and Reza Mohammadi
Full-text: Open access
Abstract
Methods for selecting loglinear models were among Steve Fienberg’s research interests since the start of his long and fruitful career. After we dwell upon the string of papers focusing on loglinear models that can be partly attributed to Steve’s contributions and influential ideas, we develop a new algorithm for selecting graphical loglinear models that is suitable for analyzing hyper-sparse contingency tables. We show how multi-way contingency tables can be used to represent patterns of human mobility. We analyze a dataset of geolocated tweets from South Africa that comprises $46$ million latitude/longitude locations of $476\mbox{,}601$ Twitter users that is summarized as a contingency table with $214$ variables.
Article information
Source
Ann. Appl. Stat., Volume 12, Number 2 (2018), 815-845.
Dates
Received: November 2017
Revised: March 2018
First available in Project Euclid: 28 July 2018
Permanent link to this document
https://projecteuclid.org/euclid.aoas/1532743478
Digital Object Identifier
doi:10.1214/18-AOAS1164
Mathematical Reviews number (MathSciNet)
MR3834287
Zentralblatt MATH identifier
06980477
Keywords
Contingency tables model selection human mobility graphical models Bayesian structural learning birth–death processes pseudo-likelihood
Citation
Dobra, Adrian; Mohammadi, Reza. Loglinear model selection and human mobility. Ann. Appl. Stat. 12 (2018), no. 2, 815--845. doi:10.1214/18-AOAS1164. https://projecteuclid.org/euclid.aoas/1532743478
References
- Agresti, A. (1990). Categorical Data Analysis. Wiley, New York.Zentralblatt MATH: 0716.62001
- Albert, R. and Barabási, A.-L. (2002). Statistical mechanics of complex networks. Rev. Modern Phys. 74 47–97.
- Baldi, P., Brunak, S., Chauvin, Y., Andersen, C. A. and Nielsen, H. (2000). Assessing the accuracy of prediction algorithms for classification: An overview. Bioinformatics 16 412–424.
- Baltazar, C. S., Horth, R., Inguane, C., Sathane, I., César, F., Ricardo, H., Botão, C., Augusto, Â., Cooley, L., Cummings, B., Raymond, H. F. and Young, P. W. (2015). HIV prevalence and risk behaviors among Mozambicans working in South African mines. AIDS Behav. 19 59–67.
- Becker, R., Cáceres, R., Hanson, K., Isaacman, S., Loh, J. M., Martonosi, M., Rowland, J., Urbanek, S., Varshavsky, A. and Volinsky, C. (2013). Human mobility characterization from cellular network data. Commun. ACM 56 74–82.
- Besag, J. (1975). Statistical analysis of non-lattice data. J. R. Stat. Soc., Ser. D Stat. 24 179–195.
- Besag, J. (1977). Efficiency of pseudolikelihood estimation for simple Gaussian fields. Biometrika 64 616–618.
- Bhattacharya, A. and Dunson, D. B. (2012). Simplex factor models for multivariate unordered categorical data. J. Amer. Statist. Assoc. 107 362–377.Mathematical Reviews (MathSciNet): MR2949366
Zentralblatt MATH: 1263.62097
Digital Object Identifier: doi:10.1080/01621459.2011.646934 - Bishop, Y. M. M., Fienberg, S. E. and Holland, P. W. (1975). Discrete Multivariate Analysis: Theory and Practice. MIT Press, Cambridge, MA. With the collaboration of Richard J. Light and Frederick Mosteller.Zentralblatt MATH: 0332.62039
- Brockmann, D., Hufnagel, L. and Geisel, T. (2006). The scaling laws of human travel. Nature 439 462–465.
- Calabrese, F., Diao, M., Lorenzo, G. D., Ferreira Jr., J. and Ratti, C. (2013). Understanding individual mobility patterns from urban sensing data: A mobile phone trace example. Transp. Res., Part C, Emerg. Technol. 26 301–313.
- Canale, A. and Dunson, D. B. (2011). Bayesian kernel mixtures for counts. J. Amer. Statist. Assoc. 106 1528–1539.
- Cappé, O., Robert, C. P. and Rydén, T. (2003). Reversible jump, birth-and-death and more general continuous time Markov chain Monte Carlo samplers. J. R. Stat. Soc. Ser. B. Stat. Methodol. 65 679–700.
- Cheng, Y. and Lenkoski, A. (2012). Hierarchical Gaussian graphical models: Beyond reversible jump. Electron. J. Stat. 6 2309–2331.Mathematical Reviews (MathSciNet): MR3020264
Zentralblatt MATH: 1335.62042
Digital Object Identifier: doi:10.1214/12-EJS746
Project Euclid: euclid.ejs/1354284421 - Clyde, M. and George, E. I. (2004). Model uncertainty. Statist. Sci. 19 81–94.Zentralblatt MATH: 1062.62044
Digital Object Identifier: doi:10.1214/088342304000000035
Project Euclid: euclid.ss/1089808274 - Dellaportas, P. and Forster, J. J. (1999). Markov chain Monte Carlo model determination for hierarchical and graphical log-linear models. Biometrika 86 615–633.
- Dellaportas, P. and Tarantola, C. (2005). Model determination for categorical data with factor level merging. J. R. Stat. Soc. Ser. B. Stat. Methodol. 67 269–283.
- Descombes, X., Minlos, R. and Zhizhina, E. (2009). Object extraction using a stochastic birth-and-death dynamics in continuum. J. Math. Imaging Vision 33 347–359.Mathematical Reviews (MathSciNet): MR2480967
Digital Object Identifier: doi:10.1007/s10851-008-0117-y - Dobra, A. and Lenkoski, A. (2011). Copula Gaussian graphical models and their application to modeling functional disability data. Ann. Appl. Stat. 5 969–993.Zentralblatt MATH: 1232.62046
Digital Object Identifier: doi:10.1214/10-AOAS397
Project Euclid: euclid.aoas/1310562213 - Dobra, A., Lenkoski, A. and Rodriguez, A. (2011). Bayesian inference for general Gaussian graphical models with application to multivariate lattice data. J. Amer. Statist. Assoc. 106 1418–1433.Mathematical Reviews (MathSciNet): MR2896846
Zentralblatt MATH: 1234.62018
Digital Object Identifier: doi:10.1198/jasa.2011.tm10465 - Dobra, A. and Massam, H. (2010). The mode oriented stochastic search (MOSS) algorithm for log-linear models with conjugate priors. Stat. Methodol. 7 240–253.
- Dobra, A. and Mohammadi, R. (2018). Supplement to “Loglinear model selection and human mobility.” DOI:10.1214/18-AOAS1164SUPP.
- Dobra, A., Williams, N. E. and Eagle, N. (2015). Spatiotemporal detection of unusual human population behavior using mobile phone data. PLoS ONE 10 1–20.
- Dobra, A., Bärnighausen, T., Vandormael, A. and Tanser, F. (2017). Space-time migration patterns and risk of HIV acquisition in rural South Africa. AIDS 31 37–145.
- Donato, K. M. (1993). Current trends and patterns of female migration: Evidence from Mexico. Int. Migr. Rev. 27 748–771.
- Drton, M. and Maathuis, M. H. (2017). Structure learning in graphical modeling. Annu. Rev. Statist. Appl. 4 365–393.
- Dunson, D. B. and Xing, C. (2009). Nonparametric Bayes modeling of multivariate categorical data. J. Amer. Statist. Assoc. 104 1042–1051.
- Durand, J., Kandel, W., Parrado, E. A. and Massey, D. S. (1996). International migration and development in Mexican communities. Demography 33 249–264.
- Edwards, D. and Havránek, T. (1985). A fast procedure for model search in multidimensional contingency tables. Biometrika 72 339–351.
- Fienberg, S. E. (1970). The analysis of multidimensional contingency tables. Ecology 51 419–433.
- Fienberg, S. E. (1980). The Analysis of Cross-Classified Categorical Data, 2nd ed. MIT Press, Cambridge, MA.Zentralblatt MATH: 0499.62049
- Fienberg, S. E. and Rinaldo, A. (2007). Three centuries of categorical data analysis: Log-linear models and maximum likelihood estimation. J. Statist. Plann. Inference 137 3430–3445.Mathematical Reviews (MathSciNet): MR2363267
Zentralblatt MATH: 1119.62053
Digital Object Identifier: doi:10.1016/j.jspi.2007.03.022 - Fienberg, S. E. and Rinaldo, A. (2012). Maximum likelihood estimation in log-linear models. Ann. Statist. 40 996–1023.Zentralblatt MATH: 1274.62389
Digital Object Identifier: doi:10.1214/12-AOS986
Project Euclid: euclid.aos/1342625459 - Gamal-Eldin, A., Descombes, X. and Zerubia, J. (2010). Multiple birth and cut algorithm for point process optimization. In 2010 Sixth International Conference on Signal-Image Technology and Internet-Based Systems (SITIS) 35–42. IEEE, Los Alamitos, CA.
- Gamal-Eldin, A., Descombes, X., Charpiat, G. and Zerubia, J. (2011). A fast multiple birth and cut algorithm using belief propagation. In 2011 18th IEEE International Conference on Image Processing 2813–2816. IEEE, Los Alamitos, CA.
- Gonzalez, M. C., Hidalgo, C. A. and Barabasi, A.-L. (2008). Understanding individual human mobility patterns. Nature 453 779–782.
- Green, P. J. (1995). Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika 82 711–732.
- Guerzhoy, M. and Hertzmann, A. (2014). Learning latent factor models of travel data for travel prediction and analysis. In Advances in Artificial Intelligence. Lecture Notes in Computer Science 8436 131–142. Springer, Cham.
- Harris, J. R. and Todaro, M. P. (1970). Migration, unemployment and development: A two-sector analysis. Am. Econ. Rev. 60 126–142.
- Hoff, P. D. (2008). Multiplicative latent factor models for description and prediction of social networks. Comput. Math. Organ. Theory 15 Art. ID 261.
- Höfling, H. and Tibshirani, R. (2009). Estimation of sparse binary pairwise Markov networks using pseudo-likelihoods. J. Mach. Learn. Res. 10 883–906.
- Højsgaard, S., Edwards, D. and Lauritzen, S. (2012). Graphical Models with R. Springer, New York.Mathematical Reviews (MathSciNet): MR2905395
- Imai, K. (2017). Quantitative Social Science: An Introduction. Princeton Univ. Press, Princeton, NJ.
- Jones, B., Carvalho, C., Dobra, A., Hans, C., Carter, C. and West, M. (2005). Experiments in stochastic computation for high-dimensional graphical models. Statist. Sci. 20 388–400.Zentralblatt MATH: 1130.62408
Digital Object Identifier: doi:10.1214/088342305000000304
Project Euclid: euclid.ss/1137076659 - Jurdak, R., Zhao, K., Liu, J., AbouJaoude, M., Cameron, M. and Newth, D. (2015). Understanding human mobility from Twitter. PLoS ONE 10 1–16.
- Kunihama, T. and Dunson, D. B. (2013). Bayesian modeling of temporal dependence in large sparse contingency tables. J. Amer. Statist. Assoc. 108 1324–1338.
- Lauritzen, S. L. (1996). Graphical Models. Oxford Statistical Science Series 17. The Clarendon Press, Oxford Univ. Press, New York.Mathematical Reviews (MathSciNet): MR1419991
- Leetaru, K., Wang, S., Cao, G., Padmanabhan, A. and Shook, E. (2013). Mapping the global Twitter heartbeat: The geography of Twitter. First Monday 18. Available at http://firstmonday.org/ojs/index.php/fm/article/view/4366/3654.
- Lenkoski, A. and Dobra, A. (2011). Computational aspects related to inference in Gaussian graphical models with the G-Wishart prior. J. Comput. Graph. Statist. 20 140–157. Supplementary material available online.
- Letac, G. and Massam, H. (2012). Bayes factors and the geometry of discrete hierarchical loglinear models. Ann. Statist. 40 861–890.Zentralblatt MATH: 1274.62391
Digital Object Identifier: doi:10.1214/12-AOS974
Project Euclid: euclid.aos/1338515140 - Madigan, D. and Raftery, A. E. (1994). Model selection and accounting for model uncertainty in graphical models using Occam’s window. J. Amer. Statist. Assoc. 89 1535–1546.
- Madigan, D. and York, J. (1995). Bayesian graphical models for discrete data. Int. Stat. Rev. 63 215–232.
- Madigan, D. and York, J. C. (1997). Bayesian methods for estimation of the size of a closed population. Biometrika 84 19–31.
- Madigan, D., Raftery, A. E., Volinsky, C. and Hoeting, J. (1996). Bayesian model averaging. In Proceedings of the AAAI Workshop on Integrating Multiple Learned Models 77–83.
- Massam, H., Liu, J. and Dobra, A. (2009). A conjugate prior for discrete hierarchical log-linear models. Ann. Statist. 37 3431–3467.Mathematical Reviews (MathSciNet): MR2549565
Zentralblatt MATH: 1369.62048
Digital Object Identifier: doi:10.1214/08-AOS669
Project Euclid: euclid.aos/1250515392 - Massey, D. S. (1990). Social structure, household strategies, and the cumulative causation of migration. Popul. Index 56 3–26.
- Massey, D. S. and Espinosa, K. E. (1997). What’s driving Mexico–U.S. migration? A theoretical, empirical, and policy analysis. Am. J. Sociol. 102 939–999.
- Massey, D. S., Arango, J., Hugo, G., Kouaouci, A., Pellegrino, A. and Taylor, J. E. (1993). Theories of international migration: A review and appraisal. Popul. Dev. Rev. 19 431–466.
- Massey, D. S., Williams, N., Axinn, W. G. and Ghimire, D. (2010). Community services and out-migration. Int. Migr. 48 1–41.
- Mohammadi, A. and Dobra, A. (2017). The R package BDgraph for Bayesian structure learning in graphical models. ISBA Bull. 4 11–16.
- Mohammadi, A., Massam, H. and Letac, G. (2017). The ratio of normalizing constants for Bayesian graphical Gaussian model selection. Preprint. Available at arXiv:1706.04416.arXiv: 1706.04416
- Mohammadi, A. and Wit, E. C. (2015). Bayesian structure learning in sparse Gaussian graphical models. Bayesian Anal. 10 109–138.Mathematical Reviews (MathSciNet): MR3420899
Zentralblatt MATH: 1335.62056
Digital Object Identifier: doi:10.1214/14-BA889
Project Euclid: euclid.ba/1422468425 - Mohammadi, R. and Wit, E. C. (2017). BDgraph: An R package for Bayesian structure learning in graphical models. Preprint. Available at arXiv:1501.05108v4.arXiv: 1501.05108v4
- Mohammadi, R. and Wit, E. C. and Dobra, A. (2018). BDgraph: Bayesian structure learning in graphical models using birth–death MCMC. R package version 2.49.
- Mohammadi, A., Abegaz, F., van den Heuvel, E. and Wit, E. C. (2017). Bayesian modelling of Dupuytren disease by using Gaussian copula graphical models. J. R. Stat. Soc. Ser. C. Appl. Stat. 66 629–645.
- Nardi, Y. and Rinaldo, A. (2012). The log-linear group-lasso estimator and its asymptotic properties. Bernoulli 18 945–974.Zentralblatt MATH: 1243.62107
Digital Object Identifier: doi:10.3150/11-BEJ364
Project Euclid: euclid.bj/1340887009 - Neubauer, G., Huber, H., Vogl, A., Jager, B., Preinerstorfer, A., Schirnhofer, S., Schimak, G. and Havlik, D. (2015). On the volume of geo-referenced tweets and their relationship to events relevant for migration tracking. In Environmental Software Systems. Infrastructures, Services and Applications: 11th IFIP WG 5.11 International Symposium, ISESS 2015, Melbourne, VIC, Australia, March 25–27, 2015. Proceedings (R. Denzer, R. M. Argent, G. Schimak and J. Hřebíček, eds.) 520–530. Springer, Cham.
- OpenMP Architecture Review Board (2008). OpenMP application program interface version 3.0.
- Pensar, J., Nyman, H., Niiranen, J. and Corander, J. (2017). Marginal pseudo-likelihood learning of discrete Markov network structures. Bayesian Anal. 12 1195–1215.Mathematical Reviews (MathSciNet): MR3724983
Zentralblatt MATH: 1384.62178
Digital Object Identifier: doi:10.1214/16-BA1032
Project Euclid: euclid.ba/1477918728 - Preston, C. (1975). Spatial birth-and-death processes. Bull. Inst. Int. Stat. 46 371–391, 405–408 (1975). With discussion.Zentralblatt MATH: 0379.60082
- Ravikumar, P., Wainwright, M. J. and Lafferty, J. D. (2010). High-dimensional Ising model selection using $\ell_{1}$-regularized logistic regression. Ann. Statist. 38 1287–1319.Zentralblatt MATH: 1189.62115
Digital Object Identifier: doi:10.1214/09-AOS691
Project Euclid: euclid.aos/1268056617 - Raymer, J., Abel, G. and Smith, P. W. F. (2007). Combining census and registration data to estimate detailed elderly migration flows in England and Wales. J. Roy. Statist. Soc. Ser. A 170 891–908.Mathematical Reviews (MathSciNet): MR2408983
Digital Object Identifier: doi:10.1111/j.1467-985X.2007.00490.x - Raymer, J., Wiśniowski, A., Forster, J. J., Smith, P. W. F. and Bijak, J. (2013). Integrated modeling of European migration. J. Amer. Statist. Assoc. 108 801–819.Mathematical Reviews (MathSciNet): MR3174664
Zentralblatt MATH: 06224967
Digital Object Identifier: doi:10.1080/01621459.2013.789435 - Scott, J. G. and Carvalho, C. M. (2008). Feature-inclusion stochastic search for Gaussian graphical models. J. Comput. Graph. Statist. 17 790–808.
- SMaPP (2017). smappR package: Tools for analysis of Twitter data, Social Media and Participation, New York University. Available at https://github.com/SMAPPNYU/smappR.
- Smith, P. W. F., Raymer, J. and Giulietti, C. (2010). Combining available migration data in England to study economic activity flows over time. J. Roy. Statist. Soc. Ser. A 173 733–753.
- Stark, O. and Bloom, D. E. (1985). The new economics of labor migration. Am. Econ. Rev. 75 173–178.
- Stark, O. and Taylor, J. E. (1985). Migration incentives, migration types: The role of relative deprivation. Econ. J. 101 1163–1178.
- Stopher, P. R. and Greaves, S. P. (2007). Household travel surveys: Where are we going? Transp. Res., Part A Policy Pract. 41 367–381.
- Tarantola, C. (2004). MCMC model determination for discrete graphical models. Stat. Model. 4 39–61.
- Tatem, A. J. (2014). Mapping population and pathogen movements. Int. Health 6 5–11.
- Taylor, J. E. (1987). Undocumented Mexico–U.S. migration and the returns to households in rural Mexico. Am. J. Agric. Econ. 69 616–638.
- Todaro, M. P. (1969). A model of labor migration and urban unemployment in less developed countries. Am. Econ. Rev. 59 138–148.
- Todaro, M. P. and Maruszko, L. (1987). Illegal immigration and U.S. immigration reform: A conceptual framework. Popul. Dev. Rev. 13 101–114.
- Tsamardinos, I., Brown, L. E. and Aliferis, C. F. (2006). The max–min hill-climbing Bayesian network structure learning algorithm. Mach. Learn. 65 31–78.
- Twitter, Inc. (2017). Twitter REST APIs. Available at https://dev.twitter.com/rest/public.
- VanWey, L. K. (2005). Land ownership as a determinant of international and internal migration in Mexico and internal migration in Thailand. Int. Migr. Rev. 39 141–172.
- Wainwright, M. and Jordan, M. (2008). Graphical models, exponential families and variational inference. Found. Trends Mach. Learn. 1 1–305.
- Wang, H. and Li, S. Z. (2012). Efficient Gaussian graphical model determination under $G$-Wishart prior distributions. Electron. J. Stat. 6 168–198.
- Whittaker, J. (1990). Graphical Models in Applied Multivariate Statistics. Wiley, Chichester.
- Williams, N. (2009). Education, gender, and migration in the context of social change. Soc. Sci. Res. 38 883–896.
- Williams, N. E., Thomas, T. A., Dunbar, M., Eagle, N. and Dobra, A. (2015). Measures of human mobility using mobile phone records enhanced with GIS data. PLoS ONE 10 1–16.
- Wolf, J., Oliveira, M. and Thompson, M. (2003). Impact of underreporting on mileage and travel time estimates: Results from global positioning system-enhanced household travel survey. Transp. Res. Rec. 1854 189–198.
Supplemental materials
- Additional proofs, maps, figures and tables. In this online supplementary material, we provide the proof for Theorem 5.1, together with additional maps, figures, and tables referenced in this article.Digital Object Identifier: doi:10.1214/18-AOAS1164SUPP

- You have access to this content.
- You have partial access to this content.
- You do not have access to this content.
More like this
- Marked self-exciting point process modelling of information diffusion on Twitter
Chen, Feng and Tan, Wai Hong, Annals of Applied Statistics, 2018 - Modeling node incentives in directed networks
Chakrabarti, Deepayan, Annals of Applied Statistics, 2017 - A Bayesian approach for predicting the popularity of tweets
Zaman, Tauhid, Fox, Emily B., and Bradlow, Eric T., Annals of Applied Statistics, 2014
- Marked self-exciting point process modelling of information diffusion on Twitter
Chen, Feng and Tan, Wai Hong, Annals of Applied Statistics, 2018 - Modeling node incentives in directed networks
Chakrabarti, Deepayan, Annals of Applied Statistics, 2017 - A Bayesian approach for predicting the popularity of tweets
Zaman, Tauhid, Fox, Emily B., and Bradlow, Eric T., Annals of Applied Statistics, 2014 - Explaining species distribution patterns through hierarchical modeling
Gelfand, Alan E., Holder, Mark, Latimer, Andrew, Lewis, Paul O., Rebelo, Anthony G., Silander, John A., and Wu, Shanshan, Bayesian Analysis, 2006 - A Conversation with Stephen E. Fienberg
Straf, Miron L. and Tanur, Judith M., Statistical Science, 2013 - A Survey of Exact Inference for Contingency Tables
Agresti, Alan, Statistical Science, 1992 - Bayes factors and the geometry of discrete hierarchical loglinear models
Letac, Gérard and Massam, Hélène, Annals of Statistics, 2012 - The Role of Optimal Intervention Strategies on Controlling Excessive
Alcohol Drinking and Its Adverse Health Effects
Mushayabasa, Steady, Journal of Applied Mathematics, 2015 - Transforming Contingency Tables
Meyer, Michael M., Annals of Statistics, 1982 - Characterizing Pairwise Social Relationships Quantitatively: Interest-Oriented Mobility Modeling for Human Contacts in Delay Tolerant Networks
Chen, Jiaxu, Tang, Yazhe, Hu, Chengchen, and Wang, Guijuan, Journal of Applied Mathematics, 2013
