The Annals of Statistics

Sequential importance sampling for multiway tables

Yuguo Chen, Ian H. Dinwoodie, and Seth Sullivant

Full-text: Open access

Abstract

We describe an algorithm for the sequential sampling of entries in multiway contingency tables with given constraints. The algorithm can be used for computations in exact conditional inference. To justify the algorithm, a theory relates sampling values at each step to properties of the associated toric ideal using computational commutative algebra. In particular, the property of interval cell counts at each step is related to exponents on lead indeterminates of a lexicographic Gröbner basis. Also, the approximation of integer programming by linear programming for sampling is related to initial terms of a toric ideal. We apply the algorithm to examples of contingency tables which appear in the social and medical sciences. The numerical results demonstrate that the theory is applicable and that the algorithm performs well.

Article information

Source
Ann. Statist., Volume 34, Number 1 (2006), 523-545.

Dates
First available in Project Euclid: 2 May 2006

Permanent link to this document
https://projecteuclid.org/euclid.aos/1146576273

Digital Object Identifier
doi:10.1214/009053605000000822

Mathematical Reviews number (MathSciNet)
MR2275252

Zentralblatt MATH identifier
1091.62051

Subjects
Primary: 62H17: Contingency tables 62F03: Hypothesis testing
Secondary: 13P10: Gröbner bases; other bases for ideals and modules (e.g., Janet and border bases)

Keywords
Conditional inference contingency table exact test Monte Carlo sequential importance sampling toric ideal

Citation

Chen, Yuguo; Dinwoodie, Ian H.; Sullivant, Seth. Sequential importance sampling for multiway tables. Ann. Statist. 34 (2006), no. 1, 523--545. doi:10.1214/009053605000000822. https://projecteuclid.org/euclid.aos/1146576273


Export citation

References

  • Berkelaar, M., Eikland, K. and Notebaert, P. (2004). lpSolve: Open Source (Mixed-Integer) Linear Programming System. GNU LGPL (Lesser General Public License).
  • Besag, J. and Clifford, P. (1989). Generalized Monte Carlo significance tests. Biometrika 76 633–642.
  • Breslow, N. E. and Day, N. E. (1980). Statistical Methods in Cancer Research 1. The Analysis of Case-Control Studies. International Agency for Research on Cancer, Lyon, France.
  • Bunea, F. and Besag, J. (2000). MCMC in $I\times J \times K$ contingency tables. In Monte Carlo Methods (N. Madras, ed.) 25–36. Amer. Math. Soc., Providence, RI.
  • Buzzigoli, L. and Giusti, A. (1999). An algorithm to calculate the lower and upper bounds of the elements of an array given its marginals. In Proc. Conference on Statistical Data Protection 131–147. Eurostat, Luxembourg.
  • Chen, Y., Diaconis, P., Holmes, S. and Liu, J. S. (2005). Sequential Monte Carlo methods for statistical analysis of tables. J. Amer. Statist. Assoc. 100 109–120.
  • Chen, Y., Dinwoodie, I., Dobra, A. and Huber, M. (2005). Lattice points, contingency tables and sampling. In Integer Points in Polyhedra–-Geometry, Number Theory, Algebra, Optimization (A. Barvinok, M. Beck, C. Haase, B. Reznick and V. Welker, eds.) 65–78. Amer. Math. Soc., Providence, RI.
  • Christensen, R. (1990). Log-Linear Models. Springer, New York.
  • Conti, P. and Traverso, C. (1991). Buchberger algorithm and integer programming. Applied Algebra, Algebraic Algorithms and Error-Correcting Codes. Lecture Notes in Comput. Sci. 539 130–139. Springer, Berlin.
  • Cox, D., Little, J. and O'Shea, D. (1997). Ideals, Varieties, and Algorithms, 2nd ed. Springer, New York.
  • De Loera, J. A., Haws, D., Hemmecke, R., Huggins, P., Tauzer, J. and Yoshida, R. (2003). A User's Guide for LattE v1.1. Available at www.math.ucdavis.edu/~latte/.
  • De Loera, J. A. and Onn, S. (2006). Markov basis of three-way tables are arbitrarily complicated. J. Symbolic Comput. 41 173–181.
  • Diaconis, P. and Efron, B. (1985). Testing for independence in a two-way table: New interpretations of the chi-square statistic (with discussion). Ann. Statist. 13 845–913.
  • Diaconis, P. and Sturmfels, B. (1998). Algebraic methods for sampling from conditional distributions. Ann. Statist. 26 363–397.
  • Dinwoodie, I. H. (2000). Conditional expectations in network traffic estimation. Statist. Probab. Lett. 47 99–103.
  • Dobra, A. and Fienberg, S. (2001). Bounds for cell entries in contingency tables induced by fixed marginal totals with applications to disclosure limitation. Statistical J. United Nations Economic Commission for Europe 18 363–371.
  • Dobra, A., Tebaldi, C. and West, M. (2006). Data augmentation in multi-way contingency tables with fixed marginal totals. J. Statist. Plann. Inference 136 355–372.
  • Edwards, D. and Havránek, T. (1985). A fast procedure for model search in multidimensional contingency tables. Biometrika 72 339–351.
  • Greuel, G.-M., Pfister, G. and Schoenemann, H. (2003). Singular: A computer algebra system for polynomial computations. Available at www.singular.uni-kl.de.
  • Guo, S. W. and Thompson, E. A. (1992). Performing the exact test of Hardy–Weinberg proportion for multiple alleles. Biometrics 48 361–372.
  • Hemmecke, R. and Hemmecke, R. (2003). 4ti2 Version 1.1: Computation of Hilbert bases, Graver bases, toric Gröbner bases, and more. Available at www.4ti2.de.
  • Hosten, S. and Sturmfels, B. (2006). Computing the integer programming gap. Combinatorica. To appear.
  • Huber, M., Chen, Y., Dinwoodie, I., Dobra, A. and Nicholas, M. (2006). Monte Carlo algorithms for Hardy–Weinberg proportions. Biometrics 62 49–53.
  • Kong, A., Liu, J. S. and Wong, W. H. (1994). Sequential imputations and Bayesian missing data problems. J. Amer. Statist. Assoc. 89 278–288.
  • Kreuzer, M. and Robbiano, L. (2000). Computational Commutative Algebra 1. Springer, Berlin.
  • R Development Core Team (2004). R: A language and environment for statistical computing. Available at www.r-project.org.
  • Rapallo, F. (2006). Markov bases and structural zeros. J. Symbolic Comput. 41 164–172.
  • Sturmfels, B. (1996). Gröbner Bases and Convex Polytopes. Amer. Math. Soc., Providence, RI.
  • Sturmfels, B. (2002). Solving Systems of Polynomial Equations. Amer. Math. Soc., Providence, RI.
  • Tebaldi, C. and West, M. (1998). Bayesian inference on network traffic using link count data (with discussion). J. Amer. Statist. Assoc. 93 557–576.
  • Vardi, Y. (1996). Network tomography: Estimating source–destination traffic intensities from link data. J. Amer. Statist. Assoc. 91 365–377.