Brazilian Journal of Probability and Statistics

A product-multinomial framework for categorical data analysis with missing responses

Frederico Z. Poleto, Julio M. Singer, and Carlos Daniel Paulino

Full-text: Open access

Abstract

With the objective of analysing categorical data with missing responses, we extend the multinomial modelling scenario described by Paulino (Braz. J. Probab. Stat. 5 (1991) 1–42) to a product-multinomial framework that allows the inclusion of explanatory variables. We consider maximum likelihood (ML) and weighted least squares (WLS) as well as a hybrid ML/WLS approach to fit linear, log-linear and more general functional linear models under ignorable and nonignorable missing data mechanisms. We express the results in an unified matrix notation that may be easily used for their computational implementation and develop such a set of subroutines in R. We illustrate the procedures with the analysis of two data sets, and perform simulations to assess the properties of the estimators.

Article information

Source
Braz. J. Probab. Stat. Volume 28, Number 1 (2014), 109-139.

Dates
First available in Project Euclid: 5 February 2014

Permanent link to this document
https://projecteuclid.org/euclid.bjps/1391611341

Digital Object Identifier
doi:10.1214/12-BJPS198

Mathematical Reviews number (MathSciNet)
MR3165432

Zentralblatt MATH identifier
06291464

Keywords
EM algorithm incomplete data missing data missingness mechanism selection models

Citation

Poleto, Frederico Z.; Singer, Julio M.; Paulino, Carlos Daniel. A product-multinomial framework for categorical data analysis with missing responses. Braz. J. Probab. Stat. 28 (2014), no. 1, 109--139. doi:10.1214/12-BJPS198. https://projecteuclid.org/euclid.bjps/1391611341


Export citation

References

  • Agresti, A. (2002). Categorical Data Analysis, 2nd ed. New York: Wiley.
  • Azzalini, A. (1994). Logistic regression for autocorrelated data with application to repeated measures. Biometrika 81, 767–775.
  • Baker, S. G. (1994). Missing data: Composite linear models for incomplete multinomial data. Statistics in Medicine 13, 609–622.
  • Baker, S. G. and Laird, N. M. (1988). Regression analysis for categorical variables with outcome subject to nonignorable nonresponse. Journal of the American Statistical Association 83, 62–69; Corrigenda 1232.
  • Berndt, E. R., Hall, B. H., Hall, R. E. and Hausman, J. A. (1974). Estimation and inference in nonlinear structural models. Annals of Economic and Social Measurement 3, 653–666.
  • Blumenthal, S. (1968). Multinomial sampling with partially categorized data. Journal of the American Statistical Association 63, 542–551.
  • Chen, T. T. and Fienberg, S. E. (1974). Two-dimensional contingency tables with both completely and partially cross-classified data. Biometrics 30, 629–642.
  • Dempster, A. P., Laird, N. M. and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm (with comments). Journal of the Royal Statistical Society, Ser. B 39, 1–38.
  • Fuchs, C. (1982). Maximum likelihood estimation and model selection in contingency tables with missing data. Journal of the American Statistical Association 77, 270–278.
  • Grizzle, J. E., Starmer, C. F. and Koch, G. G. (1969). Analysis of categorical data by linear models. Biometrics 25, 489–504.
  • Hocking, R. R. and Oxspring, H. H. (1971). Maximum likelihood estimation with incomplete multinomial data. Journal of the American Statistical Association 66, 65–70.
  • Imrey, P. B., Koch, G. G., Stokes, M. E., Darroch, J. N., Freeman, D. H. Jr. and Tolley, H. D. (1981). Categorical data analysis: Some reflections on the log linear model and logistic regression. Part I: Historical and methodological overview. International Statistical Review 49, 265–283.
  • Imrey, P. B., Koch, G. G., Stokes, M. E., Darroch, J. N., Freeman, D. H. Jr. and Tolley, H. D. (1982). Categorical data analysis: Some reflections on the log linear model and logistic regression. Part II: Data analysis. International Statistical Review 50, 35–63.
  • Kenward, M. G. and Molenberghs, G. (1998). Likelihood based frequentist inference when data are missing at random. Statistical Science 13, 236–247.
  • Kenward, M. G., Goetghebeur, E. and Molenberghs, G. (2001). Sensitivity analysis for incomplete categorical data. Statistical Modelling 1, 31–48.
  • Koch, G. G., Imrey, P. B. and Reinfurt, D. W. (1972). Linear model analysis of categorical data with incomplete response vectors. Biometrics 28, 663–692.
  • Landis, J. R., Stanish, W. M., Freeman, J. L. and Koch, G. G. (1976). A computer program for the generalized chi-square analysis of categorical data using weighted least squares (GENCAT). Computer Methods and Programs in Biomedicine 6, 196–231.
  • Lipsitz, S. R. and Fitzmaurice, G. M. (1996). The score test for independence in $R\times C$ contingency tables with missing data. Biometrics 52, 751–762.
  • Lipsitz, S. R., Laird, N. M. and Harrington, D. P. (1994). Weighted least squares analysis of repeated categorical measurements with outcomes subject to nonresponse. Biometrics 50, 11–24.
  • Little, R. J. A. and Rubin, D. B. (2002). Statistical Analysis with Missing Data, 2nd ed. New York: Wiley.
  • Molenberghs, G., Beunckens, C., Sotto, C. and Kenward, M. G. (2008). Every missingness not at random model has a missingness at random counterpart with equal fit. Journal of the Royal Statistical Society, Ser. B 70, 371–388.
  • Molenberghs, G. and Goetghebeur, E. (1997). Simple fitting algorithms for incomplete categorical data. Journal of the Royal Statistical Society, Ser. B 59, 401–414.
  • Molenberghs, G., Goetghebeur, E., Lipsitz, S. R. and Kenward, M. G. (1999). Nonrandom missingness in categorical data: Strengths and limitations. The American Statistician 53, 110–118.
  • Paulino, C. D. (1991). Analysis of incomplete categorical data: A survey of the conditional maximum likelihood and weighted least squares approaches. Brazilian Journal of Probability and Statistics 5, 1–42.
  • Paulino, C. D. and Silva, G. L. (1999). On the maximum likelihood analysis of the general linear model in categorical data. Computational Statistics & Data Analysis 30, 197–204.
  • Paulino, C. D. and Soares, P. (2003). Analysis of rates in incomplete Poisson data. The Statistician 52, 87–99.
  • Poleto, F. Z. (2006). Analysis of categorical data with missingness. M.Sc. thesis, Universidade de São Paulo (in Portuguese).
  • Poleto, F. Z., Singer, J. M. and Paulino, C. D. (2011a). Missing data mechanisms and their implications on the analysis of categorical data. Statistics and Computing 21, 31–43.
  • Poleto, F. Z., Paulino, C. D., Molenberghs, G. and Singer, J. M. (2011b). Inferential implications of over-parameterization: A case study in incomplete categorical data. International Statistical Review 79, 92–113.
  • R Development Core Team (2012). R: A Language and Environment for Statistical Computing. Vienna: R Foundation for Statistical Computing.
  • Rubin, D. B. (1976). Inference and missing data. Biometrika 63, 581–592.
  • Rubin, D. B. (1987). Multiple Imputation for Nonresponse in Surveys. New York: Wiley.
  • Schafer, J. L. (1997). Analysis of Incomplete Multivariate Data. Boca Raton: Chapman & Hall.
  • Vansteelandt, S., Goetghebeur, E., Kenward, M. G. and Molenberghs, G. (2006). Ignorance and uncertainty regions as inferential tools in a sensitivity analysis. Statistica Sinica 16, 953–979.
  • Williamson, G. D. and Haber, M. (1994). Models for three-dimensional contingency tables with completely and partially cross-classified data. Biometrics 49, 194–203.
  • Woolson, R. F. and Clarke, W. R. (1984). Analysis of categorical incomplete longitudinal data. Journal of the Royal Statistical Society, Ser. A 147, 87–99.