• Bernoulli
  • Volume 19, Number 4 (2013), 1176-1211.

The potential and perils of preprocessing: Building new foundations

Alexander W. Blocker and Xiao-Li Meng

Full-text: Open access


Preprocessing forms an oft-neglected foundation for a wide range of statistical and scientific analyses. However, it is rife with subtleties and pitfalls. Decisions made in preprocessing constrain all later analyses and are typically irreversible. Hence, data analysis becomes a collaborative endeavor by all parties involved in data collection, preprocessing and curation, and downstream inference. Even if each party has done its best given the information and resources available to them, the final result may still fall short of the best possible in the traditional single-phase inference framework. This is particularly relevant as we enter the era of “big data”. The technologies driving this data explosion are subject to complex new forms of measurement error. Simultaneously, we are accumulating increasingly massive databases of scientific analyses. As a result, preprocessing has become more vital (and potentially more dangerous) than ever before.

We propose a theoretical framework for the analysis of preprocessing under the banner of multiphase inference. We provide some initial theoretical foundations for this area, including distributed preprocessing, building upon previous work in multiple imputation. We motivate this foundation with two problems from biology and astrophysics, illustrating multiphase pitfalls and potential solutions. These examples also emphasize the motivations behind multiphase analyses—both practical and theoretical. We demonstrate that multiphase inferences can, in some cases, even surpass standard single-phase estimators in efficiency and robustness. Our work suggests several rich paths for further research into the statistical principles underlying preprocessing. To tackle our increasingly complex and massive data, we must ensure that our inferences are built upon solid inputs and sound principles. Principled investigation of preprocessing is thus a vital direction for statistical research.

Article information

Bernoulli Volume 19, Number 4 (2013), 1176-1211.

First available in Project Euclid: 27 August 2013

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

data compression data repositories measurement error multiphase inference multiple imputation statistical principles


Blocker, Alexander W.; Meng, Xiao-Li. The potential and perils of preprocessing: Building new foundations. Bernoulli 19 (2013), no. 4, 1176--1211. doi:10.3150/13-BEJSP16.

Export citation


  • Affymetrix, I. (2002). Statistical algorithms description document. Affymetrix, Inc., Santa Clara, CA. Available at (Accessed April, 2013.).
  • Anderson, L.D., Zavagno, A., Rodón, J.A., Russeil, D., Abergel, A., Ade, P., André, P., Arab, H., Baluteau, J.P., Bernard, J.P., Blagrave, K., Bontemps, S., Boulanger, F., Cohen, M., Compiègne, M., Cox, P., Dartois, E., Davis, G., Emery, R., Fulton, T., Gry, C., Habart, E., Huang, M., Joblin, C., Jones, S.C., Kirk, J.M., Lagache, G., Lim, T., Madden, S., Makiwa, G., Martin, P., Miville-Deschênes, M.A., Molinari, S., Moseley, H., Motte, F., Naylor, D.A., Okumura, K., Pinheiro Gonçalves, D., Polehampton, E., Saraceno, P., Sauvage, M., Sidher, S., Spencer, L., Swinyard, B., Ward-Thompson, D. and White, G.J. (2010). The physical properties of the dust in the RCW 120 H ii region as seen by Herschel. Astronomy and Astrophysics 518 L99.
  • Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B Stat. Methodol. 57 289–300.
  • Berger, J.O. (1985). Statistical Decision Theory and Bayesian Analysis, 2nd ed. Springer Series in Statistics. New York: Springer.
  • Berger, J.O. and Bernardo, J.M. (1992). On the development of reference priors. In Bayesian Statistics, 4 (PeñíScola, 1991) 35–60. New York: Oxford Univ. Press.
  • Blackwell, D. (1951). Comparison of experiments. In Proceedings of the Second Berkeley Symposium on Mathematical Statistics and Probability 1 93–102. Berkeley and Los Angeles: Univ. California Press.
  • Blackwell, D. (1953). Equivalent comparisons of experiments. Ann. Math. Statist. 24 265–272.
  • Blocker, A.W. and Protopapas, P. (2012). Semi-parametric robust event detection for massive time-domain databases. In Statistical Challenges in Modern Astronomy V (E.D. Feigelson and G.J. Babu, eds.). Lecture Notes in Statistics 902 177–187. New York, NY: Springer.
  • Bolstad, B.M.M., Irizarry, R.A.A., Astrand, M. and Speed, T.P.P. (2003). A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 19 185–193.
  • Braverman, A.J., Fetzer, E.J., Kahn, B.H., Manning, E.M., Oliphant, R.B. and Teixeira, J.P. (2012). Massive dataset analysis for NASA’s atmospheric infrared sounder. Technometrics 54 1–15.
  • Brooks, S., Gelman, A., Jones, G.L. and Meng, X.L., eds. (2010). Handbook of Markov Chain Monte Carlo: Methods and Applications. Boca Raton, FL: Chapman & Hall/CRC.
  • Cox, D.R. (1972). Regression models and life-tables. J. R. Stat. Soc. Ser. B Stat. Methodol. 34 187–220.
  • Cox, D.R. (1975). Partial likelihood. Biometrika 62 269–276.
  • Davey, A. (2012). Massive data streams. Presented at SolarStat 2012.
  • Désert, F.X., Macías-Pérez, J.F., Mayet, F., Giardino, G., Renault, C., Aumont, J., Benoît, A., Bernard, J.P., Ponthieu, N. and Tristram, M. (2008). Submillimetre point sources from the Archeops experiment: Very cold clumps in the Galactic plane. Astronomy and Astrophysics 481 411–421.
  • Dupac, X., Bernard, J.P., Boudet, N., Giard, M., Lamarre, J.M., Mény, C., Pajot, F., Ristorcelli, I., Serra, G., Stepnik, B. and Torre, J.P. (2003). Inverse temperature dependence of the dust submillimeter spectral index. Astronomy and Astrophysics 404 L11–L15.
  • Evans, I., Cresitello-Dittmar, M., Doe, S., Evans, J., Fabbiano, G., Germain, G., Glotfelty, K., Plummer, D. and Zografou, P. (2006). The Chandra X-ray Observatory data processing system. In Society of Photo-Optical Instrumentation Engineers (SPIE) Conference Series. Society of Photo-Optical Instrumentation Engineers (SPIE) Conference Series 6270.
  • Farrell, R.H. (1968). On a necessary and sufficient condition for admissibility of estimators when strictly convex loss is used. Ann. Math. Statist. 39 23–28.
  • Geisser, S. and Eddy, W.F. (1979). A predictive approach to model selection. J. Amer. Statist. Assoc. 74 153–160.
  • Geman, D. (2012). Order statistics and gene regulation. Medallion Lecture at Joint Statistical Meetings.
  • Geman, D., d’Avignon, C., Naiman, D.Q. and Winslow, R.L. (2004). Classifying gene expression profiles from pairwise mRNA comparisons. Stat. Appl. Genet. Mol. Biol. 3 21 pp. (electronic).
  • Goel, P.K. and DeGroot, M.H. (1979). Comparison of experiments and information measures. Ann. Statist. 7 1066–1077.
  • Gray, R.M. and Neuhoff, D.L. (1998). Quantization. IEEE Trans. Inform. Theory 44 2325–2383.
  • Hartigan, J. (1964). Invariant prior distributions. Ann. Math. Statist. 35 836–845.
  • Ioannidis, J.P.A. and Khoury, M.J. (2011). Improving validation practices in “omics” research. Science 334 1230–1232.
  • Irizarry, R.A., Wu, Z. and Jaffee, H.A. (2006). Comparison of Affymetrix GeneChip expression measures. Bioinformatics 22 789–794.
  • Irizarry, R.A., Hobbs, B., Collin, F., Beazer-Barclay, Y.D., Antonellis, K.J., Scherf, U. and Speed, T.P. (2003). Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics 4 249–264.
  • Jeffreys, H. (1946). An invariant form for the prior probability in estimation problems. Proc. Roy. Soc. London. Ser. A. 186 453–461.
  • Kadane, J.B. (1993). Several Bayesians: A review. TEST 2 1–32.
  • Kass, R.E. and Wasserman, L. (1996). The selection of prior distributions by formal rules. J. Amer. Statist. Assoc. 91 1343–1370.
  • Kelly, B.C., Shetty, R., Stutz, A.M., Kauffmann, J., Goodman, A.A. and Launhardt, R. (2012). Dust spectral energy distributions in the era of Herschel and Planck: A hierarchical Bayesian-fitting technique. The Astrophysical Journal 752 55.
  • Le Cam, L. (1964). Sufficiency and approximate sufficiency. Ann. Math. Statist. 35 1419–1455.
  • Lehmann, E.L. and Casella, G. (1998). Theory of Point Estimation, 2nd ed. Springer Texts in Statistics. New York: Springer.
  • Lindley, D.V., Tversky, A. and Brown, R.V. (1979). On the reconciliation of probability assessments. J. Roy. Statist. Soc. Ser. A 142 146–180.
  • McGee, M. and Chen, Z. (2006). Parameter estimation for the exponential-normal convolution model for background correction of Affymetrix GeneChip data. Stat. Appl. Genet. Mol. Biol. 5 27 pp. (electronic).
  • Meng, X.L. (1994). Multiple-imputation inferences with uncongenial sources of input (with discussion). Statist. Sci. 9 538–558.
  • Meng, X.L. and Romero, M. (2003). Discussion: Efficiency and self-efficiency with multiple imputation inference. International Statistical Review 71 607–618.
  • Meng, X.L. and Rubin, D.B. (1991). Using EM to obtain asymptotic variance–covariance matrices: The SEM algorithm. J. Amer. Statist. Assoc. 86 899–909.
  • Meng, X.L. and Xie, X. (2013). I got more data, my model is more refined, but my estimator is getting worse! Am I just dumb? Econometric Reviews (Special Issue on Bayesian Inference and Information Theoretic Methods: In Memory of Arnold Zellner). To appear.
  • Neyman, J. and Scott, E.L. (1948). Consistent estimates based on partially consistent observations. Econometrica 16 1–32.
  • Nguyen, X., Wainwright, M.J. and Jordan, M.I. (2009). On surrogate loss functions and $f$-divergences. Ann. Statist. 37 876–904.
  • Nielsen, S.F. (2003). Proper and improper multiple imputation. International Statistical Review 71 593–607.
  • Paradis, D., Veneziani, M., Noriega-Crespo, A., Paladini, R., Piacentini, F., Bernard, J.P., de Bernardis, P., Calzoletti, L., Faustini, F., Martin, P., Masi, S., Montier, L., Natoli, P., Ristorcelli, I., Thompson, M.A., Traficante, A. and Molinari, S. (2010). Variations of the spectral index of dust emissivity from Hi-GAL observations of the Galactic plane. Astronomy and Astrophysics 520 L8.
  • Quackenbush, J. (2002). Microarray data normalization and transformation. Nat. Genet. 32 Suppl 496–501.
  • Ritchie, M.E., Silver, J., Oshlack, A., Holmes, M., Diyagama, D., Holloway, A. and Smyth, G.K. (2007). A comparison of background correction methods for two-colour microarrays. Bioinformatics 23 2700–2707.
  • Rubin, D.B. (1976). Inference and missing data. Biometrika 63 581–592.
  • Rubin, D.B. (1987). Multiple Imputation for Nonresponse in Surveys. Wiley Series in Probability and Mathematical Statistics: Applied Probability and Statistics. New York: Wiley.
  • Rubin, D.B. (1996). Multiple imputation after 18+ years. J. Amer. Statist. Assoc. 91 473–489.
  • Savage, L.J. (1976). On rereading R. A. Fisher. Ann. Statist. 4 441–500.
  • Shetty, R., Kauffmann, J., Schnee, S., Goodman, A.A. and Ercolano, B. (2009). The effect of line-of-sight temperature variation and noise on dust continuum observations. The Astrophysical Journal 696 2234–2251.
  • Smyth, G.K. (2005). Limma: Linear models for microarray data. In Bioinformatics and Computational Biology Solutions Using R and Bioconductor (R. Gentelman, V. Carey, S. Dudoit, R. Irizarry and W. Huber, eds.) 2005 397–420. Berlin: Springer.
  • Tan, A.C., Naiman, D.Q., Xu, L., Winslow, R.L. and Geman, D. (2005). Simple decision rules for classifying human cancers from gene expression profiles. Bioinformatics 21 3896–3904.
  • Tusher, V.G., Tibshirani, R. and Chu, G. (2001). Significance analysis of microarrays applied to the ionizing radiation response. Proc. Natl. Acad. Sci. USA 98 5116–5121.
  • Xie, X. and Meng, X.L. (2012). Exploring multi-party inferences: What happens when there are three uncongenial models involved? Preprint.
  • Xie, Y., Wang, X. and Story, M. (2009). Statistical methods of background correction for Illumina BeadArray data. Bioinformatics 25 751–757.