Open Access
September 2013 The potential and perils of preprocessing: Building new foundations
Alexander W. Blocker, Xiao-Li Meng
Bernoulli 19(4): 1176-1211 (September 2013). DOI: 10.3150/13-BEJSP16

Abstract

Preprocessing forms an oft-neglected foundation for a wide range of statistical and scientific analyses. However, it is rife with subtleties and pitfalls. Decisions made in preprocessing constrain all later analyses and are typically irreversible. Hence, data analysis becomes a collaborative endeavor by all parties involved in data collection, preprocessing and curation, and downstream inference. Even if each party has done its best given the information and resources available to them, the final result may still fall short of the best possible in the traditional single-phase inference framework. This is particularly relevant as we enter the era of “big data”. The technologies driving this data explosion are subject to complex new forms of measurement error. Simultaneously, we are accumulating increasingly massive databases of scientific analyses. As a result, preprocessing has become more vital (and potentially more dangerous) than ever before.

We propose a theoretical framework for the analysis of preprocessing under the banner of multiphase inference. We provide some initial theoretical foundations for this area, including distributed preprocessing, building upon previous work in multiple imputation. We motivate this foundation with two problems from biology and astrophysics, illustrating multiphase pitfalls and potential solutions. These examples also emphasize the motivations behind multiphase analyses—both practical and theoretical. We demonstrate that multiphase inferences can, in some cases, even surpass standard single-phase estimators in efficiency and robustness. Our work suggests several rich paths for further research into the statistical principles underlying preprocessing. To tackle our increasingly complex and massive data, we must ensure that our inferences are built upon solid inputs and sound principles. Principled investigation of preprocessing is thus a vital direction for statistical research.

Citation

Download Citation

Alexander W. Blocker. Xiao-Li Meng. "The potential and perils of preprocessing: Building new foundations." Bernoulli 19 (4) 1176 - 1211, September 2013. https://doi.org/10.3150/13-BEJSP16

Information

Published: September 2013
First available in Project Euclid: 27 August 2013

zbMATH: 06216073
MathSciNet: MR3102548
Digital Object Identifier: 10.3150/13-BEJSP16

Keywords: data compression , data repositories , measurement error , multiphase inference , multiple imputation , statistical principles

Rights: Copyright © 2013 Bernoulli Society for Mathematical Statistics and Probability

Vol.19 • No. 4 • September 2013
Back to Top