Electronic Journal of Statistics

Linear regression with sparsely permuted data

Martin Slawski and Emanuel Ben-David

Full-text: Open access


In regression analysis of multivariate data, it is tacitly assumed that response and predictor variables in each observed response-predictor pair correspond to the same entity or unit. In this paper, we consider the situation of “permuted data” in which this basic correspondence has been lost. Several recent papers have considered this situation without further assumptions on the underlying permutation. In applications, the latter is often to known to have additional structure that can be leveraged. Specifically, we herein consider the common scenario of “sparsely permuted data” in which only a small fraction of the data is affected by a mismatch between response and predictors. However, an adverse effect already observed for sparsely permuted data is that the least squares estimator as well as other estimators not accounting for such partial mismatch are inconsistent. One approach studied in detail herein is to treat permuted data as outliers which motivates the use of robust regression formulations to estimate the regression parameter. The resulting estimate can subsequently be used to recover the permutation. A notable benefit of the proposed approach is its computational simplicity given the general lack of procedures for the above problem that are both statistically sound and computationally appealing.

Article information

Electron. J. Statist., Volume 13, Number 1 (2019), 1-36.

Received: November 2017
First available in Project Euclid: 4 January 2019

Permanent link to this document

Digital Object Identifier

Primary: 62J05: Linear regression 62F35: Robustness and adaptive procedures 90C10: Integer programming

Broken sample entity resolution record linkage quadratic assignment problem robust regression

Creative Commons Attribution 4.0 International License.


Slawski, Martin; Ben-David, Emanuel. Linear regression with sparsely permuted data. Electron. J. Statist. 13 (2019), no. 1, 1--36. doi:10.1214/18-EJS1498. https://projecteuclid.org/euclid.ejs/1546570940

Export citation


  • [1], IBM ILOG CPLEX Optimization Studio. http://www.ibm.com/us-en/marketplace/ibm-ilog-cplex.
  • [2], UCI Machine Learning Repository. https://archive.ics.uci.edu/ml/datasets.html.
  • [3] A. Abid, A. Poon, and J. Zou, Linear Regression with Shuffled Labels. arXiv :1705.01342, 2017.
  • [4] R. Adler and J. Taylor, Random Fields and Geometry, Springer, 2007.
  • [5] Z. Bai and T. Hsing, The broken sample problem, Probability Theory and Related Fields, 131 (2005), pp. 528–552.
  • [6] R. Baraniuk, M. Davenport, R. DeVore, and M. Wakin, A Simple Proof of the Restricted Isometry Property for Random Matrices, Constructive Approximation, 28 (2006), pp. 253–263.
  • [7] A. Belloni, V. Chernozhukov, and L. Wang, Square root lasso: Pivotal recovery of sparse signals via conic programming, Biometrika, 98 (2011), pp. 791–806.
  • [8] D. Bertsimas, A. King, and R. Mazumder, Best subset selection via a modern optimization lens, The Annals of Statistics, 44 (2016), pp. 813–852.
  • [9] D. Bertsimas and R. Mazumder, Least quantile regression via modern optimization, The Annals of Statistics, 42 (2014), pp. 2494–2525.
  • [10] P. Bickel, Y. Ritov, and A. Tsybakov, Simultaneous analysis of Lasso and Dantzig selector, The Annals of Statistics, 37 (2009), pp. 1705–1732.
  • [11] M. Bohensky, Methodological Developments in Data Linkage, John Wiley & Sons, Ltd, 2015, ch. Bias in data linkage studies.
  • [12] S. Boucheron, G. Lugosi, and P. Massart, Concentration Inequalities: A Nonasymptotic Theory of Independence, Oxford University Press, 2013.
  • [13] R. Burkard, M. Dell’Amico, and S. Martello, Assignment Problems: Revised Reprint, SIAM, 2009.
  • [14] T. Cai, T. Liang, and A. Rakhlin, Geometric inference for general high-dimensional linear inverse problems, The Annals of Statistics, 44 (2016), pp. 1536–1563.
  • [15] E. Candes and T. Tao, Decoding by Linear Programming, IEEE Transactions on Information Theory, 51 (2005), pp. 4203–4215.
  • [16] H.-P. Chan and W.-L. Loh, A file linkage problem of DeGroot and Goel revisited, Statistica Sinica, 11 (2001), pp. 1031–1045.
  • [17] V. Chandrasekaran, B. Recht, P. Parrilo, and A. Willsky, The convex geometry of linear inverse problems, Foundations of Computational Mathematics, 12 (2012), pp. 805–849.
  • [18] O. Collier and A. Dalalyan, Minimax Rates in Permutation Estimation for Feature Matching, Journal of Machine Learning Research, 17 (2016), pp. 1–31.
  • [19] S. DasGupta, An elementary proof of a theorem of Johnson and Lindenstrauss, Random Structures and Algorithms, 22 (2003), pp. 60–65.
  • [20] P. David, D. Dementhon, R. Duraiswami, and H. Samet, Softposit: Simultaneous pose and correspondence determination, Int. J. Comput. Vision, 59 (2004), pp. 259–284.
  • [21] M. DeGroot, P. Feder, and P. Goel, Matchmaking, The Annals of Mathematical Statistics, 42 (1971), pp. 578–593.
  • [22] M. DeGroot and P. Goel, The Matching Problem for Multivariate Normal Data, Sankhya, Series B, 38 (1976), pp. 14–29.
  • [23] M. DeGroot and P. Goel, Estimation of the correlation coefficient from a broken random sample, The Annals of Statistics, 8 (1980), pp. 264–278.
  • [24] E. Enamorado, B. Eifield, and K. Imai, Fast Probabilistic Record Linkage with Missing Data. R-package, Version 0.2.0.
  • [25] I. P. Fellegi and A. B. Sunter, A theory for record linkage, Journal of the American Statistical Association, 64 (1969), pp. 1183–1210.
  • [26] N. Flammarion, C. Mao, and P. Rigollet, Optimal rates of statistical seriation. arXiv :1607.02435, 2016.
  • [27] R. Foygel and L. Mackey, Corrupted Sensing: Novel Guarantees for Separating Structured Signals, IEEE Transactions on Information Theory, 60 (2014), pp. 1223–1247.
  • [28] Y. Gordon, On Milman’s inequality and random subspaces which escape through a mesh in $\mathbbR^n$, Springer Berlin Heidelberg, Berlin, Heidelberg, 1988, pp. 84–106.
  • [29] M. Grant and S. Boyd, CVX: Matlab software for disciplined convex programming, version 2.1. http://cvxr.com/cvx, Mar. 2014.
  • [30] R. Gutman, C. Afendulis, and A. Zaslavsky, A Bayesian Procedure for File Linking to Analyze End-of-Life Medical Costs, Journal of the American Statistical Association, 108 (2013), pp. 34–47.
  • [31] M. H. P. Hof and A. H. Zwinderman, Methods for analyzing data from probabilistic linkage strategies based on partially identifying variables, Statistics in Medicine, 31 (2012), pp. 4231–4242.
  • [32] D. Hsu, S. Kakade, and T. Zhang, A tail inequality for quadratic forms of sub-Gaussian random vectors, Electronic Communications in Probability, 52 (2012), pp. 1–6.
  • [33] D. Hsu, K. Shi, and X. Sun, Linear regression without correspondence. arXiv :1705.07048, 2017.
  • [34] P. Huber, Robust Estimation of a Location Parameter, The Annals of Mathematical Statistics, 53 (1964), pp. 73–101.
  • [35] G. Kim and R. Chambers, Regression analysis under probabilistic multi-linkage, Statistica Neerlandica, 66 (2012), pp. 64–79.
  • [36] P. Lahiri and M. D. Larsen, Regression analysis with linked data, Journal of the American Statistical Association, 100 (2005), pp. 222–230.
  • [37] J. N. Laska, M. A. Davenport, and R. G. Baraniuk, Exact Signal Recovery from Sparsely Corrupted Measurements through the Pursuit of Justice, in Asilomar Conference on Signals, Systems and Computers, 2009, pp. 1556–1560.
  • [38] R. Maronna, R. Martin, and V. Yohai, Robust Statistics: Theory and Methods, Wiley, 2006.
  • [39] J. Neter, S. Maynes, and R. Ramanathan, The effect of mismatching on the measurement of response error, Journal of the American Statistical Association, 60 (1965), pp. 1005–1027.
  • [40] N. Nguyen and T. Tran, Robust Lasso with Missing and Grossly Corrupted Observations, IEEE Transactions on Information Theory, 59 (2013), pp. 2036–2058.
  • [41] A. Pananjady, M. Wainwright, and T. Cortade, Denoising Linear Models with Permuted Data. arXiv :1704.07461, 2017.
  • [42] A. Pananjady, M. Wainwright, and T. Cortade, Linear regression with shuffled data: Statistical and computational limits of permutation recovery, IEEE Transctions on Information Theory, 3826–3300 (2018).
  • [43] Y. Plan and R. Vershynin, Robust 1-bit compressed sensing and sparse logistic regression: a convex programming approach, IEEE Transactions on Information Theory, 59 (2013), pp. 482–494.
  • [44] Y. Plan and R. Vershynin, The generalized Lasso with non-linear observations, IEEE Transactions on Information Theory, 62 (2016), pp. 1528–1537.
  • [45] A. B. Poore and S. Gadaleta, Some assignment problems arising from multiple target tracking, Math. Comput. Model., 43 (2006), pp. 1074–1091.
  • [46] B. Ripley, B. Venables, D. Bates, K. Hornik, A. Gebhard, and D. Firth, MASS: Support Functions and Datasets for Venables and Ripley’s MASS. R-package version 7.3.-47.
  • [47] M. Rudelson and R. Vershynin, Geometric approach to error correcting codes and reconstruction of signals, International Mathematical Research Notices, 64 (2005), pp. 4019–4041.
  • [48] F. Scheuren and W. Winkler, Regression analysis of data files that are computer matched I, Survey Methodology, 19 (1993), pp. 39–58.
  • [49] F. Scheuren and W. Winkler, Regression analysis of data files that are computer matched II, Survey Methodology, 23 (1997), pp. 157–165.
  • [50] Y. She and A. Owen, Outlier Detection Using Nonconvex Penalized Regression, Journal of the American Statistical Association, 106 (2012), pp. 626–639.
  • [51] T. Sun and C.-H. Zhang, Scaled sparse linear regression, Biometrika, 99 (2012), pp. 879–898.
  • [52] A. Tancredi and B. Liseo, Regression analysis with linked data: problems and possible solutions, Statistica, 75 (2015), pp. 19–35.
  • [53] J. Unnikrishnan, S. Haghighatshoar, and M. Vetterli, Unlabeled sensing with random linear measurements, IEEE Transactions on Information Theory, 64 (2018), pp. 3237–3253.
  • [54] R. Vershynin, In: Compressed Sensing: Theory and Applications, Cambridge University Press, 2012, ch. ’Introduction to the non-asymptotic analysis of random matrices’.
  • [55] W. E. Winkler, Matching and record linkage, Wiley Interdisciplinary Reviews: Computational Statistics, 6 (2014), pp. 313–325.
  • [56] Y. N. Wu, A note on broken sample problem, tech. rep., Department of Statistics, University of Michigan, 1998.
  • [57] S. Zhou, Restricted Eigenvalue Conditions on Subgaussian Random Matrices. arXiv :0912.4045.