Open Access
2019 Linear regression with sparsely permuted data
Martin Slawski, Emanuel Ben-David
Electron. J. Statist. 13(1): 1-36 (2019). DOI: 10.1214/18-EJS1498
Abstract

In regression analysis of multivariate data, it is tacitly assumed that response and predictor variables in each observed response-predictor pair correspond to the same entity or unit. In this paper, we consider the situation of “permuted data” in which this basic correspondence has been lost. Several recent papers have considered this situation without further assumptions on the underlying permutation. In applications, the latter is often to known to have additional structure that can be leveraged. Specifically, we herein consider the common scenario of “sparsely permuted data” in which only a small fraction of the data is affected by a mismatch between response and predictors. However, an adverse effect already observed for sparsely permuted data is that the least squares estimator as well as other estimators not accounting for such partial mismatch are inconsistent. One approach studied in detail herein is to treat permuted data as outliers which motivates the use of robust regression formulations to estimate the regression parameter. The resulting estimate can subsequently be used to recover the permutation. A notable benefit of the proposed approach is its computational simplicity given the general lack of procedures for the above problem that are both statistically sound and computationally appealing.

References

1.

[1], IBM ILOG CPLEX Optimization Studio.  http://www.ibm.com/us-en/marketplace/ibm-ilog-cplex.[1], IBM ILOG CPLEX Optimization Studio.  http://www.ibm.com/us-en/marketplace/ibm-ilog-cplex.

2.

[2], UCI Machine Learning Repository.  https://archive.ics.uci.edu/ml/datasets.html.[2], UCI Machine Learning Repository.  https://archive.ics.uci.edu/ml/datasets.html.

3.

[3] A. Abid, A. Poon, and J. Zou, Linear Regression with Shuffled Labels. arXiv :1705.01342, 2017.[3] A. Abid, A. Poon, and J. Zou, Linear Regression with Shuffled Labels. arXiv :1705.01342, 2017.

4.

[4] R. Adler and J. Taylor, Random Fields and Geometry, Springer, 2007.[4] R. Adler and J. Taylor, Random Fields and Geometry, Springer, 2007.

5.

[5] Z. Bai and T. Hsing, The broken sample problem, Probability Theory and Related Fields, 131 (2005), pp. 528–552. 1062.62041 10.1007/s00440-004-0384-5[5] Z. Bai and T. Hsing, The broken sample problem, Probability Theory and Related Fields, 131 (2005), pp. 528–552. 1062.62041 10.1007/s00440-004-0384-5

6.

[6] R. Baraniuk, M. Davenport, R. DeVore, and M. Wakin, A Simple Proof of the Restricted Isometry Property for Random Matrices, Constructive Approximation, 28 (2006), pp. 253–263. MR2453366 1177.15015 10.1007/s00365-007-9003-x[6] R. Baraniuk, M. Davenport, R. DeVore, and M. Wakin, A Simple Proof of the Restricted Isometry Property for Random Matrices, Constructive Approximation, 28 (2006), pp. 253–263. MR2453366 1177.15015 10.1007/s00365-007-9003-x

7.

[7] A. Belloni, V. Chernozhukov, and L. Wang, Square root lasso: Pivotal recovery of sparse signals via conic programming, Biometrika, 98 (2011), pp. 791–806. 1228.62083 10.1093/biomet/asr043[7] A. Belloni, V. Chernozhukov, and L. Wang, Square root lasso: Pivotal recovery of sparse signals via conic programming, Biometrika, 98 (2011), pp. 791–806. 1228.62083 10.1093/biomet/asr043

8.

[8] D. Bertsimas, A. King, and R. Mazumder, Best subset selection via a modern optimization lens, The Annals of Statistics, 44 (2016), pp. 813–852. 1335.62115 10.1214/15-AOS1388 euclid.aos/1458245736[8] D. Bertsimas, A. King, and R. Mazumder, Best subset selection via a modern optimization lens, The Annals of Statistics, 44 (2016), pp. 813–852. 1335.62115 10.1214/15-AOS1388 euclid.aos/1458245736

9.

[9] D. Bertsimas and R. Mazumder, Least quantile regression via modern optimization, The Annals of Statistics, 42 (2014), pp. 2494–2525. 1302.62154 10.1214/14-AOS1223 euclid.aos/1415801781[9] D. Bertsimas and R. Mazumder, Least quantile regression via modern optimization, The Annals of Statistics, 42 (2014), pp. 2494–2525. 1302.62154 10.1214/14-AOS1223 euclid.aos/1415801781

10.

[10] P. Bickel, Y. Ritov, and A. Tsybakov, Simultaneous analysis of Lasso and Dantzig selector, The Annals of Statistics, 37 (2009), pp. 1705–1732. 1173.62022 10.1214/08-AOS620 euclid.aos/1245332830[10] P. Bickel, Y. Ritov, and A. Tsybakov, Simultaneous analysis of Lasso and Dantzig selector, The Annals of Statistics, 37 (2009), pp. 1705–1732. 1173.62022 10.1214/08-AOS620 euclid.aos/1245332830

11.

[11] M. Bohensky, Methodological Developments in Data Linkage, John Wiley & Sons, Ltd, 2015, ch. Bias in data linkage studies.[11] M. Bohensky, Methodological Developments in Data Linkage, John Wiley & Sons, Ltd, 2015, ch. Bias in data linkage studies.

12.

[12] S. Boucheron, G. Lugosi, and P. Massart, Concentration Inequalities: A Nonasymptotic Theory of Independence, Oxford University Press, 2013. 1279.60005[12] S. Boucheron, G. Lugosi, and P. Massart, Concentration Inequalities: A Nonasymptotic Theory of Independence, Oxford University Press, 2013. 1279.60005

13.

[13] R. Burkard, M. Dell’Amico, and S. Martello, Assignment Problems: Revised Reprint, SIAM, 2009.[13] R. Burkard, M. Dell’Amico, and S. Martello, Assignment Problems: Revised Reprint, SIAM, 2009.

14.

[14] T. Cai, T. Liang, and A. Rakhlin, Geometric inference for general high-dimensional linear inverse problems, The Annals of Statistics, 44 (2016), pp. 1536–1563. 1357.62235 10.1214/15-AOS1426 euclid.aos/1467894707[14] T. Cai, T. Liang, and A. Rakhlin, Geometric inference for general high-dimensional linear inverse problems, The Annals of Statistics, 44 (2016), pp. 1536–1563. 1357.62235 10.1214/15-AOS1426 euclid.aos/1467894707

15.

[15] E. Candes and T. Tao, Decoding by Linear Programming, IEEE Transactions on Information Theory, 51 (2005), pp. 4203–4215. 1264.94121 10.1109/TIT.2005.858979[15] E. Candes and T. Tao, Decoding by Linear Programming, IEEE Transactions on Information Theory, 51 (2005), pp. 4203–4215. 1264.94121 10.1109/TIT.2005.858979

16.

[16] H.-P. Chan and W.-L. Loh, A file linkage problem of DeGroot and Goel revisited, Statistica Sinica, 11 (2001), pp. 1031–1045. 0984.62039[16] H.-P. Chan and W.-L. Loh, A file linkage problem of DeGroot and Goel revisited, Statistica Sinica, 11 (2001), pp. 1031–1045. 0984.62039

17.

[17] V. Chandrasekaran, B. Recht, P. Parrilo, and A. Willsky, The convex geometry of linear inverse problems, Foundations of Computational Mathematics, 12 (2012), pp. 805–849. MR2989474 1280.52008 10.1007/s10208-012-9135-7[17] V. Chandrasekaran, B. Recht, P. Parrilo, and A. Willsky, The convex geometry of linear inverse problems, Foundations of Computational Mathematics, 12 (2012), pp. 805–849. MR2989474 1280.52008 10.1007/s10208-012-9135-7

18.

[18] O. Collier and A. Dalalyan, Minimax Rates in Permutation Estimation for Feature Matching, Journal of Machine Learning Research, 17 (2016), pp. 1–31. 1360.62262[18] O. Collier and A. Dalalyan, Minimax Rates in Permutation Estimation for Feature Matching, Journal of Machine Learning Research, 17 (2016), pp. 1–31. 1360.62262

19.

[19] S. DasGupta, An elementary proof of a theorem of Johnson and Lindenstrauss, Random Structures and Algorithms, 22 (2003), pp. 60–65. 1018.51010 10.1002/rsa.10073[19] S. DasGupta, An elementary proof of a theorem of Johnson and Lindenstrauss, Random Structures and Algorithms, 22 (2003), pp. 60–65. 1018.51010 10.1002/rsa.10073

20.

[20] P. David, D. Dementhon, R. Duraiswami, and H. Samet, Softposit: Simultaneous pose and correspondence determination, Int. J. Comput. Vision, 59 (2004), pp. 259–284. 1039.68613[20] P. David, D. Dementhon, R. Duraiswami, and H. Samet, Softposit: Simultaneous pose and correspondence determination, Int. J. Comput. Vision, 59 (2004), pp. 259–284. 1039.68613

21.

[21] M. DeGroot, P. Feder, and P. Goel, Matchmaking, The Annals of Mathematical Statistics, 42 (1971), pp. 578–593.[21] M. DeGroot, P. Feder, and P. Goel, Matchmaking, The Annals of Mathematical Statistics, 42 (1971), pp. 578–593.

22.

[22] M. DeGroot and P. Goel, The Matching Problem for Multivariate Normal Data, Sankhya, Series B, 38 (1976), pp. 14–29. 0414.62044[22] M. DeGroot and P. Goel, The Matching Problem for Multivariate Normal Data, Sankhya, Series B, 38 (1976), pp. 14–29. 0414.62044

23.

[23] M. DeGroot and P. Goel, Estimation of the correlation coefficient from a broken random sample, The Annals of Statistics, 8 (1980), pp. 264–278. 0446.62049 10.1214/aos/1176344952 euclid.aos/1176344952[23] M. DeGroot and P. Goel, Estimation of the correlation coefficient from a broken random sample, The Annals of Statistics, 8 (1980), pp. 264–278. 0446.62049 10.1214/aos/1176344952 euclid.aos/1176344952

24.

[24] E. Enamorado, B. Eifield, and K. Imai, Fast Probabilistic Record Linkage with Missing Data. R-package, Version 0.2.0.[24] E. Enamorado, B. Eifield, and K. Imai, Fast Probabilistic Record Linkage with Missing Data. R-package, Version 0.2.0.

25.

[25] I. P. Fellegi and A. B. Sunter, A theory for record linkage, Journal of the American Statistical Association, 64 (1969), pp. 1183–1210. 0186.53903[25] I. P. Fellegi and A. B. Sunter, A theory for record linkage, Journal of the American Statistical Association, 64 (1969), pp. 1183–1210. 0186.53903

26.

[26] N. Flammarion, C. Mao, and P. Rigollet, Optimal rates of statistical seriation. arXiv :1607.02435, 2016. 07007219 10.3150/17-BEJ1000 euclid.bj/1544605258[26] N. Flammarion, C. Mao, and P. Rigollet, Optimal rates of statistical seriation. arXiv :1607.02435, 2016. 07007219 10.3150/17-BEJ1000 euclid.bj/1544605258

27.

[27] R. Foygel and L. Mackey, Corrupted Sensing: Novel Guarantees for Separating Structured Signals, IEEE Transactions on Information Theory, 60 (2014), pp. 1223–1247. 1364.94124 10.1109/TIT.2013.2293654[27] R. Foygel and L. Mackey, Corrupted Sensing: Novel Guarantees for Separating Structured Signals, IEEE Transactions on Information Theory, 60 (2014), pp. 1223–1247. 1364.94124 10.1109/TIT.2013.2293654

28.

[28] Y. Gordon, On Milman’s inequality and random subspaces which escape through a mesh in $\mathbbR^n$, Springer Berlin Heidelberg, Berlin, Heidelberg, 1988, pp. 84–106. MR950977[28] Y. Gordon, On Milman’s inequality and random subspaces which escape through a mesh in $\mathbbR^n$, Springer Berlin Heidelberg, Berlin, Heidelberg, 1988, pp. 84–106. MR950977

29.

[29] M. Grant and S. Boyd, CVX: Matlab software for disciplined convex programming, version 2.1.  http://cvxr.com/cvx, Mar. 2014.[29] M. Grant and S. Boyd, CVX: Matlab software for disciplined convex programming, version 2.1.  http://cvxr.com/cvx, Mar. 2014.

30.

[30] R. Gutman, C. Afendulis, and A. Zaslavsky, A Bayesian Procedure for File Linking to Analyze End-of-Life Medical Costs, Journal of the American Statistical Association, 108 (2013), pp. 34–47. 1379.62069 10.1080/01621459.2012.726889[30] R. Gutman, C. Afendulis, and A. Zaslavsky, A Bayesian Procedure for File Linking to Analyze End-of-Life Medical Costs, Journal of the American Statistical Association, 108 (2013), pp. 34–47. 1379.62069 10.1080/01621459.2012.726889

31.

[31] M. H. P. Hof and A. H. Zwinderman, Methods for analyzing data from probabilistic linkage strategies based on partially identifying variables, Statistics in Medicine, 31 (2012), pp. 4231–4242.[31] M. H. P. Hof and A. H. Zwinderman, Methods for analyzing data from probabilistic linkage strategies based on partially identifying variables, Statistics in Medicine, 31 (2012), pp. 4231–4242.

32.

[32] D. Hsu, S. Kakade, and T. Zhang, A tail inequality for quadratic forms of sub-Gaussian random vectors, Electronic Communications in Probability, 52 (2012), pp. 1–6. 1309.60017[32] D. Hsu, S. Kakade, and T. Zhang, A tail inequality for quadratic forms of sub-Gaussian random vectors, Electronic Communications in Probability, 52 (2012), pp. 1–6. 1309.60017

33.

[33] D. Hsu, K. Shi, and X. Sun, Linear regression without correspondence. arXiv :1705.07048, 2017.[33] D. Hsu, K. Shi, and X. Sun, Linear regression without correspondence. arXiv :1705.07048, 2017.

34.

[34] P. Huber, Robust Estimation of a Location Parameter, The Annals of Mathematical Statistics, 53 (1964), pp. 73–101. MR161415 0136.39805 10.1214/aoms/1177703732 euclid.aoms/1177703732[34] P. Huber, Robust Estimation of a Location Parameter, The Annals of Mathematical Statistics, 53 (1964), pp. 73–101. MR161415 0136.39805 10.1214/aoms/1177703732 euclid.aoms/1177703732

35.

[35] G. Kim and R. Chambers, Regression analysis under probabilistic multi-linkage, Statistica Neerlandica, 66 (2012), pp. 64–79. MR2878847 10.1111/j.1467-9574.2011.00509.x[35] G. Kim and R. Chambers, Regression analysis under probabilistic multi-linkage, Statistica Neerlandica, 66 (2012), pp. 64–79. MR2878847 10.1111/j.1467-9574.2011.00509.x

36.

[36] P. Lahiri and M. D. Larsen, Regression analysis with linked data, Journal of the American Statistical Association, 100 (2005), pp. 222–230. 1117.62376 10.1198/016214504000001277[36] P. Lahiri and M. D. Larsen, Regression analysis with linked data, Journal of the American Statistical Association, 100 (2005), pp. 222–230. 1117.62376 10.1198/016214504000001277

37.

[37] J. N. Laska, M. A. Davenport, and R. G. Baraniuk, Exact Signal Recovery from Sparsely Corrupted Measurements through the Pursuit of Justice, in Asilomar Conference on Signals, Systems and Computers, 2009, pp. 1556–1560.[37] J. N. Laska, M. A. Davenport, and R. G. Baraniuk, Exact Signal Recovery from Sparsely Corrupted Measurements through the Pursuit of Justice, in Asilomar Conference on Signals, Systems and Computers, 2009, pp. 1556–1560.

38.

[38] R. Maronna, R. Martin, and V. Yohai, Robust Statistics: Theory and Methods, Wiley, 2006.[38] R. Maronna, R. Martin, and V. Yohai, Robust Statistics: Theory and Methods, Wiley, 2006.

39.

[39] J. Neter, S. Maynes, and R. Ramanathan, The effect of mismatching on the measurement of response error, Journal of the American Statistical Association, 60 (1965), pp. 1005–1027.[39] J. Neter, S. Maynes, and R. Ramanathan, The effect of mismatching on the measurement of response error, Journal of the American Statistical Association, 60 (1965), pp. 1005–1027.

40.

[40] N. Nguyen and T. Tran, Robust Lasso with Missing and Grossly Corrupted Observations, IEEE Transactions on Information Theory, 59 (2013), pp. 2036–2058. 1364.94146 10.1109/TIT.2012.2232347[40] N. Nguyen and T. Tran, Robust Lasso with Missing and Grossly Corrupted Observations, IEEE Transactions on Information Theory, 59 (2013), pp. 2036–2058. 1364.94146 10.1109/TIT.2012.2232347

41.

[41] A. Pananjady, M. Wainwright, and T. Cortade, Denoising Linear Models with Permuted Data. arXiv :1704.07461, 2017.[41] A. Pananjady, M. Wainwright, and T. Cortade, Denoising Linear Models with Permuted Data. arXiv :1704.07461, 2017.

42.

[42] A. Pananjady, M. Wainwright, and T. Cortade, Linear regression with shuffled data: Statistical and computational limits of permutation recovery, IEEE Transctions on Information Theory, 3826–3300 (2018). 1395.62204 10.1109/TIT.2017.2776217[42] A. Pananjady, M. Wainwright, and T. Cortade, Linear regression with shuffled data: Statistical and computational limits of permutation recovery, IEEE Transctions on Information Theory, 3826–3300 (2018). 1395.62204 10.1109/TIT.2017.2776217

43.

[43] Y. Plan and R. Vershynin, Robust 1-bit compressed sensing and sparse logistic regression: a convex programming approach, IEEE Transactions on Information Theory, 59 (2013), pp. 482–494. 1364.94153 10.1109/TIT.2012.2207945[43] Y. Plan and R. Vershynin, Robust 1-bit compressed sensing and sparse logistic regression: a convex programming approach, IEEE Transactions on Information Theory, 59 (2013), pp. 482–494. 1364.94153 10.1109/TIT.2012.2207945

44.

[44] Y. Plan and R. Vershynin, The generalized Lasso with non-linear observations, IEEE Transactions on Information Theory, 62 (2016), pp. 1528–1537. 1359.94153 10.1109/TIT.2016.2517008[44] Y. Plan and R. Vershynin, The generalized Lasso with non-linear observations, IEEE Transactions on Information Theory, 62 (2016), pp. 1528–1537. 1359.94153 10.1109/TIT.2016.2517008

45.

[45] A. B. Poore and S. Gadaleta, Some assignment problems arising from multiple target tracking, Math. Comput. Model., 43 (2006), pp. 1074–1091. 1171.90450 10.1016/j.mcm.2005.05.026[45] A. B. Poore and S. Gadaleta, Some assignment problems arising from multiple target tracking, Math. Comput. Model., 43 (2006), pp. 1074–1091. 1171.90450 10.1016/j.mcm.2005.05.026

46.

[46] B. Ripley, B. Venables, D. Bates, K. Hornik, A. Gebhard, and D. Firth, MASS: Support Functions and Datasets for Venables and Ripley’s MASS. R-package version 7.3.-47.[46] B. Ripley, B. Venables, D. Bates, K. Hornik, A. Gebhard, and D. Firth, MASS: Support Functions and Datasets for Venables and Ripley’s MASS. R-package version 7.3.-47.

47.

[47] M. Rudelson and R. Vershynin, Geometric approach to error correcting codes and reconstruction of signals, International Mathematical Research Notices, 64 (2005), pp. 4019–4041. 1103.94014 10.1155/IMRN.2005.4019[47] M. Rudelson and R. Vershynin, Geometric approach to error correcting codes and reconstruction of signals, International Mathematical Research Notices, 64 (2005), pp. 4019–4041. 1103.94014 10.1155/IMRN.2005.4019

48.

[48] F. Scheuren and W. Winkler, Regression analysis of data files that are computer matched I, Survey Methodology, 19 (1993), pp. 39–58.[48] F. Scheuren and W. Winkler, Regression analysis of data files that are computer matched I, Survey Methodology, 19 (1993), pp. 39–58.

49.

[49] F. Scheuren and W. Winkler, Regression analysis of data files that are computer matched II, Survey Methodology, 23 (1997), pp. 157–165.[49] F. Scheuren and W. Winkler, Regression analysis of data files that are computer matched II, Survey Methodology, 23 (1997), pp. 157–165.

50.

[50] Y. She and A. Owen, Outlier Detection Using Nonconvex Penalized Regression, Journal of the American Statistical Association, 106 (2012), pp. 626–639.[50] Y. She and A. Owen, Outlier Detection Using Nonconvex Penalized Regression, Journal of the American Statistical Association, 106 (2012), pp. 626–639.

51.

[51] T. Sun and C.-H. Zhang, Scaled sparse linear regression, Biometrika, 99 (2012), pp. 879–898.[51] T. Sun and C.-H. Zhang, Scaled sparse linear regression, Biometrika, 99 (2012), pp. 879–898.

52.

[52] A. Tancredi and B. Liseo, Regression analysis with linked data: problems and possible solutions, Statistica, 75 (2015), pp. 19–35.[52] A. Tancredi and B. Liseo, Regression analysis with linked data: problems and possible solutions, Statistica, 75 (2015), pp. 19–35.

53.

[53] J. Unnikrishnan, S. Haghighatshoar, and M. Vetterli, Unlabeled sensing with random linear measurements, IEEE Transactions on Information Theory, 64 (2018), pp. 3237–3253. 1395.94168 10.1109/TIT.2018.2809002[53] J. Unnikrishnan, S. Haghighatshoar, and M. Vetterli, Unlabeled sensing with random linear measurements, IEEE Transactions on Information Theory, 64 (2018), pp. 3237–3253. 1395.94168 10.1109/TIT.2018.2809002

54.

[54] R. Vershynin, In: Compressed Sensing: Theory and Applications, Cambridge University Press, 2012, ch. ’Introduction to the non-asymptotic analysis of random matrices’.[54] R. Vershynin, In: Compressed Sensing: Theory and Applications, Cambridge University Press, 2012, ch. ’Introduction to the non-asymptotic analysis of random matrices’.

55.

[55] W. E. Winkler, Matching and record linkage, Wiley Interdisciplinary Reviews: Computational Statistics, 6 (2014), pp. 313–325.[55] W. E. Winkler, Matching and record linkage, Wiley Interdisciplinary Reviews: Computational Statistics, 6 (2014), pp. 313–325.

56.

[56] Y. N. Wu, A note on broken sample problem, tech. rep., Department of Statistics, University of Michigan, 1998.[56] Y. N. Wu, A note on broken sample problem, tech. rep., Department of Statistics, University of Michigan, 1998.

57.

[57] S. Zhou, Restricted Eigenvalue Conditions on Subgaussian Random Matrices. arXiv :0912.4045.[57] S. Zhou, Restricted Eigenvalue Conditions on Subgaussian Random Matrices. arXiv :0912.4045.
Martin Slawski and Emanuel Ben-David "Linear regression with sparsely permuted data," Electronic Journal of Statistics 13(1), 1-36, (2019). https://doi.org/10.1214/18-EJS1498
Received: 1 November 2017; Published: 2019
Vol.13 • No. 1 • 2019
Back to Top