## Electronic Journal of Statistics

### Efficient block boundaries estimation in block-wise constant matrices: An application to HiC data

#### Abstract

In this paper, we propose a novel modeling and a new methodology for estimating the location of block boundaries in a random matrix consisting of a block-wise constant matrix corrupted with white noise. Our method consists in rewriting this problem as a variable selection issue. A penalized least-squares criterion with an $\ell_{1}$-type penalty is used for dealing with this problem. Firstly, some theoretical results ensuring the consistency of our block boundaries estimators are provided. Secondly, we explain how to implement our approach in a very efficient way. This implementation is available in the R package blockseg which can be found in the Comprehensive R Archive Network. Thirdly, we provide some numerical experiments to illustrate the statistical and numerical performance of our package, as well as a thorough comparison with existing methods. Fourthly, an empirical procedure is proposed for estimating the number of blocks. Finally, our approach is applied to HiC data which are used in molecular biology for better understanding the influence of the chromosomal conformation on the cells functioning.

#### Article information

Source
Electron. J. Statist., Volume 11, Number 1 (2017), 1570-1599.

Dates
First available in Project Euclid: 22 April 2017

https://projecteuclid.org/euclid.ejs/1492826487

Digital Object Identifier
doi:10.1214/17-EJS1270

Mathematical Reviews number (MathSciNet)
MR3638971

Zentralblatt MATH identifier
1362.62012

#### Citation

Brault, Vincent; Chiquet, Julien; Lévy-Leduc, Céline. Efficient block boundaries estimation in block-wise constant matrices: An application to HiC data. Electron. J. Statist. 11 (2017), no. 1, 1570--1599. doi:10.1214/17-EJS1270. https://projecteuclid.org/euclid.ejs/1492826487

#### References

• [1] Auger, I. E. and C. E. Lawrence (1989). Algorithms for the optimal identification of segment neighborhoods., Bulletin of Mathematical Biology 51 (1), 39–54.
• [2] Bach, F., R. Jenatton, J. Mairal, and G. Obozinski (2012). Optimization with sparsity-inducing penalties., Foundations and Trends® in Machine Learning 4 (1), 1–106.
• [3] Bellman, R. (1961). On the approximation of curves by line segments using dynamic programming., Commun. ACM 4 (6), 284–286.
• [4] Breiman, L., J. H. Friedman, R. A. Olshen, and C. J. Stone (1984)., Classification and Regression Trees. Statistics/Probability Series. Belmont, California, U.S.A.: Wadsworth Publishing Company.
• [5] Dixon, J. R., S. Selvaraj, F. Yue, A. Kim, Y. Li, Y. Shen, M. Hu, J. S. Liu, and B. Ren (2012). Topological domains in mammalian genomes identified by analysis of chromatin interactions., Nature 485 (7398), 376–380.
• [6] Efron, B., T. Hastie, I. Johnstone, R. Tibshirani, et al. (2004). Least angle regression., The Annals of statistics 32 (2), 407–499.
• [7] Fisher, W. D. (1958). On grouping for maximum homogeneity., Journal of the American Statistical Association 53 (284), 789–798.
• [8] Golub, G. H. and C. F. Van Loan (2012)., Matrix computations. JHU Press. 3rd edition.
• [9] Harchaoui, Z. and C. Lévy-Leduc (2010). Multiple change-point estimation with a total variation penalty., Journal of the American Statistical Association 105 (492), 1480–1493.
• [10] Hoefling, H. (2010). A path algorithm for the fused lasso signal approximator., J. Comput. Graph. Statist. 19 (4), 984–1006.
• [11] Kay, S. (1993)., Fundamentals of statistical signal processing: detection theory. Prentice-Hall, Inc.
• [12] Lévy-Leduc, C., M. Delattre, T. Mary-Huard, and S. Robin (2014). Two-dimensional segmentation for analyzing hi-c data., Bioinformatics 30 (17), i386–i392.
• [13] Lieberman-Aiden, E., N. L. Van Berkum, L. Williams, M. Imakaev, T. Ragoczy, A. Telling, I. Amit, B. R. Lajoie, P. J. Sabo, M. O. Dorschner, et al. (2009). Comprehensive mapping of long-range interactions reveals folding principles of the human genome., science 326 (5950), 289–293.
• [14] Liu, S. and G. Trenkler (2008). Hadamard, khatri-rao, kronecker and other matrix products., Int. J. Inform. Syst. Sci. 4, 160–177.
• [15] Maidstone, R., T. Hocking, G. Rigaill, and P. Fearnhead (2016). On optimal multiple changepoint algorithms for large data., Statistics and Computing , 1–15.
• [16] Mairal, J. and B. Yu (2012). Complexity analysis of the lasso regularization path. In, Proceedings of the 29th ICML.
• [17] Meinshausen, N. and P. Bühlmann (2010). Stability selection., Journal of the Royal Statistical Society: Series B (Statistical Methodology) 72 (4), 417–473.
• [18] Osborne, M. R., B. Presnell, and B. A. Turlach (2000). A new approach to variable selection in least squares problems., IMA journal of numerical analysis 20 (3), 389–403.
• [19] R Core Team (2015)., R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing.
• [20] Rigaill, G. (2015). A pruned dynamic programming algorithm to recover the best segmentations with 1 to kmax change-points., Journal de la Société Française de Statistique 156 (4), 180–205.
• [21] Sanderson, C. (2010). Armadillo: An open source C++ linear algebra library for fast prototyping and computationally intensive experiments. Technical report, NICTA.
• [22] Scott, A. J. and M. Knott (1974). A cluster analysis method for grouping means in the analysis of variance., Biometrics 30 (3), 507–512.
• [23] Tibshirani, R. J. and J. Taylor (2011). The solution path of the generalized lasso., Ann. Statist. 39 (3), 1335–1371.
• [24] Vert, J.-P. and K. Bleakley (2010). Fast detection of multiple change-points shared by many signals using group lars. In, Advances in Neural Information Processing Systems, pp. 2343–2351.