August 2023 Statistical Embedding: Beyond Principal Components
Dag Tjøstheim, Martin Jullum, Anders Løland
Author Affiliations +
Statist. Sci. 38(3): 411-439 (August 2023). DOI: 10.1214/22-STS881

Abstract

There has been an intense recent activity in embedding of very high-dimensional and nonlinear data structures, much of it in the data science and machine learning literature. We survey this activity in four parts. In the first part, we cover nonlinear methods such as principal curves, multidimensional scaling, local linear methods, ISOMAP, graph-based methods and diffusion mapping, kernel based methods and random projections. The second part is concerned with topological embedding methods, in particular mapping topological properties into persistence diagrams and the Mapper algorithm. Another type of data sets with a tremendous growth is very high-dimensional network data. The task considered in part three is how to embed such data in a vector space of moderate dimension to make the data amenable to traditional techniques such as cluster and classification techniques. Arguably, this is the part where the contrast between algorithmic machine learning methods and statistical modeling, represented by the so-called stochastic block model, is at its greatest. In the paper, we discuss the pros and cons for the two approaches. The final part of the survey deals with embedding in R2, that is, visualization. Three methods are presented: t-SNE, UMAP and LargeVis based on methods in parts one, two and three, respectively. The methods are illustrated and compared on two simulated data sets; one consisting of a triplet of noisy Ranunculoid curves, and one consisting of networks of increasing complexity generated with stochastic block models and with two types of nodes.

Funding Statement

This work was supported by the Norwegian Research Council Grant 237718 (BigInsight).

Acknowledgments

The authors would like to thank two anonymous referees, an Associate Editor and in particular the editor for their constructive and very helpful comments that improved the quality of this paper.

Citation

Download Citation

Dag Tjøstheim. Martin Jullum. Anders Løland. "Statistical Embedding: Beyond Principal Components." Statist. Sci. 38 (3) 411 - 439, August 2023. https://doi.org/10.1214/22-STS881

Information

Published: August 2023
First available in Project Euclid: 20 August 2023

MathSciNet: MR4630376
Digital Object Identifier: 10.1214/22-STS881

Keywords: diffusion mapping , graph spectral theory , ISOMAP , LargeVis , local linear method , multidimensional scaling , neighborhood sampling strategies , network embedding , nonlinear principal component , persistence diagram , Persistent homology , principal component , random projection , ‎reproducing kernel Hilbert ‎space , Skip-Gram , spectral embedding , Statistical embedding , stochastic block modeling , the Mapper , topological data analysis and embedding , t-SNE , UMAP , visualization

Rights: Copyright © 2023 Institute of Mathematical Statistics

Vol.38 • No. 3 • August 2023
Back to Top