Large deviation principles induced by the Stiefel manifold, and random multi-dimensional projections

Given an $n$-dimensional random vector $X^{(n)}$ , for $k<n$, consider its $k$-dimensional projection $\mathbf{a}_{n,k}X^{(n)}$, where $\mathbf{a}_{n,k}$ is an $n \times k$-dimensional matrix belonging to the Stiefel manifold $\mathbb{V}_{n,k}$ of orthonormal $k$-frames in $\mathbb{R}^n$. For a class of sequences $\{X^{(n)}\}$ that includes the uniform distributions on scaled $\ell_p^n$ balls, $p \in (1,\infty]$, and product measures with sufficiently light tails, it is shown that the sequence of projected vectors $\{\mathbf{a}_{n,k}^\intercal X^{(n)}\}$ satisfies a large deviation principle whenever the empirical measures of the rows of $\sqrt{n} \mathbf{a}_{n,k}$ converge, as $n \rightarrow \infty$, to a probability measure on $\mathbb{R}^k$. In particular, when $\mathbf{A}_{n,k}$ is a random matrix drawn from the Haar measure on $\mathbb{V}_{n,k}$, this is shown to imply a large deviation principle for the sequence of random projections $\{\mathbf{A}_{n,k}^\intercal X^{(n)}\}$ in the quenched sense (that is, conditioned on almost sure realizations of $\{\mathbf{A}_{n,k}\}$). Moreover, a variational formula is obtained for the rate function of the large deviation principle for the annealed projections $\{\mathbf{A}_{n,k}^\intercal X^{(n)}\}$, which is expressed in terms of a family of quenched rate functions and a modified entropy term. A key step in this analysis is a large deviation principle for the sequence of empirical measures of rows of $\sqrt{n} \mathbf{A}_{n,k}$, which may be of independent interest. The study of multi-dimensional random projections of high-dimensional measures is of interest in asymptotic functional analysis, convex geometry and statistics. Prior results on quenched large deviations for random projections of $\ell_p^n$ balls have been essentially restricted to the one-dimensional setting.


Background
The study of high-dimensional measures and their lower-dimensional projections is a central theme in highdimensional probability, asymptotic functional analysis and convex geometry, where in the latter case the measures of interest are distributions on convex bodies, which are compact, convex sets with non-empty interior (see, e.g., [Kla07,Mec12]). Multidimensional projections of high-dimensional random vectors are also relevant in statistics, data analysis and computer science [DF84,JL84]. Recent work has shown that large deviation principles (LDPs) that capture the tail behavior of lower-dimensional random projections can provide more interesting information about the original high-dimensonal measures than central limit theorem type results that capture universal phenomena of fluctuations. For example, in the case of ℓ n p balls, p ∈ [1, ∞), which are fundamental objects in convex geometry, this was first illustrated by LDPs for one-dimensional projections obtained in [GKR17,Kim17], and subsequently by LDPs for norms of samples from ℓ n p balls and their multi-dimensional projections in [AGPT18,KPT19a,KLR20], as well as corresponding refined large deviation estimates obtained in [LR20,Kau21]. LDPs of random projections of high-dimensional measures are broadly of two types, the terminology arising from statistical physics: so-called "quenched" LDPs, where one conditions on the choice of the (sequence of) sub-spaces, bases or directions onto which one projects; or "annealed" LDPs, which average over the randomness arising in the choice of the projection. While most of the work described above on ℓ n p balls focused on one-dimensional LDPs (either for one-dimensional projections or norms of higher-dimensional projections), in [KLR20], annealed LDPs were also established for multi-dimensional projections of high-dimensional measures that satisfy a general condition called the asymptotic thin shell condition. This condition was shown to be satisfied in [KLR20] by several classes of measures, including product measures with polynomial tail decay, ℓ n p balls, p ∈ [1, ∞], and classes of Orlicz balls and Gibbs measures.
In this article, we establish quenched LDPs for multidimensional random projections of a class of highdimensional measures as the dimension n goes to infinity. Quenched LDPs can often provide more geometric information than annealed LDPs, but their analysis is typically more difficult because one can no longer exploit symmetry properties of the random projection measure. To state our results more precisely, for k ∈ R n , let I k denote the k × k identity matrix, and for n > k, let V n,k := {A ∈ R n×k : A ⊺ A = I k } (1.1) denote the Stiefel manifold of k-frames in R n . Observe that the set V n,n can be identified with the set O(n) of n × n orthogonal matrices with columns of norm 1. Also, note that for k, n ∈ N, k < n, any a n,k ∈ V n,k defines a linear projection from n to k dimensions. Fixing a probability space (Ω, F, P), we consider sequences of random vectors {X (n) } n∈N defined on this space that satisfy a certain set of conditions (see Assumption 2.1), which include, for example, X (n) uniformly distributed on an ℓ n p ball of radius n 1/p , p ≥ 2, or X (n) distributed according to a product measure with sufficiently light tails. For any fixed k ∈ N, let N k := {n ∈ N : n > k}, and consider the sequence of k-dimensional projections where a := {a n,k } n∈N k , with a n,k ⊂ V n,k for each n ∈ N k . Also, let L a n,k := 1 n n i=1 δ √ na n,k (i,· ) , n ∈ N k , (1.3) be the associated sequence of empirical measures of the rows of √ na n,k . Our first result, Theorem 2.4, shows that whenever {L a n,k } n∈N k converges in the q ⋆ -Wasserstein topology (see Definition 1.2) to a measure ν, then the sequence of random projections {n −1/2 a ⊺ n,k X (n) } n∈N k , satisfies an LDP with a rate function that we denote by J qu ν . In particular, this implies a quenched LDP for the sequence {A ⊺ n,k X (n) } n∈N k , where the random matrix A n,k = [A n,k (i, j)] i=1,...,n; j=1,...,k is sampled, independently of {X (n) } n∈N , from σ n,k , the Haar measure on V n,k (i.e., the unique probability measure on V n,k that is invariant under the group O(n) of orthogonal transformations). In [KLR20], it was shown that {A ⊺ n,k X (n) } n∈N k also satisfies an annealed LDP. Our second result, Theorem 2.7, establishes a variational formula for the (annealed) rate function J an , in terms of the quenched rate functions J qu ν . Along the way, for any q ∈ (0, 2), in Theorem 2.8, we also establish an LDP for the random empirical measure sequence {L A n,k } n∈N k in the q-Wasserestein topology, which may be of independent interest. In the next section, we introduce some basic notation and terminology that will be used throughout, and then provide precise statements of our main results in Section 2, with proofs presented in Sections 3-6.

Notation and Terminology
We first recall the definition of an LDP, referring to [DZ09] for further background on large deviations theory.
Definition 1.1. Let X be a topological space with Borel sigma-algebra B. A sequence of probability measures {P n } n∈N ⊂ P(X) is said to satisfy a large deviation principle (LDP) in X with rate function I : where B • andB denote the interior and closure of B, respectively. We say I is a good rate function (GRF) if it has compact level sets. Analogously, a sequence of X-valued random variables {x n } n∈N is said to satisfy an LDP with GRF I if the sequence of their laws {P • x n } n∈N does.
We recall some definitions that will prove useful in our discussion of rate functions. Given a function f : R m → [−∞, ∞] for some m ∈ N, we let f * denote its Legendre transform: Since we will frequently invoke the contraction principle, Cramér's theorem and Sanov's theorem, we refer the reader to Theorem 4.2.1, Corollary 6.1.6, and Section 6.2, respectively, of [DZ09], for precise statements.
Next, for p ∈ [1, ∞], let · p denote the ℓ k p norm on R k . When p = 2, and where the context makes it clear, we will omit the subscript and simply write · for the Euclidean norm. Let P(R k ) denote the space of probability measures on R k , by default equipped with the topology of weak convergence. We will sometimes consider the following restricted subsets of probability measures: for q > 0, let Definition 1.2. For q > 0, we say a sequence of probability measures {ν n } n∈N ⊂ P q (R d ) converges to a limit ν with respect to the q-Wasserstein topology if we have both weak convergence, denoted ν n ⇒ ν, as well as convergence of q-th moments R d x q ν n (dx) → R d x q ν(dx). As noted in Section 6 of [Vil08], the q-Wasserstein topology is metrizable through the q-Wasserstein metric which we denote by W q .

Main results
We now provide a precise statement of our results. We start by defining the class of random vectors that we consider. As in [DZ09, Definition 2.3.5], we define the domain of an (extended real-valued) function f defined on a Euclidean space S, denoted D f , as the subset of points in S for which f is finite; furthermore, the function f is said to be essentially smooth if: Assumption 2.1 (quenched). We impose the following conditions on the sequence of random vectors {X (n) } n∈N .
(i) Representation: there exists a sequence of i.i.d. real-valued random variables {ξ j } j∈N , a Borel measurable function r : R → R + , and a continuous function ρ : R + → R + such that where ξ (n) := (ξ 1 , . . . , ξ n ). Let Λ denote the log moment-generating function (mgf) of (ξ 1 , r(ξ 1 )): (2.1) (ii) Log Moment-Generating Function (MGF): There exists q ⋆ > 0 such that for every (s 1 , s 2 ) ∈ D Λ , there exists a finite constant C s2 (depending only on s 2 and not s 1 ) such that (iii) Integrated Log Mgf: For any ν ∈ P(R k ), the function Ψ ν : R k+1 → R obtained as an integrated form of the log mgf, contains 0 in the interior of its domain, is lower semicontinuous, and is essentially smooth.
Remark 2.3. A wide class of product measures satisfy Assumption 2.1 with ρ ≡ r ≡ 1; namely those that have sufficiently light tails, in the sense of parts (iv) and (v). Examples of sequences of non-product measures satisfying Assumption 2.1 are ℓ n p spheres. More precisely, fix p ∈ [1, ∞), and for n ∈ N, let D n,p := {x ∈ R n : n i=1 |x i | p = n} be the scaled ℓ n p ball in R n , let S n,p := ∂D n,p be the scaled ℓ n p sphere in R n , let η n,p be the cone measure on S n,p : for Borel subsets S ⊂ S n,p , with vol n denoting Lebesgue measure on R n , and let X (n) = X (n,p) be distributed according to η n,p . Then: (i) for p ∈ [1, ∞), this condition follows from results in [SZ90,RR91], with {ξ j } j∈N being the i.i.d. sequence with common law equal to the generalized p-normal distribution (namely, the probability measure on R with density proportional to e −|y| p /p ), r(x) = |x| p , and ρ(y) = y −1/p ; (ii) for p ∈ (1, ∞), the growth conditions on the log mgf Λ are established in [GKR17, Lemma 5.7]; further, Λ is symmetric in its first argument due to the symmetry of the generalized p-normal distribution; (iii) for p ∈ (1, ∞), the conditions on the integrated log mgf are established in [GKR17, Lemma 5.9]; (iv) for p ∈ [2, ∞), the log mgf condition is easily verified; (v) for p ∈ (2, ∞), the precise tail bound exponent is established in [GKR17, Lemma 5.5].
We now our state our first result, whose proof is deferred to Section 5. Recall the q-Wasserstein metric W q specified in Definition 1.2. Also, for any ν ∈ P(R k ), we let Ψ * ν denote the Legendre transform of Ψ ν , (2.5) Also, let γ denote the standard Gaussian distribution on R, and γ ⊗k its k-fold product.
Theorem 2.4 (Quenched). Fix k ∈ N, and suppose {X (n) } n∈N satisfies Assumption 2.1(i ii, iii) with associated q ⋆ > 0 and Ψ ν . Choose any sequence a = {a n,k } n∈N k ⊂ V n,k such that, for some ν ∈ P(R k ), where L a n,k ∈ P(R k ) is the empirical measure defined in (1.3). Then, the following claims hold: (2.7) (ii) If σ is any probability measure on S := ⊗ n>k V n,k whose n-th marginal coincides with the Haar measure σ n,k , then for σ-a.e. a = {a n,k } n∈N k ∈ S, the sequence {n −1/2 a ⊺ n,k X (n) } n∈N k satisfies an LDP in R k with GRF J qu γ ⊗k .
(iii) Let U be a uniformly distributed random variable on [0, 1], independent of {X (n) } n∈N . If the log mgf Λ of (2.1) is symmetric in its first argument, then the claims (i) and (ii) also hold for the sequence {n −1/2 a ⊺ n,k X (n) U 1/n } n∈N k .
Remark 2.5. Claim (iii) of Theorem 2.4 is motivated by the observation that if X (n,p) is distributed according to the cone measure on the scaled ℓ n p sphere S n,p , then the random variable U 1/n X (n,p) is uniformly distributed on the scaled ℓ n p ball D n,p [SZ90]. Since, as noted in Remark 2.3, X (n,p) satisfies Assumption 2.1 across a wide range of p (with symmetric Λ), Theorem 2.4(iii) allows an extension of the LDP results in (i) and (ii) of Theorem 2.4 from ℓ n p spheres to ℓ n p balls, which are of greater interest in convex geometry. Note that the rate function J qu ν depends only on the limit ν in (2.6), and is insensitive to further specifics of the projection matrix sequence a. For one-dimensional projections (k = 1), Theorem 2.4 recovers both Theorem 2 of [GKR16], which addresses the case where X (n) has a product distribution, and Theorem 2.5 and Proposition 5.3 of [GKR17], which consider the case when X (n) is uniformly distributed on D n,p or according to the cone measure η n,p (as defined in Remark 2.2). One setting of multidimensional projections (k > 1) considered prior to the above result is the LDP for the projection of X (n) onto the first k canonical directions, where a n,k is the matrix of 1s on the diagonal and 0s elsewhere, which does not satisfy (2.6). More recent work [KPT19b] establishes interesting asymptotics (law of large numbers and LDPs) for the shape of multidimensional projections of the uniform distribution on a cube or discrete cube. This paper differs by establishing almost everywhere quenched LDP results, first reported in the PhD thesis [Kim17], for multidimensional projections beyond the particular cases of the canonical projection and product measures. Our results provide a potential starting point for obtaining asymptotic results for shapes and instrinsic volumes of projections of non-product measures such as ℓ n p balls, as well as for ongoing work on sharp quenched large deviation estimates for multi-dimensional projections and their norms, which are relevant for understanding volumetric properties of convex bodies and their intersections.
Our second main result concerns a variational representation of the annealed rate function for the sequence of random multi-dimensional projections {A ⊺ n,k X (n) } n∈N k . We start by stating an annealed LDP counterpart to Theorem 2.4, specialized to the setting considered in this article.
Proof. It follows from Theorem 2.7 of [KLR20] that {n −1/2 A ⊺ n,k X (n) } n∈N k satisfies an LDP in R k with GRF J an as defined above whenever Assumption A* therein is satisfied with speed s n = n, namely, when the sequence of scaled norms { X (n) 2 / √ n} n∈N satisfies an LDP with GRF J X . Since the domain of Λ contains a neighborhood of the origin due to Assumption 2.1(iv), Cramér's theorem implies that the with ρ continuous, the contraction principle shows that the sequence of scaled norms { X (n) 2 / √ n} n∈N satisfies an LDP with GRF J X . This completes the proof.
To state the variational representation for the rate function J an , we first introduce some notation. For ν, µ ∈ P(R), define the relative entropy of ν with respect to µ as if ν ≪ µ, and H(ν|µ) := +∞ otherwise. Let γ denote the standard Gaussian measure on R, and γ ⊗k the associated product measure on R k . For ν ∈ P(R k ), let C : P(R k ) → R k×k denote the covariance map, (2.10) Theorem 2.7. Fix k ∈ N, suppose that the sequence {X (n) } n∈N satisfies Assumption 2.1. Let J qu ν and J an be defined as in Theorems 2.4 and 2.6, respectively. Then, we have the following variational formula: Note that H k (ν) = 0 when ν = γ ⊗k , which implies J an ≤ J qu γ ⊗k , as would be expected from Jensen's inequality given J qu γ ⊗k is simply the GRF of the quenched LDP for {A ⊺ n,k X (n) } n∈N k . More generally, the optimization problem (2.11) can be interpreted as saying that at the large deviation level, the annealed probability of a rare event is the infimum, over all random "environments" (in this case "projections"), of the probability of the rare event conditioned on that environment plus the cost of the choice of the environment. While such a relation is intuitive, rigorous proofs of such informal statements are typically non-trivial. For example, such variational representations have been rigorously established only in a few specific cases, such as LDPs for random walks in random environments on Z in [CGZ00] and on supercritical Galton-Watson trees in [Aid10]. The one-dimensional case (k = 1) of Theorem 2.7 for ℓ n p balls recovers Theorem 2.7 of [GKR17]. The proof of the the multi-dimensional case stated in Theorem 2.7), which is given in Section 6, is more involved and relies on an auxiliary LDP for the following sequence of random empirical measures, analogous to those defined in (1.3): (2.12) Theorem 2.8. Fix k ∈ N. Then for all q ∈ (0, 2), the sequence {L n,k } n∈N k satisfies an LDP in P q (R k ) with the strictly convex GRF H k .
This theorem, which is established in Section 4, generalizes Theorem 6.6 of [BADG01], which states an LDP for the empirical measure of coordinates drawn uniformly from the sphere S n−1 , which corresponds to the case k = 1 in our work. In contrast to this case, the k > 1 case necessitates more extensive computations which arise due to the non-commutative matrix setting, where the Bartlett decomposition of Proposition 3.2 replaces the usual polar decomposition for a random vector from the sphere. Given that large deviation perspectives have informed the analysis of asymptotics for spherical integrals [GM05,OMH13], it is possible that a similar approach could inform asymptotics for integrals over the Stiefel manifold, which arise, for instance, as the normalizing constant of the matrix Bingham distribution [Hof09], or in the study of multispiked random covariance matrices [OMH14].
Remark 2.9. The first term in the definition (2.10) of H k is the relative entropy with respect to the kdimensional standard Gaussian measure, which is (due to Sanov's theorem) the large deviation rate function for the sequence of empirical measures of the rows of an n × k matrix of i.i.d. standard Gaussian elements. Hence, Theorem 2.8 offers a way of characterizing the distinction between Haar-distributed matrices on the Stiefel manifold and Gaussian matrices. Outside of the large deviations literature, a different comparison between such Stiefel and Gaussian matrices can be found in [Tro12], which analyzes expectations of sublinear convex functions of random matrices.
Remark 2.10. The second term in the definition (2.10) of H k arises from the orthogonality and normalization constraint defining the Stiefel manifold. Note that because P(A ⊺ n,k A n,k = I k ) = 1, we have, P-a.s., Nonetheless, the definition of the rate function H k includes the trace term tr(I k − C(ν)), and H k (ν) is finite even for ν ∈ P(R k ) such that tr(I k − C(ν)) = 0, due to the fact that the statement of an LDP (Definition 1.1) involves infimization of the rate function H k not over a set like V k := {ν ∈ P(R k ) : I k = C(ν)}, but rather over its interior and closure (in the space of probability measures). In particular, the example set V k is neither open nor closed with respect to the weak topology. In fact, it is possible to show from [WWW10] that V k is neither open nor closed with respect to any topology for which the sequence of empirical measures {L n,k } n∈N k satisfies an LDP.
An immediate consequence of Theorem 2.8 is the following: Corollary 2.11. Fix k ∈ N. Then for all q ∈ (0, 2), the sequence {L n,k } n∈N k satisfies the strong law of large numbers in P q (R k ). That is, almost surely, as n → ∞, we have W q (L n,k , γ ⊗k 2 ) → 0.
Proof. By Theorem 2.8, the rate function H k in (2.10) is strictly convex. Since H k (γ ⊗k ) = 0, H k attains its unique minimum over P q (R k ) at γ ⊗k 2 . For ǫ > 0, due to the LDP for {L n,k } n∈N k and the uniqueness of the minimum of H k , there exists δ > 0 and N ∈ N k such that for n > N , P(W q (L n,k , γ ⊗k 2 ) > ǫ) ≤ e −nδ , which combined with the Borel-Cantelli Lemma yields the almost sure convergence of L n,k .

The Bartlett decomposition and its consequences
We recall the well known result of Bartlett on the QR decomposition of a matrix with independent standard Gaussian entries, and derive certain consequences which will be used in the proofs of the main theorem. Throughout, let U k denote the space of k × k upper triangular matrices, and recall that V n,k denotes the Stiefel manifold of k-frames in R n .
Definition 3.1. Fix k ≤ n ∈ N. Let Z n,k ∈ R n×k be an n × k matrix with i.i.d. standard Gaussian elements. Let Z n,k = Q n,k R n,k be the QR decomposition of Z n,k as the product of the semi-orthogonal matrix Q n,k ∈ V n,k ⊂ R n×k and the upper triangular matrix R n,k ∈ U k ⊂ R k×k .
Proposition 3.2 (Bartlett decomposition [Bar33]). The law of Q n,k is σ n,k , the Haar measure on V n,k . Moreover, the diagonal entries of R n,k satisfy R n,k (i, i) ∼ χ n−i+1 , the chi distribution with n − i + 1 degrees of freedom, for i = 1, · · · , k. Remark 3.3. In fact, the matrices Q n,k and R n,k of the Bartlett decomposition are independent, and moreover, the marginal law of the off-diagonal entries of R n,k are also explicitly known; however, we will not need these facts for our analysis. Also, note that when k = 1, the Bartlett decomposition corresponds to the classical polar decomposition of the n-dimensional Gaussian measure.
Let L Z n,k denote the empirical measure of the rows of Z n,k , Then, due to Proposition 3.2, for A n,k distributed according to the Haar measure σ n,k on V n,k , we have In the second equality, we use the fact that R n,k is (almost surely) invertible, since it is an upper triangular matrix with diagonal entries that are all (almost surely) positive. Recalling the definition of L n,k from (2.12), and using the representation (3.1), we have for any Borel set B ⊂ R k , Fortuitously, each element of the matrix R n,k can be computed as a function of the rows of the matrix Z n,k , and can be written as the image of a linear functional of the measure L Z n,k . To be more precise, we recall the Gram-Schmidt process. For a matrix Z ∈ R n×k with columns z 1 , . . . , z k ∈ R n , let y 1 := z 1 q 1 := y 1 y 1 2 ; Then, we have the decomposition Z = QR where Q = (q 1 , . . . , q k ) and Note that we also have the following relation among the elements of R: This expression allows us to clarify the relationship between R n,k and L Z n,k . Definition 3.4. Let Sym k be the space of real symmetric k × k matrices. For L, M ∈ Sym k , we write L M (resp., L ≻ M ) if L − M is positive semi-definite (resp., positive definite).
We equip Sym k ⊂ R k×k with the induced Borel σ-algebra when viewing it as a measurable space, and the Frobenius norm when viewed as a Banach space.
Definition 3.5. Define the map Γ : R k×k → U k according to the following iterative procedure: for M ∈ R k×k , j = 1, . . . , k, and i = 1, . . . , j, In the case i = j = 1, we abide by the convention that empty summations are set to zero, so Γ(M ) 11 := M 1/2 11 .

Proof of the empirical measure large deviations
The representation (3.2) and Lemma 3.6 suggest our plan of attack for the proof of Theorem 2.8: first prove a joint LDP for {L Z n,k , C(L Z n,k )} n∈N k ; then, establish an LDP for {L n,k } n∈N k . Note that we do not attempt to directly establish an LDP for {L n,k } n∈N k from that for {L Z n,k } n∈N k , because the map L Z n,k → L n,k is not continuous with respect to the weak topology. Nor do we attempt to directly establish an LDP for {L Z n,k , Γ(C(L Z n,k ))} n∈N k , because the map Γ • C is a nonlinear functional on the space of measures on R k . In contrast, our proposed first step is tractable precisely because C is a linear functional. We will make use of the following result, which is stated in [KLR20, Corollary A.2] as a corollary of [BADG01, Proposition 6.4].
Lemma 4.2. For any k ∈ N, the sequence {L Z n,k , C(L Z n,k )} n∈N k satisfies an LDP in P(R k ) × Sym k with GRF J k : P(R k ) × Sym k → [0, ∞], defined, for ν ∈ P(R k ) and M ∈ Sym k , to be Proof. We invoke the approximate contraction principle of Lemma 4.1 with the following parameters: Σ = R k ; X = Sym k and X * = Sym k ; c(z) := [z ⊗ z] for z ∈ R k ; L n := 1 n n j=1 δ sj for s 1 , s 2 , . . . i.i.d. random vectors with common distribution γ ⊗k ; and C n := R k c dL n . Note that (L Z n,k , C(L Z n,k )) With Λ as defined in (4.1), the domain D specified in (4.2) takes the form Therefore, (4.6) and Lemma 4.1 together imply that {(L Z n,k , C(L Z n,k ))} n∈N k satisfies the stated LDP.
We now establish a relation between the rate function J k and the rate function H k of Theorem 2.8.
Lemma 4.3. For any k ∈ N and ν ∈ P(R k ), given H k of (2.10), J k of (4.5), and Γ of (3.5), we have Proof. As noted in Remark 3.8, for M ∈ Sym k and Γ as in (3.5), we have M = Γ(M ) ⊺ Γ(M ). Given this equality, the constraint M R k [x ⊗ x] ν(dx × Γ(M ) −1 ) can be rewritten, using the notation C from (2.9), as If the preceding constraint is satisfied, then using the form of J k in (4.5) in the first equality below, the chain rule for relative entropy in the third equality, and then the form of the Gaussian distribution γ ⊗k , we obtain Also, note that Tr(I k − C(ν)) = k − R k y ⊺ y ν(dy). Hence, invoking the definition of H k in (2.10), we have Taking the infimum of the expression above over M ∈ Sym k , we see that which is clearly equal to zero. Together with (4.7), this shows that H k = H k .
Lemma 4.4. Fix k ∈ N and consider the following set of probability measures, For any q ∈ (0, 2), the set K ⊂ P 2 (R d ) is compact with respect to the q-Wasserstein topology. In addition, K is convex and non-empty.
Proof. The proof is an elementary modification of the proof of the k = 1 case given in [KR18, Lemma 3.14].
Proof of Theorem 2.8. Let Γ be as defined in (3.5). Due to (3.2) and Lemma 3.6, we have The image of C is positive semi-definite matrices, so as noted in Remark 3.8, the map Γ maps a matrix to its Cholesky decomposition, hence M → Γ(M ) is continuous. By Slutsky's theorem, the map is also continuous. An application of the contraction principle to the map above yields an LDP for the sequence {L n,k } n∈N k in P(R k ) (i.e., with respect to the weak topology), with GRF where the last equality is due to Lemma 4.3. Fix q ∈ (0, 2). In order to establish the LDP for {L n,k } n∈N k in P q (R k ) (i.e., with respect to the stronger q-Wasserstein topology), by Corollary 4.2.6 of [DZ09], it suffices to show exponential tightness of {L n,k } n∈N k in the q-Wasserstein topology. Let K be the set defined in (4.8), which is compact (with respect to the q-Wasserstein topology) due to Lemma 4.4. Note that R k |x| 2 L n,k (dx) = k a.s., so P(L n,k ∈ K c 2,k ) = 0, and hence, log P(L n,k ∈ K c 2,k ) = −∞ for all n ∈ N k , trivially implying the exponential tightness of {L n,k } n∈N k . Lastly, the strict convexity of H k follows from the strict convexity of the relative entropy H(·|γ ⊗k ) and the linearity of the covariance map C.

Proof of the quenched LDP
In this section, we state the proof of Theorem 2.4. As a precursor, we state two lemmas that will assist with part (iii) of the theorem.
Lemma 5.1. Fix m ∈ N, and let F be a set of functions from R m to R such that every f ∈ F is symmetric about 0 and convex. Then, defining g : R m → R as the function g is monotone with respect to scaling in the sense that for all x ∈ R m , the mapping is non-decreasing.
Proof. Fix x ∈ R m and c 1 < c 2 ∈ R + . For any f ∈ F , the symmetry about 0 and convexity of f implies that f has a global minimum at 0, hence where the first inequality follows from convexity, and the second inequality is due to the global minimum at 0. Taking the infimum over all f ∈ F on both sides, we find that g(c 1 x) ≤ g(c 2 x), completing the proof.
Lemma 5.2. Fix m ∈ N, and let x = {x n } n∈N denote a sequence of R m -valued random variables that satisfies an LDP with GRF I x . Let U be a uniformly distributed random variable on [0, 1] independent of {x n } n∈N . If for all x ∈ R m , the mapping R + ∋ c → I x (cx) ∈ [0, ∞] is non-decreasing, then the scaled sequence {U 1/n x n } n∈N satisfies an LDP with GRF I x .
Proof. Due to [GKR17, Lemma 3.3], the sequence {U 1/n } n∈N satisfies an LDP with the good rate function By independence, the sequence {U 1/n , x n } n∈N satisfies a joint LDP with the GRF I U,x : R × R m defined as I U,x (u, w) := I U (u) + I x (x). By the contraction principle, the scaled sequence {U 1/n x n } n∈N satisfies an LDP with the rate function I, where for x ∈ R m , The mapping u → 1/u is monotonically decreasing, which when combined with the assumption on I x implies that u → I x (x/u) is monotonically decreasing. Since u → − log u is also monotonically decreasing, the infimum above is attained at u = 1, hence I(x) = I x (x) for all x ∈ R m .
Proof of Theorem 2.4. Suppose Assumption 2.1 holds for some {ξ j } j∈N , r, ρ, q ⋆ > 0, and T ≤ ∞, all as defined in the statement of the assumption. We first prove an LDP for the following R k+1 -valued sequence: In terms of the log mgf Λ of (ξ 1 , r(ξ 1 )), the scaled log mgf of R (n) a takes the form: for t 1 ∈ R k and t 2 ∈ R, Λ t 1 , √ na n,k (i, ·) , t 2 = Ψ L a n,k (t 1 , t 2 ), with Ψ · equal to the integrated log mgf functional defined in (2.3).
Fix t 1 ∈ R k . For t 2 ≥ T , both sides equal +∞ due to Remark 2.2. For t 2 < T , due to the q ⋆ -Wasserstein continuity pointed out in Remark 2.2, together with the q ⋆ -Wasserstein convergence of (2.6), we take the limit as n → ∞ to find lim n→∞ 1 n log E exp(n t, R (n) a ) = Ψ ν (t 1 , t 2 ).
Due to the lower semicontinuity and essential smoothness of Ψ ν , which follow from Assumption 2.1(iii), the Gärtner-Ellis theorem (see, e.g., [DZ09, Theorem 2.3.6]) yields the LDP for the sequence {R (n) a } n∈N in R k+1 with the GRF Ψ * ν from (2.5). Due to the representation of X (n) given by Assumption 2.1(i), we have where ρ is continuous. The LDP for {R (n) a } n∈N and the contraction principle applied to the continuous mapping R k+1 ∋ (τ 1 , τ 2 ) → ρ(τ 2 )τ 1 ∈ R k yield an LDP for {n −1/2 a ⊺ n,k X (n) } n∈N k in R k with good rate functionJ qu ν defined to bē The equivalence ofJ qu ν to the rate function J qu ν in (2.7) follows from using the constraint τ 1 ρ(τ 2 ) = x to substitute for the first argument in Ψ * ν . This proves part (i) of the theorem. In turn, the LDP from part (i) implies part (ii) of the theorem since by Corollary 2.11, almost surely, W q⋆ (L A n,k , γ ⊗k ) = W q⋆ (L n,k , γ ⊗k ) → 0 as n → ∞. We turn to the final claim (iii). Given the assumption on symmetry of Λ, it is apparent from the definition (2.3) that Ψ ν is symmetric in its first argument, and then from the definition of the Legendre transform (2.5) that Ψ * ν is also symmetric in its first argument. Applying Lemma 5.1 with dimension m = k, the set of symmetric convex functions F = {R k ∋ x → Ψ * ν ( x ρ(τ ) , τ ) ∈ R} τ ∈R+ , and g = J qu ν , we find that the mapping R + ∋ c → J qu ν (cx) ∈ [0, ∞] is non-decreasing. An application of Lemma 5.2 with x n = n −1/2 a ⊺ n,k X (n) for n ∈ N and I x = J qu ν completes the proof.

Proof of the variational formula
In this section, we prove Theorem 2.7, primarily through an application of Theorem 2.8 and Sion's minimax theorem [Sio58]. We start with preliminary results in Lemmas 6.1, 6.2, and 6.3. Throughout, recall the definition of H k from (2.10).
Lemma 6.1. Suppose Assumption 2.1 holds, with associated T and Ψ ν , ν ∈ P(R k ), and recall the empirical measure L n,k from (2.12). For t 1 ∈ R k , t 2 < T , and 0 < δ < ∞, the following condition holds: lim sup n→∞ 1 n log E e δnΨ L n,k (t1,t2) < ∞ . n ) denote a random vector distributed uniformly on the Euclidean sphere in R n of radius 1. For t 1 ∈ R k , the random vector A n,k t 1 lies on the Euclidean sphere in R n of radius t 1 2 and has a law invariant to orthogonal transformation (due to the law of A n,k ); hence, A n,k t 1 (d) = t 1 2 Θ (n) . Fix t 1 ∈ R k and t 2 < T , let C t2 and q ⋆ be as in Assumption 2.1(ii), and for i ∈ N, define g i : R + → R + as When combined, the relation A n,k t 1 (d) = t 1 2 Θ (n) , the bound of Assumption 2.1(ii) and the sub-independence of (|Θ Now, let Z (n) := (Z 1 , . . . , Z n ), and note that for each i = 1, . . . , n, √ nΘ a.s.
− − → Z 1 . Therefore, taking the limit superior, as n → ∞, in (6.2) and applying the reverse Fatou lemma, we obtain lim sup n→∞ 1 n log E e δnΨ L n,k (t1,t2) ≤ lim sup Since the last term on the right-hand side is finite for all t 1 ∈ R k because q ⋆ < 2, (6.1) follows.
Lemma 6.2. Suppose Assumption 2.1 holds, with associated {ξ j } j∈N , r, T , and Ψ ν , ν ∈ P(R k ), and let A n,k be drawn from the Haar measure σ n,k on V n,k , independently of {ξ j } j∈N . For n ∈ N, define where ξ (n) := (ξ 1 , ξ 2 , . . . , ξ n ). Then, for t 1 ∈ R k and t 2 ∈ R, where, with K equal to the set defined in (4.8), Proof. Due to the independence of ξ 1 , ξ 2 , . . . , and their independence from A n,k , we can write, for n ∈ N, for t 1 ∈ R k and t 2 ∈ R, where L n,k is as in (2.12). Now, let T ≤ ∞ and q ⋆ ∈ (0, 2) be as specified in Assumption 2.1(ii). For t 2 ≥ T , by Remark 2.2, both Φ n (·, t 2 ) and Φ(·, t 2 ) are identically equal to infinity, and so the limit holds trivially. Now, suppose t 2 < T . Recall from Theorem 2.8 that the sequence {L n,k } n∈N k satisfies an LDP in P q (R k ) for all q ∈ (0, 2), with the GRF H k . Due to the bound in Assumption 2.1(ii), the map P(R) ∋ ν → Ψ ν (t 1 , t 2 ) ∈ R is continuous with respect to the q ⋆ -Wasserstein topology, and we know q ⋆ ∈ (0, 2) due to Assumption 2.1(v). By Varadhan's lemma [DZ09, Theorem 4.3.1] which is applicable due to Lemma 6.1, it follows that the limit of Φ n (t 1 , t 2 ) is given by Φ(t 1 , t 2 ) defined in (6.4).
To complete the proof of the lemma, it only remains to establish the last equality in (6.4), but this is an immediate consequence of the fact that H k (ν) = ∞ for ν / ∈ K.
Lemma 6.3. Suppose Assumption 2.1 holds, and for each ν ∈ P(R k ), let Ψ ν be as defined in (2.3), let Ψ * ν denote its Legendre transform, as specified in (2.5), and let K ⊂ P(R k ) be the set defined in (4.8). Then the Legendre transform Φ * of the function Φ defined in (6.4) satisfies, for τ 1 ∈ R k , and τ 2 ∈ R, Proof. First, note that the second equality in (6.5) holds because H k (ν) = ∞ for ν ∈ K. Next, fix the following: • let Λ, T be as in Assumption 2.1(ii), and define D T := R k × (−∞, T ); • let q ⋆ ∈ (0, 2) be as in Assumption 2.1(ii), and let M q⋆ (R k ) denote the space of finite signed measures (not necessarily probability measures) on R k , equipped with the q ⋆ -Wasserstein topology.
Fix τ = (τ 1 , τ 2 ) ∈ R k × R. Then by the definition (2.5) of Ψ * ν , where the second equality holds because, by Remark 2.2, Ψ ν (t 1 , t 2 ) = ∞ if t 2 ≥ T . Thus, the right-hand side of (6.5) is equal to inf ν∈K sup t=(t1,t2)∈DT inf ν∈K F τ (ν, t), where On the other hand, by the definition of Φ * and the representation (6.4) for Φ, where the last equality uses the fact that for t 2 > T , Ψ ν (t 1 , t 2 ) = ∞ and hence, F τ (ν, t) = −∞ (see Remark 2.2). Thus, to prove the first equality in (6.5), it suffices to show that for all (τ 1 , τ 2 ) ∈ R k × R, (t 1 , t 2 )). (6.7) To justify the exchange of infimum and supremum in (6.7), we verify the conditions of the minimax theorem [Sio58, Corollary 3.3]. That is, for (τ 1 , τ 2 ) ∈ R k × R, we note that • the set D T = R k × (−∞, T ) is a convex subset of the topological vector space R k+1 ; • due to Lemma 4.4, K is a convex compact subset of the topological vector space M q⋆ (R k ); • for t = (t 1 , t 2 ) ∈ D T : the lower semicontinuity of F τ (·, t) follows from the lower semicontinuity of ν → Ψ ν (t) due to Assumption 2.1(iii) and of H k (as it is a GRF); the convexity of F τ (·, t) follows from the linearity of ν → Ψ ν (t) and the convexity of H k , which was established in Theorem 2.8; • for ν ∈ K: the lower semicontinuity of t → Ψ ν (t) on D T follows from Assumption 2.1(iii); the convexity of Ψ ν on D T follows from linearity of expectation, the definition (2.3), and the fact that Λ is convex since it is a log mgf; • since t → τ 1 , t 1 + τ 2 t 2 is continuous and linear, it follows that F τ (ν, ·) is upper semicontinuous and concave on D T .
Due to the conditions verified above, the minimax theorem can be applied to conclude that (6.7), and hence, the desired first equality in (6.5), holds. This completes the proof of the lemma.
The independence of A n,k from {ξ j } j∈N , along with Theorem 3.4 of [BGLCR10] (applied to the case of p = 2 therein, with their canonically projected X (k) equivalent to our A n,k (1, ·)) then implies that the sequence {S (n) } n∈N satisfies an LDP with the convex GRF J : R k+2 → [0, ∞] defined by J(a, b, c) := − 1 2 log(1 − a 2 2 ) + J(b, c), for a ∈ R k such that a 2 2 < 1 and b, c ∈ R. Then, by the contraction principle, {R (n) } n∈N satisfies an LDP with the GRF J R : R k+1 → R defined as follows: J R (x, z) := inf y∈R:y> x 2 J(xy −1/2 , y, z), x ∈ R k , z ≥ 0.
Note that J R is convex due to [GKR17, Lemma 6.2] and Theorem 5.3 of [Roc70].