Transportation-Based Functional ANOVA and PCA for Covariance Operators

We consider the problem of comparing several samples of stochastic processes with respect to their second-order structure, and describing the main modes of variation in this second order structure, if present. These tasks can be seen as an Analysis of Variance (ANOVA) and a Principal Component Analysis (PCA) of covariance operators, respectively. They arise naturally in functional data analysis, where several populations are to be contrasted relative to the nature of their dispersion around their means, rather than relative to their means themselves. We contribute a novel approach based on optimal (multi)transport, where each covariance can be identified with a a centred Gaussian process of corresponding covariance. By means of constructing the optimal simultaneous coupling of these Gaussian processes, we contrast the (linear) maps that achieve it with the identity with respect to a norm-induced distance. The resulting test statistic, calibrated by permutation, is seen to distinctly outperform the state-of-the-art, and to furnish considerable power even under local alternatives. This effect is seen to be genuinely functional, and is related to the potential for perfect discrimination in infinite dimensions. In the event of a rejection of the null hypothesis stipulating equality, a geometric interpretation of the transport maps allows us to construct a (tangent space) PCA revealing the main modes of variation. As a necessary step to developing our methodology, we prove results on the existence and boundedness of optimal multitransport maps. These are of independent interest in the theory of transport of Gaussian processes. The transportation ANOVA and PCA are illustrated on a variety of simulated and real examples.


Introduction
Let {X i,1 } n 1 i=1 , . . ., {X i,K } n K i=1 be K independent samples of i.i.d.random elements in a separable Hilbert space H, posessing well-defined means {µ j } K j=1 and covariances {Σ j } K j=1 .We consider the problem of testing the hypothesis on the basis of the observations {X i,j } and, if H 0 is rejected, the subsequent problem of describing the main mode(s) of variation of the K underlying covariances.This problem arises very naturally in functional data analysis, i.e. when H is taken to be a function space (for instance L 2 [0, 1] or a reproducing kernel Hilbert subspace thereof), and one is interested in discerning whether K different groups of functions manifest the same type of dispersion relative to their mean.For instance, the functions could be curves representing DNA minicircles (Panaretos, Kraus and Maddocks [28], Kraus and Panaretos [24], and Tavakoli and Panaretos [39]), where different groups correspond to different base-pair sequences, and one is interested in probing for a dependence of the mechanical properties on the base pair sequence; or they could be surfaces representing the log spectrograms of short spoken words by different speakers (as in Ferraty and Vieu [14]), and one may wish to see whether there is a difference in several groups of sounds; yet a further example may be in the analysis of age-dependent wheel-running activity curves in mice, where one may wish to see whether the level of activity across age had evolved under several generation selections (Cabassi et al. [6]).What is common to all these examples is that it is not the mean structure that is suspected to differ (or at least to capture the most interesting differences); in that case, the problems would fall under the well studied topic of functional analysis of variance (see, e.g., Benko, Härdle and Kneip [2], Zhang [42], Cuesta-Albertos and Febrero-Bande [7], Górecki and Smaga [17]).Rather it is the fluctuations around the means µ j , as encapsulated by the operators Σ j , in what could be termed a functional covariance ANOVA.
Early contributions in this direction focus on the two-sample case, as in Benko, Härdle and Kneip [3], Panaretos, Kraus and Maddocks [28], Fremdt et al. [15].In particular, Benko, Härdle and Kneip [3] propose a two-sample bootstrap based tests for some aspects of the spectrum of functional data, Panaretos, Kraus and Maddocks [28] consider the problem in a two-sample setting with Gaussian processes, and Fremdt et al. [15] extend to non-Gaussian.Kraus and Panaretos [24] provide resistant versions of two sample tests, focussed on operators related to the covariance.Two-sample testing has been first extended to the K-sample case by Boente, Rodriguez and Sued [4], who propose a test based on the Hilbert-Schmidt distance between the estimated covariance operators of each population and where the critical values of the test statistics are calibrated via a bootstrap procedure.The common theme in these papers is that the covariances are contrasted with respect to the Hilbert-Schmidt metric, which corresponds to imbedding covariance operators in a larger linear space, whereas they are not closed under linear operations.Instead, covariance operators are trace-class non-negative operators, so rather than being seen as Hilbert-Schmidt operators, they are better represented as "squares" of such operators.For this reason, Pigoli et al. [31] considered the use of nonlinear metrics adapted to non-negative operators in the two-sample setting, generalising some of the work in Dryden, Koloydenko and Zhou [11] in finite dimensions.Cabassi et al. [6] extend their metric-based methodology to the K-sample case.Other contributions for the K-sample comparison are from Paparoditis and Sapatinas [29], who develop an empirical bootstrap methodology and prove its consistency when the test statistics is based on the Hilbert--Schmidt norm, and Kashlak, Aston and Nickl [22], who perform K-sample comparison via concentration inequalities based methods.Recently, Hlávka, Hlubinka and Koňasová [20] proposed a method to perform functional ANOVA based on empirical characteristic functionals.When comparing with Anderson [1], Paparoditis and Sapatinas [29] and Kashlak, Aston and Nickl [22], Cabassi et al. [6] report simulation results illustrating state-of-the-art performance of their method.We found that this holds true even when comparing against the recent work of Hlávka, Hlubinka and Koňasová [20].In Hlávka, Hlubinka and Koňasová [20], a specific choice of the parameters covariance matrix yield similar conclusion to Cabassi et al. [6] when comparing covariances in a real-data example.
Pigoli et al. [31] paid particular attention to the Procrustes distance, which generalises a metric used to compare unlabelled shapes into a metric between covariance operators.Heuristically, the Procurstes metric aims to compare roots of two operators in the Hilbert-Schmidt distance, a natural choice since covariances are characterised as squares of Hilbert-Schmidt operators.However, there is an ambiguity as to which precise root one ought to use, and the Procrustes distance corrects for that by optimising over the square root orbits.Indeed, later work by Pigoli et al. [32] reports that the Procrustes metric offered the most natural framework to compare trace-class operators, in that it uses a map from the space of covariance operators to the linear space of Hilbert-Schmidt operators.Masarotto, Panaretos and Zemel [25] carried out a deeper study of the Procrustes metric and established a fruitful connection with the Wasserstein metric between Gaussian processes.On the one hand, this allowed them to provide a complete geometrical description of the space of covariances under the Procrustes metric, including basic results about Fréchet means; on the other hand, it established intriguing parallels with the theory of optimal transportation, offering potentially new avenues and tools for the analysis of covariance operators (see also the discussion of Pigoli et al. [32] by Panaretos [27]).
In this paper, we use precisely this novel transportation perspective to introduce a new ANOVA test, and then exploit the corresponding geometry to construct a PCA that respects the nature of the covariance operators, by representing covariance operators as transport maps.Specifically, we view the testing problem through the lens of optimally multicoupling the Gaussian processes {N (0, Σ 1 ), . . ., N (0, Σ K )}, thus translating the task of testing the hypothesis H 0 in Equation (1) into that of testing whether the optimal multicoupling is "trivial".To do this we first prove that the optimal multicoupling can always be deterministically 1 produced by means of bounded linear transport maps {t j }, and this regardless of the validity of the null hypothesis.These two results are of independent interest, and are stated as Theorem 2.Then, given these results, we translate the task of testing the hypothesis in Equation (1) into the equivalent task of testing the hypothesis for I the identity, and so the hypothesis (2) can now be tested by means of a norm-based (e.g., operator, Hilbert-Schmidt, or nuclear) test statistic, establishing a direct analogy with classical ANOVA.The ∆ j = t j − I can heuristically also be viewed as "roots" of the original covariances, albeit free of any unitary ambiguity.Though the test is motivated by the 1-to-1 correspondence between covariances and Gaussian processes, it relies in no way on a Gaussian assumption, and can indeed admit an interpretation purely in terms of Procrustean geometry.Our simulation experiments indicate that the new test dominates state-of-the-art competitors, with dramatic gains in power, particularly against more challenging local alternatives.This is also explained by means of theory, and is seen to be a genuinely functional effect, with connection to the Hajek-Feldman condition.See Section 6 for more details.In terms the computation, the quantity (K −1 K j=1 t j − I) in finite dimension is precisely the negative gradient of the Fréchet (sum-of-squares) functional Zemel and Panaretos [41, Theorem 1] and its computation is stable and feasible because it relies on the fast nature of the steepest descent algorithm in the space of covariances endowed with the Procrustes metric.
When the mull hypothesis is rejected, it is natural as a second step to wish to describe the variation manifested by the covariances {Σ j }, or indeed obtain a parsimonious representation thereof.We show how the maps {t j } can then readily be employed to do just that, via a principal component analysis.In particular, the ∆ j = t j − I can be interpreted as the logarithms of the {Σ j } at their Procrustes-Fréchet mean.The corresponding tangent space admits a Hilbertian structure with respect to a modified Hilbert-Schmidt inner product, which we use to produce a tangent space fPCA, and then retract the principal components back onto the covariance space.To the best of our knowledge, this is the first instance of a functional PCA on covariance operators that respects their intrinsic geometric features as trace-class positive operators.We describe the computational steps required to do so in Section 3, and illustrate the usefulness of the procedure on simulated data as well as a linguistic data set.The next paragraph collects the notational conventions employed throughout the paper.Proofs of our theoretical results are given in Section 7.

Basic Setting and Notation
As stated in the introduction, we will be interested in exploring the variation in a finite collection of covariances {Σ j } K j=1 on a real separable Hilbert space H, equipped with the inner product •, • : H × H → R, and corresponding norm • : H → [0, ∞).Since H will in principle be infinite-dimensional, we will need to review some basic definitions and notation, which can be more subtle.
Given a bounded linear operator A : H → H, we will denote its trace (when defined) by trA or tr(A), its adjoint by A * , and its inverse by A −1 .The inverse may not be defined, or defined only on a (dense) subspace of H.The range of A will be denoted by range(A) = {Av : v ∈ H} whereas the kernel of A will be denoted by ker(A) = {v ∈ H : Av = 0}.We will say that a (possibly unbounded) operator A is self-adjoint if Au, v = u, Av for all u, v in the domain of definition of A; if A happens to also be bounded, then this is equivalent to the condition that A = A * .
A non-negative operator is a self-adjoint, possibly unbounded operator A such that Au, u ≥ 0 for all u in the domain of A. If in addition A is compact, then there exists a unique non-negative operator whose square equals A, which will be denoted by either A 1/2 or √ A. The inverse square root (A 1/2 ) −1 is denoted by A −1/2 .For any bounded operator A, A * A is nonnegative.The identity operator on H will be denoted by I.The operator, Hilbert-Schmidt and trace (nuclear) norms will respectively be and can be ordered from coarser to finer as follows When A 2 < ∞ we say that A is Hilbert-Schmidt and when A 1 < ∞ we say that A is nuclear or trace-class.Summarising, in this setting, covariances are linear operators from H into H, that are self-adjoint, non-negative, and trace-class.As such, a covariance operator Σ on H can be considered as the "square" of a Hilbert-Schmidt operator: if B 2 < ∞ then B is certainly bounded, and B * B defines a valid covariance operator.
It therefore becomes clear that covariance operators are non-linear objects, and though they can be contrasted by means of any of the three norms it may be preferable to find a means of comparison that respects this non-linear nature.In finite dimensions, this is done by means of some form of linearisation, i.e. the use of a transformation that substitutes a covariance pair (Σ 1 , Σ 2 ) to be considered by an operator that can be contrasted to zero by means of one of the norms • r , r = 1, 2, ∞.For instance, in classical two-sampled covariance tests, two covariances Σ 1 and Σ 2 have been contrasted by means of quantities such as assuming that the inverses exist (e.g., Roy [37], Kiefer and Schwartz [23], Giri [16].In non-Euclidean statistics, covariances (Σ 1 , Σ 2 ) have been contrasted by means of again, assuming that the inverses exist2 (Dryden, Koloydenko and Zhou [11]).
In infinite dimensions, however, these criteria will generally fail to be well-defined.For example, the inverse of Σ 2 will be unbounded, and there is no guarantee that Σ 1 Σ −1 2 will be bounded, except if Σ 1 and Σ 2 share some special relation.Similarly, the logarithm of a covariance operator will typically be unbounded, and unless there is a specific relation between Σ 1 and Σ 2 , the logarithmic criteria will fail to be well defined.This is one of the main reasons why much of the literature on covariance operators has focussed on bypassing their nonlinear nature, and comparing them directly, e.g. by means of Σ 1 − Σ 2 2 (Panaretos, Kraus and Maddocks [28], Fremdt et al. [15], Boente, Rodriguez and Sued [4]).
A first step in obtaining linearisations that would yield contrasts respecting the nature of covariances, while being well-defined in infinite dimensions was made by Pigoli et al. [31] Since covariances are "squares" of Hilbert-Schmidt operators, they considered contrasting the square roots of in the Hilbert-Schmidt distance Observing that one could nevertheless choose roots other than the (unique) positive roots, by means of the fact that Σ which lifts the unitary ambiguity by optimising over unitary matrices, and is well defined in both finite and infinite dimensions.Indeed they use this metric to develop a two-sample test for covariance comparison.Masarotto, Panaretos and Zemel [25] further developed several key properties of this metric and its geometry, interpreting it via the optimal transportation of Gaussian processes as the L 2 -Wasserstein distance between two Gaussian measures N (0, Σ 1 ) and N (0, Σ 2 ) on H, .
The key observation in this paper is that the optimal transport theory developed in Masarotto, Panaretos and Zemel [25] can be directly leveraged in order to provide natural notions of "roots" (or linearisations) that: • are unequivocally defined without any unitary ambiguity; • are efficiently computable; • that offer remarkable power when used in a covariance ANOVA; • can be used in order to obtain a natural PCA, when the equality of covariances is rejected.
These are defined via the notion of an optimal multicoupling, and are introduced in the next Section.

Optimal Multicoupling and Transport Maps
As already stated, our strategy for testing H 0 : Σ 1 = • • • = Σ K is to view the covariance operators through the lens of optimal multicoupling of Gaussian processes.Specifically, we observe that the collection of covariances {Σ 1 , . . ., Σ K } can be bijectively identified with a collection of centred Gaussian measures {N (0, Σ 1 ), . . ., N (0, Σ K )} on the Hilbert space H. Denote these measures as {γ 1 , . . ., γ K }.Equality of the covariance operators thus holds true, if and only if the measures {γ j } coincide.Viewing the measures {γ j } as the marginals of a joint measure π on H K , one can ask what are the possible forms of π.This set of possible joint measures π is always nonempty (it always contains the product measure), and is called the set of multicouplings of {γ 1 , . . ., γ K }.An optimal multicoupling is a multicoupling π * such that the marginals are as tightly coupled as possible in a pairwise mean-square sense, in that it minimizes the functional Said differently, π * is the joint distribution of collection of K Gaussian processes on H, say (Z 1 , . . ., Z K ), such that Z j ∼ N (0, Σ j ) marginally for all j ≤ K, while i<j E Z i − Z j 2 is minimized.Existence of finite second moments of Gaussian measures (a consequence of Fernique's Fernique [13] theorem) implies that F (π) is finite for any multicoupling π.It can be shown that an optimal multicoupling of Gaussians always exists (Masarotto, Panaretos and Zemel [25]).We say that such an optimal multicoupling π * is manifested by (deterministic) transport maps if the collection (Z 1 , . . ., Z K ) ∼ π * can be generated by taking a single process Z, and a collection of deterministic maps In other words, an optimal multicoupling π * is generated by deterministic maps if it is supported on the graph of a vector-valued function from H to H K .It is a priori unclear whether a deterministic multicoupling exists in general, and if it does, whether the maps t j are bounded.but it is not hard to see that it will exist under the null H 0 and that it will be "trivial": Lemma 1.The equality Σ 1 = • • • = Σ K holds true if and only if the (unique) optimal multicoupling of (γ 1 , . . ., γ K ) can be achieved by transport maps satisfying The maps t j are called transport maps because they can be thought of as "transporting" the (unspecified) law of Z to that of Z j .The lemma suggests that we can detect departures from the hypothesis But to even speak of departures from H 0 , we must be assured that such maps exist even under the alternative regime, and this existence is not a priori guaranteed (see Conjecture 17 and the discussion in Section 12 of [25]).Furthermore, to quantify the extent of departures from the null, we need to make sure that the multicoupling maps not only exist, but are bona fide bounded linear operators over all of H, and can thus be contrasted by appropriate norms.
Our main theoretical result, in the form of the following theorem, shows that a multicoupling can always be realised by means of bounded deterministic maps, a result that is of independent interest in optimal transport in its own right.It provides a partial positive resolution of Conjecture 17 in [25].
Theorem 2. Let {γ 1 , . . ., γ K } be an arbitrary finite collection of Gaussian measures on H with mean zero.Then there exists an optimal multicoupling of {γ j } K j=1 manifested by deterministic transport maps t j : H → H that are bounded non-negative linear operators satisfying t j ∞ ≤ K, for all j ≤ K.
Although the optimal coupling π * is typically unique, its representation in terms of the maps is not.For instance, if π * is manifested as the law of (t 1 (Z), . . ., t K (Z)), it may also be represented as (2t 1 (Z/2), . . ., 2t K (Z/2)).It is natural to take Z ∼ N (0, Σ), where Σ is a centre, i.e., the Fréchet mean of Σ 1 , . . ., Σ K with respect to the Procrustes metric.This choice forces the maps t j to have mean identity (see (4) below), and in particular they must be the identity under the null (1), so that (2) holds.Using this convention, the existence and boundedness result in Theorem 2 opens the way for a testing procedure: the deviations ∆ j = t j − I are all self-adjoint and bounded, but no longer restricted to be non-negative.When H 0 is valid, Lemma 1 implies that ∆ j = t j − I = 0 for all j.Under the alternative, at least one ∆ j is non-zero.We can thus replace the null hypothesis viewing the ∆ j as elements of a linear space, and reducing the original testing problem to a more traditional linear functional ANOVA setting.Since the ∆ j are guaranteed to be bounded (Theorem 2), they can certainly be contrasted to 0 using the operator norm.However, one can devise even more powerful procedures by measuring the size of ∆ j in a stronger norm, such as the the Hilbert-Schmidt norm, or even the trace norm.In the finite-dimensional setting this choice of norm will typically not make much difference, since all norms are equivalent.But in the infinite-dimensional case, a finer norm will detect subtle departures from the null.For instance, if K = 2 and Σ 2 = δ 2 Σ 1 for some δ ≥ 0, then one can show that for j Similarly, if the ∆ j are Hilbert-Schmidt but not trace class, their trace norm will be infinite, promising to furnish high power even against very local alternatives.This genuinely functional phenomenon is not unlike the possibility of perfect discrimination of Gaussian processes (Feldman [12], Hájek [18], Rao and Varadarajan [36]; see Section 6 for a more detailed discussion).It is demonstrated empirically in our later simulations.If there are only K = 2 populations, then in view of (4), ∆ 2 = −∆ 1 and the test statistic is ∆ 1 r .When comparing more than two covariance operators, the criteria ∆ j r will need to be combined into a single criterion (e.g., by taking their supremum over j or by summation).
How does one concretely construct a deterministic multicoupling {t j }, and hence the {∆ j } in practice?In proving Theorem 2 we establish the existence and boundedness of the maps where Σ is a Fréchet mean of {Σ j } K j=1 with respect to the procrustes metric Π, i.e., a minimiser of the sum-of-squares functional Γ → K j=1 Π 2 (Γ, Σ j ) over the space of trace-class covariances.Moreover, the t j are centred around the identity in that 1 The Fréchet mean Σ is unique when at least one of the Σ j is injective (or more generally, if the kernel of at least one Σ j is contained in the kernels of all other Σ j ); see Masarotto, Panaretos and Zemel [25, Proposition 10].Its algorithmic construction is discussed in detail in Section 3.
Once the multicoupling (3) has been constructed, and a norm • r (r = 1, 2, ∞) has been chosen, the null hypothesis can be tested by measuring a combined deviation of the ∆ j from zero using that norm, and calibrating the typical values of such deviations under the null.This is discussed in the next subsection.As discussed in Masarotto, Panaretos and Zemel [25], the optimal multicouping has an elegant geometrical interpretation in terms of the manifold geometry of the Procrustes distance -this will later be exploited in Section 2.4 in order to construct a functional PCA of the covariance operators.

Transportation-Based Functional ANOVA of Covariances
Assume now that we have K independent groups of functional data {X ij , j = 1, . . ., n, i = 1, . . ., K}, each having covariance Σ i (the procedure can be easily adapted to different group sizes).Without loss of generality, the data are assumed to be zero mean.Based on the discussion in the previous paragraph, we can test the equality of covariance operators by means of testing the hypothesis H 0 : and Σ is a Fréchet mean of {Σ j } K j=1 with respect to the procrustes metric Π.At the level of our sample, we have access to empirical versions of { Σj }, constructed on the basis of the samples of size n from each group.These could simply be the empirical covariances within each group (under a complete observation assumption), or some smoothed estimator (for instance the empirical covariance of smoothed versions of the {X ij } (Ramsay and Silverman [33]), or PACE-type estimators (Yao, Müller and Wang [40])).Whichever the case may be, the Σj are finite dimensional, of rank q ≤ n.In case a smoothing technique is used, we assume that it is such that the Σj share a common range, and can thus be represented as q × q positive matrices, via a common (tensor product) basis.For tidiness, we use the same notation for Σj and its q × q matrix representation in the common basis.It is clear that this basis can be chosen so that at least one of these matrices is of full rank q.
In this case, there exists a unique empirical Fréchet mean Σ, Σ = arg min and this can be computed from { Σ1 , . . ., ΣK } using steepest descent (see Section 3).This gives rise to empirical versions of the t j , and corresponding empirical deviations from the identity ∆j = tj − I q×q .
The testing procedure is now based on the test statistic where r ∈ {1, 2, ∞} (the performance under the different choices of r is investigated in the simulation section).We avoid making any concrete parametric assumptions, and instead calibrate the test statistic by means of permutations.The typical permuted value will be calculated according to the following steps: -Reassign the n × K curves {X i,j } into K groups of equal size.Call these new groups {X * i,j }. -Construct the empirical covariance Σ * j for the jth group Repeating these steps for all possible re-assignments yields the distribution for the permuted statistics T * r , which can be used to generate a p-value for T r under the null hypothesis.As usual, an exact such p-value can become prohibitive for large K, and in practice we resort to Monte Carlo sampling of permutations.Note that similar steps allow for the implementation of a bootstrap-type procedure, simply by randomly permuting indices with replacement.However, we opt for the permutation approach since the exchangeability of the permutation labels under H 0 guarantees the (near) exactness of the K-sample permutation test (Pesarin and Salmaso [30]), and we do not pursue the bootstrap approach further.

Transportation-Based Functional PCA of Covariances
When the null hypothesis of equality between covariances is rejected, the analyst may wish to explore whether the detected differences are carried by some interpretable main modes of variation.The transport maps t j (or their empirical versions) can be used to this aim.When these exist, the differences ∆ j = t j − I admit an elegant geometric interpretation as the logarithms of the operators Σ j at the Fréchet mean Σ, under the manifoldlike geometry induced by the Procrustes metric Π(•, •) on the space of (traceclass) covariance operators.Specifically, Masarotto, Panaretos and Zemel [25] show that it admits a tangent space with respect to geometry induced by Π that is characterised as where the closure is with respect to the inner product When Σ is injective, this is a bona fide inner product, that is, Q, Q = 0 ⇐⇒ Q = 0. Unfortunately, whether the Fréchet mean is injective or not remains an open question, even if all Σ j are so (see [25,Conjecture 17]), and it is not clear that the Σ j 's can be lifted to the tangent space.Nevertheless, our new result in the form of Theorem 2 guarantees that the maps ∆ j do exist as bounded self-adjoint operators, and indeed the 1-form is well-defined, regardless of the injectivity of Σ by means of the formulae for t i and Π.Consequently, the finite-dimensional span of {∆ 1 , . . ., ∆ K } admits a Hilbertian structure when equipped with the inner product •, • Σ , and for all practical purposes can be used to carry out a PCA 3 based on the spectral decomposition of the non-negative operator where (A ⊗ Σ B)C = B, C Σ A. Notice that the latter constitutes precisely the empirical covariance of the collection {∆ j } K j=1 , because by Equation (4).Once the principal components are constructed, the main modes of covariance variation can be visualised by retracting appropriate subspaces of the tangent space back to the space of covariance operators.Specifically, if E 1 is the eigenoperator associated with the largest eigenvalue of K, this retraction takes the form 3 Such a PCA can be interpreted as a tangent space PCA with respect to a Procrustean metric tensor which is a geodesic for sufficiently small > 0. This principal geodesic is the visualisation of the main mode of variation of {Σ j } K j=1 near their Fréchet mean Σ.
A subtlety here is that the PCA is to be carried out on a Hilbert space endowed with an inner product other than the standard Hilbert-Schmidt inner product.This different choice of inner product affects both the formal definition and the computational evaluation of the principal components.The defining maximisation problem yielding the first principal component is now arg max Nevertheless, this change of inner product poses no essential difficulty, and has indeed considered before by Silverman [38] in the case of Sobolev inner products, and generalised by Ocaña, Aguilera and Valderrama [26].Further details are given in Section 3.

Computational Implementation
In the next Section we will work with K independent groups of functional data {X ij , j = 1, . . ., K, i = 1, . . ., n j }, each group of sample size n j and with covariance operator Σ j , j = 1, . . ., K. Unless otherwise stated, all curves are curves are simulated from a multivariate Gaussian process and sampled on an equispaced grid on Ω = [0, 1].The sample size and the grid points vary across applications, therefore, in practice, we only have access to estimated empirical covariances Σ 1 , . . ., Σ K .In our case, Σ 1 , . . ., Σ K are obtained from the smoothed versions of the {X ij } as traditional sample covariance functions, through the command var.fd in the R package fda (Ramsay and Silverman [34], Ramsay et al. [35]).If a smoothed version of the {X ij }'s is not available, a PACE-type estimator can be used (Yao, Müller and Wang [40]).

Transport ANOVA
Once estimators Σ 1 , . . ., Σ K are at our disposal, our transport ANOVA requires their Fréchet mean Σ = arg min and the transport maps contrasted with the identity When Σ j commute ( Σ j Σ i = Σ i Σ j for all i, j), the Σ has the explicit form .
However, there is no reason that these commutativity should hold.For general covariances, Σ has no closed form formula, but it can be approximated by the iterative procedure described in [25, Section 8].It can be interpreted as steepest descent in the Procrustes space of covariances, and in finite dimensions provably approximates Σ and ∆ j to arbitrary precision.It is carried out as follows: • Let Σ 0 : H → H be an injective covariance, serving as the initial point.
• Denote the current iterate at step k as Σ k .
• For each j compute the optimal maps from Σ k to each of the prescribed operators Σ j : H → H, namely t • Define their average In practice the algorithm will stop after, say, k iterations, Σ k will be our numerical approximation for Σ and t Σ j Σ k − I will approximate t j − I.In terms of the manifold-like geometry of covariances under the Procrustes metric (see Section 2.4), the algorithm starts with an initial guess of the Fréchet mean; it then lifts all observations to the tangent space at that initial guess via the log map, and averages linearly on the tangent space; this linear average is then retracted onto the manifold via the exponential map, providing the next guess, and iterates.The quantity T k − I is precisely the negative gradient of the Fréchet (sum-of-squares) functional, which is the reason why this is steepest descent.The test statistic, a linear combination (or the maximum) of powers of ∆ j

Transport PCA
Once the empirical Fréchet mean Σ of Σ 1 , . . ., Σ K and the ∆ j 's are computed (see the previous section), we can perform functional PCA on the collection { Σ 1 , . . ., Σ K } by their tangent space representation ∆ 1 , . . ., ∆ K (see Section 2.4).As explained there, there are some subtleties involved in this PCA, since we are working with a different inner product than the standard Hilbert-Schmidt one.However, Ocaña, Aguilera and Valderrama [26] proved that PCA with respect to the tangent space inner product is equivalent to the PCA performed with the Hilbert-Schmidt inner product on suitably transformed data, thus allowing a framework to interpret standard Euclidean PCA in the Procrustes geometry.More precisely, let •, • HS be the Hilbert-Schmidt inner product and •, • Σ be the Wasserstein one at the tangent space at Σ, that is A, B Σ = tr(AΣB).We follow the steps by Ocaña, Aguilera and Valderrama [26], using that there is a unique operator T characterised by which in our case we take to be the multiplication from the right by Σ (so T(A) = (AΣ 1/2 )Σ 1/2 is trace class and has an adjoint).T is nonnegative and the PCA of some data (X 1 , . . ., X n ) with respect to •, • Σ is computed as the PCA of [T 1/2 (X i )] n i=1 with Hilbert-Schmidt norm, in the sense that the eigenvalues (i.e. the variances) remain the same and the eigenfunctions with respect to •, • Σ are T 1/2 applied to the eigenfunctions with respect to •, • HS .In our specific case, T 1/2 (X) = XΣ 1/2 , and the PCA on the tangent space is carried out as follows: -Multiply ∆ j = t j − I from the right by Σ 1/2 .-Find the spectral decomposition of the empirical operator defined on the space of Hilbert-Schmidt operators with respect to the Hilbert-Schmidt norm.-Multiply (from the right) the eigenfunctions of K by Σ −1/2 to obtain the eigenfunctions of K.

Numerical Experiments
We now demonstrate the efficacy of the proposed methods through a variety of simulated examples.Simulations are broadly categorized into two subsets: functional ANOVA and tangent space PCA.We initially explore the behavior in a scenario where the Fréchet mean is known and the transport-based functional ANOVA test is applied to covariances {Σ j = T j ΣT j } K j=1 obtained via perturbation according to the generative model described in Masarotto, Panaretos and Zemel [25,Section 10].
To allow comparability of our method with the existing literature, we subsequently consider another simulation scenario, directly taken from Cabassi et al. [6], where the simulated covariance operators are perturbation of the male and female subjects in the Berkeley growth data set (Jones and Bayley [21]).In both scenarios, generative model and Berkeley data, we compare the functional ANOVA based on Transport Maps with the permutation test of Cabassi et al. [6], the concentration inequality method by Kashlak, Aston and Nickl [22] and the functional ANOVA method based on empirical characteristic functionals of Hlávka, Hlubinka and Koňasová [20].Cabassi et al. [6] provide result showing that their method is state-of-the-art, and to the best of our knowledge, no other alternative procedures besides the ones listed exist.It is evident from Figure 1 and 2 that our testing method overpowers other methods, reaching nearly perfect power even in the presence of small differences.
After validating the performance of transport-based functional ANOVA, in Section 4.2 we make use of the generative model scenario to perform tangent space Principal Component.Finally, in 5 we test the performance of our method on the classic phoneme dataset (Ferraty and Vieu [14]), which consists of 4509 log-periodograms of 5 different phonemes.The data is available at https://hastie.su.domains/ElemStatLearn/. Real data analysis consolidates the strength of our method with respect to the competitors, as it can be seen in Section 5.1.If the null hyphothesis of equality among covariance operators is rejected, Section 5.2 shows how tangent space PCA can be a successful tool in understanding dataset variability.

Simulation Experiments
In order to avoid propagation of error, it is convenient to formulate a simulation setup in which the Fréchet mean is known exactly and does not need to be approximated (Section 3).It is easy to construct such examples in the commutative case, but we shall not do so, as this case is overly restrictive and unrealistic.We thus appeal to the generative model in Masarotto, Panaretos and Zemel [25,Section 10], which states that if a collection of nonnegative maps T 1 , . . ., T K has mean identity, then any covariance operator Σ is the Fréchet mean of {Σ j = T j ΣT j } K j=1 , and the maps t j in (3) must equal T j (on the closed range of Σ).
To construct the collection {T 1 , . . ., T K } we proceed as follows.Let H = L 2 [0, 1] and for f, g ∈ H their tensor product f ⊗ g is the operator If t is the parameter of the functions, we write f (t) ⊗ g(t) to mean f ⊗ g.With this notation set where δ (j) n are independent of θ (j) , and k > 0. This construction guarantees that E[T j ] = E[E[T j |θ (j) ] = I regardless of the distribution of θ (j) .The parameter k controls the concentration of T (j) around the identity; when k is large, the law of large numbers entails that δ (j) n is close to 1.Of course, a given realisation of T 1 , . . ., T K will not average precisely to the identity, but will average approximately to the identity if K and k are not too small.The parameter θ i ≥ 0 on the other hand, serves as indicators on how far we are from commutativity.On this note, a parametric model can be assumed for the θ i .We chose the θ i to be sampled from a von Mises distribution with mean 0 and measure of concentration 1/σ, with the degenerate case of σ → ∞ yielding commutativity.
We then generate n j Gaussian curves X i,j , i ∈ {1, . . ., n i } with mean zero and covariance Σ j = T j ΣT j , j ∈ {1, . . ., K}, j ∈ {1, . . ., K}. Inspired from Kashlak, Aston and Nickl [22], the "population" Fréchet mean was chosen to be a matrix with eigenvalue decay rate O(n −4 ): where U is a randomly generated orthogonal operator.To simulate a functional case and have enough information to display the decay of the spectrum, we chose for the matrices a relative high size of 50×50.The power is estimated from 500 replications.The number of permutations is 200.At each replication, we generate two optimal maps T 1 and T 2 via the generative model, and two corresponding covariances Σ 1 = T 1 ΣT * 1 and Σ 2 = T 2 ΣT * 2 , with Σ given by equation (6).For each Σ i , i = 1, 2, we sample 40 observations of a Gaussian process with mean-zero and covariance Σ i .The empirical covariance computed from these observations will yield a replica of Σ i .We repeat this as to obtain k 1 replicas of Σ 1 , and k 2 replicas of Σ 2 , for a total of k 1 + k 2 covariances divided into two groups of size k 1 and k 2 respectively.The values of the pair (k 1 , k 2 ) are (1,2), (1,3), (1,7) and (4,4).This procedure is repeated for several values of the Von Mises parameter, namely σ −1 = (0.1, 1, 5, 10).
We compare the power of our procedure with that of the K-sample permutation test in Cabassi et al. [6], Kashlak, Aston and Nickl [22], Hlávka, Hlubinka and Koňasová [20].The idea in Cabassi et al. [6] is to perform a series of partial 2-sample tests for each pair of groups, and combine the pairwise test statistics through the non-parametric combination algorithm of Pesarin and Salmaso [30].The pairwise test statistics are T ij = d(Σ i , Σ j ), where Σ i and Σ j are the sample covariance operators of the corresponding groups, and d is a metric on the operators space.The method of Cabassi et al. [6] is general, as any distance d can be used as test statistic.It is shown to be more powerful than competing method such as Kashlak, Aston and Nickl [22], and is implemented in the R-package fdcov (Cabassi and Kashlak [5]).In our comparisons, we have used the square root distance Σ which led to the best performance according to Cabassi et al. [6].Using the Procrustes distance instead of the square root distance had similar performance but increased computational cost.Hlávka, Hlubinka and Koňasová [20] propose a a test statistics of Cramér-von Mises type with the distance of the empirical characteristics functionals (ECFs) |φ 1 (ω) − φ 2 (ω)| 2 dQ(ω), where φ i is the ECF of the sample and Q is some probability measure on the dual space of H.The performance of the proposed test by Hlávka, Hlubinka and Koňasová [20] relies heavily on the choice of the measure Q.We consider a Gaussian measure characterized by the choice of two different covariance operators: the sample covariance matrix on one hand and an approximation of the inverse of the pooled sample covariance matrix computed from the first 9 eigenvectors on the other.These two cases appear to be performing the best in Hlávka, Hlubinka and Koňasová [20].We include both choices in our simulation study because the first yields overall better performance, but the latter gains power in difficult cases, like when most operators are equals.After performing the global test, in case the null hyphothesis H 0 is rejected, one can investigate pairwise differences with post-analysis comparison, as in Cabassi et al. [6], Pesarin and Salmaso [30].
Figure 1 shows the empirical power of all procedures.The x-axis represents the value of the dispersion parameter k in (5), while the y-axes displays the empirical power.It is evident from the figure that our procedure overpowers that of Cabassi et al. [6], Kashlak, Aston and Nickl [22] as well as of Hlávka, Hlubinka and Koňasová [20] for both variations considered.
The success of our test is a genuinely functional phenomenon and indeed, when the data are truncated, the differences are not so overwhelming.To understand this better notice that, since χ 2 k is an unbounded random variable we have T j ∞ = ∞ almost surely, and our test based on T j − I 2 = ∞ consequently rejects the null hypothesis.If the series is truncated at a finite level n 0 , then T j 1 ∼ k −1 χ 2 k,n 0 and so P ( T j 1 > R) decays to zero exponentially as R and/or k increase; a fortriori the same holds for T j 2 and T j ∞ .The procedure of Cabassi et al. [6] effectively puts very small weights on what happens at the tails (n large), whereas our procedure is able to detect departures from the null even when they only occur at very high frequencies.We refer to Section 6 for more insight on the power of the functional ANOVA test.
One may argue that the generative model setup in (5) artificially favours our testing procedure, as it guarantees ∆ j = ∞.We therefore consider another simulation scenario, directly taken from Cabassi et al. [6], in order to compare the two methods on the same ground.Here n = 20 curves generated from a mean-zero Gaussian process with suitably chosen covariances are evaluated on a equipaced grid of 31 points on [0, 1].Such curves are assumed to come from K populations, and the covariance operators of each population is obtained via perturbations of some given, known covariances Σ f and Σ m .These are computed with the fda R-package as the covariance operators of the smoothed growth curves of male and female subjects in the Berkeley growth data set (Jones and Bayley [21]).[21] The perturbations take two different forms: 1. geodesic perturbations: K 1 < K of the groups have covariance operator with R the operator minimising the procrustes distance and γ ∈ [0, 5].The other The number of permutation is again 200.The power is estimated from a total of 500 replications.The test-statistics employ the Hilbert-Schmidt norm (r = 2).The probabilities of false positive (I type error) are estimated using all the available replications when γ = 0. Figure 2 compares the power of our transport test with that of Cabassi et al. [6], Kashlak, Aston and Nickl [22], Hlávka, Hlubinka and Koňasová [20] on these synthetic data.The x-axis gives the value of the γ parameter, while the y-axes displays the empirical power.It is seen that our method is more powerful in all scenarios considered.Moreover, in case of geodesic perturbations, we achieve near perfect power, as opposed to the other tests that have little to nearly no power, for small values of γ, i.e. against local alternatives.Furthermore, notice that without knowing the null distribution is not possible to use the calibration procedure of Kashlak, Aston and Nickl [22], which is too conservative and does not respect the nominal level of 0.05 under H 0 .

Tangent space PCA
We validate the PCA framework on a synthetic datasets which is inspired by the theoretical generative model (5)  separated in K groups.The aim is to see whether PCA is able to differentiate between the groups.The operators Σ 1 , . . ., Σ K are obtained as a conjugation perturbation of some known Fréchet mean by the generated "optimal" maps T 1 , . . ., where the δ n are drawn from a χ 2 distribution and θ (i) are sampled from a von Mises distribution of mean 0 and measure of concentration 1/σ.The Fréchet mean is chosen to be Σ = U ΛU * as in Kashlak, Aston and Nickl [22], with U being a randomly generated unitary operator, and Λ a d × d diagonal matrix with eigenvalue decay of O(d −4 ), d being the dimension of the matrices.As the generative model yields optimal maps which are small perturbations of the identity, the dimension of the matrices used to approximate the operators needs to be large, otherwise the estimation errors would overwhelm the intrinsic variability of the sample.The dimension is chosen to be 200, the measure of concentration to be 1 and the number of groups K to be K = 3.For each of the Σ j , j = 1, 2, 3, we generate 100 samples of 50 Gaussian curves each.We then estimate the empirical covariance of these curves, obtaining a sample of N = 300 covariances.Results of the PCA are shown in Table 1 and Figures

Data Analysis: Phoneme Periodograms
In this section, we illustrate our method on the phoneme data set considered in Hastie, Buja and Tibshirani [19].The dataset consists of 4509 logperiodograms of length 256 each, computed from continuous speech frames of 50 male speakers.Each speech frame is 32msec long, sampled at a rate of 16kHz and represents one of the following five phoneme: "aa" (as in "dark", nasal a), "ao" (as in "water"), "iy" (as in "she"), "sh" (as in "she"), "dcl" (as in "dark", "british" d).Each phoneme j gives rise to a covariance operator Σ j .We use this sample of K = 5 covariances to generate K populations of n Gaussian processes, on which inference will be performed.

ANOVA
In order to perform ANOVA on the phoneme dataset, we extract the logperiodograms corresponding to the phonemes "aa", "ao", "iy" .We limit the test to these three phonemes because of their similarity, which makes it harder to distinguish them and allows for better discrimination among different procedures.If we include all 5 phonemes, the difference between vowels and consonants sound is so stark that all tests have very high power.
To sample under H 0 , we sample 3n log-periodograms for the "iy" phoneme which are then randomly assigned to three groups, each of size n.To sample under the alternative H 1 , we sample n log-periodograms for each phoneme.We repeat the test for n = 25 and n = 50, and for 500 replications and 200 permutations.Again we compare both with Cabassi et al. [6], Kashlak, Aston and Nickl [22] and Hlávka, Hlubinka and Koňasová [20].We limit the comparison with Hlávka, Hlubinka and Koňasová [20] to the test statistics using the sample covariance matrix, as it is the scenario that gives consistently better results.We extend their two-sample testing to 3-samples by considering all pairwise comparisons and using the maximum test statistics.When interpreting the results, it is important to treat the outputs carefully, since the procedure of Kashlak, Aston and Nickl [22] was unable to produce a result in a small number of cases, as the computation of the distance using SVD failed, while in some cases of Hlávka, Hlubinka and Koňasová [20], both the test statistics under the null and the p-values are identically 0.
Table 2 shows the comparison between the test on smoothed phonema log-periodograms.Since the different phonemes have different mean functions, observations must be centered around the sample mean of each group, implying that the right type I error probability under H 0 might not be respected.Regardless, the transport test delivers a level very close to the nominal 0.05, especially when n = 50 (which is still relatively low compared to the 256 points where the curves are sampled).The tests of Cabassi et al. [6] and Hlávka, Hlubinka and Koňasová [20] also reache an acceptable nominal level under H 0 .This is not the case for Kashlak, Aston and Nickl [22] making impossible the comparison.However, Cabassi et al. [6] show that their test outperforms that of Kashlak, Aston and Nickl [22].It is worth mentioning that the test of Hlávka, Hlubinka and Koňasová [20] responds very well to variation of the mean, because it takes into account the full distribution thanks to the characteristic functionals.In the phoneme dataset, the mean log-periodogram captures most of the variability.Thus, if one was to consider variation with respect to both first and second moment simultaneously, the method of Hlávka, Hlubinka and Koňasová [20] 2 Comparison of the empirical power of the three different testing methods on the phoneme dataset, when applied on the phonemes "aa", "ao" and "iy".
significant increase in power.However, for the scope of this work, we are interested specifically in second order variation.When centering with respect to the mean, the test based on the transport maps greatly outperforms the competing methods under the alternative hypothesis.

PCA
In this section, we illustrate the use of tangent space PCA of covariance operators by applying it to the phoneme dataset described in Hastie, Buja and Tibshirani [19].The collection of curves corresponding to each phoneme gives rise to a sample covariance operator, for a total of five covariances.The five empirical covariances are lifted to the tangent space via the log map centered at their Fréchet mean Σ. Successively they are scaled by Σ1/2 as explained in section 3.2.Standard PCA can now be run on these quantities.Figure 4 (left) shows the results of applying PCA on phoneme data.We see clearly that tangent space PCA captures very well the difference among the phonemes, as each sillable is isolated in at least one plot.The colours are as follows: "sh" black, "iy" red, "dcl "green", "aa" blue, "ao" cyan, more precisely: 1.The first PC captures (part of) the difference between "aa, ao and iy" (vowels) and "dcl and sh" (consonants) 2. The second PC captures (part of) the difference between "dcl" and "sh" (two consonants).3. The third PC captures (part of) the difference between "aa and ao" and "iy" (separating the two similar sounding vowels from the third more different one).4. The fourth PC captures (part of) the difference between "aa" and "ao" (separating the last two remaining, and very similar, sounds). 5. Since, the order in the y-axes is the order of magnitude of the eigenvalues the analysis suggests also the importance of the differences between the operators.As intuition dictates the difference between vowels and Left: PCA scores, as computed from the phoneme dataset.The colours are as follows: "sh" black, "iy" red, "dcl "green", "aa" blue, "ao" cyan.Right: screeplot of eigenvalues, phoneme dataset consonants is nearly four times more pronounced than the difference between the sounds "aa" and "ao".
The screeplot (Figure 4, right) shows that four PCs explain the full variance of the data, which is obvious as we have only five data points.The fourth PC is quite important and explains 13% of the variance.
In order to test our methodology in a more realistic situation, we artificially enlarge our sample of covariances by subsampling the original data.Specifically, from each of the five phonemes we subsample B = 50 of the corresponding log-periodograms to obtain a new estimator of the covariance operator of that phoneme.We do this G = 12 times so that in total we have a sample of 5G = 60 covariances, divided into five groups, and the covariances in a group should be close to each other.We then carry out the PCA on these 60 covariances The results are showed in Figure 4 (right).Again, we can see that each phonema is isolated in at least one plot.Figure 5 shows the comparison of the PCA scores both in Euclidean and in Wasserstein distance.It is seen that the PCA based on the Procrustes tangent space distance is much more successful in distinguishing the covariances of different phonemes.

Concluding Remarks
This paper introduces a framework that allows the comparison of several population of stochastic processes with respect to their covariance structure.We contributed a new methodology that exploits the theory of optimal (multi)transport and demonstrate how taking such a stand point allows to develop: (a) a testing procedure which outperforms the state of the art and (b) the first instance of tangent space principal component analysis of covariance operators.A fundamental ingredient of our approach and the main theoretical contribution of this paper is the proof that Gaussian measures can always be multicoupled through bounded non-negative linear operators.The existence and boundedness result is elemental to both the testing procedure, and the PCA: the test statistic compares these coupling maps to the identity in norms of various strenghts, whereas the PCA uses the deviations of these couplings from the identity as a basis for eigenanalysis.Specifically concerning the testing procedure, an appealing aspect of the presented methodology is that it harnesses a genuinely functional effect, in order to manifest exceptionally powerful performance under wide classes of alternatives --alternatives for which previous tests would not perform nearly as well.
The genuinely functional effect arises under alternatives where the optimal coupling maps are not Hilbert-Schmidt (they are merely guaranteed to be bounded).At the population level, this corresponds to T j − I 2 = ∞, Thus, in the commutative case, the population level test statistic is finite if and only if the Gaussian measures are equivalent.Whether or not this is the case depends on how differences persist across the whole spectrum, rather than just in the bulk.

Proofs of Formal Statements
Proof of Lemma 1.If Σ 1 = • • • = Σ K , then the unique optimal multicoupling is given by the maps t j (z) = z and the process Z ∼ N (0, Σ 1 ).Conversely, if a multicoupling of (γ 1 , . . ., γ K ) is achieved as the law of (t 1 (Z), . . ., t K (Z)) for some process Z and some maps satisfying t 1 = • • • = t K , then γ i , the law of t i (Z), is the same for all i, i.e., γ 1 = • • • = γ K , and so Σ 1 = • • • = Σ K .Uniqueness of the optimal multicoupling follows from the first sentence in the proof.
For the proof of Theorem 2, we need the following result from Douglas [10].
By Lemma 3 there exists an operator G with range included in the closed range of Σ such that Q 1/2 G, G ∞ ≤ √ K and we may write i .If we can identify G * with Q 1/2 i Σ −1/2 , then we can conclude that is well-defined and bounded, with operator norm bounded by K. Let y = Σ 1/2 z and notice that for all x G * y, x = Σ 1/2 z, Gx = Σ Thus G * = Q 1/2 i Σ −1/2 on range(Σ 1/2 ).Since G * is bounded the equality extends to the closure of the range, which is (ker Σ) ⊥ .On ker Σ both operators are identically zero.

Fig 1 .
Fig 1. Empirical power of different tests as a function of the dispersion parameter k in (5).The dotted horizontal line represents the nominal level α = 0.05.

Fig 3 .
Fig 3. Left: PCA scores, first experiment with the generative model.Colours correspond to the three maps generated from the model.Right: eigenvalues screeplot, first experiment with the generative model Fig 4.  Left: PCA scores, as computed from the phoneme dataset.The colours are as follows: "sh" black, "iy" red, "dcl "green", "aa" blue, "ao" cyan.Right: screeplot of eigenvalues, phoneme dataset

Lemma 3 .
Let 0 ≤ A ≤ B be bounded operators, where A ≤ B means that B − A is non-negative.Then there exists a bounded operator G withG ∞ ≤ 1 such that A 1/2 = B 1/2 G and ker G * ⊇ ker B.Proof of Theorem 2. Let Σ be any Fréchet mean of Σ 1 , . . ., Σ K and defineQ i = (Σ 1/2 Σ i Σ 1/2 )1/2 ≥ 0. The fixed point equation for Fréchet means ([25, Proposition 16]) yields the inequality 3. The Figures show that the different groups are Importance of each PC, first experiment with the generative model.
would show a