Analysis of proteomics data : Phase amplitude separation using an extended Fisher-Rao metric ∗

Abstract: We consider the problem of alignment and classification of proteomics data, that is described in Koch et al. [4], using the Extended FisherRao (EFR) framework introduced in [6]. We demonstrate this framework by separating amplitude and phase components of functional data from patients having therapeutic treatments for Acute Myeloid Leukemia (AML). Then, using individual functional principal component analysis, for both the phase and amplitude components [8], we obtain bases for principal subspaces and model the data by imposing probability models on principal coefficients. Lastly, using the distances calculated from individual components, we demonstrate a successful discrimination between responders and non-responders to treatment for AML.


Introduction
As described by Koch et al. [4], protein profiling can be used to study changes in protein expression in reference to therapeutic treatments for diseases.In this paper we analyze the protein profiles of five patients with Acute Myeloid Leukemia (AML) referred to in Koch et al. [4].Specifically, we will develop tools for: (1) phase-amplitude separation from the given data, and (2) demonstrating metrics that can potentially assist in decision making and classification to study what different proteins are related to the disease process.The first step is performed by aligning the original functional data using nonlinear warping functions under an extended Fisher-Rao framework [6].This results in the aligned functions (describing amplitude variability) and the warping functions (describing phase variability).Following the alignment, we utilize two metrics for data classification -one of them measures the amplitude variation and is independent of the phase components, and the other measures the phase difference while being independent of the amplitude components.With these metrics, one can also estimate the sample means and covariance on the phase and ampli-tude components, respectively, in their appropriate spaces.While the amplitude space is a vector space and allows standard functional data analysis, the phase components need to be transformed using a square-root transformation to enable use of L 2 norm (and cross-sectional computations) for generating summary statistics.These estimated statistics can be further used to perform functional principal component analysis (fPCA) and imposing probability models on the phase and amplitude components, respectively [8].
The rest of this paper is as follows.We briefly describe the separation, modeling, and comparison of the phase and amplitude components of any functional data in Section 2. This is followed by presentation of results on the proteomic data in Section 3, and the paper ends with the short conclusion in Section 4.

Approach: Extended Fisher-Rao framework
Our general framework for phase-amplitude separation and analysis of curves is adapted from ideas in shape analysis of curves [3,7] and is described more comprehensively in [5,6,8].For a broader introduction to this theory, including asymptotic results and identifiability results, we refer the reader to these papers.
Here we present a very brief review of the method used in our analysis.Let f be a real-valued function with the domain [0, 1]; any other domain can easily be transformed to this interval.For concreteness, only functions that are absolutely continuous on [0, 1] will be considered; let F denote the set of all such functions.Also, let H be the set of boundary-preserving diffeomorphisms of the unit interval [0, 1]: The elements of H play the role of warping functions.For any f ∈ F and h ∈ H, the composition f • h denotes the time-warping of f by h.With the composition operation, the set H is a group with the identity element h id (t) = t.
In our framework we represent a function using the square-root slope function (SRSF) and is defined as: q(t) = sign( ḟ (t)) | ḟ (t)|.For pairwise registration of functions we solve the optimization problem: (q 2 •h) ḣ) using the dynamic programming algorithm.In the process, we evaluate d a which measures their amplitude differences and is independent of their phases or time warpings.For aligning multiple functions, and for separating their phase-amplitude components, we first compute a Karcher mean of the given functions (denoted by µ f in F and µ q is in SRSF space), under the metric d a .
(In F space) : (This Karcher mean has also been called by other names such as the Frechet mean, intrinsic mean or the centroid and is generalization of a Euclidean mean to metric spaces.)As described in [6], the algorithm for computing the Karcher mean also results in: (1) aligned functions fi = {f i • h i }, representing the ampli-tude variability, and (2) the warping functions {h i } used in aligning the original data, representing the phase variability.For more details on this method the reader is referred to [5,6,8] or the companion paper [9].
To quantify phase differences between given functions, we apply the same square-root transform to h and recognize that the set of all ψ(t) = ḣ(t) is an orthant of the unit Hilbert sphere S ∞ ⊂ L 2 .Then, the distance between any two warping functions is exactly the arc-length between their representatives on the sphere S ∞ .We can define a distance between a warping function and the identity function If h is the warping needed to align any two functions, then d p measures the amount of warping needed to align them, and serves as a distance between their phases.One can then use these distancesd a and d p -for classification and further analysis.We also describe how to perform fPCA on the aligned functions (amplitude) and on the warping functions (phase) to study their variability, we will call this horizontal fPCA and veritcal fPCA, respectively.While fPCA on the aligned functions is straightforward (in the L 2 space with its natural metric), the case of warping functions is not straightforward.
Horizontal PCA: As described in [8], we use the SRSF ψ to represent a warping function h, and since the unit Hilbert sphere is a non-linear manifold we choose a vector space tangent to the sphere for analysis; we call this the horizontal fPCA.The tangent space at any point ψ ∈ S ∞ is given by: T In this tangent space we can define a sample covariance function: . In practice, this covariance is computed using a finite number of points, say T , on these functions and one obtains a T × T sample covariance matrix instead, denoted by K ψ .The singular value decomposition (SVD) of K ψ = U ψ Σ ψ V T ψ provides the estimated principal components of observed {ψ i }: the principal directions U ψ,j and the observed principal coefficients v i , U ψ,j .These components can be mapped back to H using the mapping ψ → h(t) = t 0 ψ(s) 2 ds.Vertical PCA: To perform vertical fPCA on the aligned SRSFs we first add the initial value to form a larger vector: g i = [q i f i (0)].This way, the mapping from the function space F to L 2 × R is a bijection.We can define a sample covariance operator for the aligned combined vector g = [q g we can calculate the directions of principal variability in the given SRSFs using the first p ≤ n columns of U g and can be converted back to the function space F , via integration.This processes is called vertical fPCA and for more information on the two methods the reader is referred to [8].

Results on proteomics data
First, we will apply the extended Fisher-Rao framework to separate amplitude and phase components of the proteomics data.This requires the SRSF of the data.It should be noted that if the data is noisy some smoothing [1] can be applied before computing the derivative and SRSF.Second, we will perform fPCA on the separated amplitude and phase components, respectively.We will then construct models on the corresponding components and the models will be validated using random sampling.Lastly, we will perform classification between responders and non-responders to chemotherapy using the amplitude and phase distances, d a and d p , calculated during the alignment process.

Alignment
The original data with markers corresponding to the key peaks in the data is presented in Fig. 1(a).The peaks in the data are not well aligned as the corresponding markers demonstrate.The results of applying our alignment method are presented in Fig. 1(b).The aligned functions exhibit a high level of registration with almost all of the peaks are aligned.There are a few exceptions with peaks 1 and 2 which have a very low amplitude and that makes their registration difficult.
We can also quantify the alignment performance using the decrease in the cumulative cross-sectional variance of the aligned functions.For any functional dataset {g i (t  where {f i } is the set of original functions, { fi } is the set of aligned functions, and {µ f • h i } is the set of applying the warping functions {h i } to µ f , which is the mean function after alignment.From the decrease in the amplitude variance and increase in the phase variance we can quantify the level of alignment.Furthermore, the corresponding warping functions are presented in the right panel of Fig. 2. Fig. 3 presents a zoom in on a region of the data from 81.4 to 111 time samples.The top panel is the original data where we see very poor alignment of the peaks.The bottom panel is the corresponding aligned data using the extended Fisher-Rao framework, where a nice alignment of the peaks and valleys have occurred.

Modeling
In this section, we present the results of applying horizontal and vertical fPCA on the warping and aligned functions, respectively.First, we perform horizontal fPCA on the warping functions and the first two principal directions are presented in Fig. 4(a) and (b), respectively.It can be see that most of the variation is contained in the first principal direction with minor perturbations contained in the next principal direction.
Next, we analyzed the aligned SRSFs by performing vertical fPCA.The first two vertical principal-geodesic paths are shown in Fig. 5 (a) and (b), respectively.The first 5 singular values for the data are: 3.89, 1.94, 1.49, 1.10, and 0.95 with the rest being negligibly small.Visually most of the variation lies in the first principal direction, which can also be attributed to the largeness of the first singular value relative to the other singular values Once we have obtained the fPCA coefficients for the horizontal and vertical components we can then impose a probability model on the coefficients and induce a distribution on the function space F .Using the joint Gaussian model described in [8] we randomly generate 35 domain-warping functions and 35  corresponding amplitude functions (Fig. 6(a)), and the set of random samples (Fig. 6(b)).Comparing the random samples with the original data (Fig. 1(a)) we conclude that the samples are very similar to the original data and, at least under a visual inspection, the proposed models are successful in capturing the variability in the given data.

Classification
In this section, we present the results of using the amplitude distance, d a , and phase distance, d p , for classification between responders and non-responders to chemotherapy.
We first compute the standard L 2 distance between each pair, i.e., d ij L 2 = f i − f j , i, j = 1, . . ., n.The matrix of pairwise L 2 distances is shown as a gray scale image in left panel in Fig. 7.This image of the pairwise distances looks unstructured, highlighting the difficulty of classification under this metric.Based on this distance matrix, we perform classification by using nearest neighbor classifier.We then measure the error rate of this classifier using leaveone-out (LOO) cross-validation and found that the accuracy is 0.27 (4/15).Then, we computed distances d a and d p between all pairs of functions and these distance matrices are shown as gray scale images in the middle and right panels in Fig. 7, respectively.In the image of d a (middle panel Fig. 7), we find that the pairwise distances are more structured than the L 2 distances.We also perform classification using the LOO cross-validated nearest-neighbor based on the d a distances.The accuracy turns out to be 0.87 (13/15), a significant improvement over the standard L 2 result (0.27).We find that the d p distances do not have strong classification performance, which can be noted by the lack of structure in the right panel of Fig. 7.The classification accuracy using d p turns out to be 0.33 (5/15), which is only slightly higher than the standard L 2 norm in the function space.The clear best performance of d a is very consistent with the expectation that relative amounts of peptides should drive these differences, as stated in [4].Since d a and d p each only partially describe the variability in the data, which corresponds to the phase and amplitude differences between the functions, there is a possibility of improvement if d a and d p are used jointly.One simple idea is to linearly combine these two distances and use the weighted distance to perform classification on the data.Define d τ as the weighted average d τ = τ d p +(1−τ )d a , of d a and d p .We found an optimal of τ = 0.1 provides a LOO accuracy of 0.93 (14/15), which is higher than the accuracy of the L 2 , d a , and d p distances.This indicates that there is some information carried in the phase than previously thought, however a larger number of samples would be required to justify this claim.Moreover, if there is more contribution from the phase it would suggest a batch effect or other nonrandom bias in the data that would have to be studied further.
For comparison, we computed a "naive" distance, d N aive , which corresponds to the quantity min h f 1 − f 2 • h that has often been used in the literature for function alignment.We also perform the cross-validated nearest-neighbor using

Fig 1 .
Fig 1.Alignment of proteomics data using the square-root slope framework with original data in Panel a) and aligned functions in Panel b).The aligned functions exhibit high level of registration of marked peaks.

Fig 2 .
Fig 2. The aligned proteomics data (left panel) with corresponding observed warping functions (right panel).

Fig 3 .Fig 4 . 6 .Fig 5 .
Fig 3. Zoom in on original proteomics data (top panel) and aligned proteomics data (bottom panel).The original data peaks are varying in time locations while the aligned data has tight time alignment of all the peaks.

Fig 6 .
Fig 6.Random samples from from the amplitude model (a) and combination of the random samples with random samples from the phase model (Fig. 4(c)).

Fig 7 .
Fig 7. The pairwise distances using the L 2 , da, and dp distances, in the left, middle, and right panels, respectively.