Selecting the length of a principal curve within a Gaussian model

– Principal curves are parameterized curves passing “through the middle” of a data cloud. These objects constitute a way of generalization of the notion of ﬁrst principal component in Principal Component Analysis. Several deﬁnitions of principal curve have been proposed, one of which can be expressed as a least-square minimization problem. In the present paper, adopting this deﬁnition, we study a Gaussian model selection method for choosing the length of the principal curve, in order to avoid interpolation, and obtain a related oracle-type inequality. The proposed method is practically implemented and illustrated on cartography problems


Introduction
Principal curves can be thought of as a nonlinear generalization of Principal Component Analysis.Instead of searching for the first principal component of a data cloud, the purpose is to design a curve passing "through the middle" of the observations, as illustrated in Figure 1.Principal curves have many applications in various areas, such as physics (Hastie and Stuetzle [21], Friedsam and Oren [19]), character and speech recognition (Kégl and Krzyżak [22], Reinhard and Niranjan [30]), but also mapping and geology (Brunsdon [10], Stanford and Raftery [34], Banfield and Raftery [3], Einbeck, Tutz and Evers [17,18]), natural sciences (De'ath [14], Corkeron, Anthony and Martin [13], Einbeck, Tutz and Evers [17]) and medicine (Wong and Chung [38], Caffo, Crainiceanu, Deng and Hendrix [11]).where I = [a, b] is a closed interval of the real line.The original definition of a principal curve, due to Hastie and Stuetzle [21], relies on the self-consistency property of principal components.A smooth (infinitely differentiable) parameterized curve f (t) = (f 1 (t), . . ., f d (t)) is a principal curve for X if f does not intersect itself, has finite length inside any bounded subset of R d , and is self-consistent, which means that Here, the so-called projection index t f (x) is defined by where • denotes the standard Euclidean norm of R d .So, t f (x) is the largest real number t minimizing the Euclidean distance between x and f (t), as shown in Figure 2.
The self-consistency property may be interpreted by saying that each point of the curve f is the mean of the observations projecting on f around this point.
A number of other points of view, more or less related to this original definition, have been proposed thereafter.Tibshirani [35], keeping the self-consistency property, adopts a semiparametric approach and defines principal curves in terms of a mixture model, whereas Delicado [15] generalizes another property of principal components, leading to the notion of "principal curves of oriented points".The definitions of Verbeek, Vlassis and Kröse [36] and Einbeck, Tutz and Evers [18] are based on local principal components put together and the "locally defined principal curves" of Ozertem and Erdogmus [28] correspond to the ridge of a density function.Recently, Genovese, Perone-Pacifico, Verdinelli and Wasserman [20] discussed a closely related problem, called nonparametric filament estimation.
Figure 2: The projection index t f .For all i, t i stands for t f (x i ).
In this paper, we will adopt the principal curve definition of Kégl, Krzyżak, Linder and Zeger [23], which is closely related to the original one, but presents the advantage of avoiding the implicit formulation.Instead, this definition takes the form of an empirical risk minimization problem, which is easier to handle than (1).In the definition of Kégl, Krzyżak, Linder and Zeger [23], a principal curve of length L for X is a parameterized curve minimizing the least-square type criterion over all curves of length not larger than L > 0. Such a principal curve always exists provided E X 2 < ∞, but it may not necessarily be unique.Note that Sandilya and Kulkarni [31] have proposed a similar definition, using a constraint on the turn instead of the length of the curve.
In practice, the distribution of X is unknown, and we have at hand a sample X 1 , . . ., X n of independent random variables distributed as X.In this situation, ∆(f ) is replaced by its empirical counterpart In order to construct a satisfactory principal curve, a good choice of the length is crucial.Indeed, a principal curve constrained to have a too small length will not be able to capture the shape of the data, whereas a too long curve may lead to interpolation problems, as illustrated in Figure 3.In the present contribution, we propose to study the length selection problem, using the approach of non-asymptotic model selection by penalization introduced by Birgé and Massart [7] and Barron, Birgé and Massart [4].To this end, we will consider a Gaussian framework.A related point of view in the context of almost surely bounded random variables is discussed in Biau and Fischer [6].The rest of the paper is organized as follows.In Section 2, we consider the problem of choosing the length of a principal curve using a Gaussian model selection approach, and show that the curve obtained by minimizing some appropriate penalized criterion satisfies an oracle-type inequality.Section 3 presents some experimental results in the context of cartography.Proofs are collected in Section 4 for the sake of clarity.

Length selection
We investigate a Gaussian model selection method in order to choose the length of a principal curve.Our context is similar to that of Caillerie and Michel [12], who tackle model selection questions for graphs called simplicial complexes.In the sequel, the Euclidean space R d is equipped with the inner product defined by and • denotes the associated Euclidean norm.
We assume that we observe random vectors X 1 , . . ., X n with values in R d following the model where the x ⋆ i are unknown, the ξ i are independent standard Gaussian vectors of R d and σ > 0 stands for the noise level, which is supposed known.Let us denote by − → X = t ( t X 1 , . . ., t X n ) the (column) vector made of all coordinates of the random vectors X i , i = 1, . . ., n. Defining − → x ⋆ and − → ξ in the same way, the model (3) can be rewritten under the form Let F and G be two fixed points of R d and L a countable subset of ]0, +∞[.We introduce a countable collection {F ℓ } ℓ∈L , where each set F ℓ is a class of parameterized curves f : I → R d with length ℓ and endpoints F and G. Our aim is to select the length ℓ.To do this, we consider the criterion ∆ ′ n defined by where Γ f denotes the range of the curve f .Due to the definition of the norm • chosen above (2), this is the empirical criterion ∆ n (f ) normalized by the dimension d.Suppose that, for all ℓ ∈ L, − → x In order to determine the length ℓ, our purpose is to minimize in ℓ a criterion of the type where pen : L → R + is a penalty function, which should avoid the selection of a too large ℓ.Our goal is to design an appropriate penalty.Observe that the classical asymptotic model selection criteria AIC (Akaike [1]), BIC (Schwarz [33]) or Mallows'C p (Mallows [26]), which involve the "number of parameters" to be estimated, are not suitable in this framework.Therefore, our approach will rely on the non-asymptotic model selection theory developed by Birgé and Massart [8] and Barron, Birgé and Massart [4].
When the considered models are linear subspaces, the penalty can be chosen proportional to the dimension of the model, according to Birgé and Massart [8].Here, the models C ℓ are not linear subspaces of R nd and the dimension must be replaced by another quantity.
In order to measure the complexity of these nonlinear models, we will use metric entropy.
The metric entropy of a set S is given by were the covering number N (S, • , ε) is the minimal number of balls with radius ε for the norm • needed to cover S.
Our approach is based on a general model selection theorem for nonlinear Gaussian models (Massart [27]).Let us denote by For every ℓ ∈ L, let ϕ ℓ be a function such that ϕ ℓ ≥ φ ℓ , where φ ℓ is given by with κ an absolute constant.We define d ℓ by the equation Assume that there exists a family of weights {w ℓ } ℓ∈L satisfying ℓ∈L e −w ℓ = Σ < ∞.
Under these assumptions and with this notation, Theorem 4.18 in Massart [27] can be written in the following manner: Then, almost surely, there exists a minimizer l of the penalized criterion Moreover, writing xi := xi l for all i=1,. . .,n, we have , This result establishes, for a penalty pen(ℓ) which is large enough, an oracle-type inequality in expectation for the xi , i = 1, . . ., n. Provided a control of the Dudley integral (4) (Dudley [16]), this theorem will apply in our context and allow us to select the length ℓ of the curve.To assess this integral, we will need some technical lemmas, which are proved in Section 4.
The first step consists in controlling the metric entropy of the classes C ℓ , ℓ ∈ L. Note that, for all ℓ ∈ L, f ∈F ℓ Γ f corresponds to an ellipsoid of R d , as stated in the next lemma.In the sequel, this ellipsoid will be denoted by E ℓ .
Lemma 2.1.Every parameterized curve of R d with endpoints F and G and length ℓ (ℓ > F G), is included in an ellipsoid E ℓ with first principal axis of length ℓ, the other In particular, in R 2 , E ℓ is an ellipse with foci F and G (see Figure 4), and in R 3 , it is a ellipsoid of revolution around the axis passing through these two points.
In the plane R 2 , ellipse E ℓ with foci F and G and axes ℓ and λ.
We obtain then the following upper bound for Bounding the integral for all ℓ ∈ L, we can then define an adequate function ϕ ℓ .
Lemma 2.3.The function ϕ ℓ given by Finally, in order to apply Theorem 2.1, we have to assess d ℓ , defined by the equation which is the purpose of the next lemma.
Lemma 2.4.Let ϕ ℓ be given by Lemma 2.3.Suppose that Then, equation We are now in a position to state the main result of this section.
Theorem 2.2.Assume that there exists a family of weights {w ℓ } ℓ∈L such that and that, for every ℓ ∈ L, Then, there exist constants c 1 and c 2 such that, for all η > 1, if then, almost surely, there exists a minimizer l of the penalized criterion Moreover, if xi = xi l for all i = 1, . . ., n, we have , Let us now comment on the theorem.
The first remark is about the fact that Theorem 2.2 involves unknown constants.Lemma 2.4 indicates that c 1 = 16κ 2 and c 2 = π − ln(2κ √ π) could be chosen.However, these values are (likely too large) upper bounds.Furthermore, the variance noise σ has been supposed known and is involved in the penalty.Nevertheless, the noise level is generally unknown in practice.In fact, the expression (6) does not provide a penalty function directly, but gives its shape instead.Note that it is possible to estimate σ separately and then proceed by plug-in.However, there is another solution to assess c 1 , c 2 and σ, relying on the slope heuristics.This penalty calibration method introduced by Birgé and Massart [9] (see also Arlot and Massart [2], Lerasle [25] and Saumard [32]) precisely allows to tune a penalty known up to a multiplicative constant.
According to the formula binding ℓ and λ, the quantity ln(ℓ 1/d λ 1−1/d ) in the penalty characterizes each model of curves with length ℓ.The other elements varying over the collection of models are the weights {w ℓ } ℓ∈L .For linear models C ℓ with dimension D ℓ , a possible choice for w ℓ is w ℓ = w(D ℓ ) where w(D) = cD + ln |{k ∈ L, D k = D}| and c > 0 (see Massart [27]).If there is no redundancy in the models dimension, this strategy amounts to choosing w ℓ proportional to D ℓ .By analogy, w ℓ may here be chosen proportional to ln(ℓ 1/d λ 1−1/d ).More formally, we set w ℓ = c ln ℓ 1/d λ 1−1/d , where the constant c > 0 is such that ℓ∈L This choice finally yields a penalty proportional to ln(ℓ 1/d λ 1−1/d ), which may be calibrated in practice thanks to the slope heuristics.
In addition, observe that condition (5) says that the noise level σ should not be too large with respect to λ.In other words, if λ = √ ℓ 2 − F G 2 is of the same order as σ, it is not possible to obtain a suitable principal curve with length ℓ.
Finally, let us point out that due to the exponent n in the covering number in Lemma 2.2-a comment is given in Section 4 after the proof of the lemma (Remark 4.1)-, the penalty shape obtained does not tend to 0 as n tends to infinity.This point is intrinsically related to the geometry of the problem.Indeed, its resolution is not made easier by increasing the size of the sample, since nothing has been specified about the repartition of the x ⋆ i 's.A possible direction for future research could consist in dealing with the framework in which these points are distributed along the curve following a uniform distribution.

Experimental results
In this section, we propose to illustrate the length selection method practically.The experiments presented here are carried out with the software MATLAB.As announced in Section 2, the penalty given in Theorem 2.2 will be calibrated thanks to the slope heuristics.Two strategies may be used : the dimension jump method consists in identifying an abrupt jump in the models complexity, whereas the other solution is to observe that the empirical contrast is proportional to the penalty shape for complex models and use the slope of this line to assess the constant.
In this practical implementation, we considered polygonal lines, which present the advantage that projecting on the curve reduces to projection on a segment.However, the method described below could probably be replaced by a more sophisticated technique dealing with smooth curves.In the sequel, the maximal number k of segments is taken large enough, to ensure that the only parameter reflecting the complexity of the curve is the length.Then, the length ℓ of the principal curve is chosen as follows: 1.For a range of values of the length ℓ, compute fℓ by minimizing the empirical criterion ∆ n (f ) and record 2. Let w ℓ be proportional to ln ℓ 1/d λ 1−1/d and consider a penalty of the form 3. Select the constant ĉ using the slope heuristics.
4. Retain the curve fl obtained by minimizing the penalized criterion In step 1 of the algorithm, the criterion ∆ n (f ) is minimized by using a MATLAB optimization routine.
The weights w ℓ were chosen as suggested in the discussion after Theorem 2.2.This is a convenient choice, which does not modify the penalty shape.
To apply the slope heuristics in step 3, we employ the MATLAB package CAPUSHE, implemented by Baudry, Maugis and Michel [5].We tried both the dimension jump and the slope estimation methods, which results were found to be very similar.
Finally, recall that the endpoints F and G of the principal curve have been assumed to be fixed.From a practical point of view, several methods can be employed to choose these two points from the observations.A possible solution is to define F and G with the aid of the points that are farthest from each other in the minimum spanning tree of the data (or of some subset of the data).Figure 5 gives some examples of such trees, which can be constructed using Kruskal's algorithm [24] or Prim's one [29].
Figure 5: Some examples of minimum spanning trees.
We will now present two examples of applications to mapping.Indeed, Brunsdon [10] has shown that principal curves may be useful in that area, in order to estimate paths from GPS tracks.More specifically, principal curves can serve as a means to compute an average path from GPS data registered by several people moving on a given street.
The results obtained for some hairpin street located in France and the Labyrinth of the Jardin des Plantes in Paris are visible in Figure 6 and 7 respectively.Each figure gives first an air photography of the place and the corresponding GPS tracks data points.Then, the resulting principal curve is shown both on the data cloud and as an overlay on the photography, which allows to assess the performance of the method.Moreover, the principal curves fitted using our model selection approach (denoted by LS in the sequel) can be compared for both data sets to those obtained with a benchmark algorithm.Indeed, Figure 8 gives the outputs of the Polygonal Line Algorithm, which is based on a local control of the curvature and was proposed by Kégl, Krzyżak, Linder and Zeger [23] (PL hereafter).On the whole, the considered streets are correctly recovered.For the hairpin road, we observe that the PL output is smoother, but the right-hand end of the curve looks better on the LS result.Note that the direction taken by this part of the PL curve is wrong.
Regarding the Labyrinth, LS perform quite well and yields this time the smoothest curve.The PL principal curve is not at all as smooth as before: this can be explained by the fact that we used for both experiments the default parameters of this algorithm, which may be more or less suitable depending on the different characteristics of the data set, such as curvature or sampling density.
Noting that for the first example, there is a part of the street somewhat hidden by trees, a particularly interesting application could be to use principal curves to draw a map of paths in a forest.

Proof of Lemma 2.1
Let c = F G/2.Note that ℓ > 2c.In a well-chosen orthonormal coordinate system of R d , F has coordinates (−c, 0, . . ., 0) and G (c, 0, . . ., 0).A curve with length ℓ and endpoints F and G is included in the set delimited by the points M(x 1 , . . ., x d ) such that Let us show that this equation defines an ellipsoid with first principal axis ℓ, the other axes having length λ.Let M(x 1 , . . ., x d ) be such that MF + MG = ℓ.Then, Therefore, As a result, on the one hand, and on the other hand, Hence, which may be rewritten or, equivalently, where ℓ 2 /4 − c 2 > 0 since ℓ > 2c.In other words, the point M belongs to an ellipsoid with one axis of length ℓ and d − 1 axes of length 2 Reciprocally, if M(x 1 , . . ., x d ) satisfies equation (7), with ℓ 2 4 − c 2 > 0, then , since otherwise

Proof of Lemma 2.2
First, we shall compute the covering number of an ellipsoid E ℓ of R d .The next lemma is a particular case of Proposition 5 in von Luxburg, Bousquet and Schölkopf [37].
Lemma 4.1.Assume that ℓ ≥ λ ≥ ε.The number of balls of radius ε needed to cover E ℓ , ellipsoid in dimension d with principal axes ℓ, λ, . . ., λ, satisfies Proof of Lemma 4.1.The number of balls of radius ε needed to cover E ℓ satisfies where ⌊y⌋ denotes the floor of y, i.e., the largest integer less than or equal to y.Indeed, the ellipsoid E ℓ is inscribed in a parallelepiped with sides of lengths ℓ, λ, . . ., λ, and the number of balls of radius ε needed to cover a parallelepiped with sides of length c 1 , . . ., c d is By assumption, ℓ ≥ λ ≥ ε, so that ℓ ε + 1 ≤ 2ℓ ε and λ ε + 1 ≤ 2λ ε .Hence, Thus, according to Lemma 4.1, Let U be a collection of at most 2 , where the u i 's are elements of U, we have n i=1 B(u i , ε) ⊂ B( − → u , ε) (balls for the normalized norm of R d and R nd respectively).Consequently, Remark 4.1.It is not possible to get rid of the exponent n in this covering number computation.Indeed, if we consider only one curve f of length ℓ, covered by N balls of R d , N n balls of R nd are needed to cover (Γ f ) n , since every point among the n considered points along the curve can be in any one of the N balls.However, our upper bound for the number of balls in R nd needed to recover C ℓ is probably too large.Indeed, since the n points are constrained to be located on the same curve of length ℓ, we would not need to use all balls of R nd obtained by combining the centers of the balls of R d covering E ℓ .We could undoubtedly get a better upper bound by solving the combinatorial problem consisting in counting all acceptable combinations of balls in R d for a class C ℓ .

Proof of Lemma 2.3
We begin by a technical lemma which will be useful for bounding integrals.
Back to the proof of Lemma 2.3, note that according to Lemma 2.2, the metric entropy Thus, For r ≤ λ, let   observe that looking for a solution d ℓ amounts to solving, for r > 0, This equation admits a unique solution r ℓ = 2σ d ℓ nd , since ϕ ℓ is concave and r → r 2 convex.Moreover, the solution r ℓ satisfies r ℓ ≤ λ if, and only if, which means that If this condition is satisfied, the equation becomes which is equivalent to 4σκ So, 4σκ √ π ≤ r ℓ , and then, Since r ℓ = 2σ d ℓ nd , we obtain

Figure 1 :
Figure 1: An example of principal curve.

Figure 3 :
Figure 3: Principal curves fitted with [A] a too small length, [B] a too large length and [C]an appropriate one.

Figure 6 :
Figure 6: Principal curve fitted with the LS algorithm for the hairspin data.

Figure 7 :
Figure 7: Principal curve fitted with the LS algorithm for the Labyrinth data.

Figure 8 :
Figure 8: Principal curves fitted with the PL Algorithm.[A] Hairpin street data.[B]Labyrinth data.