Asymptotics of a Clustering Criterion for Smooth Distributions

We develop a clustering framework for observations from a population with a smooth probability distribution function and derive its asymptotic properties. A clustering criterion based on a linear combination of order statistics is proposed. The asymptotic behavior of the point at which the observations are split into two clusters is examined. The results obtained can then be utilized to construct an interval estimate of the point which splits the data and develop tests for bimodality and presence of clusters.


Introduction
In this article, we develop a general framework for univariate clustering based on the ideas in Hartigan (1978) for the case of observations from a population with smooth and invertible distribution function.Contrary to Hartigan's approach, which was based on a quadratic function of the observed data, our clustering criterion function possesses the advantage of being a linear combination of order statistics-in fact, it is a combination of trimmed sums and sample quantiles.
It is common in certain applications to assume that the data are taken from a population with smooth distribution function.One important example is modeling in continuoustime mathematical finance, wherein observations are typically increments from a continuoustime stochastic process, and therefore, have smooth distributions because of presence of Itô integral components.Keeping this in mind, we deviate from the Hartigan's framework and concentrate our attention on a function of the derivative of his split function.This approach permits us to obviate the existence of a finite fourth moment assumption imposed by Hartigan in the asymptotic investigation of his criterion function-a second moment assumption at the cost of an additional smoothness condition on our criterion function suffices.As an added benefit, this modification of Hartigan's approach provides us with the genuine possibility of extending our existing theory to more interesting scenarios involving dependent observations.
The notion of a "cluster" has several reasonable mathematical definitions.As in Hartigan (1978), we adopt a definition based on determining a point which splits the data into clusters via maximizing the between cluster sums of squares.The main results in this article involve the asymptotic behavior of this particular point.The theoretical properties of k-means clustering procedure for the univariate and the multivariate cases have been extensively investigated.Pollard (1981), Pollard (1982) proved strong consistency and asymptotic normality results in the univariate case.Serinko and Babu (1992) proved some weak limit theorems under non-regular conditions for the univariate case.With the intention of having a more robust procedure for clustering, García-Escudero, Gordaliza andMatrán (1999), andCuesta-Albertos, Gordaliza andMatrán (1997) propose the trimmed k-means clustering and provide a central limit theorem for the multivariate case.Throughout the article we are primarily concerned with the case k = 2 on the real line.For extension of the split point approach to the case k > 2 we refer readers to the discussion in Hartigan (1978).
On a more practical note, our results enable us to construct an interval estimate of the point at which the data splits; this naturally allows us to develop simple tests for bimodality and presence (or absence) of clusters.Hypothesis tests for the presence (or absence) of clusters in a dataset has attracted considerable interest over the years.One of the earliest work in this area was by Engleman and Hartigan (1969); they developed a univariate method to test the null hypothesis of normally distributed cluster against the alternative of a two-component mixture of normals.Wolfe (1970) extended the work of Engleman and Hartigan (1969) to the multivariate normal setup using MLE techniques and applied his method to Fisher's Iris data.Motivated by applications in market segmentation, Arnold (1979) proposed a test for clusters based on examining the within-groups scatter matrix.A dataset generated from a large-scale survey of lifestyle statements was considered and the objective was to capture heterogeneity in the distribution of the responses to an appropriate questionnaire.Based on statistics concerning mean distances, minimum-within clusters sums and the resulting F-statistics, Bock (1985) presented several significance tests for clusters.In fact, he generalized some of the results in Hartigan (1978) to the multivariate setup.In similar spirit, we note that in our method, the point which splits the data is invariant to scaling and translation of the data.This permits us to examine the behavior of the point under the null hypothesis of "no cluster" and thereby construct a suitable test.
In Hartigan and Hartigan (1985), a popular test for unimodality, referred to as Dip test, was proposed and was applied to a dataset pertaining to the quality of 63 statistics departments.Indeed, their test did not possess good power against the specific bimodal alternative.More recently, Holzmann and Vollmer (2008) proposed a parametric test for bimodality based on the likelihood principle by using two-component mixtures.Their method was applied to investigate the modal structure of the cross-sectional distribution of per-capita log GDP across EU regions.Using the Kolmogorov-Smirnov and the Anderson-Darling statistics, Schwab, Podsiadlowski and Rappaport (2012), performed a test for bimodality in the distribution of neutron-star masses.They compared the empirical cumulative distribution function to the distribution functions of a unimodal normal and a bimodal two-component normal mixture.Our results enable us to construct, on identical lines as the test for clusters, a test for bimodality.The test statistic, again, is based on the point at which the data is split-the split point is the same for all unimodal distributions with finite second moment and can hence be used as the test statistic.
In section 2 we introduce the relevant constructs of our clustering framework: a theoretical criterion function and its zero followed by the empirical criterion function and its "zero".These quantities are of chief interest in this article.In section 3, we prove limit theorems for the empirical zero by examining the asymptotic behavior of the empirical criterion function and offer numerical verification of the limit results via simulation.
imsart-ejs ver.2011/11/15 file: clusters0320.texdate: May 1, 2014 Furthermore, we demonstrate the utility of our results on the popular faithful dataset pertaining to eruption times for the Old Faithful geyser in Yellowstone National Park, Wyoming, USA.Finally, in section 4 we highlight the salient features of our approach, note its shortcomings and comment on possible remedies and extensions.

Clustering Criterion
In this section, based on Hartigan's approach, we propose an alternative clustering criterion and examine its properties.We first state our assumptions for the rest of the paper.

Assumptions
Let W 1 , W 2 , • • • , W n be i.i.d.random variables with cumulative distribution function F .We denote by Q the quantile function associated with F .We make the following assumptions: A1. F is invertible for 0 < p < 1 and absolutely continuous with respect to Lebesgue measure with density f .A2. E(W 1 ) = 0 and E(W 2 1 ) = 1.A3. Q is twice continuously differentiable at any 0 < p < 1.
Note that owing to assumption A1, the quantile function Q is the regular inverse of F and not the generalized inverse.

Empirical Cross-over Function and Empirical Split Point
Let us first consider the split function that was introduced in Hartigan (1978) for partitioning a sample into two groups.The split function of Q at p ∈ (0, 1) is defined as represent the conditional expectations of the random variables W i up to and from Q(p).
Here I A denotes the indicator function of a set A. In our case since EW 1 = 0 the last term in the definition of the split function is 0. A value p 0 which maximizes the split function is called the split point.It is seen that if Q is the regular inverse, as in our case, p 0 satisfies the equation where the LHS is the derivative of B(Q, p).Evidently, (Q u (p)−Q l (p)) > 0 for all 0 < p < 1 and we hence, for our purposes, consider the cross-over function, for examining clustering properties.From a statistical perspective, we would like to work with the empirical version of (2.3).We deviate here from Hartigan's framework and consider the empirical cross-over function(ECF), defined in Bharath, Pozdnyakov and Dey (2012) as (2.4) The random quantity G n , represents the empirical version of (2.3) and determines the split point for the given data.G n is an L-statistic with irregular weights and hence not amenable for direct application of existing asymptotic results for L-statistics.Observe that This simple observation captures the typical behavior of the empirical cross-over function.
It starts positive and then at some point crosses the zero line.The index k at which this change occurs determines the datum W (k) at which the split occurs.In Bharath, Pozdnyakov and Dey (2012), it is shown that G n (p) is a consistent estimator of G(p) for each 0 < p < 1 and also that √ n(G n (p) − G(p)) is asymptotically normal.We now introduce the empirical split point in range [a, b], 0 < a < b < 1, the empirical counterpart of the p 0 as The quantity p n is our estimator of p 0 , the true split point (when it is in the range).If p n is equal to 0 or 1, we declare that the split point is outside the range.The asymptotic behavior of p n can be used for the construction of test for the presence of clusters in the observations, or for the estimation of the true split point.
Remark 1.Let us provide some intuition behind Hartigan's split function.The k-means clustering method for the case k = 2 requires us to minimize (with respect to k * ) the following within group sum of squares: which is basically an empirical version of Hartigan's split function (2.1) and k * /n will be another version of empirical split point.
In this paper we proceed in parallel with Hartigan (1978) (his Theorem 1 and Theorem 2) and prove consistency and asymptotic normality of p n under a uniqueness assumption.The theoretical conditions that guarantee the uniqueness of the split point is an open question.It is easy to see that for a unimodal symmetric distribution with a finite second moment, the split point is 1/2, and for all unimodal symmetric light-tailed distributions that we checked the split point was unique.However, Hartigan (1978) gives an example of a unimodal symmetric heavy-tailed distribution for which every point in (0, 1) is a split point.For a bimodal distribution the split point p 0 is typically unique and Q(p 0 ) lies between the cluster means.
The presented results can be employed for testing "no-clusters" hypothesis, testing bimodality, and estimation of the split point.The extension of this technique to the case of k clusters is discussed in Hartigan (1978).In our case, instead of one cross-over function one needs to introduce k − 1 functions; the split point in this case will be a (k − 1)-dimensional vector.To find this split point we then need to solve a system of k − 1 equations.For instance, for partition of data into three groups we need to introduce two cross-over functions imsart-ejs ver.2011/11/15 file: clusters0320.texdate: May 1, 2014 and, respectively, one needs to solve (in an appropriate sense) the following system of equations: We do not address the general k > 2 cluster situation in this paper.
Finally, as stated in the Introduction, let us remark on the main technical difference between Hartigan's assumptions and ours.Since we deal here with the derivative of the split function, we need a stronger smoothness condition (the second derivative of G instead of the first one).In return, we work with trimmed means and a weaker moment condition suffices (the finite second moment instead of the fourth one as in Hartigan (1978)).Remark 2. Notice that if for constants α > 0 and β and i = 1, . . ., n, and we define G z n to be the ECF based on Z i , then, and therefore, G z n and G n cross-over 0 at the same point.Thus assumption A2 is not restrictive since p n is invariant to scaling and translation of the data; as it should be, since in a clustering problem scale and location changes of the data should not affect the clustering mechanism.We can hence quite safely assume that we are dealing with random variables with mean 0 and variance 1.

Main Results
Let us start with the functional limit theorem for Bharath, Pozdnyakov and Dey (2012).
Theorem 1. Define in the Skorohod space D[a, b], 0 < a < b < 1 equipped with the J 1 topology, where U (p) is a Gaussian process with mean 0 and covariance function given by The next lemma states that the Gaussian process U (p) allows a continuous modification.This fact will be employed, for example, to justify the usage of the mapping theorem (for instance, Billingsley (1968) and Pollard (1984)).
Proof.First note that on the interval [a, b] the functions (of p) ] are continuously differentiable.Second, the functions max(p, q) and min(p, q) are (globally , there is a constant K such that for all p, q, p and q from [a, b] Therefore, By Theorem 1.4 from Adler (1990) we get that U (p) is continuous.
This immediately leads us to the following important consequence.Lemma 2. Under assumptions A1 − A3, for 0 < p < 1 and k−1 n ≤ p < k n , as n → ∞, Proof.
Re-arranging terms, the RHS can written as Observe that, by the law of large numbers for trimmed sums (see Stigler (1973)), where η is a constant and hence, as n → ∞, Moreover, since W (k) ), then from Devroye (1981), we have that and therefore It is hence the case that the RHS is o p 1 √ n .This concludes the proof.
imsart-ejs ver.2011/11/15 file: clusters0320.texdate: May 1, 2014 Lemma 3. Assume A1 − A3 hold.Suppose that G(p) = 0 has a unique solution, p 0 , and G (p 0 ) < 0. If a, b are such that 0 < a < p 0 < b < 1, then for any δ > 0 there exist N and C > 0 such that for all n ≥ N Proof.Fix arbitrary δ > 0. Using Theorem 1 and mapping theorem we have Therefore, for any δ > 0 there exist N and C > 0 such that for all n > N we have By the same argument as in Theorem 2, p 0 is a unique split point, 0 < p 0 < 1, G(p) > 0 for p < p 0 and G(p) < 0 for p > p 0 .Assumption G (p 0 ) < 0 tells us that in a neighborhood of p 0 the function G(p) behaves like a line.Taking this into account we get that there exist N > N and C > 0 such that for all n > N Then by (3.2) we find that for all n > N P inf Therefore, Lemma 4. Assume A1 − A3 hold.Suppose that G(p) = 0 has a unique solution, p 0 , and where Proof.Since the second derivative of G(p) is uniformly continuous on or that for any > 0 and δ > 0 there exists N such that for all n > N P sup ] equipped with the uniform metric.Therefore, Theorem 1 and the mapping theorem informs us that Since for all sufficiently large n we have it is indeed the case that As a consequence, lim sup Because the Gaussian process U (p) is continuous, sup p0−δ ≤p≤p0+δ |U (p) − U (p 0 )| → 0 with probability 1 as δ → 0, and, therefore, it converges to 0 in probability.Choosing δ small enough we can make lim sup Lemma 5. Assume A1 − A3 hold.Suppose that G(p) = 0 has a unique solution, p 0 , and where G (p) is as defined in (3.3).
Proof.Consider the line G n (p 0 ) + G (p 0 )(p − p 0 ).Let random variable p * be the solution of From Theorem 1, we know that and we hence have that by which we can claim that p n = p * + o p (n −1/2 ).The result follows by substituting for p * in (3.4).
This immediately give us the final result.where θ p0 is as defined in Theorem 1.

Numerical Verification
In this section we provide verification of our results regarding the asymptotic normality of p n along the lines of Table 1 in Hartigan (1978).Since our split point p 0 coincides with Hartigan's split point (maximum of B(Q, p)), it is to be expected that our empirical split point p n behaves asymptotically similar to his.Hartigan verifies his results when observations are obtained from a N (0, 1) population-a population with smooth distribution function; we do the same and note that the asymptotic mean and the variance of p n agree with his.
It is a simple exercise to ascertain that, for the normal case, the split point p 0 is 0.5, G 2 (0.5) ≈ 3.34 and V ar(θ 0.5 ) = 2π − 4 ≈ 2.283.Consequently, we observe that the asymptotic variance of √ n(p n − 0.5) is approximately 0.69.The table below corroborates our theoretical results.We demonstrate here how Theorem 3 can be employed to construct approximate confidence intervals (CI) for a theoretical split point.We consider a classical example of bimodal distribution-the variable "eruption" in the data set faithful available in R package MASS.The data set contains 272 measurements of the duration of eruption for the Old Faithful geyser in Yellowstone National Park, Wyoming, USA.First, we plot the ECF for the variable "eruption"; the plot is given in Figure 1.We can see that G n (•) is generally a decreasing function that crosses zero line once, far away from 0 and 1: the end-points of its domain which is the (0, 1) interval.Thus our point estimate of theoretical split point is p n = 97/272 ≈ .357.Now, to construct an approximate CI for p 0 we need to estimate V ar(θ p0 )/G 2 (p 0 ).A straightforward (but rather tedious) calculation shows that this quantity explicitly depends on the following terms: p 0 , Q(p 0 ), f (Q(p 0 )), Q l (p 0 ), Q u (p 0 ), Finally, f (Q(p 0 )) is estimated by f (W (98) ), where f comes from the standard R function density.As a result, for instance, the 95% confidence interval for a theoretical split point p 0 is given by .357± .057.

Discussion
Admittedly, the definition of the empirical split point "in the range [a, b]" might appear a bit artificial.But we still believe the results can be useful in practical applications.For instance, as with the faithful data, if we know that the distribution at hand is bimodal, and we want to estimate a split point between two clusters, it is safe to assume that the split point is in the range between two modes.It turns out that the behavior of the cross-over function when it is close to 0 or 1 can be rather complicated; under some natural assumptions, it can be shown that lim sup p↑1 G(p) < 0. For example, it is true if W 1 is bounded from above.When Q(1−) = +∞, the following condition is sufficient for lim sup p↑1 G(p) < 0. It is easy to see that, for instance, distributions with regularly varying tails and EW 2+ 1 < ∞, for > 0, satisfy (4.1).However, it is possible to construct a distribution with a "bumpy" tail for which lim sup p↑1 G(p) ≥ 0. Consequently, it is suggestive that any extension of definition of p n to the entire interval (0, 1) will require some additional assumptions.

Acknowledgements
We are thankful to the Associate Editor and referee for careful reading of the manuscript, thoughtful criticisms and very helpful suggestions that allowed us to significantly improve the presentation of our results.We also want to thank Zhiyi Chi for discussions on the behavior of the cross-over function around 0 and 1.

Fig 1 .
Fig 1. Empirical Cross-over Function Gn(p) for data set faithful.

Table 1
Simulated mean and variance of √ n(pn − 0.5) for different sample sizes for the normal case.1000 simulations were performed for each sample size.