Calibration of the Empirical Likelihood Method for a Vector Mean

The empirical likelihood method is a versatile approach for testing hypotheses and constructing confidence regions in a non-parametric setting. For testing the value of a vector mean, the empirical likelihood method offers the benefit of making no distributional assumptions beyond some mild moment conditions. However, in small samples or high dimensions the method is very poorly calibrated, producing tests that generally have a much higher type I error than the nominal level, and it suffers from a limiting convex hull constraint. Methods to address the performance of the empirical likelihood in the vector mean setting have been proposed in a number of papers, including a contribution by Chen et al. (2008) that suggests supplementing the observed dataset with an artificial data point. We examine the consequences of this approach and describe a limitation of their method that we have discovered in settings when the sample size is relatively small compared with the dimension. We propose a new modification to the extra data approach that involves adding two points and changing the location of the extra points. We explore the benefits that this modification offers, and show that it results in better calibration, particularly in difficult cases. This new approach also results in a smallsample connection between the modified empirical likelihood method and Hotelling’s T-square test. We show that varying the location of the added data points creates a continuum of tests that range from the unmodified empirical likelihood statistic to Hotelling’s T-square statistic.


Introduction
Empirical likelihood methods, introduced by Owen (1988), provide nonparametric analogs of parametric likelihood-based tests, and have been shown to perform remarkably well in a wide variety of settings.Empirical likelihood tests have been proposed for many functionals of interest, including the mean of a distribution, quantiles of a distribution, regression parameters, and linear contrasts in multisample problems.
In this paper, we focus on the use of the empirical likelihood method for inference about a vector mean, and investigate some of the small sample properties of the method.It has been widely noted (see, for example, Owen (2001), Tsao (2004a), or Chen et al. (2008)) that in small samples or high dimensional problems, the asymptotic chi-square calibration of the empirical likelihood ratio statistic produces a test that generally does not achieve the nominal error rate, and can in fact be quite anti-conservative.Many authors have proposed adjustments to the empirical likelihood statistic or to the reference distribution in an attempt to remedy some of the small sample coverage errors.We briefly examine the ability of some these adjustments to correct the behavior of the empirical likelihood ratio test, and focus in particular on the method of Chen et al. (2008) which involves adding an artificial data point to the observed sample.This approach offers several key benefits in both ease of computation and accuracy.We explore the consequences of the recommended placement of the extra point, and we demonstrate a limitation of the method that results in confidence regions equal to R d in some settings.We propose a modification of the data augmentation that involves adding two balanced points rather than just one and changing the location of the added points.The balanced points preserve the sample mean of the augmented data set, which maintains the comparison between the sample mean and the hypothesized value.This modification addresses both the under-coverage issue of the original empirical likelihood method and the limitation of the Chen et al. (2008) method.The locations of the new extra points are determined according to a parameter s > 0 which tunes the calibration of the resulting statistic.With an appropriate choice of s, these adjustments result in greatly improved calibration for small samples in high dimensional problems.Further, as s → ∞, we find a small sample connection to Hotelling's T-square test.Simulation results demonstrate the effectiveness of the modified augmented empirical likelihood calibration.
We begin in Section 2 with a description of the basic setting and introduce some notation.We then outline the empirical likelihood method, and discuss the small sample issues of the method.In Section 3 we present previous proposals for calibrating the empirical likelihood method and compare the abilities of these proposals to address the various challenges for empirical likelihood in small samples.Section 4 introduces a modification of the data-augmentation strategy, and presents a result regarding the change in sample space ordering as the location of the extra points varies, connecting the empirical likelihood method to Hotelling's T-square test.We illustrate the improvement in calibration for several examples in Section 5, and conclude in Section 6 with a discussion of the results, and some ideas for future work and extensions of the methods presented here.

Background and Notation
Let X 1 , . . ., X n ∈ R d be a sample of n independent, identically distributed dvectors, distributed according to F 0 .We want to test a hypothesis regarding the value of µ 0 = E F0 (X i ), i.e., to test H 0 : µ 0 = µ. (1) Let X = 1 n n i=1 X i denote the sample mean, and let S denote the sample covariance matrix, which we assume to be full rank: Finally, let A ∈ R d×d be an invertible matrix satisfying AA T = S. Define the following standardized quantities: Z i = A −1 X i − X , Z = 0, and η = A −1 µ − X .We will use the standardized quantities to simplify notation in later sections.

Hotelling's T-square Statistic
For the setting and hypothesis test described by (1), Hotelling's T-square statistic (Hotelling 1931) is given by Hotelling's T-square statistic is invariant under the group of transformations defined by X → X = CX, where C is a full-rank matrix of dimension d×d.The hypothesis being tested is then H 0 : E( Xi ) = μ = Cµ.In terms of standardized variables defined in the previous section, Hotelling's T-square statistic simplifies to T 2 (µ) = nη T η.
For testing the mean of a multivariate normal distribution, Hotelling's T-square test is uniformly most powerful invariant, and has been shown to be admissible against broad classes of alternatives (Stein 1956, Kiefer & Schwartz 1965).In the Gaussian case, the resulting statistic has a scaled F d,n−d distribution under the null distribution, given by: and therefore a hypothesis test of level α is obtained by rejecting the null hypothesis when The multivariate central limit theorem, along with Slutsky's theorem, justifies the use of this test for non-Gaussian data in large samples, and even in relatively small samples it is reasonably robust.Highly skewed distributions will of course require larger sample sizes to produce accurate inference using Hotelling's Tsquare test.

Empirical Likelihood Statistic
Owen (1988) and Owen (1990) proposed the ordinary empirical likelihood method, which we will denote by EL, for testing the hypothesis (1).It proceeds as follows: let The log empirical likelihood ratio statistic is then given by When positive weights w i satisfying the condition Under some mild moment assumptions, W(µ 0 ) (Owen 1990), where µ 0 is the true mean of the underlying distribution.The proof of this asymptotic behavior proceeds by showing that W(µ 0 ) converges in probability to Hotelling's T-square statistic T 2 (µ 0 ) as n → ∞.
The motivation for the empirical likelihood ratio statistic is, as the name implies, an empirical likelihood ratio.The denominator is the likelihood of the observed mean under the empirical distribution: n .The numerator is the maximized likelihood for a distribution F that is supported on the sample and satisfies E F [X] = µ.It is easy to show that the empirical likelihood ratio statistic is invariant under the same group of transformations as Hotelling's Tsquare test, and this is a property that we will seek to maintain as we address the calibration issues of the test.
The asymptotic result above allows us to test hypotheses regarding the mean and to construct confidence intervals using the appropriate critical values arising from the chi-square distribution.However, the small sample behavior of this statistic is somewhat problematic for several reasons.First, if µ is not inside the convex hull of the sample, the statistic is undefined, or by convention taken to be ∞.A paper by Wendel (1962) calculates the probability p(d, n) that the mean of a d-dimensional distribution is not contained in the convex hull of a sample of size n.The result is for distributions that are symmetric under reflections through the origin, and is found to be p(d, n) = 2 −n+1 d−1 k=0 n−1 k .That is, the probability that the convex hull of the points does not contain the mean is equal to the probability that W ≤ d − 1 for a random variable W ∼ Bin(n − 1, 1 2 ).(Note: an isomorphism between the binomial coin-flipping problem and this convex hull problem has still not been identified.)In small samples this convex hull constraint can be a significant problem, and even when the sample does contain the mean, the null distribution will be distorted somewhat by the convex hull effect.
A second issue that affects the small sample calibration of the empirical likelihood statistic is the fact that the first order term of the asymptotic expansion for the statistic is clearly not chi-square for small n, and is in fact bounded, as we now demonstrate.Analogous to the definition of S, define In the asymptotic expansion of the statistic W(µ 0 ), the first order term is which is related to Hotelling's T-square statistic by It is difficult to quantify the effect of the deviation of this term from its chi-square limit because the higher order terms clearly have a non-ignorable contribution in this setting since the EL statistic is unbounded.This does, however, indicate that the asymptotic approximation may be very far from accurate for small samples.
Together, these issues result in a generally very anti-conservative test in small samples.This is illustrated in the quantile-quantile and probabilityprobabilty plots shown in Figure 1, which are generated by simulating 5000 datasets consisting of 10 points from the multivariate Gaussian distribution in four dimensions, and then calculating the value of the EL statistic for the true mean µ 0 = 0 for each dataset.We use this extreme setting of 10 points in four dimensions to make the calibration flaws readily apparent; these flaws persist, to a lesser degree, even in more reasonable settings.From these plots we can see the extremely anti-conservative behavior of this test: a test with nominal level α = 0.05 would in fact result in a type I error rate of about 0.47.The example shown here is a difficult one, but even in more reasonable problems there can be a sizeable discrepancy between the nominal and actual type I error rates.

Calibration of Empirical Likelihood for a Vector Mean
There have been a number of suggestions for improving the behavior of the empirical likelihood ratio statistic in small samples.We give a brief description of several such calibration methods here; more in-depth discussion may be found in the references listed with each method.The simplest of these methods is to use an appropriately scaled F distribution (Owen 2001) in place of the usual χ 2 reference distribution calibration.This approach is motivated by the first order term of the empirical likelihood ratio statistic, which closely resembles Hotelling's T-square statistic.However, in many examples there is no improvement in the resulting calibration, and the convex hull issue is clearly not addressed.Then for each bootstrap sample, the empirical likelihood ratio statistic W (b) ( X) is computed for the sample mean of the original data set using the resampled data.This resampling process is performed B times, and the statistic W(µ) is then compared to the distribution of values W (b) ( X), b = 1, . . ., B to give a bootstrap p-value.The bootstrap calibration does not directly address the convex hull problem, but if the empirical likelihood function is extended beyond the hull of the data in some way, the boostrap calibration can produce usable results even when µ is not in the convex hull of the data.The calibration resulting from this bootstrap process is generally reasonably good, but it is quite computationally intensive.As with most bootstrap processes, the performance is improved with a higher number of bootstrap repetitions.DiCiccio et al. (1991) show that the empirical likelihood method is Bartlettcorrectable, and therefore the asymptotic coverage errors can be reduced from O(n −1 ) to O(n −2 ).They further demonstrate that even in small samples an estimated Bartlett correction offers a noticeable improvement.The Bartlett correction involves scaling the reference χ 2 distribution by a factor that can be estimated from the data or computed from a parametric model, and therefore offers no escape from the convex hull.Since the Bartlett correction corresponds to shifting the slope of the reference line in the quantile-quantile plot, it is also clear that in the examples we consider here it will offer only a marginal benefit in improving calibration.
The empirical likelihood-t method, discussed in Owen ( 2001) and originally proposed by K. A. Baggerly in a 1999 technical report (source unavailable) is an attempt to address the convex hull constraint by allowing the weighted mean to differ from the hypothesized mean in a constrained manner.This method does not retain the transformation invariance of the empirical likelihood method (Owen 2001), and requires significantly more computation time as it introduces another parameter to be profiled out in the search for optimal weights.Tsao (2001) and Tsao (2004b) discuss a calibration for the empirical likelihood method for a vector mean that involves simulating the exact distribution of the empirical likelihood ratio statistic when the underlying distribution of the data is Gaussian, and using this simulated distribution as the reference.There is no attempt to address the convex hull issue, but the resulting coverage levels do tend to be closer to the nominal levels when the convex hull constraint allows it.Bartolucci (2007) suggests a penalized empirical likelihood that allows hypotheses outside the convex hull of the data by penalizing the distance between the mean ν of the reweighted sample distribution and the hypothesized mean µ.While this approach does escape the convex hull issue, the choice of the penalty parameter is difficult to determine, and the method is very computationally intensive as it requires an extra search to minimize the penalty and it also relies on bootstrap calibration.In fact, the author recommends double bootstrap calibration, which becomes prohibitively expensive as the dimension of the problem increases.Clearly the benefit of this approach will depend on the choice of the penalty parameter, and it is unclear how much this modification improves the calibration of the test in the best case.
Finally, Chen et al. (2008) suggest a calibration, which we will refer to henceforth as the adjusted empirical likelihood method (AEL), that proceeds by adding an artificial point to the data set and then computing the empirical likelihood ratio statistic on the augmented sample.The point is added in such a way as to guarantee that the hypothesized mean will be in the convex hull of the augmented data, thereby addressing the convex hull constraint.Chen et al. discuss the asymptotic behavior of this modification, showing that as long as the additional point is placed in a reasonable way, the resulting statistic has the same limiting properties as the ordinary empirical likelihood ratio statistic.This approach is attractive from a computational standpoint, and appears to have good potential to influence the appropriateness of the calibration of the empirical likelihood method.
In summary, with the exception of the last two methods, these approaches do not address the convex hull constraint, and have varying degrees of success at correcting the small sample behavior of the empirical likelihood statistic.The AEL method has most convincingly overcome the convex hull issue and has further resulted in marked improvement in the calibration of the resulting statistic, so we explore their approach in greater depth.

Adjusted Empirical Likelihood
Chen et al. (2008) propose adding an additional point to the sample and then calculating the empirical likelihood statistic based on the augmented data set.Define the following quantities: so v * is the vector from the sample mean to the hypothesized mean of the underlying distribution, r * is the distance between the sample mean and the hypothesized mean, and u * is a unit vector in the direction of v * .In terms of these quantities, for the setting described in Section 2, the extra point X n+1 that Chen et al. suggest is where a n is a positive constant that may depend on the sample size n.Then the resulting adjusted log empirical likelihood ratio statistic is where They recommend the choice a n = 1 2 log(n), but discuss other options as well and state that as long as a n = o p (n 2/3 ) the first order asymptotic properities of the original log empirical likelihood ratio statistic are preserved for this adjusted statistic.It is easy to see that this modification also preserves the invariance of the ordinary empirical likelihood method.However, in the case of small samples or high dimensions, we have discovered that the AEL adjustment has a limitation that can make the chi-square calibration very inappropriate.The following Proposition describes this phenomenon.
Proposition 3.1.With an extra point placed as proposed in Chen et al. (2008) at Proof.We show that weights w i given by Then since clearly n+1 i=1 w i = 1, we therefore have So taking logarithms and multiplying by −2 we find that:  2: Maximum possible confidence level for a non-trivial chi-square calibrated confidence interval using the AEL method of Chen et al. (2008).Confidence intervals with nominal level greater than the given values will include the entire parameter space.These numbers are for the case when n = 10 and a n = log(n) 2 , for dimension ranging from 1 to 9. The upper bound for the adjusted log empirical likelihood ratio statistic for this n and a n is B(n, a n ) = 7.334.This result clearly indicates the poor performance of the chi-square calibration for this statistic with small n or large d, as this bound will in some cases be well below the 1 − α critical value of the χ 2 (d) reference distribution, which will make the chi-square calibrated 1 − α confidence intervals equal R d .Table 2 displays the largest possible coverage level that does not result in the trivial parameter space confidence region using the AEL method, for the situation where 10 observations in d dimensions.For small values of d or large values of n, the bound will not cause much of a problem.For larger values of d relative to n, the bound can be rather restrictive: from Table 2, we see that for d ≥ 3, a 95% confidence region based on the χ 2 (3) reference distribution will include the entire space.Predictably, as d increases for a fixed n, this issue becomes more pronounced.Figure 2 illustrates the bound phenomenon for 10 points in 4 dimensions, and also demonstrates suboptimal calibration even for values of α for which the boundedness of the statistic is not an issue.

Modified Sample Augmentation
Inspired by the approach of the AEL method, we propose augmenting the sample with artificial data to address the challenges mentioned above.However there are several key differences between their approach and ours.In contrast to the one point, placed at X.We also modify the placement of the points to where . This choice of c u * may be recognized as the inverse Mahalanobis distance of a unit vector from X in the direction of u * , and will result in the points being placed closer to µ when the covariance in the direction of X − µ is smaller, and farther when the covariance in that direction is larger.We will assume that P ( X = µ) = 0 and therefore we do not have to worry about the case when u * is undefined because v * is zero.
With the points placed as described, the sample mean of the augmented dataset is maintained at X.The scale factor s can be chosen based on considerations that will be investigated in the next section.Having determined the placement of the extra points, we then proceed as if our additional points X n+1 and X n+2 were part of the original dataset, and compute W(µ) = −2 log( R(µ)) where We will refer to this statistic and method as the balanced augmented empirical likelihood method (BAEL) throughout the paper, to distinguish it from the unadjusted empirical likelihood statistic (EL) and the adjusted empirical likelihood statistic (AEL) of Chen et al..By the arguments of Chen et al. (2008), it is easy to show that with a fixed value of s this approach to augmenting the dataset has the same asymptotic properties as the ordinary empirical likelihood statistic.Other desirable properties of the EL statistic are retained as well, as addressed in the following Proposition.
Proposition 4.1.Placing the points according to (4) preserves the invariance property of the empirical likelihood method under transformations of the form X → X = CX, where C is an arbitrary full-rank matrix of dimension d × d.
Proof.The transformed ũ is given by ũ and the transformed cũ is given by cũ Finally, when we place Xn+1 based on the transformed data, we get and similarly Xn+2 = CX n+2 .Using the fact that the original empirical likelihood method is invariant, we may conclude that this augmentation leaves the statistic invariant under the same group of transformations.
One of the key differences between this approach and that of the AEL method is that as X − µ increases the distance µ − X n+1 remains constant in our approach.This avoids the upper bound on W * (µ) that occurs using the AEL method.The other key idea in this placement of the extra points is to utilize distributional information estimated from the sample in the placement of the extra points.
The use of two points rather than just one is motivated by the original context of the empirical likelihood ratio statistic as a ratio of two maximized likelihoods: the numerator is the maximized empirical likelihood with the constraint that the weighted mean be µ, and the denominator is the unconstrained maximized empirical likelihood which occurs at the sample mean X. Adding just one point would necessarily change the sample mean, and therefore as different values of µ are tested, the resulting likelihood ratios are comparing the constrained maximum likelihoods to different sample means.Though the resulting weights in the denominator are the same no matter the value of the sample mean, the addition of two balanced points retains the spirit of the method and results in an interesting connection between the empirical likelihood ratio statistic and Hotelling's T-square statistic, as discussed further in Section 4.1.
In the next section we will address the choice of the scale factor s on the resulting statistic, and in particular we will describe and prove a result connecting the empirical likelihood method and Hotelling's T-square test in small samples.

Limiting Behavior of W(µ) as s → ∞
To reduce notation, we will work with the standardized versions of the data and the hypothesized mean as described in Section 2, so where Z n+1 and Z n+2 are defined as follows.Using the transformed variables, we let As these standardized observations have sample mean equal to zero and sample covariance matrix equal to I d , the extra points Z n+1 and Z n+2 are then given by Then as the distance of these extra points from Z = 0 increases, we are interested in the limiting behavior of the resulting adjusted empirical likelihood statistic, which is given by the following theorem: Theorem 4.2.For a fixed sample of size n Here we present a brief outline of the proof; a complete and detailed proof is given in the Appendix.We will use the following notation throughout the proof of the theorem.As in Owen (2001), let λ be the Lagrange multiplier satisfying so then the weights that maximize R(η) are given by .
The proof of the theorem proceeds in the following steps: 1. First we establish that λ T u = o(s −1 ) using a simple argument based on the boundedness of the weights w i .
2. We bound the norm of λ by λ = o(s −1/2 ) using the result from step 1 together with the fact that λ T (Z i − η) > −1 for all i, and the identity 3. Using the result from step 2, the unit vector in the direction of λ, given by θ, is shown to satisfy θ T u → 1.Then since from step 1 we have λ T u = o(s −1 ), we get λ = o(s −1 ).
4. The limiting behavior of λ is found to be s 2 λ T u → (n+2)r 2 , using the bound from step 3 together with the constraint given by equation ( 6), and the identity This gives λ = O(s −2 ).
5. Finally we use the limiting behavior of λ from step 4 to get 2ns 2 (n+2) 2 W (µ) → T 2 .This is done by substituting the expression for λ from step 4 into the expression for W (η): and using the Taylor series expansion for log(x) as x → 1.This proof differs is several key ways from the usual empirical likelihood proofs, and these five steps are presented in full detail in sections A.1 -A.5 of the appendix.
We mentioned in Section 2.2 that asymptotically the empirical likelihood test becomes equivalent to Hotelling's T-square test under the null hypothesis as n → ∞, but this theorem extends that relationship.This result provides a continuum of tests ranging from the ordinary empirical likelihood method to Hotelling's T-square test for any sample size.The magnitude of s that is required to achieve reasonable convergence to Hotelling's test depends on the dimension and sample size.

Results
First we present the results of simulations to compare the accuracy of the chisquare calibration for the original empirical likelihood method (EL), the Chen et al. (2008) adjusted empirical likelihood method (AEL), and our balanced augmented empirical likelihood method (BAEL) in Section 5.1.Then we illustrate the effect of the s parameter on the relationship of the BAEL method to the original empirical likelihood method and to Hotelling's T-square test in Section 5.2.

Calibration Results
To compare the calibration of EL, AEL, and BAEL, we performed numerical comparisons based on simulated datasets for a variety of settings.We considered four combinations of sample size and dimension: (d, n) = (4, 10), (4, 20), (8, 20), and (8, 40).For each combination, we simulated datasets from nine different distributions with independent margins.The distributions were chosen to represent a range of skewness and kurtosis so that we could evaluate the effects of higher moments on the calibration of the method.The skewness and kurtosis of the chosen distributions are listed in Table 3.We compared the chi-square calibrations of EL, AEL, and BAEL by creating quantile-quantile plots of the log empirical likelihood ratio statistics versus the appropriate chi-square distribution.Figures 3 -10 show the resulting improvement in chi-square calibration using our BAEL method.We also plotted the p-values resulting from the chi-square calibration versus uniform quantiles in the corresponding probability-probability plots, to give a better indication of the coverage errors of the different methods.In each figure, the black lines or points represent the ordinary EL method; the red lines or points represent the AEL method of Chen et al.; and the green lines or points are the results of our BAEL statistic.In the probability-probability plots, we have also included a blue line for the p-values resulting from Hotelling's T-square test.All of these figures were produced using s = 1.9; more discussion of the choice of s will be given in Section 6.
These plots demonstrate the marked improvement in calibration achieved by our method: for symmetric distributions, the actual type I error is almost exactly the nominal level, particularly in the upper right regions of the plots where most hypothesis testing is focused.For the skewed distributions, the accuracy of the calibration depends on the degree of skewness and also on the kurtosis Marginal Skewness Kurtosis Distribution Normal(0, 1) Table 3: Skewness and kurtosis of example distributions.
of the distributions.We find that it is harder to correct the behavior of empirical likelihood inskewed and highly kurtotic distributions, but even in the case of the Gamma(1/4, 1/10) distribution we have acheived distinct improvement over the other two versions of empirical likelihood.We have also essentially matched the calibration performance of Hotelling's T-square test even though the value of the scale factor s is not large enough to have forced convergence to Hotelling's test, as will be addressed in section 5.2.Thus we are still in the empirical likelihood setting, but with significantly improved accuracy for our test.
Note also that though the behavior in skewed distributions is not completely corrected by our calibration, it appears from the quantile-quantile plots that a Bartlett correction might result in a marked improvement by shifting the slope of the reference distribution line.A Bartlett correction is clearly not as likely to result in improvement for the EL and AEL statistics, as the quantile-quantile plots for those methods versus the reference chi-square distribution are quite non-linear.

Sample Space Ordering Results
Next we explored the degree to which our new calibration deviates from the ordinary empirical likelihood method to agree with Hotelling's, as a function of the scale factor s. Two tests are functionally equivalent if they order the possible samples in the same way, and therefore will always come to the same conclusion.Otherwise, if the tests produce different orderings of possible samples, they may make different decisions on the same dataset.For instance, the two-tailed t-test for a univariate mean is equivalent to the F -test that results from squaring the t statistic: though these two tests have different reference distributions, they will always make the same decision for any given sample.In contrast, Pearson's chisquare test for independence in 2 × 2 tables orders the sample space differently than Fisher's exact test does, and thus these two tests may come to different conclusions.The important idea here is the ordering that different tests impose  (4) distribution, and the y-axis is quantiles of the ordinary EL statistic (black), the AEL statistic (red), and our BAEL statistic (green).Reading across the rows, the distributions are arranged in order of increasing skewness and then increasing kurtosis.The first five distributions are symmetric.Black tick marks on the y = x line indicate the 90%, 95%, and 99% quantiles of the reference distribution.
on the sample space determines the properties of the tests, such as their power against various alternatives.
We have shown that as s increases, our BAEL statistic will become equivalent to Hotelling's T-square statistic, but we would like to explore the extent to which this is true for small values of s.To do this, we generated 100 datasets, each consisting of 40 observations from a standard multivariate Gaussian distribution in 8 dimensions.For each dataset, we computed Hotelling's T-square statistic T 2 (µ 0 ), the EL statistic W(µ 0 ), and the BAEL statistic W(µ 0 ).We considered how the three statistics ordered different samples when testing the true null hypothesis by ranking the datasets according to each of the statistics.Figure 11 plots the ranking of the samples according to the BAEL statistic on the yaxis versus the ranking according to Hotelling's T-square statistic on the x-axis.The value of s increases as powers of 2 from the top left plot to the bottom right.These same samples and choices of s are shown again in Figure 12, except now the x-axis is the rank according to the EL statistic.
These figures demonstrate the convergence of the sample space ordering to that of Hotelling's T-square statistic as s increases.From these figures we can see, for example, that for the value s = 1.9 used in the calibration simulations the ordering imposed by the BAEL statistic has not yet converged to the ordering produced by Hotelling's T-square statistic.It is important to note that though the sample space ordering of the new augmented empirical likelihood statistic looks to be identical to that of Hotelling's statistic when s = 16, this does not mean that the relationship is linear yet.We also note that for different combinations of the underlying distribution, sample size, and dimension, the same value of s will produce different ordering discrepancies between the augmented empirical likelihood method and Hotelling's T-square statistic, but the qualitative behavior as s increases will be preserved.

Discussion
We have introduced and explored many of the properties of a new augmented data empirical likelihood calibration.It has performed remarkably well in difficult problems with quite small sample sizes, and produces a versatile family of tests that allow an investigator to take advantage of both the data-driven confidence regions of the empirical likelihood method and the accurate calibration of Hotelling's T-square test.
In additional simulations we have explored the effect of the scale factor s on the resulting chi-square calibration of the BAEL statistic.We found that there is some variability in the value of s * (d) that produces the best χ 2 (d) calibration for a given dimension, but the range is fairly tight, from approximately 1.6 for d = 2 to 2.5 for d = 30.The optimal value s * (d) was chosen to be the value that gave the best overall fit to the χ 2 (d) distribution, as judged by the Kolmogorov-Smirnov statistic.The default value s * (d) warrants more detailed investigation, and will be explored further in later work.
We would like to investigate the potential of a Bartlett correction to improve the calibration in skewed samples.Since estimating the correction factor for a Bartlett correction involves estimating fourth moments, it will be a challenge in small samples and high dimensions, but it does appear that there may be significant gains possible.The linearity of the quantile-quantile plots in the skewed distributions indicates that perhaps the skewness just scales the chisquare distribution of the augmented empirical likelihood statistic, but does not otherwise significantly alter it.This certainly warrants further exploration and theoretical justification.Concurrent work by Liu & Chen (2009) has explored the use of two additional data points in another context.They kindly shared a preprint of their article with us as we were finishing work on our approach.Liu & Chen (2009) use different criteria for determining the placement of the extra points, and they investigate a connection between their resulting method and the Bartlett correction for empirical likelihood.
We have not addressed the power of the resulting test in this work, but we have made preliminary investigations into the effect of our modification on the power of the competing tests.As might be expected, we have found that the power of BAEL is between that of the ordinary empirical likelihood and the power of Hotelling's T-square test.The relationship described in Theorem 4.2 explains this behavior on a heuristic level, and also indicates that as s increases, the power curve of the augmented empirical likelihood test will more closely resemble the power curve from Hotelling's T-square.For most of the examples and alternatives that we explored, the power of the ordinary empirical likelihood and the power of Hotelling's test were very close.
The connection to Hotelling's T-square test may prove to be especially interesting in the multi-sample setting.This result has potential implications beyond the one-sample mean setting, where it is largely of theoretical interest.In multisample settings, this relationship, combined with the generality of the empirical likelihood method, might be useful in extending Hotelling's test to scenarios where it currently does not apply, such as unequal group sizes with different variances.The use of the empirical likelihood-Hotelling's T-square continuum could enable us to produce tests with the accuracy of Hotelling's T-square, but with the flexibility and relaxed assumptions of the empirical likelihood framework.Similar extensions may also be made to regression problems.

A Proof of Theorem 4.2
Recall that Z i are the standardized variables Z i = A −1 X i − X , leading to the standardized versions of the sample mean Z = 0, and the hypothesized mean η = A −1 µ − X .We have defined the following quantities: But since r T i u = 0 for all i, the only way this equality can hold is if both sides are 0, so In the last line, the term s 3 u(λ T u) 2 [(n + 2)(w n+2 − w n+1 )] is of order o(s 3 )o(s −2 )O(s −1 ) = o(1), using ( 7) and ( 8).Thus we get 0 = (n + 2)ru − 2s 2 (λ T u)u + o(1), giving and since λ T u = λ θ T u, by (15) we conclude A .5 Step 5 Finally, we use the Taylor series expansion for log(1 + x) about 1 to write − log ((n + 2)w i ) = log 1 + λ T (Z i − η) where d i = O(s −4 ) from ( 19) and the boundedness of the other terms in the expansion.Using the representation (20) in the expression we have Substituting in the limiting expression (18) for s 2 λ T u, we have which simplifies to 2ns 2 (n + 2) 2 W (η) → nr 2 . ( Then, since in this standardized setting Hotelling's T-square statistic is given by T 2 = nη T η = n(−ru) T (−ru) = nr 2 , this completes the proof.

Figure 1 :
Figure1: Quantile-quantile and probability-probability plots for the null distribution of the empirical likelihood method (EL) statistic versus the reference χ 2 distribution when the data consists of 10 points sampled from a 4 dimensional multivariate Gaussian distrubution.The x-axis corresponds to quantiles (left) or p-values (right) for the χ 2 distribution and the y-axis is quantiles (left) or p-values (right) of the EL statistic.

Figure 2 :
Figure2: Quantile-quantile and probability-probability plots for the null distribution of the adjusted empirical likelihood (AEL) statistic versus the reference χ 2 distribution when the data consists of 10 points sampled from a 4 dimensional multivariate Gaussian distrubution.The x-axis corresponds to quantiles (left) or p-values (right) for the χ 2 distribution and the y-axis is quantiles (left) or p-values (right) of the AEL statistic.

Figure 3 :
Figure 3: Quantile-quantile plots for d = 4, n = 10.The x-axis has quantiles of the χ 2(4) distribution, and the y-axis is quantiles of the ordinary EL statistic (black), the AEL statistic (red), and our BAEL statistic (green).Reading across the rows, the distributions are arranged in order of increasing skewness and then increasing kurtosis.The first five distributions are symmetric.Black tick marks on the y = x line indicate the 90%, 95%, and 99% quantiles of the reference distribution.

Table 1 :
Comparisons of the small-sample properties of the calibration methods discussed in Section 3. The first column of comparisons indicates the abilities of the methods to address the constraint that the hypothesized mean must be contained in the convex hull of the data.The second comparison column describes the degree to which the method improves the agreement between the achieved and nominal level of a hypothesis test, when a test of that level is possible given the convex hull constraint.