Comparing two samples by penalized logistic regression

Inference based on the penalized density ratio model is proposed and studied. The model under consideration is specified by assuming that the log--likelihood function of two unknown densities is of some parametric form. The model has been extended to cover multiple samples problems while its theoretical properties have been investigated using large sample theory. A main application of the density ratio model is testing whether two, or more, distributions are equal. We extend these results by arguing that the penalized maximum empirical likelihood estimator has less mean square error than that of the ordinary maximum likelihood estimator, especially for small samples. In fact, penalization resolves any existence problems of estimators and a modified Wald type test statistic can be employed for testing equality of the two distributions. A limited simulation study supports further the theory.


Introduction
The density ratio model is specified by assuming that the log-ratio of two unknown probability density functions is linear in some parameters, see Qin and Zhang (1997) and Qin (1998).The model is motivated by considering a logistic regression model for a binary random variable Y , which assumes the values 1 and 2-where "1" denotes success-and X a d-dimensional vector of covariates, see Anderson (1972Anderson ( , 1979)), Breslow and Day (1980), Cox and Snell (1989), for example.Then, the logistic regression model expresses the probability of the event {Y = 1} as a function of X by , where α ⋆ is a scalar parameter and β is a d × 1 vector of regression coefficients.The logistic model leads to the density ratio model when considering case-control or retrospective sampling, see Prentice and Pyke (1979), Farewell (1979).Suppose that X 11 , . . ., X 1n1 is a random sample from G(x | Y = 1) and X 21 , . . ., X 2n2 is another independent sample from G(x | Y = 2).Set π = P [Y = 1] and g(x | Y = i) = dG(x | Y = i)/dx, for the conditional probability density function of X given Y = i, i = 1, 2. Bayes' theorem shows that with α = α ⋆ + log{(1 − π)/π}.The last equation justifies the term density ratio model: the densities of the observations are related by a parametric exponential tilt, but otherwise are unknown.In general, consider the following two independent samples semiparametric problem: X 11 , . . ., X 1n1 is a random sample from g 1 (x) = exp (α + β ′ h(x)) g 2 (x), X 21 , . . ., X 2n2 is a random sample from g 2 (x), (1) where g i (x), i = 1, 2 are unknown probability density functions.The quantity α is an unknown scalar, β is a ddimensional vector of parameters while h(x) is a d-dimensional vector which consists of known functions of X.Several examples of distribution fall within the above framework, in particular the exponential family of distributions satisfies (1) trivially.An important observation is that when the model holds, and if β = 0, then the two samples are identically distributed.We conclude that model (1) is useful to the semiparametric comparison of two samples in the sense that the densities g i (.), i = 1, 2 are left completely unspecified but the weight function exp (α + β ′ h(x)) depends on some finite dimensional parameter.The last remark connects the density ratio model and biased sampling theory, see Vardi (1982Vardi ( , 1985)), Gill et al. (1988), Gilbert et al. (1999) and Gilbert (2000).
Inference regarding both finite and infinite dimensional parameters of model (1) has been studied by various authors assuming that the sample sizes tend to infinity in a suitable way.The methodology is based on empirical likelihood inference, see Owen (2001).Accordingly, a parametric likelihood function for the finite dimensional parameters is obtained after profiling out the infinite dimensional parameter of the model.However, there are applications where the sample sizes are small and therefore direct application of the empirical likelihood methodology might not be suitable.To overcome these problems we put forward a penalized empirical likelihood function for inference, see (4), which depends on two forms, a regularization parameter and a penalty function.The regularization parameter controls the amount of penalization whereas the penalty term is a function of the finite dimensional parameter of the model.The choice of the penalty function is independent of the infinite dimensional parameter as it will be explained in Section 2 where the methodology is developed in detail.Using penalized empirical likelihood leads to solution of several problems including elimination of non existence and non convergence issues.In addition approximations even for relatively small sample sizes are adequate enough to allow for testing.Section 3 reviews several facts about the resulting estimators and shows that a judicious choice of the regularization parameter leads to consistent and asymptotically normal estimators.The last section examines the finite sample performance of the proposed estimators by some limited simulations.The paper closes with several remarks and an appendix.

Penalized likelihood inference
Suppose that X ij , j = 1, . . ., n i , i = 1, 2 are two independent random samples from cumulative distribution functions G i (x), i = 1, 2, respectively.Let g i (x) be the corresponding density functions, that is g i (x) = dG i (x)/dx, and denote by X = (X 11 , X 12 , . . ., X 1n1 , X 21 , . . ., X 2n2 ) ′ the vector of all observations.Suppose further that the density ratio model (1) holds.Inference for the finite dimensional parameters θ = (α, β ′ ) ′ is based on the so-called empirical likelihood-see Owen (2001).Accordingly, let p ij denote the size of the jump at the observed datum x ij , that is , 2 and consider the following nonparametric likelihood given the data, (2) Following Qin and Zhang (1997) and Fokianos et al. (2001), elimination of the infinite dimensional parameter G 2 (.) is accomplished by maximizing the first product of (2) subject to the constraints 2 i=1 ni j=1 p ij = 1, and Avoiding unnecessary repetition, the empirical log-likelihood is given by l where ρ 1 = n 1 /n 2 .Furthermore, if θ = (α, β) ′ denotes the maximum likelihood estimator of θ, assuming that it exists, then it can be shown that Hence a consistent estimator for both of G 1 (.) and G 2 (.) can be constructed provided that the total sample size n = n 1 + n 2 tends to infinity such that n 1 /n 2 → ρ 1 -see Qin and Zhang (1997) and Fokianos et al. (2001) for further details.
Expression (3), after reparametrization, is equivalent to standard logistic regression likelihood-a direct consequence of the equivalence between retrospective and prospective sampling, see Prentice and Pyke (1979).Accordingly, inference for the parameter θ is carried out by numerous statistical programs which include logistic regression modeling.However, it is well known that if the sample size is large, then the logistic likelihood equations yield the maximum likelihood estimator provided that it exists.In this case, the score equations are solved by the Fisher scoring method and under some reasonable regularity assumptions a sequence of approximations is derived that converges to the maximum likelihood estimators of α and β.In contrast, convergence and existence issues arise when the sample size is relatively small.When the sample size is small, maximum likelihood estimates fail to exist in general if the data are completely, quasi-completely or partially separated.More specifically, there does not exist a maximum likelihood estimator of β if there exists a hyperplane in the covariate space such that when Y = 1, the covariates belong to the left side of the hyperplane whereas for Y = 2, the covariates belong to the other side of the hyperplane (see Albert and Anderson (1984), McCullagh and Nelder (1989, Sec. 4.4) and Santner and Duffy (1989, p. 234)).The discussion shows that application of the density ratio model is questionable when only a few observations belong to each group.
However, regularization of the log -likelihood function (3), in the sense of adding a concave penalty term, avoids existence and convergence issues, see also Hastie and Tibshirani (2004).Consider the so-called penalized empirical log-likelihood function where l(θ) denotes the unrestricted log-likelihood function given by (3), λ n is a sequence of regularization parameter controlling the amount of shrinkage and J(.) is a penalty function on the parameter β.The parameter α-the interceptis not penalized explicitly because it is a function of β when (1) holds.It is well known that when λ n → 0 then (4) yields to the unrestricted maximum likelihood estimator whereas if λ n → ∞ then θ shrinks towards 0. We argue that it is reasonable to employ the profile empirical likelihood for inference.Expression (4) is not a proper log-likelihood function in the sense that the first summand, namely the quantity l(θ), is the outcome of a profiling procedure derived by means of empirical likelihood methodology.However, recent work in the area of semiparametric statistical inference shows that such functions share the same properties of common likelihood functions, see Murphy and van der Vaart (2000).Thus, it is sensible to penalize (3) by introducing an extra concave term which does not depend on the distribution function G 2 .The method resolves successfully the existence problem of the maximum likelihood estimator of θ and yields estimates of the unknown distribution functions even for relatively small samples.
Motivation of penalized empirical likelihood (4) stems also from a Bayesian point of view.Suppose that the density ratio model (1) holds conditionally on β and suppose that β is a random variable with a density proportional to The above display points out to a crucial point, namely the independence between the prior of β and the cdf G 2 .It is precisely this fact that enables to base inference on the posterior likelihood function By setting l p (θ) ≡ log π(β|G 2 , x) we obtain the penalized log-likelihood expression (4).Indeed, maximization of the last display with respect to p ij 's is equivalent to maximization of (2) since the second factor does not depend on the infinite dimensional parameter.That is the penalized log-likelihood function still produces valid cdf estimators that satisfy the constraints imposed by the density ratio model.We conclude that maximization of π(β|G 2 , x) has no effect on the final form of the profile log-likelihood function as long as the designated function of β that multiplies (2) does not depend on G 2 .Note that the methodology is quite general and it is rather interesting to study its properties for a wider class of examples associated with empirical likelihood theory.
The penalized log-likelihood equations (4) depends on the choice of regularization parameter and the penalty term J(β).The selection of λ n can be based on existing cross-validation methods but it is unclear what are the properties of such an approach especially when few observations are available.There are various approaches to choose the penalty function-perhaps the most popular being J(β) = d j=1 β 2 j leading to the so called ridge regression type estimators (see Le Cessie and Van Houwelingen (1992) for the case of the logistic regression when the number of covariates is large).A more general family of penalty functions is given by J(β) = d j=1 γ k ψ(β j ), with γ k > 0 and leads to several well known examples.For instance, if γ k = 1 and ψ(β j ) =| β j |, then J(β) reduces to the L 1 penalty (see Tibshirani (1996)) while the L q penalty is obtained when ψ(β j ) =| β j | q , for q > 0, see Frank and Friedman (1993).Insight on the choice of penalty function is given in the recent articles of Antoniadis and Fan (2001) and Fan and Li (2001).However, in what follows consider the penalty function (5) In particular, the ridge penalty will be employed in applications since the linear combination β ′ h(x) is expected to vary smoothly in accordance with (1).

Main results
To fix notation, suppose that θ = (α, β′ ) ′ denotes the maximum likelihood estimator derived by maximization of the unrestricted log-likelihood (3) and let θλ denote the constrained maximum likelihood estimator obtained by maximizing (4).In what follows, θ 0 denotes the true value of the parameter θ when (1) holds.Define and put The following lemma summarizes some useful facts regarding the large sample behavior of both the score function and the Hessian matrix.It is used for the proofs of Theorems 3.1, 3.2 and 3.3 and it is stated for completeness of the presentation.
Lemma 3.1 (Qin and Zhang (1997)) Suppose that the density ratio model (1) holds and assume that A is nonsingular.Denote by , a (d + 1) -dimensional vector.Then • The score function of the unrestricted likelihood (3) is asymptotically normal, as n → ∞ such that n 1 /n 2 → ρ 1 .• The Hessian matrix of the unrestricted likelihood (3) converges in probability, as n → ∞ such that n 1 /n 2 → ρ 1 , to The large sample behavior of the score and the Hessian matrix requires a number of standard regularity conditions to be satisfied.However, in the context of logistic regression, these conditions can be easily verified, see for instance Santner and Duffy (1989).Hence, the density ratio model can be applied to a variety of settings provided that the log likelihood ratio of two probability densities whose support is identical, is a sufficiently smooth function.Several well known examples of distributions fall within this class and therefore model ( 1) is applicable to a large collection of problems.
The first task is to establish existence of θλ , that is whether the problem of maximizing ( 4) is well defined and if the resulting estimator is unique.When considering penalty functions of the form (5)-and more generally sums of smooth convex functions-the answer is given by the following result: Lemma 3.2 (Fu (1998)) For the penalty function (5) and for given q > 1 and λ n > 0 there exists a unique solution of the penalized score equations The unique solution is equal to the unique estimator of the maximization problem max θ l p (θ), provided that ∇l(θ) is continuously differentiable with respect to θ and ∇ 2 l(θ) is positive semi definite.
The equivalence of the density ratio model to logistic regression implies that both conditions of Lemma 3.2 are satisfied, or, in other words, a unique restricted estimator θλ exists.Summarizing, the issue of non existence of maximum likelihood estimators in the context of logistic regression is successfully resolved by penalization.Thus, the density ratio model is applicable to situations where the available sample size is small.Furthermore, it will be shown that the choice of λ n = O( √ n) yields to √ n-consistent estimators.In the case of λ n = o( √ n) then θλ is consistent.These conclusions are supported further by some limited simulation results which are reported in the next section.
In addition, estimates of the unknown distribution functions are computed by setting and Ĝλ where I(.) is the indicator function.The jump sizes pλ ij sum up to 1 by construction since the penalized log likelihood function (4) is employed after profiling out the parameters p ij .
Remark 3.1 Identifiability of α, β and G 2 (.) is guaranteed by the work of Gilbert et al. (1999, Th. 2) which shows that if there exist a value x 0 such that h(x 0 ) = 0, then the density ratio model ( 1) is identifiable.
The above theorem, when compared to the results of Lemma 3.1, does not point out to any advantages of the penalization for the density ratio model.Indeed, under the aforementioned conditions the maximum penalized empirical likelihood estimator of θ is asymptotically biased and its covariance matrix is equal to the covariance matrix of the unrestricted maximum empirical likelihood estimator.However the results is asymptotic and a careful examination of the proof of the theorem 3.2 shows that the asymptotic covariance matrix Σ is approximated by where Â is the empirical estimator of A and D is equal to the diagonal matrix D evaluated at βλ , provided that the penalty function is twice differentiable.Hence, for a judicial chosen value of λ n , the mean square error of θλ will be smaller than that of θ in small samples.This point is further illustrated in the next section by considering some finite sample properties of the restricted maximum likelihood estimator, see equation ( 13) and subsequent discussion.
The estimate (11) performs satisfactorily even for small sample sizes given a known value of the regularization parameter-see next section.For large n, and 11) reduces to that used by Qin and Zhang (1997) and Fokianos et al. (2001) for the asymptotic variance estimator of the unrestricted maximum likelihood θ.Theorem 3.2 suggests that a standard Wald test can be used to test the hypothesis H 0 : = 0-that is the two samples are identically distributed.In particular, under the hypothesis and provided that λ n = o( √ n), we obtain that the test statistic where Σ−1 22 denotes the estimated asymptotic variance of βλ which is obtained by means of ( 11).Under the null hypothesis, the asymptotic distribution of W is the chi-square with d degrees of freedom provided that the regularization parameter satisfies the aforementioned conditions.It is clear that (12) depends on the choice of the regularization parameter but the notation is suppressed for ease of presentation.A limited simulation study-see Section 4-shows that the chi-square approximation is valid even though the sample size is relatively small.Remark 3.2 Application of the density ratio model relies on the assumed functional form of h(.).Misspecification of this form results in biased estimators and loss of efficiency-the point is made by Fokianos and Kaimi (2006) who employ the Box-Cox family of transformations to estimate h(.).However, we have tacitly assumed throughout the presentation that the function h(.) is the true function associated with the density ratio model (1).
Remark 3.3 An alternative penalization scheme is based on (5) but with 0 < q ≤ 1, see Knight and Fu (2000) for example, who studied the problem in the linear regression setup.These penalty functions are rather appealing in the sense that they combine model selection and estimation.Theory regarding the density ratio model in connection to penalty functions of the form J(β) when 0 < q ≤ 1 might be particular useful especially for multivariate observations.In this case, the main task is to identify a parsimonious functional form of the log likelihood ratio of two, or more, multivariate probability density functions and estimate some of the corresponding coefficients while setting the remaining to zero.In other words, the problem of modeling multivariate data by the density ratio model in large dimensions reduces to that of a model selection and estimation problem.
The last part of this section is devoted to the study of the large sample properties of the estimated cdf G 2 given by ( 8).The following theorem studies its asymptotic distributions and generalizes the corresponding theorem of Qin and Zhang (1997).
Theorem 3.3 Suppose that the conditions of Theorem 3.1 are satisfied and let λ n be such that weakly, where W is a Gaussian process possessing continuous sample paths and has mean 0 and covariance function specified by equation ( 15).
A similar large sample result holds for Ĝλ 1 .Its proof is along the lines of the previous theorem and is omitted.In the next section we provide some empirical evidence for the finite sample performance of the estimate θλ .

Simulation study
The above results show that the penalized density ratio model depends on both the choice of the regularization parameter and the selection of the penalty function J(β).In the sequel we assume that J(β) = d i=1 β 2 i relying on the fact that the linear combination β ′ h(x) is a smooth function β.This choice leads to the so-called ridge regression type estimators whose mean square error is less than that of the corresponding unrestricted maximum likelihood estimators, for a range of values of λ n .For the ridge penalty, (see, for instance, Le Cessie and Van Houwelingen (1992)) consider the following approximate expansion of the penalized likelihood function for After rearranging terms and taking into account equations ( 9), ( 10) and Lemma 3.1, an approximate expression for θλ is given by θλ ≈ ∇ 2 l(θ 0 ) + 2λ n D r −1 ∇ 2 l(θ 0 )θ 0 − ∇l(θ 0 ) , where where I d denoted the d-dimensional identity matrix.On the other hand, a similar argument shows that θ 0 = θ + {∇ 2 l(θ 0 )} −1 ∇l(θ 0 ).These two displays when combined show that Therefore θλ shrinks towards zero as the value of the regularization parameter increases while for λ n → 0, θλ is approaching the unrestricted maximum likelihood estimator.In particular, the above representation forms the basis for developing an asymptotic expression of the mean square error of θλ .The methodology is quite analogous to ordinary ridge regression and therefore is omitted (for more see Hoerl and Kennard (1970b,a).Consequently, it can be shown that the mean square error of βλ attains its minimum value at some value of the regularization parameter, see the lower left hand side plot of Figure 1 for an example.In other words, the maximum penalized empirical likelihood estimator attains smaller mean square error than that of the estimator proposed by Qin and Zhang (1997), especially for small samples.These results are empirically verified by considering the following simple example.Consider the case of two lognormal random samples with corresponding density functions The density ratio model ( 1) is satisfied with α = (µ 2 2 − µ 2 1 )/2σ 2 , β = (µ 1 − µ 2 )/σ 2 and h(x) = log x and consider penalized empirical likelihood inference based on (4) by choosing appropriately both the penalty parameter and the penalty function.In this limited simulation study, the parameter λ n was chosen as a constant satisfying therefore the conditions of Theorems 3.1 and 3.2.Furthermore, a simple quadratic penalty function, that is J(β) = β 2 , was used for maximizing the penalized likelihood (4).Table 1 summarizes the results of 1000 runs for different sample sizes and for values of λ n .The first six lines of Table 1 report results when both groups of data are generated by the same log-normal distribution with µ 1 = µ 2 = 0 and σ 1 = σ 2 = 1.The mean square error of the penalized estimator of β, say βλ is less than the mean square error of the unrestricted estimator even for small sample sizes.In addition, the nominal significance level of 5% for testing β = 0 by using the test statistic ( 12) is achieved satisfactorily in all cases except that of λ n = 0.The variance estimate is obtained by means of (11).The last six lines of Table 1 report results under the alternative hypothesis.In this case the two log-normal samples were generated with µ 1 = 1, µ 2 = 0 and σ 1 = σ 2 = 1.The results are along the lines of the previous findings.It is interesting to observe that for n 1 = n 2 = 10, the mean square error of the unrestricted maximum likelihood estimator is large when compared with the mean square error of the penalized maximum likelihood estimators.
The top plot of Figure 1 shows a QQ-plot of the test static (12) under the hypothesis H 0 : β = 0. Notice that the approximation is quite satisfactory even though the sample sizes are relatively small.The bottom left hand side plot of Figure 1 shows the mean square error of βλ as a function of λ.As the plot illustrates, there exists a value of λ such that the mean square error is minimized-recall the above discussion.The right hand side of the plot shows the asymptotic efficiency of β with respect to βλ defined by For this specific example, the graph illustrates empirically that e( β; βλ ) ≤ 1, implying that the restricted estimator is more efficient than the unrestricted estimator.

Conclusions
The density ratio model is a semiparametric alternative to the problem of comparing two, or more distributions functions.However, there are several drawbacks for its application especially when small samples are under consideration.
The first part of this work dealt explicitly with the penalization approach to the empirical likelihood methodology.It was shown that the method is also motivated by Bayesian arguments and it is worth considering its performance to other settings.As a result, a new two sample methodology emerges where estimation of the parametric part can be carried out by a number of statistical programs while estimation of the non-parametric part is achieved by simple calculations.
In particular the problem of existence of maximum empirical likelihood estimators is resolved successfully.In addition, the final estimators have smaller mean square error than the estimators without penalty.The test statistic-see (12)-for testing the equality of two distributions is obtained and the results were applied to simulated case control data.Application of the density ratio model depends on a number of assumptions, in particular the assumed functional form of model (1), the regularization parameter and the form of the penalty function (5).It is suggested to vary the regularization parameter within a predetermined range of values and use different functions h(x) to examine the sensitivity of the results.Penalization by the quadratic penalty seems to be a sensible idea for several applications.As a final remark, we note that the problem can be generalized further by considering multiple biased sampling models-or semiparametric ANOVA-as in Fokianos et al. (2001).This will allow for semiparametric comparisons among several treatments and will also yield estimates of the unknown cdf.Furthermore, it is worth studying estimation of functionals associated with the baseline distribution, see Zhang (2000), in the presence of a regularization parameter.Different penalty functions can be taken into account and this methodology is quite promising for semiparametric comparison of multivariate distribution where the true functional form of (1) is complicated.

Table 1
Estimated coefficient, its mean square error (MSE) and power of the test statistic (12) for testing β = 0 (nominal significance level 0.05) for different choices of sample sizes and regularization parameter values.The data have been generated by a log-normal distribution and a quadratic penalty has been used.Results are based on 1000 simulations