Computationally efficient likelihood inference in exponential families when the maximum likelihood estimator does not exist

In a regular full exponential family, the maximum likelihood estimator (MLE) need not exist, in the traditional sense, but the MLE may exist in the Barndorff-Nielsen completion of the family. Existing algorithms for finding the MLE in the Barndorff-Nielsen completion solve many linear programs; they are slow in small problems and too slow for large problems. We provide new, fast, and scalable methodology for finding the MLE in the Barndorff-Nielsen completion based on approximate null eigenvectors of the Fisher information matrix. Convergence of Fisher information follows from cumulant generating function convergence, conditions for which are given.


Introduction
We develop an inferential framework for regular full discrete exponential families when the observed value of the canonical statistic lies on the boundary of its convex support. Then the maximum likelihood estimator (MLE) for the canonical parameter cannot exist [3,Theorem 9.13]. But the MLE may exist in a completion of the exponential family. Completions for exponential families have been described (in order of increasing generality) by Barndorff-Nielsen [3, pp. 154-156], Brown [4, pp. 191-201], Csiszár and Matúš [5], and Geyer [12, unpublished PhD thesis, Chapter 4]. The latter two are equivalent for full exponential families, but the last, most general, has much stronger algebraic properties that help with theory, so we use it. It also is the only completion that is a completion under no regularity conditions whatsoever (other than exponential family). It works for curved exponential families and other non-full exponential families ( [5] works for non-full but closed convex families). Following Geyer [8] we will call all of these completions the Barndorff-Nielsen completion without fuss about the technical details differentiating them.
Geyer [8] developed ways to do hypothesis tests and confidence intervals when the MLE in an exponential family does not exist in the conventional sense. The hypothesis test scheme was credited to Fienberg (personal communication -an answer he gave to a question at the end of a talk). The confidence interval scheme generates one-sided non-asymptotic confidence intervals, because the MLE fails to exist in the conventional sense when the canonical statistic is on the boundary of its convex support, and this is an inherently one-sided situation, and conventional asymptotics do not work near the boundary.
To simplify both explanation and computation, Geyer [8] assumed the regularity conditions that Brown [4] assumed for his completion. These conditions of Brown hold for nearly all applications known to us (applications for which a more general completion are required include aster models [10] and Markov spatial point processes [7]). We will also need to use Brown's conditions to guarantee our methods work.
The issue of when the MLE exists in the conventional sense and what to do when it doesn't is very important because of the wide use of generalized linear models for discrete data and loglinear models for categorical data. In every application of these, existing statistical software gives completely invalid results when the MLE does not exist in the conventional sense, but most such software either does not check for this problem or does very weak checks that have high probability of both false positives and false negatives. Moreover, even if these checks correctly detect nonexistence of the MLE in the conventional sense, conventional software implements no valid procedures for statistical inference when this happens. When this issue is detected most users will go to smaller statistical models for which the MLE seems to exist, even though such models may neither fit the data nor address the questions of scientific interest. Geyer [8] gives examples with valid inference. Authoritative textbooks, such as Agresti [1,Section 6.5], discuss the issue but provide no solutions.
Thus a solution to this issue that is efficiently computable would be very important. The algorithms of Geyer [12,8] and Albert and Anderson [2] are based on doing many linear programs. The algorithm of Geyer [8] is the most efficient; it is mostly due to Fukuda, who provided the underlying C code for the computational geometry functions of R package rcdd [11], which this algorithm uses. This algorithm does at most n linear programs, where n is the number of cases of a generalized linear model (GLM) or the number of cells in a contingency table, in order to determine the existence of the MLE in the conventional sense. Each of these linear programs has p variables, where p is the number of parameters of the model, and up to n inequality constraints. Since linear programming can take time exponential in n when pivoting algorithms are used, and since such algorithms are necessary in computational geometry to get correct answers despite inaccuracy of computer arithmetic (see the warnings about the need to use rational arithmetic in the documentation for R package rcdd [11]), these algorithms can be very slow. Typically, they take several minutes of computer time for toy problems and can take longer than users are willing to wait for real applications. These algorithms do have the virtue that if they use infinite-precision rational arithmetic, then their calculations are exact, as good as a mathematical proof.
Previous theoretical discussions of these issues that do not provide algorithms [3,4,5] use the notions of faces of convex sets or tangent cones or normal cones and all of these are much harder to compute than the algorithm of Geyer [8]. So they provide no direction toward efficient computing.
Because computational geometry is so slow and does not scale to large problems, we abandon it and return to calculations using the inexact computer arithmetic provided by computer hardware. Conventional maximum likelihood computations come close, in a sense, to finding the MLE in the Barndorff-Nielsen completion. They go uphill on the likelihood function until they meet their convergence criteria and stop. At this point, the canonical parameter estimates are still infinitely far away from their analogs for the MLE in the completion, but the corresponding probability distributions are close in total variation norm. Here we show that they are also close in the sense of moment generating function convergence (Theorem 7 below) and consequently moments of all orders are also close. The MLE in the completion is not only a limit of distributions in the original family but also a distribution in the original family conditioned on the affine hull of a face of the effective domain of the log likelihood supremum function [12,Theorem 4.3, special cases of which were known to other authors]. To do valid statistical inference when the MLE does not exist in the conventional sense, we need to know this affine hull.
This affine hull is the support of the canonical statistic under the MLE distribution (in the completion). Hence it is a translate of the null space of the Fisher information matrix, which (for an exponential family) is the variance-covariance matrix of the canonical statistic. This affine hull must contain the mean vector of the canonical statistic under the MLE distribution. Hence knowing the mean vector and variance-covariance matrix of the canonical statistic under the MLE distribution allows us to do valid statistical inference, and our conventional maximum likelihood calculation (go uphill until things don't change much in an iteration) will give us good approximations of them (relative to the inexactness of computer arithmetic).
We will get nearly the correct affine hull if we can guess the correct null space of the Fisher information matrix from its eigenvalues and eigenvectors computed using inexact computer arithmetic. We will not be able to do this when the statistical model has an ill-conditioned model matrix (the model matrix for categorical data analysis being the model matrix when it is recast as a Poisson regression). Ill-conditioning will add spurious nearly zero eigenvalues that arise from the ill-conditioning rather than the concentration of the MLE distribution on the correct affine hull. We will suppose that the model matrix is not ill-conditioned. If a sequence of parameter estimates maximizes the likelihood, then the corresponding sequence of probability density functions (PDFs) has subsequences converging to PDFs of MLE distributions in the Barndorff-Nielsen completion [12,Theorem 4.1]. If the MLE distribution is unique, as it always is for a full exponential family [8,Section 3.8], then all of these MLE PDFs will correspond to the same probability distribution. For a curved exponential family, the MLE need not be unique, even when it exists in the conventional sense.

Motivating example
Consider the case of complete separation in the logistic regression model as an example of a discrete exponential family with data on the boundary of the convex support of the canonical statistic. Suppose that we have one predictor vector x having values 10, 20, 30, 40, 60, 70, 80, 90, and suppose the components of the response vector y are 0, 0, 0, 0, 1, 1, 1, 1. Then the simple logistic regression model that has linear predictor η = β 0 + β 1 x exhibits failure of the MLE to exist in the conventional sense. This example is the same as that of Agresti [1, Section 6.5.1].
For an exponential family, the submodel canonical statistic is M T y, where M is the model matrix [8,Section 3.9]. Figure 1 shows the observed value of the canonical statistic vector and the support (all possible values) of this vector. As is obvious from the figure, the observed value of the canonical statistic is on the boundary of the convex support, in which case the MLE does not exist in the conventional sense [8,Theorem 4]. In general, this figure is too computationally intensive and too high-dimensional to draw. So our methods do not use such figures. It is here to develop intuition. In our methodology, this degeneracy follows from the Fisher information matrix at the apparent MLE being nearly the zero matrix.
In this example, like in Example 1 of Geyer [8], the MLE in the Barndorff-Nielsen completion corresponds to a completely degenerate distribution. This MLE distribution says no other data than what was observed could be observed. But the sample is not the population and estimates are not parameters. So this degeneracy is not a problem. To illustrate the uncertainty of estimation we follow Figure 2 of Geyer [8], which shows confidence intervals (necessarily one-sided) for the saturated model mean value parameters. Our Figure 2 shows that, as would be expected from so little data, the confidence intervals are very wide. The MLE in the completion says the probability of observing a response equal to one jumps from zero to one somewhere between 40 and 60. The confidence intervals show that we are fairly sure that this probability goes from near zero at x = 10 to near one at x = 90 but we are very unsure where jumps are if there are any. These intervals were constructed using the theory of Geyer [8,Section 3.16]. The actual computations follow some later course notes [9].
Our theory allows for inference in not only the complete separation example but also in any   [8]. We redo Example 2.3 of [8] in our Section 7.2, and we find that our methodology produces the same inferences as theirs in a fraction of the time.

Laplace transforms and standard exponential families
Let λ be a positive Borel measure on a finite-dimensional vector space E. The log Laplace transform of λ is the function c : E * → R defined by where E * is the dual space of E, where · , · is the canonical bilinear form placing E and E * in duality, and where R is the extended real number system, which adds the values −∞ and +∞ to the real numbers with the obvious extensions to the arithmetic and topology [15, Section 1.E].
If one prefers, one can take E = E * = R p for some p, and define but the coordinate-free view of vector spaces offers more generality and more elegance. Also, as we are about to see, if E is the sample space of a standard exponential family, then a subset of E * is the canonical parameter space, and the distinction between E and E * helps remind us that we should not consider these two spaces to be the same space. A log Laplace transform is a lower semicontinuous convex function that nowhere takes the value −∞ (the value +∞ is allowed and occurs where the integral in (1) does not exist) [ For every θ ∈ dom c, the function f θ : E → R defined by is a probability density with respect to λ. The set F = { f θ : θ ∈ Θ }, where Θ is any nonempty subset of dom c, is called a standard exponential family of densities with respect to λ. This family is full if Θ = dom c. We also say F is the standard exponential family generated by λ having canonical parameter space Θ, and λ is the generating measure of F. The log likelihood of this family is (2) have log likelihood A general exponential family [12, Chapter 1] is a family of probability distributions having a sufficient statistic X taking values in a finite-dimensional vector space E that induces a family of distributions on E that have a standard exponential family of densities with respect to some generating measure. Reduction by sufficiency loses no statistical information, so the theory of standard exponential families tells us everything about general exponential families [12,Section 1.2].
In the context of general exponential families X is called the canonical statistic and θ the canonical parameter (the terms natural statistic and natural parameter are also used). The set Θ is the canonical parameter space of the family, the set dom c is the canonical parameter space of the full family having the same generating measure. A full exponential family is said to be regular if its canonical parameter space dom c is an open subset of E * .
The cumulant generating function (CGF) of the distribution of the canonical statistic for parameter value θ is the function k θ defined by provided this distribution has a CGF, which it does if and only if k θ is finite on a neighborhood of zero, that is, if and only if θ ∈ int(dom c). Thus every distribution in a full family has a CGF if and only if the family is regular. Derivatives of k θ evaluated at zero are the cumulants of the distribution for θ. These are the same as derivatives of c evaluated at θ.

Generalized affine functions 4.1 Characterization on affine spaces
Exponential families defined on affine spaces instead of vector spaces are in many ways more elegant [12, Sections 1.4 and 1.5 and Chapter 4]. To start, a family of densities with respect to a positive Borel measure on an affine space is a standard exponential family if the log densities are affine functions. Following Geyer [12,Chapter 4], we complete the exponential family by taking pointwise limits of densities, allowing +∞ and −∞ as limits.
We call these limits generalized affine functions. Real-valued affine functions on an affine space are functions that are are both convex and concave. Generalized affine functions on an affine space are extended-real-valued functions that are are both concave and convex [12,Chapter 4]. (For a definition of extended-real-valued convex functions see Rockafellar [14,Chapter 4]. ) We thus have two characterizations of generalized affine functions: functions that are both convex and concave and functions that are limits of sequences of affine functions. Further characterizations will be given below.
Let h n denote a sequence of affine functions that are log densities in a standard exponential family with respect to λ, that is, e hn dλ = 1 for all n. Since e hn → e h pointwise if and only if h n → h pointwise, the idea of completing an exponential family naturally leads to the the study of generalized affine functions.
If h : E → R is a generalized affine function, we use the notation An extended-real-valued function h on a finite-dimensional affine space E is generalized affine if and only if one of the following cases holds All theorems for which a proof does not follow the theorem statement are proved in either the appendix or the supplementary material.
The intention is that this theorem is applied recursively. If we are in case (d), then the restriction of h to H is another generalized affine function to which the theorem applies. Since a nested sequence of hyperplanes can have length at most the dimension of E, the recursion always terminates.

Topology
Let G(E) denote the space of generalized affine functions on a finite-dimensional affine space E with the topology of pointwise convergence.
Sequentially compact means every sequence has a (pointwise) convergent subsequence. That this follows from the two preceding theorems is well known [16, p. 22, gives a proof].
The space G(E) is not metrizable, unless E is zero-dimensional [12, penultimate paragraph of Section 3.3]. So we cannot use δ-ε arguments, but we can use arguments involving sequences, using sequential compactness.
Let λ be a positive Borel measure on E, and let H be a nonempty subset of G(E) such that Then, following Geyer [12, Chapter 4], we call H a standard generalized exponential family of log densities with respect to λ. Let H denote the closure of H in G(E). Proof. Suppose x is the observed value of the canonical statistic. Then there exists a sequence h n in H such that This sequence has a convergent subsequence h n k → h in G(E). This limit h is in H and maximizes the likelihood.
We claim this is the right way to think about completion of exponential families. For full exponential families or even closed convex exponential families the closure only contains proper log probability densities (h that satisfy the equation in (5)). This is shown by Geyer [12,Chapter 2] and also by Csiszár and Matúš [5].
For curved exponential families and for general non-full exponential families, applying Fatou's lemma to pointwise convergence in G(E) gives only When the integral in (6) is strictly less than one we say h is an improper log probability density. The examples in Geyer [12,Chapter 4] show that improper probability densities cannot be avoided in curved exponential families. Geyer [12,Theorem 4.3] shows that this closure of an exponential family can be thought of as a union of exponential families, so this generalizes the conception of Brown [4] of the closure as an aggregate exponential family. Thus our method generalizes all previous methods of completing exponential families.
Admittedly, this characterization of the completion of an exponential family is very different from any other in its ignoring of parameters. Only log densities appear. Unless one wants to call them parameters -and that conflicts with the usual definition of parameters as real-valuedparameters just do not appear.
So in the next section, we bring parameters back.

Characterization on vector spaces
In this section we take sample space E to be vector space (which, of course, is also an affine space, so the results of the preceding section continue to hold). Recall from Section 3 above, that E * denotes the dual space of E, which contains the canonical parameter space of the exponential family.
Theorem 5. An extended-real-valued function h on a finite-dimensional vector space E is generalized affine if and only if there exist finite sequences (perhaps of length zero) of vectors η 1 , . . . , η j in in E * and scalars δ 1 , . . . , δ j such that η 1 , . . . , η j are linearly independent and h has the following form. Define H 0 = E and, inductively, for integers i such that 0 < i ≤ j i for any i, and h is either affine or constant on H j , where +∞ and −∞ are allowed for constant values.
The "if any" refers to the case where the sequences have length zero, in which case the theorem asserts that h that h is affine on E or constant on E.
As we saw in the preceding section, we are interested in likelihood maximizing sequences. Here we represent the likelihood maximizing sequence in the coordinates of the linearly independent η vectors that characterize the generalized affine function h according to its Theorem 5 representation. Let θ n be a likelihood maximizing sequence of canonical parameter vectors, that is, where the log likelihood l is given by (3) and where Θ is the canonical parameter space of the family. To make connection with the preceding section, define Then h θn is a sequence of affine functions, which has a subsequence that converges (in G(E)) to some generalized affine function h ∈ H, which maximizes the likelihood: The following lemma gives us a better understanding of the convergence h θn → h.
Suppose that a generalized affine function h on a finite dimensional vector space E is finite at at least one point. Represent h as in Theorem 5, and extend η 1 , . . . , η j to be a basis η 1 , . . . , η p for E * . Suppose h n is a sequence of affine functions converging to h in G(E). Then there are sequences of scalars a n and b i,n such that and, as n → ∞, we have (d) a n converges.
In (9) the first sum is empty when j = 0 and the second sum is empty when j = p. Such empty sums are zero by convention.
The results given in Lemma 1 are applicable to generalized affine functions in full generality. The case of interest to us, however, is when h n = h θn is the likelihood maximizing sequence constructed above.

Corollary 2.
For data x from a regular full exponential family defined on a vector space E, suppose θ n is a likelihood maximizing sequence satisfying (7) with log densities h n = h θn defined by (8) converging pointwise to a generalized affine function h. Characterize h and h n as in Theorem 5 and Lemma 1. Define ψ n = p i=j+1 b i,n x, η i . Then conclusions (a) and (b) of Lemma 1 hold in this setting and ψ n → θ * , as n → ∞, where θ * is the MLE of the exponential family conditioned to H j .
In case j = p the conclusion ψ n → θ * is the trivial zero converges to zero. The original exponential family conditioned on the event H j is what Geyer [8] calls the limiting conditional model (LCM).
Proof. The conditions of Lemma 1 are satisfied by our assumptions so all conclusions of Lemma 1 are satisfied. As a consequence, ψ n → θ * as n → ∞. The fact that θ * is the MLE of the LCM restricted to H j follows from our assumption that θ n is a likelihood maximizing sequence.
Taken together, Theorem 5, Lemma 1, and Corollary 2 provide a theory of maximum likelihood estimation in the completions of exponential families that is the theory of the preceding section with canonical parameters brought back.

Cumulant generating function convergence
We now show CGF convergence along likelihood maximizing sequences (7). This implies convergence in distribution and convergence of moments of all orders. Theorems 6 and 7 in this section say when CGF convergence occurs. Their conditions are somewhat unnatural (especially those of Theorem 6). However, the example in Section 4 of the supplementary material shows not only that some conditions are necessary to obtain CGF convergence (it does not occur for all full discrete exponential families) but also that the conditions of Theorem 6 are sharp, being just what is needed to rule out that example.
The CGF of the distribution having log density that is the generalized affine function h is defined by κ(t) = log e y,t e h(y) λ(dy), and similarly κ n (t) = log e y,t e hn(y) λ(dy) where we assume h n are the log densities for a likelihood maximizing sequence such that h n → h pointwise. The next theorem characterizes when κ n → κ pointwise.
Let c A denote the log Laplace transform of the restriction of λ to the set A, that is, where, as usual, the value of the integral is taken to be +∞ when the integral does not exist (a convention that will hold for the rest of this section).
Theorem 6. Let E be a finite-dimensional vector space of dimension p. For data x ∈ E from a regular full exponential family with natural parameter space Θ ⊆ E * and generating measure λ, assume that all LCMs are regular exponential families. Suppose that θ n is a likelihood maximizing sequence satisfying (7) with log densities h n converging pointwise to a generalized affine function h. Characterize h as in Theorem 5. When j ≥ 2, and for i = 1, ..., j − 1, define and assume that Then κ n (t) converges to κ(t) pointwise for all t in a neighborhood of 0.
Discrete exponential families automatically satisfy (11) when In this setting, e corresponds to the probability mass function for the random variable conditional on the occurrence of Therefore, Theorem 6 is applicable for the non-existence of the maximum likelihood estimator that may arise in logistic and multinomial regression. We show in the next theorem that discrete families with convex polyhedral support K also satisfy (11) under additional regularity conditions that hold in practical applications. When K is convex polyhedron, we can write K = {y : y, α i ≤ a i , for i = 1, ..., m}, as in [15,Theorem 6.46]. When the MLE does not exist, the data x ∈ K is on the boundary of K. Denote the active set of indices corresponding to the boundary K containing x by I(x) = {i : x, α i = a i }. In preparation for Theorem 7 we define the normal cone N K (x), the tangent cone T K (x), and faces of convex sets and then state conditions required on K.
where cl denotes the set closure operation.
When K is a convex polyhedron, N K (x) and T K (x) are both convex polyhedron with formulas given in [15,Theorem 6.46]. These formulas are Definition 3. A face of a convex set K is a convex subset F of K such that every (closed) line segment in K with a relative interior point in F has both endpoints in F . An exposed face of K is a face where a certain linear function achieves its maximum over K [14, p. 162].
The conditions required on K for our theory to hold are from Brown [4, pp. 193-197]. These conditions are: (i) The support of the exponential family is a countable set X.
(ii) The exponential family is regular.
(iii) Every x ∈ X is contained in the relative interior of an exposed face F of the convex support K.
(iv) The convex support of the measure λ|F equals F , where λ is the generating measure for the exponential family.
Conditions (i) and (ii) are already assumed in Theorem 6. It is now shown that discrete exponential families satisfy (11) under the above conditions. Theorem 7. Assume the conditions of Theorem 6 with the omission of (11) when j ≥ 2. Let K denote the convex support of the exponential family. Assume that the exponential family satisfies the conditions of Brown. Then (11) holds.

Consequences of CGF convergence
Theorems 6 and 7 both verify CGF convergence along likelihood maximizing sequences (7) on neighborhoods of 0. The next theorems show that CGF convergence on neighborhoods of 0 is enough to imply convergence in distribution and of moments of all orders. Therefore moments of distributions with log densities that are affine functions converge along likelihood maximizing sequences (7) to those of a limiting distributions whose log density is a generalized affine function.
Suppose that X is a random vector in a finite-dimensional vector space E having a moment generating function (MGF) ϕ X , then regardless of whether the MGF exist or not. It follows that the MGF of X, t for all t determine the MGF of X and vice versa, when these MGF exist. More generally, This observation applied to characteristic functions rather than MGF is called the Cramér-Wold theorem. In that context it is more trivial because characteristic functions always exist.
If v 1 , . . . , v d is a basis for a vector space E, then there exists a unique dual basis w 1 , . . . , w d for E * that satisfies [13, Theorem 2 of Section 15].
Theorem 8. If X is a random vector in E having an MGF, then the random scalar X, t has an MGF for all t ∈ E * . Conversely, if X, t has an MGF for all t ∈ E * , then X has an MGF.
Theorem 9. Suppose X n , n = 1, 2, . . . is a sequence of random vectors, and suppose their moment generating functions converge pointwise on a neighborhood W of zero. Then and X has an MGF ϕ X , and ϕ Xn (t) → ϕ X (t), t ∈ E * .
Theorem 10. Under the assumptions of Theorem 9, suppose t 1 , t 2 , . . . , t k are vectors defined on E * , the dual space of E. Then k i=1 X n , t i is uniformly integrable so The combination of Theorems 6-10 provide a methodology for statistical inference along likelihood maximizing sequences when the MLE is in the Barndorff-Nielsen completion. In particular, we have convergence in distribution and convergence of moments of all orders along likelihood maximizing sequence. The limiting distribution in this context is a generalized exponential family with density e h where h is a generalized affine function.

Convergence of null spaces of Fisher information
Our method for finding the MLE in the Barndorff-Nielsen completion relies on finding the null space of the Fisher information matrix. We need to show that we have convergence for that. In order to prove this we need an appropriate notion of convergence of vector subspaces.
Definition 4. Painlevé-Kuratowski set convergence [15,Section 4.A] can be defined as follows (Rockafellar and Wets [15] give many equivalent characterizations). If C n is a sequence of sets in R p and C is another set in R p , then we say C n → C if (i) For every x ∈ C there exists a subsequence n k of the natural numbers and there exist x n k ∈ C n k such that x n k → x.
(ii) For every sequence x n → x in R p such that there exists a natural number N such that x n ∈ C n whenever n ≥ N , we have x ∈ C.
Theorem 11. Suppose that A n ∈ R p×p is a sequence of positive semidefinite matrices and A n → A componentwise. Fix ε > 0 less than half of the least nonzero eigenvalue of A unless A is the zero matrix in which case ε > 0 may be chosen arbitrarily. Let V n denote the subspace spanned by the eigenvectors of A n corresponding to eigenvalues that are less than ε. Let V denote the null space of A. Then V n → V (Painlevé-Kuratowski).
6 Calculating the MLE in the completion 6

.1 Assumptions
So far everything has been for general exponential families except for Theorems 6 and 7, the later of which assumes the conditions of Brown [4], and those conditions hold for GLM and log-linear models for categorical data analysis. Now, following Geyer [8] we restrict our attention to discrete GLM. This, in effect, includes log-linear models for contingency tables because we can always assume Poisson sampling, which makes them equivalent to GLM [1, Section 8.6.7; 8, Section 3.17].

First characterization
Suppose we know the affine support of the MLE distribution in the completion. This is the smallest affine set that contains the canonical statistic with probability one. Denote the affine support by A. An affine set is a translate of a vector subspace. Since the observed value of the canonical statistic is contained in A with probability one, and the canonical statistic for a GLM is M T Y , where M is the model matrix, Y is the response vector, and y its observed value [8, Section 3.9], we have A = M T y + V for some vector space V .
Then the LCM in which the MLE in the completion is found is the OM conditioned on the event From this we see that the vectors η 1 , . . . , η j span the null space of the Fisher information matrix for the LCM, which our Theorems 7 and 10 say is well approximated by the Fisher information matrix for the OM at parameter values that are close to maximizing the likelihood. The vector subspace spanned by the vectors η 1 , . . . , η j is called the constancy space of the LCM in [8].

Second characterization
Any vector δ in the canonical parameter space of an exponential family is called a direction of recession (DOR) if the likelihood function is nondecreasing in that direction or, equivalently, if where Y denotes the response vector considered as a random vector, y denotes its observed value, and M is the model matrix [12, Section 2.2; 8, Theorem 3 and the following discussion and Section 3.9].
If we can find a DOR, then the MLE in the completion is a distribution in the LCM, which is the family of distributions in the original model (OM) conditioned on the event Y, M δ = y, M δ , almost surely, (16) or a distribution in the completion of the LCM [12, Chapter 2; 8, Section 3]. Define ζ = M δ. In light of (15) the only way (16) can hold is if ζ i = 0 implies Y i = y i almost surely.
Thus the distributions in the LCM are the distributions in the OM conditioned on this event. Moreover the MLE in the LCM is easily found using standard software. We simply change the model by removing the components of the response that are fixed in the LCM. Using R function GLM this is done using the optional argument subset = zeta == 0, where zeta is the R object corresponding to the vector ζ.
If the MLE for the LCM exists in the conventional sense, then we have solved the problem of finding the MLE in the completion of the OM. If not we have to solve the problem of finding the MLE in the completion of the LCM we found, and that is done as we did before. And so forth. The iteration must terminate because each LCM has smaller dimension than the one before. Geyer [12,Chapter 2] gives details.

Third characterization
Geyer [8,Section 3] shows that the recursion in the preceding section can be avoided by use of a generic direction of recession (GDOR), which is a DOR in the relative interior of the set of all DOR.

Based on the first characterization
We do not need a DOR because we only use that to determine the affine support of the LCM, and we can estimate that by other methods. Suppose η 1 , . . . , η j and other notation are as in Section 6.2.1 above. The LCM is the OM conditioned on the event Y, M η i = y, M η i , almost surely for i ∈ 1, . . . , j.
We have no readily available way to fit a GLM subject to (17). We know, however, (Section 6.2.2 above) that the event (17) fixes some components of the response vector at their observed values and leaves the rest entirely unconstrained. Those components, that are entirely unconstrained are those for which the corresponding component of M η i is zero (or, taking account of the inexactness of computer arithmetic, nearly zero) for all i = 1, . . . , j.

Based on the second characterizations
We can find a DOR by minimizing the function f defined by where v 1 , . . . , v m are vectors that generate the tangent cone for the GLM, which are defined in Geyer [8, Sections 3.10 and 3.11] and for which R code to calculating them is given in the technical reports accompanying [8], and where η 1 , . . . , η j are as in Section 6.2.3 above.
We minimize f over all unit vectors a, to avoid unbounded domain (if we minimized over all vectors, the optimal value might be −∞, and no solution would exist) and also to avoid the zero vector being a solution.
Ifā 1 , . . . ,ā j are the components of the solution, the DOR is and the LCM is the OM conditioned on the event that every component of the response vector for which the corresponding component of ζ = M δ is nonzero is constrained to be equal to its observed value. We fit the model using the subset argument of R function glm as explained in Section 6.2.2 above.
We can use ideas from Section 6.2.1 to tell us whether we need to iterate. We already know the dimension j of the constancy space of the MLE in the completion. If the LCM determined by this GDOR has a constancy space of this dimension, then we have the correct LCM and do not need to iterate. R function glm will figure out the dimension of the constancy space (how many coefficients it needs to drop to get an identifiable model) on its own.

Based on the third characterization
As far as we know, there is no way to calculate a GDOR except by using the very time consuming computational geometry calculations explained by [8].

Complete separation example
We return to the motivating example of Section 2. Here we see that the Fisher information matrix has only null eigenvectors. Thus the LCM is completely degenerate at the one point set containing Table 1: The estimated null eigenvector of the Fisher information matrix (column 2) and the gdor computed by [8] (column 3). Only nonzero components are shown.
only the observed value of the canonical statistic of this exponential family. We adopt the techniques of Section 3.16.2 of [8] to make inferences about mean-value parameters (success probability considered as a function of the predictor x). This is outlined in Section 2. Onesided confidence intervals are seen in Figure 2. As stated in Section 2, the actual computations follow some later course notes [9].

Example in Section 2.3 of [8]
This example consists of a 2 × 2 × · · · × 2 contingency table with seven dimensions hence 2 7 = 128 cells. These data now have a permanent location [6]. There is one response variable y that gives the cell counts and seven categorical predictors v 1 , . . ., v 7 that specify the cells of the contingency table. We fit a generalized linear regression model where y is taken to be Poisson distributed. We consider a model with all three-way interactions included but no higher-order terms. Geyer [8] shows the MLE in this example does not exist in the traditional sense, and then computes a generic direction of recession using the repeated linear programming with R package rcdd (Section 6.3.3). In this example there is only single null eigenvector of the Fisher information matrix, which consequently must be a generic direction of recession. Therefore all of our methods of determining the support of the LCM in Sections 6.3.1, 6.3.2, and 6.3.3 must do the same thing. Table 1 displays the comparison between the characterizations in Sections 6.3.2 and 6.3.3. The vectorη is the estimated null eigenvector of the Fisher information matrix using our implementation. The vectorη gdor is the estimated gdor in [8]. Theη vector is identical toη gdor up to six decimal places (the results in Table 1 are rounded). Therefore, the inferences resulting from these two distinct approaches is identical up to rounding. The only material difference between our implementation and the linear programming in [8] is computational time. Our implementation estimates η in 0.017 seconds of computer time, while the functions in the rcdd package estimatesη gdor in 4.481 seconds of computer time. This is a big difference for a relatively small amount of data.
Inference for the MLE in the LCM are included in the supplementary materials.

Big data example
This example uses the other dataset at [6]. It shows our methods are much faster than the linear programming method of [8] for recovering directions of recession (Sections 6.2.2 and 6.2.3). The characterization in Section 6.2.1 is even faster since no direction of recession is computed. This dataset consists of five categorical variables with four levels each and a response variable y that is Poisson distributed. A model with all four-way interaction terms is fit to this data. It may seem that the four way interaction model is too large (1024 data points vs 781 parameters) but χ 2 tests select this model over simpler models, see Table 2. We estimate that the dimension of the null space of the estimated Fisher information matrix is 23. In the Section 6.3.2 characterization we minimize f over a ∈ R 23 in (18), a = 1 to find a DOR. The resulting vectorη gdor = 23 k=1 a kηk is a GDOR since it satisfies conditions (20a) and (20b) of [8]. Fitting the model, estimating the dimension of the null space of estimated Fisher information, finding a, and estimating the support of the LCM took less than 2 seconds of computer time. In the Section 6.3.3 characterization, the functions in the rcdd package perform the same tasks in 334701 seconds (roughly 3.8 days) of computer time. The two different methods estimated different GDORs but they estimate the same support for the LCM.
Inferences for the MLE in the LCM are included in the supplementary materials. One-sided 95% confidence intervals for mean-value parameters that correspond to components of the canonical statistic which are on the boundary of their support (MLE equal to 0) are also included in the supplementary materials. We provide a new method for calculating these intervals that has not been previously published, but whose concept is found in Geyer [8] in the penultimate paragraph of Section 3.16.2.
Let I denote the index set of the components of the response vector on which we condition the OM to get the LCM, and let Y I and y I denote these components considered as a random vector and as an observed value, respectively. Let θ = M β denote the saturated model canonical parameter (usually called "linear predictor" in GLM theory) with β being the submodel canonical parameter vector. Then endpoints for a 100(1 − α)% confidence interval for a scalar parameter g(β) are min γ∈Γ lim prβ +γ (Y I =y I )≥α g(β + γ) and max γ∈Γ lim prβ +γ (Y I =y I )≥α where Γ lim is a basis for the null space of Fisher information. At least one of (19) is at the end of the range of this parameter (otherwise we can use conventional two-sided intervals). For Poisson sampling, let µ = exp(θ) denote the mean value parameter (here exp operates componentwise like the R function of the same name does), then pr β (Y I = y I ) = exp − i∈I µ i . We take the where µ is taken to be the function of γ described in (19). The optimization in (20) can be done for any k ∈ I. Implementation details are included in Sections 10.7.1 and 10.7.2 in the supplementary materials.
One-sided 95% confidence intervals for mean-valued parameters whose MLE is equal to 0 are displayed in Table 3. The full table is included in the supplementary materials. Some of the intervals in Table 3 are relatively wide which represents non-trivial uncertainty about the observed MLE being 0.

Discussion
The chance of observing a canonical statistic on the boundary of its support increases when the dimension of the model increases. Researchers naturally want to include all possibly relevant covariates in an analysis, and this will often result in the MLE not existing in the conventional sense. Our methods provide a computationally inexpensive solution to this problem.
The theory of generalized affine functions and the geometry of exponential families allows GLM software to provide a MLE when the observed value of the canonical statistic is on the boundary of its support. In such settings, the MLE does not exist in the traditional sense and is said to belong to the Barndorff-Nielsen completion of the exponential family [3,4,8,5] when the supremum of the log likelihood is finite. [3,4,5] all provided a MLE when it exists in the Barndorff-Nielsen completion of the family and [8] provided estimates of variability under the conditions of [4]. We do the same here using the theory of generalized affine functions.
The limiting distribution evaluated along the iterates of such an optimization is a generalized exponential family taking the form of a generalized affine function with structure given by Theorem 5. Cumulant generating functions converge along this sequence of iterates (Theorems 6 and 7) as well as estimates of moments of all orders (Theorem 10) for distributions taking estimated parameter values along this sequence of iterates. We can then use the null eigenvectors of estimated Fisher information to find a DOR and the support of a LCM. Parameter estimation in the LCM is conducted in the traditional manner using GLM software. One-sided confidence intervals for mean-value and canonical parameters that are observed to be on the boundary can also be provided.
The costs of computing a DOR and the support of a LCM are minimal compared to the repeated linear programming in the rcdd package, especially when the dimension of the data is large. This is where the desirability of our approach stems from. It is much faster to let optimization software, such as glm in R, simply go uphill on the log likelihood of the exponential family until a convergence tolerance is reached. Our examples show what kind of time saving is possible using our methods on small and large datasets.

A Technical appendix
Proof of Theorem 6. First consider the case when j = 0, the sequences of η vectors and scalars δ are both of length zero. There are no sets C + and C − in this setting and h is affine on E. From Lemma 1 we have ψ n = θ n . From Corollary 2, θ n → θ * as n → ∞. We observe that c(θ n ) → c(θ * ) from continuity of the cumulant function. The existence of the MLE in this setting implies that there is a neighborhood about 0 denoted by W such that θ * + W ⊂ int(dom c). Pick t ∈ W and observe that c(θ n + t) → c(θ * + t). Therefore κ n (t) → κ(t) when j = 0. Now consider the case when j = 1. Define c 1 (θ) = log H 1 e y,θ λ(dy) for all θ ∈ int(dom c 1 ). In this scenario we have From [12, Theorem 2.2], we know that as s → ∞ since δ 1 ≥ y, η 1 for all y ∈ H 1 . The left hand side of (21) is a convex function of θ and the right hand side is a proper convex function. If int(dom c 1 ) is nonempty, which holds whenever int(dom c) is nonempty, then the convergence in (21) is uniform on compact subsets of int(dom c 1 ) [15,Theorem 7.17]. Also [15,Theorem 7.14], uniform convergence on compact sets is the same as continuous convergence. Using continuous convergence, we have that both where b 1,n → ∞ as n → ∞ by Lemma 1. Thus = log e y,t +h(y) λ(dy) = κ(t).
This concludes the proof when j = 1.
For the rest of the proof we will assume that 1 < j ≤ p where dim(E) = p. Represent the sequence θ n in coordinate form as θ n = b 1,n η 1 + b 2,n η 2 + · · · + b p,n η p , with scalars b i,n , i = 1, ..., p. For 0 < j < p, we know that ψ n → θ * as n → ∞ from Corollary 2. The existence of the MLE in this setting implies that there is a neighborhood about 0, denoted by W , such that θ * + W ⊂ int(dom c). Pick t ∈ W , fix ε > 0, and construct ε-boxes about θ * and θ * + t, denoted by N 0,ε (θ * ) and N t,ε (θ * ) respectively, such that both N 0,ε (θ * ), N t,ε (θ * ) ⊂ int (dom c). Let V t,ε be the set of vertices of N t,ε (θ * ). For all y ∈ E define From the conclusions of Lemma 1 and Corollary 2, we can pick an integer N such that y, ψ n + t ≤ M t,ε (y) and b (i+1),n /b i,n < 1 for all n > N and i = 1, ..., j − 1. For all y ∈ F , we have Therefore, which implies that by dominated convergence. To complete the proof, we need to verify that We know that (26) holds when λ(∪ j−1 i=1 D i ) = 0 in (11) because of (25). Now suppose that and for all n > N by the assumption given by (11). The assumption that the exponential family is discrete and full implies that e h (y)λ(dy) = 1 [12,Theorem 2.7]. This in turn implies that λ(C + i ) = 0 for all i = 1, ..., j which then implies that . Putting (24), (27), and (28) together we can conclude that (26) holds as n → ∞ by dominated convergence and = log e y,t +h(y) λ(dy) = κ(t).
(29) for all t ∈ W . This verifies CGF convergence on neighborhoods of 0 which completes the proof.
Proof of Theorem 7. Represent h as in Theorem 5. Denote the normal cone of the convex polyhedron support K at the data x by N K (x). We show that a sequence of scalars δ * i and a linearly independent set of vectors η * i ∈ E * can be chosen so that η * i ∈ N K (x), and for i = 1, ..., j where H 0 = E so that (11) holds. We will prove this by induction with the hypothesis H(m), m = 1, ..., j, that (30) holds for i ≤ m where the vectors η * i ∈ N K (x) i = 1, ..., m. We first verify the basis of the induction. The assumption that the exponential family is discrete and full implies that e h (y)λ(dy) = 1 [12,Theorem 2.7]. This in turn implies that λ(C + k ) = 0 for all k = 1, ..., j. This then implies that K ⊆ {y ∈ E : y, η 1 ≤ δ 1 } = H 1 ∪ C − 1 . Thus η 1 ∈ N K (x) and the base of the induction holds with η 1 = η * 1 and δ 1 = δ * 1 . We now show that H(m + 1) follows from H(m) for m = 1, ..., j − 1. We first establish that K ∩ H m is an exposed face of K. This is needed to show that (30) holds for i = 1, ..., m + 1. Let L K be the collection of closed line segments with endpoints in K. Arbitrarily choose l ∈ L K such that an interior point y ∈ l is such that y ∈ K ∩ H m . We can write y = γa + (1 − γ)b, 0 < γ < 1, where a and b are the endpoints of l. Since a, b ∈ K by construction, we have that a − x, η * m ≤ 0 and b − x, η * m ≤ 0 because η * m ∈ N K (x) by H(m). Now, 0 ≥ a − x, η * m = a − y + y − x, η * m = a − y, η * m = a − (γa + (1 − γ) Therefore a, b ∈ K ∩ H m and this verifies that K ∩ H m is a face of K since l was chosen arbitrarily. The function y → y−x, η * m −δ * m , defined on K, is maximized over K ∩H m . Therefore K ∩H m is an exposed face of K by definition. The exposed face K ∩H m = K ∩(H m+1 ∪C − m+1 ) since λ(C + m+1 ) = 0 and the convex support of the measure λ|H m is H m by assumption. Thus, η m+1 ∈ N K∩Hm (x).
The sets K and H m are both convex and are therefore regular at every point [15,Theorem 6.20]. We can write N K∩Hm (x) = N K (x) + N Hm (x) since K and H m are convex sets that cannot be separated where + denotes Minkowski addition in this case [15,Theorem 6.42]. The normal cone N Hm (x) has the form N Hm (x) = {η ∈ E * : y − x, η ≤ 0 for all y ∈ H m } = {η ∈ E * : y − x, η ≤ 0 for all y ∈ E such that y − x, η i = 0, i = 1, ..., m} = m i=1 a i η i : a i ∈ R, i = 1, ..., m .
Proof of Theorem 11. We first consider the case that A is positive definite and V = {0}. We can write A n = A + (A n − A) where (A n − A) is a perturbation of A for large n. From Weyl's inequality [17], we have that all eigenvalues of A n are bounded above zero for large n and V n = {0} as a result. Therefore, V n → V as n → ∞ when A is positive definite. Now consider the case that A is not strictly positive definite. Without loss of generality, let x ∈ V be a unit vector. For all 0 < γ ≤ ε, let V n (γ) denote the subspace spanned by the eigenvectors of A n corresponding to eigenvalues that are less than γ. By construction, V n (γ) ⊆ V n .
From [15,Example 10.28], if A has k zero eigenvalues, then for sufficiently large N 1 there are exactly k eigenvalues of A n are less than ε and p − k eigenvalues of A n greater than ε for all n > N 1 . The same is true with respect to γ for all n greater than N 2 . Thus j n (γ) = j n (ε) which implies that V n (γ) = V n for all n > max{N 1 , N 2 }.
We now verify part (i) of Painlevé-Kuratowski set convergence with respect to V n (γ). Let N 3 be such that x T A n x < γ 2 for all n ≥ N 3 . Let λ k,n and e k,n be the eigenvalues and eigenvectors of A n , with the eigenvalues listed in decreasing orders. Without loss of generality, we assume that the eigenvectors are orthonormal. Then, There have to be eigenvectors e k,n such that x T e k,n ≥ 1/ √ p with corresponding eigenvalues λ k,n that are very small since λ k,n (x T e k,n ) 2 < γ. But conversely, any eigenvalues λ k,n such that λ k,n ≥ γ must have λ k,n (x T e k,n ) 2 < γ 2 =⇒ (x T e k,n ) 2 < γ 2 /λ k,n ≤ γ.