Pseudo-maximization and self-normalized processes

Self-normalized processes are basic to many probabilistic and statistical studies. They arise naturally in the the study of stochastic integrals, martingale inequalities and limit theorems, likelihood-based methods in hypothesis testing and parameter estimation, and Studentized pivots and bootstrap-$t$ methods for confidence intervals. In contrast to standard normalization, large values of the observations play a lesser role as they appear both in the numerator and its self-normalized denominator, thereby making the process scale invariant and contributing to its robustness. Herein we survey a number of results for self-normalized processes in the case of dependent variables and describe a key method called ``pseudo-maximization'' that has been used to derive these results. In the multivariate case, self-normalization consists of multiplying by the inverse of a positive definite matrix (instead of dividing by a positive random variable as in the scalar case) and is ubiquitous in statistical applications, examples of which are given.


Introduction
This paper presents an introduction to the theory and applications of selfnormalized processes in dependent variables, which was relatively unexplored until recently due to difficulties caused by the highly non-linear nature of selfnormalization. We overcome these difficulties by using the method of mixtures which provides a tool for "pseudo-maximization".
We dedicate this paper to the memory of J. L. Doob, a great probabilist who generously pointed one of us in this general direction, though we each had our independent initial path. While the first author was visiting the University of Illinois at Urbana-Champaign in the early 1990's, he took frequent hiking trips with Doob. Upon reaching a mountain-top in one of these trips, he asked Doob what were the most important open problems in probability. Doob replied that there were results in harmonic analysis involving harmonic functions divided by subharmonic or superharmonic functions that did not yet have analogues in the probabilistic setting of martingales. Guided by Doob's answer, de la Peña (1999) developed a new technique for obtaining exponential bounds for martingales. Subsequently, de la Peña, Klass and Lai ( , 2004 introduced another method, which we call pseudo-maximization, to derive exponential and L p -bounds for self-normalized processes. Via separate and newly innovated methods a universal LIL was also obtained. In this survey we review these results as well as those by others that fall (broadly) under the rubric of self-normalization. Our choice of topics has been guided by the usefulness and definitiveness of the results, and the light they shed on various aspects of probabilistic/statistical theory.
Self-normalized processes arise naturally in statistical applications. In standard form (as when connected to CLT's) they are unit-free and often permit the weakening or even the elimination of moment assumptions. The prototypical example of a self-normalized random process is Student's t-statistic. It replaces the population standard deviation σ in √ n(X − µ)/σ by the sample standard deviation. Let {X i } be i.i.d normal N (µ, σ 2 ), Then T n =X n −µ sn/ √ n has the t n−1 -distribution.
We can re-express T n in terms of A n /B n as T n = A n /B n (n − (A n /B n ) 2 )/(n − 1) .
Thus, information on the properties of T n can be derived from the self-normalized process (1.1) Hence, in a more general context, a self-normalized process assumes the form A n /B n , or A t /B t in continuous time, where B t is a random variable that is used to estimate a dispersion measure of the process A t . Although the methodology of self-normalization dates back to 1908 when William Gosset (aka Student) introduced Student's t-statistic, active development of the probability theory of self-normalized processes began in the 1990's after the seminal work of Kuelbs (1989, 1991) on laws of the iterated logarithm (LIL) for self-normalized sums of i.i.d. variables belonging to the domain of attraction of a normal or stable law. In particular,  derived a Berry-Esseen bound for Student's tstatistic, and Giné, Götze and Mason (1997) proved that the t-statistic has a limiting standard normal distribution if and only if X 1 is in the domain of attraction of a normal law, by making use of exponential and L p bounds for A n /B n , where A n = n i=1 X i and B 2 n = n i=1 X 2 i . The limiting distribution of the t-statistic when X 1 belongs to the domain of attraction of a stable law had been studied earlier by Efron (1969) and Logan, Mallows, Rice and Shepp (1973). Whereas Giné, Götze and Mason's result settles one of the conjectures of Logan, Mallows, Rice and Shepp that the self-normalized sum "is asymptotically normal if (and perhaps only if) X 1 is in the domain of attraction of the normal law," Chistyakov and Götze (2004) have settled that other conjecture that the "only possible nontrivial limiting distributions" are those when X 1 follows a stable law. Shao (1997) proved large deviation results for Σ n i=1 X i / Σ n i=1 X 2 i without moment conditions and moderate deviation results when X 1 is the domain of attraction of a normal or stable law. Subsequently Shao (1997) obtained Cramér-type large deviation results when E|X 1 | 3 < ∞. Jing, Shao and Zhou (2003) derived saddlepoint approximations for Student's tstatistic with no moment assumptions. Bercu, Gassiat and Rio (2002) obtained large and moderate deviation results for self-normalized empirical processes. Self-normalized sums of independent but non-identically distributed X i have been considered by Bentkus, Bloznelis and Götze (1996), Wang and Jing (1999) and Jing, Shao and Wang (2002). Chen (1999), Worms (2000) and Bercu (2001) have provided extensions to self-normalized sums of functions of ergodic Markov chains and autoregressive models. Giné and Mason (1998) relate the LIL of selfnormalized sums of i.i.d. random variables X i to the stochastic boundedness of Egorov (1998) gives exponential inequalities for a centered variant of (1.1). Along a related line of work, Caballero, Fernández and Nualart (1998) provide moment inequalities for a continuous martingale divided by its quadratic variation, and use these results to show that if {M t , t ≥ 0} is a continuous martingale, null at zero, then for every 1 ≤ p < q, there exists a universal constant C = C(p, q) such that Related work in Revuz and Yor (1999 page 167) for continuous local martingales yields for all p > q > 0 the existence of a constant C pq such that In Section 2 we describe the approaches of de la Peña (1999) to exponential inequalities for strong-law norming and of de la Peña, Klass andLai (2000, 2004) to exponential inequalities and L p -bounds of self-normalized processes.
Section 3 considers laws of the iterated logarithm for self-normalized martingales. Section 4 concludes with a discussion and review of self-normalization and pseudo-maximization in statistical applications.

Motivation
"BEHIND EVERY LIMIT THEOREM THERE IS AN INEQUALITY" This folklore has been attributed to Kolmogorov.
Example 2.1. Let Sn an be a sequence of random variables. Then, to show that Sn an → µ in probability, Markov's inequality is often used: The weak law of large numbers for sums of i.i.d. random variables with finite variance uses the case p = 2 and the fact that the variance of the sum is the sum of the the variances. What happens when the variance is infinite, and when a n depends on (X 1 , X 2 , . . . , X n ) ?
Example 2.2 (Almost Sure Growth Rate of a Sum). For non-decreasing a n , if we can show for some K > 0 and for all ǫ > 0 that P { S n a n > (1 + 3ǫ)K i.o.} = 0, then we can conclude that lim sup Sn an ≤ K a.s. By the Borell-Cantelli lemma, it suffices to show that for some 1 < n 1 < n 2 < . . . with a j < (1 + ǫ)a n k for n k ≤ j < n k+1 whenever k is sufficiently large, Problems of this type frequently reduce to the use of Markov's inequality and finding appropriate bounds on E exp(λS n /a n ) to show that the above series converges. We are particularly interested in situations in which a n depends on the available data and hence is random. This further motivates the study of self-normalized sums.
Example 2.3. Consider the autoregressive model where α is a fixed (unknown) parameter and ǫ n are independent standard normal random variables. The maximum likelihood estimate (MLE)α n of α is the maximizer of the log likelihood function Differentiating with respect to α and equating to zero, we obtain Therefore,α n − α can be expressed as a self-normalized random variable: (2.1) In (2.1), the numerator A n := n j=1 Y j−1 ǫ j is a martingale with respect to the filtration F n := σ(Y 1 , ..., Y n ; ǫ 1 , ..., ǫ n ). The denominator of (2.1) is which is the conditional variance of A n . Therefore,α n − α = A n /B 2 n is a process self-normalized by the conditional variance. Since the ǫ j are N (0, 1), for all n ≥ 1 and −∞ < λ < ∞, is an exponential martingale and therefore satisfies the canonical assumption below, which we use to develop a comprehensive theory for self-normalized processes.

Canonical assumption and exponential bounds for strong law
We consider a pair of random variables A, B, with B > 0 such that There are three regimes of interest: when (2.2) holds Results are presented in all three cases. Such canonical assumptions imply various moment and exponential bounds, including the following bound connected to the law of large numbers in de la Peña (1999) for case (i).
Theorem 2.1. Under the canonical assumption for all real λ, for all x, y > 0.
Proof. The key here is to "keep" the indicator when using Markov's inequality.
In fact, for all measurable sets S, by the Cauchy-Schwarz inequality. The first term in the last inequality is bounded by 1, by the canonical assumption. The value minimizing the second term is λ = x, and therefore Letting S = {1/B 2 ≤ y} gives the desired result.
Applying this bound with y = 1/z to both A n and −A n in Example 2.3 yields for all x, z > 0. The following variant of Theorem 2.1 generalizes a result of Lipster and Spokoiny (1999).
Theorem 2.2. Under the canonical assumption (2.2) for all real λ, Example 2.4 (Martingales and Supermartingales). The Appendix provides several classes of random variables that satisfy the canonical assumption (2.2) in a series of lemmas (A.1-A.8). These lemmas are closely related to martingale theory. Moreover, Lemmas A.2-A.8 are about a supermartingale condition (2.7) that is stronger than (2.2) for the regime 0 ≤ λ < λ 0 .
Theorem 2.1 gives an inequality related to the law of large numbers. There is a class of results of this type in Khoshnevisan (1996), who points out that it essentially dates back to McKean (1962). A related reference is Freedman (1975).
(2.4) Khoshnevisan (1996) has also shown that if one assumes that the local modulus of continuity of M t is in some sense deterministic, then the inequality (2.4) can be reversed (up to a multiplicative constant). As applications of this result, he presents some large deviation results for such martingales. A related paper of Dembo (1996) provides moderate deviations for martingales with bounded jumps. Concerning moderate deviations, a general context for extending results like Theorem 2.3 to the case of discrete-time martingales can be found in de la Peña (1999), who provides a decoupling method for obtaining sharp extensions of exponential inequalities for martingales to the quotient of a martingale divided by its quadratic variation. In what follows we present three related results from de la Peña (1999). The first result is for martingales in continuous time and the last two involves discrete-time processes. The third result is a sharp extension of Bernstein's and Bennett's inequalities. (See also de la Peña and Giné (1999) for details.) and hence and hence

Pseudo-maximization (Method of Mixtures)
Note that if the integrand exp{λA − λ 2 B 2 /2} in (2.2) can be maximized over λ inside the expectation (as can be done if A/B 2 is non-random), taking λ = A/B 2 would yield E exp( A 2 2B 2 ) ≤ 1. This in turn would give the optimal Chebyshevtype bound P ( A B > x) ≤ exp( −x 2 2 ). Since A/B 2 cannot (in general) be taken to be non-random, we need to find an alternative method for dealing with this maximization. One approach for attaining a similar effect involves integrating over a probability measure F , and using Fubini's theorem to interchange the order of integration with respect to P and F . To be effective for all possible pairs (A, B), the F chosen would need to be as uniform as possible so as to include the maximum value of exp{λA − λ 2 B 2 /2} regardless of where it might occur. Thereby some mass is certain to be assigned to and near the random value λ = A/B 2 which maximizes λA − λ 2 B 2 /2. Since all uniform measures are multiples of Lebesgue measure (which is infinite), we construct a finite measure (or a sequence of finite measures) which tapers off to zero at λ = ∞ as slowly as we can manage. This approach will be used in what follows to provide exponential and moment inequalities for A/B, A/ B 2 + (EB) 2 , A/{B log log(B ∨ e 2 )}. We begin with the second case where the proof is more transparent. The approach, pioneered by Robbins and Siegmund (1970) and commonly known as the method of mixtures, was used by de la Peña, Klass and Lai (2004) to prove the following.
Theorem 2.7. Let A, B with B > 0 be random variables satisfying the canonical assumption (2.2) for all λ ∈ R. Then for all x > 0.
(2.6) By the Cauchy-Schwarz inequality and (2.6), Since E B 2 y 2 + 1 ≤ E( B y + 1), the special case y = EB gives Combining Markov's inequality with this yields P |A| In what follows we discuss the analysis of certain boundary crossing probabilities by using the method of mixtures, first introduced in Robbins and Siegmund (1970), under the following refinement of the canonical assumption: for 0 ≤ λ < λ 0 , with A 0 = 0. We begin by introducing the Robbins-Siegmund (R-S) boundaries; see Lai (1976). Let F be a finite positive measure on (0, λ 0 ) and assume that F (0, λ 0 ) > 0. Let Ψ(u, v) = exp(λu − λ 2 v/2)dF (λ). Given c > 0 and v > 0, the equation with sup over the empty set equal to zero. The R-S boundaries β F (v, c) can be used to analyze the boundary crossing probability when g(B t ) = β F (B 2 t , c) for some F and c > 0. This probability equals applying Doob's inequality to the supermartingale Ψ(A t , B t ), t ≥ 0.

Moment inequalities for self-normalized processes
The inequality (1.2) obtained by Caballero, Fernández and Nualart (1998) is used by them to establish the continuity and uniqueness of the solutions of a nonlinear stochastic partial differential equation. A natural question that arises in connection with (1. 2) is what about the case when the normalization is done by M t . The following result from de la Peña, Klass and Lai ( , 2004) provides an answer to this question.

An expectation version of the LIL for self-normalized processes
We next study the case of self-normalized inequalities when there is no possibility of explosion at the origin by shifting the denominator away from zero. An important result in this direction comes from Graversen and Peskir (2000).
Theorem 2.9. Let {M t , F t , t ≥ 0} be a continuous local martingale with quadratic variation process M t , t ≥ 0. Let l(x) = log(1 + log(1 + x)). Then there exist universal constants D 1 , D 2 > 0 such that for all stopping times τ of M .
The proof of this result was obtained by making use of Lenglart's inequality, the optional sampling theorem and Ito's formula. Shortly after this result appeared, de la Peña, Klass and Lai (2000) introduced the moment bounds in the last section, in which the denominator is not shifted away from 0. They then realized that shifted moment bounds can also be obtained for the case in which (2.2) or (2.7) only hold for 0 < λ < λ 0 . Subsequently, de la Peña, Klass and Lai (2004) proved part (i) of the following theorem for more general processes than continuous local martingales. Part (ii) of the theorem can be proved by arguments similar to those of Theorem 2 of de la Peña, Klass and Lai (2000).
for 0 < λ < λ 0 and such that A 0 = 0 and B t is nondecreasing in t > 0. In the case T = [0, ∞), assume furthermore that A t and B t are right-continuous.
(i) There exists a constant κ depending only on λ 0 , η, δ, q, h and L such that (2.14) (ii) There exists κ such that for any stopping time τ , (2.15) 3. Self-normalized LIL for stochastic sequences

Stout's LIL: Self-normalizing via conditional variance
A well-known result in martingale theory is Stout's (1970Stout's ( , 1973 LIL that uses the square root of the conditional variance for normalization. The key to the proof of Stout's result is an exponential supermartingale in Lemma A.5 of the Appendix. n < ∞ a.s for all n, (iii) lim n→∞ σ 2 n = ∞ a.s., (iv) lim sup n→∞ m n log log(σ 2 n )/σ n = 0 a.s. Then lim sup M n 2σ 2 n log log σ n ≤ 1 a.s.

Discrete-time martingales: self-normalizing via sums of squares
By making use of Lemma A.8 (see Appendix), de la Peña, Klass and Lai (2004) have proved the following upper LIL for self-normalized and suitably centered sums of random variables, under no assumptions on their joint distributions.
Theorem 3.2. Let X n be random variables adapted to a filtration {F n }. Let S n = X 1 + . . . + X n and V 2 n = X 2 1 + . . . + X 2 n . Then, given any λ > 0, there exist positive constants c λ and b λ such that lim λ→0 b λ = √ 2 and The constant b λ in Theorem 3.2 can be determined as follows. For λ > 0, let h(λ) be the positive solution of h − log(1 + h) = λ 2 . Let b λ = h(λ)/λ, γ = h(λ)/{1 + h(λ)}, and c λ is determined via λ/γ. Then lim λ→0 b λ = √ 2. Let e k = exp(k/ log k). The basic idea underlying the proof of Theorem 3.2 pertains to upper-bounding the probability of an event of the form E k = {t k−1 ≤ τ k < t k }, in which t j and τ j are stopping times defined by (3.2) letting inf ∅ = ∞. Note that for i < n, the centering constants µ i [−λv n , c λ v n ) involve v n that is not determined until time n, so the centered sums that result do not usually form a martingale. However, sandwiching τ k between t k−1 and t k enables us to replace both the random exceedance and truncation levels in (3.2) by constants. Then the event E k can be re-expressed in terms of two simultaneous inequalities, one involving centered sums and the other involving a sum of squares. These inequalities combine to permit application of Lemma A.8 (see Appendix). Thereby we conclude that P (E k ) can be upper-bounded by the probability of an event involving the supremum of a nonnegative supermartingale with mean ≤ 1, to which Doob's maximal inequality can be applied; see de la Peña, Klass and Lai (2004, pages 1924-1925 for details. Although Theorem 3.2 gives an upper LIL for any adapted sequence of random variables X n , the upper bound in (3.1) may not be attained. Example 6.4 of de la Peña, Klass and Lai (2004) suggests that one way to sharpen the bound is to center X n at its conditional median before applying Theorem 3.2 to X n = X n − med(X n |F n−1 ). On the other hand, if X n is a martingale difference sequence such that |X n | ≤ m n a.s. for some F n−1 -measurable m n with m n = o(v n ) and v n → ∞ a.s., then Theorem 6.1 of de la Peña, Klass and Lai (2004) shows that (3.1) is sharp in the sense that lim sup S n V n (log log V n ) 1/2 = √ 2 a.s. (3.3) The following example of de la Peña, Klass and Lai (2004) illustrates the difference between self-normalization by V n and by σ n in Theorems 3.2 and 3.1.

Statistical applications
Most of the probability theory of self-normalized processes developed in the last two decades is concerned with A t self-normalized by B t in the case A t = Σ i≤t X i is a sum of i.i.d. random vectors X i and B t = (Σ i≤t X i X ′ i ) 1/2 , using a key property that as observed by Chan and Lai (2000, pages 1646-1648. In the i.i.d. case, the finiteness of the moment generating function (4.1) enables one to embed the underlying distribution in an exponential family and one can then use change of measures (exponential tilting) to derive saddlepoint approximations for large or moderate deviation probabilities of self-normalized sums or for more general boundary crossing probabilities. Specifically, letting C n = Σ n i=1 X i X ′ i and ψ(θ, ρ) denote the left hand side of (4.1), we have (4.2) Let P θ,ρ be the probability measure under which (X i , X i X ′ i ) are i.i.d. with density function exp(θ ′ X i − ρθ ′ X i X ′ i θ) with respect to P = P 0,0 . The random variable inside the square brackets of (4.2) is simply the likelihood ratio statistic based on (X 1 , . . . , X n ), or the Radon-Nikodym derivative dP (n) θ,ρ /dP (n) , where the superscript (n) denotes restriction of the measure to the σ-field generated by {X 1 , . . . , X n }. For the case of dependent random vectors, although we no longer have the simple cumulant generating function nψ(θ, ρ) in (4.2) to develop precise large or moderate deviation approximations, we can still derive exponential bounds by applying the pseudo-maximization technique to (2.2) or (2.13), which is weaker than (4.2), as shown in the following two examples from de la Peña, Klass and Lai (2004, pages 1921-1922 who use a multivariate normal distribution with mean 0 and covariance matrix V −1 for the mixing distribution F to generalize (2.8) to the multivariate setting.
Example 4.1. Let {d n } be a sequence of random vectors adapted to a filtration {F n } such that d i is conditionally symmetric. Then for any a > 1 and any positive definite k × k matrix V , Example 4.2. Let M t be a continuous local martingale taking values in R k such that M 0 = 0, lim t→∞ λ min ( M t ) = ∞ a.s. and E exp(θ ′ M t θ) < ∞ for all θ ∈ R k and t > 0. Then for any a > 1 and any positive definite k × k matrix V , The reason why equality holds in (4.3) is is a nonnegative continuous martingale to which an equality due to Robbins and Siegmund (1970) can be applied; see Corollary 4.3 of de la Peña, Klass and Lai (2004). Self-normalization is ubiquitous in statistical applications, although A n and C n need no longer be linear functions of the observations X i and X i X ′ i as in the t-statistic or Hotelling's T 2 -statistic in the multivariate case. Section 4.1 gives an overview of self-normalization in statistical applications, and Section 4.2 discusses the connections of the pseudo-maximization approach with likelihood and Bayesian inference.

Self-normalization in statistical applications
The t-statistic √ n(X n − µ)/s n is a special case of more general Studentized statistics ( θ n − θ)/ se n that are of fundamental importance in statistical inference on an unknown parameter θ of an underlying distribution from which the sample observations X 1 , . . . , X n are drawn. In nonparametric inference, θ is a functional g(F ) of the underlying distribution function F and θ n is usually chosen to be g( F n ), where F n is the empirical distribution. The standard deviation of θ n is often called its standard error, which is typically unknown, and se n denotes a consistent estimate of the standard error of θ n . For the t-statistic, µ is the mean of F andX n is the mean of F n . Since Var(X n ) = Var(X 1 )/n, we estimate the standard error ofX n by s n / √ n, where s 2 n is the sample variance. An important property of a Studentized statistic is that it is an approximate pivot, which means that its distribution is approximately the same for all θ; see Efron and Tibshirani (1993, Section 12.5) who make use of this pivotal property to construct bootstrap-t confidence intervals and tests. For parametric problems, θ is usually a multidimensional vector and θ n is an asymptotically normal estimate (e.g., by maximum likelihood). Moreover, the asymptotic covariance matrix Σ n (θ) of θ n depends on the unknown parameter θ, so Σ is the self-normalized (Studentized) statistic that can be used an approximate pivot for tests and confidence regions.
The theoretical basis for the approximate pivotal property of Studentized statistics lies in the limiting standard normal distribution, or in some other limiting distribution that does not involve θ (or F in the nonparametric case). To derive the asymptotic normality of θ n , one often uses a martingale M n associated with the data, and approximates Σ −1/2 n ( θ n )( θ n −θ) by M −1/2 n M n . For example, in the asymptotic theory of the maximum likelihood estimator θ n , Σ n (θ) is the inverse of the observed Fisher information matrix I n (θ), and the asymptotic normality of θ n follows by using Taylor's theorem to derive The right-hand side of (4.4) is a martingale whose predictable variation is −I n (θ). Therefore the Studentized statistic associated with the maximum likelihood estimator can be approximated by a self-normalized martingale.

Pseudo-maximization in likelihood and Bayesian inference
Let X 1 , . . . , X n be observations from a distribution with joint density function f θ (x 1 , . . . , x n ). Likelihood inference is based on the likelihood function L n (θ) = f θ (X 1 , . . . , X n ), whose maximization leads to the maximum likelihood estimator θ n . Bayesian inference is based on the posterior distribution of θ, which is the conditional distribution of θ given X 1 , . . . , X n when θ is assumed to have a prior distribution with density function π. Under squared error loss, the Bayes estimator θ n is the mean of the posterior distribution whose density function is proportional to L n (θ)π(θ), i.e., θ n = θπ(θ)L n (θ)dθ π(θ)L n (θ)dθ. (4.5) Recall that L n (θ) is maximized at the maximum likelihood estimator θ n . Applying Laplace's asymptotic formula to the integrals in (4.5) shows that θ n is asymptotically equivalent to θ n , so integrating θ over the posterior distribution in (4.5) amounts to pseudo-maximization. Let θ 0 denote the true parameter value. A fundamental quantity in likelihood inference is the likelihood ratio martingale L n (θ) L n (θ 0 ) = e ℓn(θ) , where ℓ n (θ) = n i=1 log f θ (X i |X 1 , . . . , X i−1 ) f θ0 (X i |X 1 , . . . , X i−1 ) . (4.6) Note that ∇ℓ n (θ 0 ) is also a martingale; it is the martingale in the right-hand side of (4.4). Clearly e ℓn(θ) π(θ)dθ is also a martingale for any probability density function π of θ. Lai (2004) shows how the pseudo-maximization approach can be applied to e ℓn(θ) to derive boundary crossing probabilities for the generalized likelihood ratio statistics ℓ n ( θ n ) that lead to efficient procedures in sequential analysis.
and setting d n = X n − X ′ n ; see Section 6.1 of de la Peña and Giné (1999). For the next four lemmas, Lemma A.5 is the tool used by Stout (1970Stout ( , 1973 to obtain the LIL for martingales, Lemma A.6 is de la Peña's (1999) extension of Bernstein's inequality for sums of independent variables to martingales, and Lemmas A.7 and A.8 correspond to Lemma 3.9(ii) and Corollary 5.3 of de la Peña, Klass and Lai (2004).
Lemma A.5. Let {d n } be a sequence of random variables adapted to an increasing sequence of σ-fields {F n } such that E(d n |F n−1 ) ≤ 0 and d n ≤ M a.s. for all n and some nonrandom positive constant M . Let 0 < λ 0 ≤ M −1 , Then {exp(λA n − 1 2 λ 2 B 2 n ), F n , n ≥ 0} is a supermartingale for every 0 ≤ λ ≤ λ 0 .