Szeg\"o's Theorem and its Probabilistic Descendants

The theory of orthogonal polynomials on the unit circle (OPUC) dates back to Szeg\"o's work of 1915-21, and has been given a great impetus by the recent work of Simon, in particular his two-volume book [Si4], [Si5], the survey paper (or summary of the book) [Si3], and the book [Si9], whose title we allude to in ours. Simon's motivation comes from spectral theory and analysis. Another major area of application of OPUC comes from probability, statistics, time series and prediction theory; see for instance the book by Grenander and Szeg\"o [GrSz]. Coming to the subject from this background, our aim here is to complement [Si3] by giving some probabilistically motivated results. We also advocate a new definition of long-range dependence.

The subject of orthogonal polynomials on the real line (OPRL), at least some of which forms part of the standard undergraduate curriculum, has its roots in the mathematics of the 19th century.The name of Gabor Szegö (1895Szegö ( -1985) ) is probably best remembered nowadays for two things: co-authorship of 'Pólya and Szegö' [PoSz] and authorship of 'Szegö' [Sz4], his book of 1938, still the standard work on OPRL.Perhaps the key result in OPRL concerns the central role of the three-term recurrence relation ([Sz4], III.3.2:'Favard's theorem').
Much less well known is the subject of orthogonal polynomials on the unit circle (OPUC), which dates from two papers of Szegö in 1920-21 ([Sz2], [Sz3]), and to which the last chapter of [Sz4] is devoted.Again, the key is the appropriate three-term recurrence relation, the Szegö recursion or Durbin-Levinson algorithm ( §2).This involves a sequence of coefficients (not two sequences, as with OPRL), the Verblunsky coefficients α = (α n ) ( §2), named (there are several other names in use) and systematically exploited in the magisterial two-volume book on OPUC ( [Si4], [Si5]) by Barry Simon.See also his survey paper [Si3], written from the point of view of analysis and spectral theory, the survey [GoTo], and his recent book [Si9].
Complementary to this is our own viewpoint, which comes from probability and statistics, specifically time series (as does the excellent survey of 1986 by Bloomfield [Bl3]).Here we have a stochastic process (random phenomenon unfolding with time) X = (X n ) with n integer (time discrete, as here, corresponds to compactness of the unit circle by Fourier duality, whence the relevance of OPUC; continuous time is also important, and corresponds to OPRL).
We make a simplifying assumption, and restrict attention to the stationary case.The situation is then invariant under the shift n → n + 1, which makes available the powerful mathematical machinery of Beurling's work on invariant subspaces ( [Beu]; [Nik1]).While this is very convenient mathematically, it is important to realize that this is both a strong restriction and one unlikely to be satisfied exactly in practice.One of the great contributions of the statistician and econometrician Sir Clive Granger  was to demonstrate that statistical/econometric methods appropriate for stationary situations can, when applied indiscriminately to non-stationary situations, lead to misleading conclusions (via the well-known statistical problem of spurious regression).This has profound implications for macroeconomic policy.Governments depend on statisticians and econometricians for advice on interpretation of macroeconomic data.When this advice is misleading and mistaken policy decisions are implemented, avoidable economic losses (in terms of GDP) may result which are large-scale and permanent (cf.Japan's 'lost decade' in the 1990s, or lost two decades, and the global problems of 2007-8 on).
The mathematical machinery needed for OPUC is function theory on the (unit) disc, specifically the theory of Hardy spaces and Beurling's theorem (factorization into inner and outer functions and Blaschke products).We shall make free use of this, referring for what we need to standard works (we recommend [Du], [Ho], [Gar], [Koo1], [Nik1], [Nik2]), but giving detailed references.The theory on the disc (whose boundary the circle is compact) corresponds analytically to the theory on the upper half-plane, whose boundary the real line is non-compact (for which see e.g.[DymMcK]).Probabilistically, we work on the disc in discrete time and the half-plane in continuous time.In each case, what dominates is an integrability condition.In discrete time, this is Szegö's condition (Sz), or non-determinism (ND) -integrability of the logarithm log w of the spectral density w (of µ) ( §3).In continuous time, this is the logarithmic integral, which gives its name to Koosis' book [Koo2].
In view of the above, the natural context in which to work is that of complex-valued stochastic processes, rather than real-valued ones, in discrete time.We remind the reader that here the Cauchy-Schwarz inequality tells us that correlation coefficients lie in the unit disc, rather than the interval [−1, 1].
The time-series aspects here go back at least as far as the work of Wiener [Wi1] in 1932 on generalized harmonic analysis, GHA (which, incidentally, contains a good historical account of the origins of spectral methods, e.g. in the work of Sir Arthur Schuster in the 1890s on heliophysics).During World War II, the linear filter (linearity is intimately linked with Gaussianity) was developed independently by Wiener in the USA [Wi2], motivated by problems of automatic fire control for anti-aircraft artillery, and Kolmogorov in Russia (then USSR) [Kol].This work was developed by the Ukrainian mathematician M. G. Krein over the period 1945Krein over the period -1985 (see e.g.[Dym]), by Wiener in the 1950s ([Wi3], IG, including commentaries) and by I. A. Ibragimov (1968 on).
The subject of time series is of great practical importance (e.g. in econo-metrics), but suffered within statistics by being regarded as 'for experts only'.This changed with the 1970 book by Box and Jenkins (see [BoxJeRe]), which popularized the subject by presenting a simplified account (including an easyto-follow model-fitting and model-checking recipe), based on ARMA models (AR for autoregressive, MA for moving average).The ARMA approach is still important; see e.g.Brockwell and Davis [BroDav] for a modern textbook account.Simon's work ([Si3], [Si4], [Si5]) focusses largely on four conditions, two weak (and comparable) and two strong (and non-comparable).Our aim here is to complement the expository account in [Si3] by adding the timeseries viewpoint.This necessitates adding (at least) five new conditions.Four of these (comparable) we regard as intermediate, the fifth as strong.
In our view, one needs three levels of strength here, not two.One is reminded of the Goldilocks principle (from the English children's story: not too hot/hard/high/..., not too cold/soft/low/..., but just right).
What follows is a survey of this area, which contains (at least) eight different layers, of increasing (or decreasing) generality.This is an increase on Simon's (basic minimum of) four.We hope that no one will be deterred by this increase in dimensionality, and so in apparent complexity.Our aim is the precise opposite: to open up this fascinating area to a broader mathematical public, including the time-series, probabilistic and statistical communities.For this, one needs to open up the 'grey zone' between the strong and weak conditions, and examine the third category, of intermediate conditions .We focus on these three levels of generality.This largely reduces the effective dimensionality to three, which we feel simplifies matters.Mathematics should be made as simple as possible, but not simpler (to adapt Einstein's immortal dictum about physics).
We close by quoting Barry Simon ([Si8], 85): "It's true that until Euclidean Quantum Field Theory changed my tune, I tended to think of probabilists as a priesthood who translated perfectly simple functional analytic ideas into a strange language that merely confused the uninitiated."He continues: in his 1974 book on Euclidean Quantum Field Theory, "the dedication says: "To Ed Nelson who taught me how unnatural it is to view probability theory as unnatural" ". §2.Verblunsky's theorem and partial autocorrelation.
Let X = (X n : n ∈ Z) be a discrete-time, zero-mean, (wide-sense) stationary stochastic process, with autocovariance function γ = (γ n ), (the variance is constant by stationarity, so we may take it as 1, and then the autocovariance reduces to the autocorrelation).
Let H be the Hilbert space spanned by X = (X n ) in the L 2 -space of the underlying probability space, with inner product (X, Y ) := E[XY ] and norm X := [E(|X| 2 )] 1/2 .Write T for the unit circle, the boundary of the unit disc D, parametrised by z = e iθ ; unspecified integrals are over T .
Theorem 1 (Kolmogorov Isomorphism Theorem).There is a process Y on T with orthogonal increments and a probability measure µ on T with (i) (iii) The autocorrelation function γ then has the spectral representation (iv) One has the Kolmogorov isomorphism between H (the time domain) and L 2 (µ) (the frequency domain) given by for integer t (as time is discrete).
Proof.Parts (i), (ii) are the Cramér representation of 1942 ( [Cra], [Do] X.4; Cramér and Leadbetter [CraLea] §7.5).Part (iii), due originally to Herglotz in 1911, follows from (i) and (ii)([Do] X.4, [BroDav] §4.3).Part (iv) is due to Kolmogorov in 1941 [Kol].All this rests on Stone's theorem of 1932, giving the spectral representation of groups of unitary transformations of linear operators on Hilbert space; see [Do] 636-7 for a historical account and references (including work of Khintchine in 1934 in continuous time), [DunSch] X.5 for background on spectral theory.// The reader will observe the link between the Kolmogorov Isomorphism Theorem and (ii), and its later counterpart from 1944, the Itô Isomorphism Theorem and (dB t ) 2 = dt in stochastic calculus.
To avoid trivialities, we suppose in what follows that µ is non-trivialhas infinite support.
Since for integer t the e itθ span polynomials in e iθ , prediction theory for stationary processes reduces to approximation by polynomials.This is the classical approach to the main result of the subject, Szegö's theorem ( §2 below); see e.g.[GrSz], Ch. 3,[Ach], Addenda, B. We return to this in §7.7 below.
The Toeplitz matrix for X, or µ, or γ, is It is positive definite.
For n ∈ N, write H [−n,−1] for the subspace of H spanned by {X −n , . . ., X −1 } (the finite past at time 0 of length n), P [−n,−1] for projection onto H [−n,−1] (thus P [−n,−1] X 0 is the best linear predictor of X 0 based on the finite past), X 0 is the prediction error).We use a similar notation for prediction based on the infinite past.Thus H (−∞,−∞] is the closed linear span (cls) of X k , k ≤ −1, P (−∞,−1] is the corresponding projection, and similarly for other time-intervals.Write for the (subspace generated by) the past up to time n, for their intersection, the (subspace generated by) the remote past.With corr(Y, Z) for Y, Z zero-mean and not a.s.0, write also for the correlation between the residuals at times 0, n resulting from (linear) regression on the intermediate values X 1 , . . ., X n−1 .The sequence is called the partial autocorrelation function (PACF).It is also called the sequence of Verblunsky coefficients, for reasons which will emerge below.
Theorem 2 (Verblunsky's Theorem.There is a bijection between the sequences α = (α n ) with each α n ∈ D and the probability measures µ on T .
This result dates from Verblunsky in 1936 [V2], in connection with OPUC.It was re-discovered long afterwards by Barndorff-Nielsen and Schou [BarN-S] in 1973 and Ramsey [Ram] in 1974, both in connection with parametrization of time-series models in statistics.The Verblunsky bijection has the great advantage to statisticians of giving an unrestricted parametrization: the only restrictions on the α n are the obvious ones resulting from their being correlations -|α n | ≤ 1, or as µ is non-trivial, |α n | < 1.By contrast, γ = (γ n ) gives a restricted parametrization, in that the possible values of γ n are restricted by the inequalities of positive-definiteness (principal minors of the Toeplitz matrix Γ are positive).This partly motivates the detailed study of the PACF in, e.g., [In1], [In2], [In3], [InKa1], [InKa2].For general statistical background on partial autocorrelation, see e.g.[KenSt], Ch. 27 (Vol. 2),).
As we mentioned in §1, the basic result for OPUC corresponding to Favard's theorem for OPRL is Szegö's recurrence (or recursion): given a probability measure µ on T , let Φ n be the monic orthogonal polynomials they generate (by Gram-Schmidt orthogonalization).For every polynomial where the parameters α n lie in D: and are the Verblunsky coefficients (also known variously as the Szegö, Schur, Geronimus and reflection coefficients; see [Si4], §1.1).The double use of the name Verblunsky coefficients and the notation α = (α n ) for the PACF and the coefficients is justified: the two coincide.Indeed, the Szegö recursion is known in the time-series literature as the Durbin-Levinson algorithm; see e.g.[BroDav], § §3.4,5.2.The term Verblunsky coefficient is from Simon [Si4], to which we refer repeatedly.We stress that Simon writes α n for our α n+1 , and so has n = 0, 1, . . .where we have n = 1, 2, . ... Our notational convention is already established in the time-series literature (see e.g.[BroDav], § §3.4,5.2), and is more convenient in our context of the PACF, where n = 1, 2, . . .has the direct interpretation as a time-lag between past and future (cf.[Si4], (1.5.15), p. 56-57).See [Si4], §1.5 and (for two proofs of Verblunsky's theorem) §1.7, 3.1, and [McLZ] for a recent application of the unrestricted PACF parametrization.
One may partially summarize the distributional aspects of Theorems 1 and 2 by the one-one correspondences for the best linear predictor of X n+1 given X n , . . ., X 1 , for the mean-square error in the prediction of X n+1 based on X 1 , . . ., X n , for the vector of finite-predictor coefficients.The Durbin-Levinson algorithm ( [Lev], [Dur]; [BroDav] §5.2, [Pou] §7.2) gives the φ n+1 , v n+1 recursively, in terms of quantities known at time n, as follows: (i) The first component of φ n+1 is given by The φ nn are the Verblunsky coefficients α n : (ii) The remaining components are given by (iii) The prediction errors are given recursively by In particular, v n > 0 and we have from (ii) that Since by (iii) the n-step prediction error variance v n → σ 2 > 0 iff the infinite product converges, that is, α ∈ ℓ 2 , an important condition that we will meet in §3 below in connection with Szegö's condition.
Note. 1.The Durbin-Levinson algorithm is related to the Yule-Walker equations of time-series analysis (see e.g.[BroDav], §8.1), but avoids the need there for matrix inversion.
2. The computational complexity of the Durbin-Levinson algorithm grows quadratically, rather than cubically as one might expect; see e.g.Golub and van Loan [GolvL], §4.7.Its good numerical properties result from efficient use of the Toeplitz character of the matrix Γ (or equivalently, of Szegö recursion).
3. See [KatSeTe] for a recent approach to the Durbin-Levinson algorithm, and [Deg] for the multivariate case.

Stochastic versus non-stochastic
This paper studies prediction theory for stationary stochastic processes.As an extreme example (in which no prediction is possible), take the 'free' case, in which the X n are independent (and identically distributed).Then In contrast to this is the situation where X = (X n ) is non-stochasticdeterministic, but (typically) chaotic.This case often arises in non-linear time-series analysis and dynamical systems; for a monograph treatment, see Kantz and Schreiber [KanSch].
One natural way to classify results on OPUC is by the strength of the conditions that they impose.Simon's book discusses a range of conditions, starting with a fairly weak one, Szegö's condition ([Si4] Ch. 2 and §3 below), and proceeding to two principal stronger ones, Baxter's condition ([Si4] Ch. 5 and §4 below) and the strong Szegö condition ([Si4] Ch. 6 and §5 below).From a probabilistic viewpoint, equally important are a range of intermediate conditions not discussed in Simon's book.These we discuss in §6.We close with some remarks in §7.§3.Weak conditions: Szegö's theorem.

Rakhmanov's Theorem
One naturally expects that the influence of the distant past decays with increasing lapse of time.So one wants to know when By Rakhmanov's theorem ( [Rak]; [Si5] Ch 9, and Notes to §9.1, [MatNeTo]), this happens if the density w of the absolutely continuous component µ a is positive on a set of full measure: (using normalized Lebesgue measure -or 2π using Lebesgue measure).
Non-determinism and the Wold decomposition.Write σ 2 for the one-step mean-square prediction error: by stationarity, this is the σ 2 = lim n→∞ v n above.Call X non-deterministic (ND) if σ > 0, deterministic if σ = 0. (This usage is suggested by the usual one of non-randomness being zero-variance, though here a non-deterministic process may be random, but independent of time, so the stochastic process reduces to a random variable.)The Wold decomposition (von Neumann [vN] in 1929, Wold [Wo] in 1938; see e.g.Doob [Do], XII.4,Hannan [Ha1], Ch.III) expresses a process X as the sum of a non-deterministic process U and a deterministic process V : the process U is a moving average, with the ξ j zero-mean and uncorrelated, with each other and with V ; Thus when σ = 0 the ξ n are 0, U is missing and the process is deterministic.When σ > 0, the spectral measures of U n , V n are µ ac and µ s , the absolutely continuous and singular components of µ.Think of ξ n as the 'innovation' at time n -the new random input, a measure of the unpredictability of the present from the past.This is only present when σ > 0; when σ = 0, the present is determined by the past -even by the remote past.The Wold decomposition arises in operator theory ( [vN]; Sz.-Nagy and Foias in 1970 [SzNF], Rosenblum and Rovnyak in 1985 [RoRo], §1.3, [Nik2]), as a decomposition into the unitary and completely non-unitary (cnu) parts.
The original motivation of Szegö, and later Verblunsky, was approximation theory, specifically approximation by polynomials.The Kolmogorov Isomorphism Theorem allows us to pass between finite sections of the past to polynomials; denseness of polynomials allows prediction with zero error (a 'bad' situation -determinism), which happens iff (Sz) does not hold.There is a detailed account of the (rather involved) history here in [Si4] §2.3.Other classic contributions include work of Krein in 1945, Levinson in 1947 [Lev] and Wiener in 1949 [Wi2].See [BroDav] §5.8 (where un-normalized Lebesgue measure is used, so there is an extra factor of 2π on the right of (K)), [Roz] §II.5 from the point of view of time series, [Si4] for OPUC.

Pure non-determinism, (P ND
When the remote past is trivial, there is no deterministic component in the Wold decomposition, and no singular component in the spectral measure.The process is then called purely non-deterministic.Thus (usage differs here: the term 'regular' is used for (P ND) in [IbRo], IV.1, but for (ND) in [Do], XII.2).

The Szegö function and Hardy spaces
Szegö's theorem is the key result in the whole area, and to explore it further we need the Szegö function (h, below).For this, we need the language and viewpoint of the theory of Hardy spaces, and some of its standard results; several good textbook accounts are cited in §1.For 0 < p < ∞, the Hardy space H p is the class of analytic functions f on D for which As well as in time series and prediction, as here, Hardy spaces are crucial for martingale theory (see e.g.[Bin1] and the references there).For an entertaining insight into Hardy spaces in probability, see Diaconis [Dia].
For non-deterministic processes, define the Szegö function h by h(z) := exp 1 4π (note that in , [InKa1,2], [Roz] II.5 an extra factor √ 2π is used on the right), or equivalently Because log w ∈ L 1 by (Sz), H is an outer function for H 1 (whence the name (OF ) above); see Duren [Du] (thus h may be regarded as an 'analytic square root' of w).See also Hoffman [Ho], Ch. 3-5, Rudin [Ru], Ch. 17, Helson [He], Ch. 4. Kolmogorov's formula now reads When σ > 0, the Maclaurin coefficients m = (m n ) of the Szegö function h(z) are the moving-average coefficients of the Wold decomposition (recall that the moving-average component does not appear when σ = 0); see Inoue [In3] and below.When σ > 0, m ∈ ℓ 2 is equivalent to convergence in mean square of the moving-average sum ∞ j=0 m n−j ξ j in the Wold decomposition.This is standard theory for orthogonal expansions; see e.g.[Do], IV.4.Note that a function being in H 2 and its Maclaurin coefficients being in ℓ 2 are equivalent by general Hardy-space theory; see e.g.[Ru], 17.10 (see also Th. 17.17 for factorization), [Du] §1.4,2.4, [Z2], VII.7.
Simon [Si4], §2.8 -'Lots of equivalences' -gives Szegö's theorem in two parts.One ([Si4] Th. 2.7.14) gives twelve equivalences, the other ([Si4], Th. 2.7.15) gives fifteen; the selection of material is motivated by spectral theory [Si5].Theorem 3 above extends these lists of equivalences, and treats the material from the point of view of probability theory.(It does not, however, give a condition on the autocorrelation γ = (γ n ) equivalent to (Sz); this is one of the outstanding problems of the area.) The contrast here with Verblunsky's theorem is striking.In general, one has unrestricted parametrization: all values |α n | are possible, for all n.But under Szegö's condition, one has α ∈ ℓ 2 , and in particular α n → 0, as in Rakhmanov's theorem.Thus non-deterministic processes fill out only a tiny part of the α-parameter space D ∞ .One may regard this as showing that the remote past, trivial under (Sz), has a rich structure in general, as follows: Szegö's alternative (or dichotomy).
One either has In the former case, α occupies a tiny part ℓ 2 of D ∞ , and the remote past . This is trivial iff µ s = 0; cf.(P ND).In the second case, α occupies all of D ∞ , and the remote past is the whole space.Szegö's dichotomy may be interpreted by analogy with physical systems.Some systems (typically, liquids and gases) are 'loose' -left alone, they will thermalize, and tend to an equilibrium in which the details of the past history are forgotten.By contrast, some systems (typically, solids) are 'tight': for example, in tempered steel, the thermal history is locked in permanently by the tempering process.Long memory is also important in economics and econometrics; for background here, see e.g.[Rob], [TeKi].Note. 1.Our h is the Szegö function D of Simon [Si4], (2.4.2), and −1/h (see below) its negative reciprocal −∆ [Si4], (2.2.92): which use h, to within the factor √ 2π mentioned above, and [Si4], our reference on OPUC, which uses D). 2. Both h and −1/h are analytic and non-vanishing in D. See [Si4], Th. 2.2.14 (for −1/h, or ∆), Th. 2.4.1 (for h, or D). 3.That (Sz) implies h = D is in the unit ball of H 2 is in [Si4], Th. 2.4.1.4. See de Branges and Rovnyak [dBR] for general properties of such squaresummable power series.5. Our autocorrelation γ is Simon's c (he calls our γ n , or his c n , the moments of µ: [Si4], (1.1.20)).Our moving-average coefficients m = (m n ) have no counterpart in [Si4], and nor do the autoregressive coefficients r = (r n ) or minimality (see below for these).We will also need the Fourier coefficients of log w, known for reasons explained below as the cepstrum), which we write as L = (L n ) ('L for logarithm': Simon's Ln [Si4], (6.1.13)),and a sequence b = (b n ), the phase coefficients (Fourier coefficients of h/h).6. Lund et al. [LuZhKi] give several properties -monotonicity, convexity etc. -which one of m, γ has iff the other has.

MA(∞) and AR(∞)
The power series expansion generates the AR(∞) coefficients r = (r n ) in the (infinite-order) autoregression See [InKa2] §2, [In3] for background.One may thus extend the above list of one-one correspondences, as follows: Finite and infinite predictor coefficients.
The Szegö limit theorem.
With G(µ) as above, write T n (or T n (γ), or T n (µ)) for the n × n Toeplitz matrix Γ (n) with elements Γ (n) ij := c j−i obtained by truncation of the Toeplitz matrix Γ (cf.[BotSi2]).Szegö's limit theorem states that, under (Sz), its determinant satisfies (note that (Sz) is needed for the right to be defined).A stronger statement -Szegö's strong limit theorem -holds; we defer this till §5.
The Szegö limit theorem is used in the Whittle estimator of time-series analysis; see e.g.Whittle [Wh], Hannan [Ha2].

Phase coefficients.
When the Szegö condition (Sz) holds, the Szegö function h(z) = ∞ 0 m n z n is defined.We can then define the phase function h/h, so called because it has unit modulus and depends only on the phase or argument of h (Peller [Pel], §8.5).Its Fourier coefficients b n are called the phase coefficients.They are given in terms of m = (m n ) and r = (r n ) by The role of the phase coefficients is developed in [BiInKa].They are important in connection with rigidity ( §6 below), and Hankel operators [Pel].

Rajchman measures.
In the Gaussian case, mixing in the sense of ergodic theory holds iff  Ly3] and the appendix to [KahSa]).
ARMA(p, q).The Box-Jenkins ARMA(p, q) methodology ( [BoxJeRe], [BroDav]: autoregressive of order p, moving average of order q -see §6.3 for MA(q)) applies to stationary time series where the roots of the relevant polynomials lie in the unit disk (see e.g.[BroDav] §3.1).The limiting case, of unit roots, involves non-stationarity, and so the statistical dangers of spurious regression ( §1); cf.Robinson [Rob], p.2.We shall meet other instances of unit-root phenomena later ( §6.3).
(v) r ∈ ℓ 1 , µ s = 0 and the spectral density w is continuous and positive. Proof.
(iv) ⇒ (ii).From the MA(∞) representation,  The relevant Banach algebras contain the Wiener algebra used above as the special case ν = 1. 5. The approach of [Si4], §5.1 is via truncated Toeplitz matrices and their inverses.The method derives, through Baxter's work, from the Wiener-Hopf technique.This point of view is developed at length in [BotSi1], [BotSi2].
Baxter's motivation was approximation to infinite-past predictors by finitepast predictors.

Long-range dependence
In various physical models, the property of long-range dependence (LRD) is important, particularly in connection with phase transitions (see e.g.[Si1], Ch.II, [Gri1], Ch. 9, [Gri2], Ch. 5), to which we return below.This is a spatial property, but applies also in time rather than space, when the term used is long memory.A good survey of long-memory processes was given by Cox [Cox] in 1984, and a monograph treatment by Beran [Ber] in 1994.For more recent work, see [DouOpTa], [Rob], [Gao] Ch. 6, [TeKi], [GiKoSu].
Li ([Li], §3.4) has recently given a related but different definition of long memory; we return to this in §5 below.

Strong conditions: the strong Szegö theorem
The work of this section may be motivated by work from two areas of physics.

The cepstrum.
During the Cold War, the problem of determining the signature of the underground explosion in a nuclear weapon test, and distinguuishing it from that of an earthquake, was very important, and was studied by the American statistician J. W. Tukey and collaborators.Write L = (L n ), where the L n are the Fourier coefficients of log w, the log spectral density: L n := log w(θ)e inθ dθ/2π.
Thus exp(L 0 ) is the geometric mean G(µ).The sequence L is called the cepstrum, L n the ceptstral coefficients (Simon's notation here is Ln ; [Si4], (2.1.14),(6.1.11));see e.g.[OpSc], Ch. 12.The terminology dates from work of Bogert, Healy and Tukey of 1963 on echo detection [BogHeTu]; see McCullagh [McC], Brillinger [Bri] (the term is chosen to suggest both echo and spectrum, by reversing the first half of the word spectrum; it is accordingly pronounced with the c hard, like a k).

The strong Szegö limit theorem.
This (which gives the weak form on taking logarithms) states (in its present form, due to Ibragimov) that (of course the sum here must converge; it turns out that this form is bestpossible: the result is valid whenever it makes sense ([Si4], 337).
The motivation was Onsager's work in the two-dimensional Ising model, and in particular Onsager's formula, giving the existence of a critical temparature T c and the decay of the magnetization as the temperature T ↑ T c ; see [BotSi2] §5.1, [Si1] II.6, [McCW].The mechanism was a question by Onsager (c. 1950) to his Yale colleague Kakutani, who asked Szegö ([Si4], 331).
Write H 1/2 for the subspace of ℓ 2 of sequences a = (a n ) with (the function of the '1' on the right is to give a norm; without it, .vanishes on the constant functions).This is a Sobolev space ([Si4], 329, 337; it is also a Besov space, whence the alternative notation B 1/2 2 ; see e.g.Peller [Pel], Appendix 2.6 and §7.13).This is the space that plays the role here of ℓ 2 in §2 and ℓ 1 in §3.Note first that, although ℓ 1 and H 1/2 are close in that a sequence (n c ) of powers belongs to both or neither, neither contains the other (consider Theorem 6 (Strong Szegö Theorem).
(i) If (P ND) holds (i.e.(Sz) = (ND) holds and µ s = 0), then (all three may be infinite), with the infinite product converging iff the strong Szegö condition holds.
(iii) Under (Sz), finiteness of any (all three) of the expressions in (i) forces µ s = 0.
Proof.Part (i) is due to Ibragimov ([Si4], Th. 6.1.1),and (ii) is immediate from this.Part (iii) is due to Golinski and Ibragimov ([Si4], Th. 6.1.2;cf. [Si2]).// Part of Ibragimov's theorem was recently obtained independently by Li [Li], under the term reflectrum identity (so called because it links the Verblunsky or reflection coefficients with the cepstrum), based on information theory -mutual information between past and future.Earlier, Li and Xie [LiXi] had shown the following: (i) a process with given autocorrelations γ 0 , . . ., γ p with minimal information between past and future must be an autoregressive model AR(p) of order p; (ii) a process with given cepstral coefficients L 0 , . . ., L p with minimal information between past and future must be a Bloomfield model BL(p) of order p ([Bl1], [Bl2]), that is, one with spectral density w(θ) = exp{L 0 + 2 p k=1 L k cos kθ}.
Another approach to the strong Szegö limit theorem, due to Kac [Kac], uses the conditions (recall that ℓ 1 and H 1/2 are not comparable).This proof, from 1954, is linked to probability theory -Spitzer's identity of 1956, and hence to fluctuation theory for random walks, for which see e.g.[Ch], Ch. 8.
The Borodin-Okounkov formula.This turns the strong Szegö limit theorem above from analysis to algebra by identifying the quotient on the left there as a determinant which visibly tends to 1 as n → ∞ [BorOk]; see [Si4] §6.2.(It was published in 2000, having been previously obtained by Geronimo and Case [GerCa] in 1979; see [Si4] 337, 344, [Bot] for background here.)In terms of operator theory and in Widom's notation [Bot], the result is for a a sufficiently smooth function without zeros on the unit circle and with winding number 0. Then a has a Wiener-Hopf factorization (see e.g.[Si4], Th. 6.2.13), and Q n H(b)H(c)Q n → 0 in the trace norm, whence det T n (a)/G(a) n → E(a), the strong Szegö limit theorem.See [Si4], Ch. 6,[Si6], [BasW], [BotW] (in [Si4] §6.2 the result is given in OPUC terms; here b, c are the phase function h/h and its inverse).
We may have both of the strong conditions (B) and (sSz) (as happens in Kac's method [Kac], for instance).Matters then simplify, since the spectral density w is now continuous and positive.So w is bounded away from 0 and ∞, so log w is bounded.Write for the L 2 modulus of continuity.Applying [IbRo], IV.4,Lemma 7 to log w, and applying it to w, Thus under (B), L ∈ H 1/2 and γ ∈ H 1/2 become equivalent.This last condition is Li's proposed definition of long-range dependence: ( [Li], §3.4; compare the Debowski-Inoue definition (DI) above, that LRD iff α / ∈ ℓ 1 ).
We are now in W ∩H 1/2 , the intersection of H 1/2 with the Wiener algebra W (of absolutely convergent Fourier series) relevant to Baxter's theorem as in §3.As there, we can take inverses, since the Szegö function is non-zero on the circle (cf.[BotSi2], §5.1).One can thus extend Theorem 2 to this situation, including the cepstral condition L ∈ H 1/2 (Li [Li], Th. 1 part 3, showed that L ∈ H 1/2 and γ ∈ H 1/2 are equivalent if w is continuous and positive).
(The reader is warned that some authors use other letters here -e.g.[IbRo] uses β for our φ; we follow Bradley.) We quote [Bra1] that φ-mixing implies ρ-mixing.We regard the first as a strong condition, so include it here, but the second and its several weaker relatives as intermediate conditions, which we deal with in §6 below.
The spectral characterization for φ-mixing is where P is a polynomial with its roots on the unit circle and the cepstrum L * = (L * n ) of w * satisfies the strong Szegö condition (sSz) ( [IbRo] IV.4,p. 129).This is weaker than (sSz).In the Gaussian case, φ-mixing (also known as absolute regularity) can also be characterized in operator-theoretic terms: φ(n) can be identified as tr(B n ), where B n are compact operators with finite trace, so φ-mixing is tr(B n ) → 0 ( [IbRo], IV.2 Th. 4, IV.3 Th. 6).

Intermediate conditions
We turn now to four intermediate conditions, in decreasing order of strength.

ρ-mixing
The spectral characterization of ρ-mixing (also known as complete regularity) is where P is a polynomial with its roots on the unit circle and log w * = u + ṽ, with u, v real and continuous (Sarason [Sa2]; Helson and Sarason [HeSa]).
An alternative spectral characterization is where P is a polynomial with its roots on the unit circle and for all ǫ > 0, log w * = r ǫ + u ǫ + ṽǫ , where r ǫ is continuous, u ǫ , v ǫ are real and bounded, and u ǫ + v ǫ < ǫ ( [IbRo], V.2 Th. 3; we note here that inserting such a polynomial factor preserves complete regularity, merely changing ρ - [IbRo] V.1, Th. 1).
We turn now to a weaker condition.For subspaces A, B of H, the angle between A and B is defined as Then A, B are at a positive angle iff this supremum is < 1.One says that the process X satisfies the positive angle condition, (P A), if for some time lapse k the past cls(X m : m < 0) and the future cls(X k+m : m ≥ 0) are at a positive angle, i.e. ρ(0) = . . .ρ(k − 1) = 1, ρ(k) < 1, which we write as P A(k) (Helson and Szegö [HeSz], k = 1; Helson and Sarason [HeSa], k > 1).The spectral characterization of this is where P is a polynomial of degree k − 1 with its roots on the unit circle and log w * = u + ṽ, where u, v are real and bounded and v < π/2 ([IbRo] V.2, Th. 3, Th. 4).
The case P A(k) for k > 1 is a unit-root phenomenon (cf. the note at the end of §3).We may (with some loss of information) reduce to the case P A(1) by sampling only at every kth time point (cf.[Pel], § §8.5, 12.8).We shall do this for convenience in what follows.
It turns out that the Helson-Szegö condition (P A(1)) coincides with Muckenhoupt's condition A 2 in analysis: where |.| is Lebesgue measure and the supremum is taken over all subintervals I of the unit circle T .See e.g.Hunt, Muckenhoupt and Wheeden [HuMuWh].With the above reduction of P A to P A(1), we then have ρmixing implies P A(1) (= A 2 ).

Pure minimality
Consider now the interpolation problem, of finding the best linear interpolation of a missing value, X 0 say, from the others.Write for the closed linear span of the values at times other than n.
Under minimality, the relationship between the moving-average coefficients m = (m n ) and the autoregressive coefficients r = (r n ) becomes symmetrical, and one has the following complement to Theorem 4: Theorem 7 (Inoue).For a stationary process X, the following are equivalent: (i) The process is minimal.
and ± log w are in L 1 together, when 1/w ∈ L 1 (i.e. the process is minimal) one can handle 1/w, 1/h, m = (m n ) as we handled w, h and r = (r n ), giving Conversely, each of these is equivalent to (min); [In1] This terminology is due to Sarason [Sa1], [Sa2]; the alternative terminology, due to Nakazi, is strongly outer [Na1], [Na2].One could instead say that such a function is determined by its phase.The idea originates with de Leeuw and Rudin [dLR] and Levinson and McKean [LevMcK].In view of this, we call the condition that µ be absolutely continuous with spectral density w = |h| 2 with h 2 rigid, or determined by its phase, the Levinson-McKean condition, (LM).Complete non-determinism; intersection of past and future.
These are weaker than pure minimality ([Bl3], §7, [KaBi]).But since (CND) was already known to be equivalent to (P ND) + (IP F ), they are stronger than (P ND).This takes us from the weakest of the four intermediate conditions of this section to the stronger of the weak conditions of §3.
The spectral characterizations given above were mainly obtained before the work of Fefferman [Fe] in 1971, Fefferman and Stein [FeSt] in 1972 (see Garnett [Gar], Ch.VI for a textbook account): in particular, they predate the Fefferman-Stein decomposition of a function of bounded mean oscillation, f ∈ BMO, as This has a complement due to Sarason [Sa3], where f here is in V MO iff u, v are continuous.Sarason also gives ([Sa3], Th. 2) a characterization of his class of functions of vanishing mean oscillation V MO within BMO related to Muckenhoupt's condition (A 2 ).While both components u, v are needed here, and may be large in norm, it is important to note that the burden of being large in norm may be born by a continuous function, leaving u and ṽ together to be small in (L ∞ ) norm (in particular, less than π/2).This is the Ibragimov-Rozanov result ( [IbRo], V.2 Th. 3), used in §6.1 to show that absolute regularity ( §5) implies complete regularity.

Winding number and index.
The class H 1/2 occurs in recent work on topological degree and winding number; see Brezis [Bre], Bourgain and Kozma [BouKo].The winding number also occurs in operator theory as an index in applications of Banach-algebra methods and the Gelfand transform; see e.g.[Si4], Ch. 5 (cf. Tsirelson [Ts]).

Rapid decay and continuability.
Even stronger than the strong conditions considered here in § §4, 5 is assuming that the Verblunsky coefficients are rapidly decreasing.This is connected to analytic continuability of the Szegö function beyond the unit disk; see [Si7].
7. Wavelets.Traditionally, the subject of time series seemed to consist of two nonintercommunicating parts, 'time domain' and 'frequency domain' (known to be equivalent to each other via the Kolmogorov Isomorphism Theorem of §2).The subject seemed to suffer from schizophrenia (see e.g.[BriKri] and [HaKR]) -though the constant relevance of the spectral or frequency side to questions involving time directly is well illustrated in the apt title 'Past and future' of the paper by Helson and Sarason [HeSa] (cf.[Pel] §8.6).This unfortunate schism has been healed by the introduction of wavelet methods (see e.g. the standard work Meyer [Me], Meyer and Coifman [MeCo], and in OPUC, Treil and Volberg [TrVo1]).The practical importance of this may be seen in the digitization of the FBI's finger-print data-bank (without which the US criminal justice system would long ago have collapsed).Dealing with time and frequency together is also crucial in other areas, e.g. in the highquality reproduction of classical music.

Higher dimensions: matrix OPUC (MOPUC).
We present the theory here in one dimension for simplicity, reserving the case of higher dimensions for a sequel [Bin2].We note here that in higher dimensions the measure µ and the Verblunsky coefficients α n become matrix-valued (matrix OPUC, or MOPUC), so one loses commutativity.The multidimensional case is needed for portfolio theory in mathematical finance, where one holds a (preferably balanced) portfolio of risky assets rather than one; see e.g.[BinFrKi].9. Non-commutativity.Much of the theory presented here has a non-commutative analogue in operator theory; see Blecher and Labuschagne [BlLa], Bekjan and Xu [BeXu] and the references cited there.10.Non-stationarity.
As mentioned in §1, the question of whether or not the process is stationary is vitally important, and stationarity is a strong assumption.The basic Kolmogorov Isomorphism Theorem can be extended beyond the stationary case in various ways, e.g. to harmonisable processes (see e.g.[Rao]).For background, and applications to filtering theory, see e.g.[Kak]; for filtering theory, we refer to e.g.[BaiCr].
The Szegö condition (Sz) for the unit circle (regarded as the boundary of the unit disc) corresponds to the condition for the real line (regarded as the boundary of the upper half-plane).This follows from the Möbius function w = (z − i)/(z + i) mapping the half-plane conformally onto the disc; see e.g.[Du], 189-190.The consequences of this condition are explored at length in Koosis' monograph on the 'logarithmic integral', [Koo2].Passing from the disc to the half-plane corresponds probabilistically to passing from discrete to continuous time (and analytically to passing from Fourier series to Fourier integrals).The probabilistic theory is considered at length in Dym and McKean [DymMcK].
We have mentioned the close links between Gaussianity and linearity in §1.For background on Gaussian Hilbert spaces and Fock space, see Janson [Jan], Peller [Pel]; for extensions to § §5.1, 6 in the Gaussian case, see [IbRo], [Pel], [Bra1] §5.To return to the undergraduate level of our opening paragraph: for an account of Gaussianity, linearity and regression, see e.g.Williams [Wil], Ch. 8, or [BinFr].
We have φ j = c 0 r j = σr j with φ j the infinite-predictor coefficients ([InKa2], (3.1)).Then r ∈ ℓ 1 follows by the Wiener-Lévy theorem, as in Baxter [Ba3], 139-140.// Note. 1.Under Baxter's condition, both |h| and |1/h| (or |D| and |∆| = |1/D|) are continuous and positive on the unit circle.As h, 1/h are analytic in the disk, and so attain their maximum modulus on the circle by the maximum principlefor D(.), ∆); [Si4 [InKa2]lization that the Verblunsky coefficients α of OPUC are actually the partial autocorrelation function (PACF) of time series opened the way for the systematic exploitation of OPUC within time series by a number of authors.These include Inoue, in a series of papers from 2000 on (see especially[In3]of 2008), and Inoue and Kasahara from 2004 on (see especially[InKa2]of 2006).