Generalized Rescaled Polya urn and its statistical applications

We introduce the Generalized Rescaled Polya (GRP) urn, that provides a generative model for a chi-squared test of goodness of fit for the long-term probabilities of clustered data, with independence between clusters and correlation, due to a reinforcement mechanism, inside each cluster. We apply the proposed test to a data set of Twitter posts about COVID-19 pandemic: in a few words, for a classical chi-squared test the data result strongly significant for the rejection of the null hypothesis (the daily long-run sentiment rate remains constant), but, taking into account the correlation among data, the introduced test leads to a different conclusion. Beside the statistical application, we point out that the GRP urn is a simple variant of the standard Eggenberger-Polya urn, that, with suitable choices of the parameters, shows"local"reinforcement, almost sure convergence of the empirical mean to a deterministic limit and different asymptotic behaviours of the predictive mean. Moreover, the study of this model provides the opportunity to analyze stochastic approximation dynamics, that are unusual in the related literature.


Introduction
The standard Eggenberger-Pólya urn (see EggPol23,mah) has been widely studied and generalized (for instance, some recent variants can be found in [7,8,11,15,18,33,34,35]). In its simplest form, this model with k-colors works as follows. An urn contains N0 i balls of color i, for i = 1, . . . , k, and, at each discrete time, a ball is extracted from the urn and then it is returned inside the urn together with α > 0 additional balls of the same color. Therefore, if we denote by Nn i the number of balls of color i in the urn at time n, we have Nn i = Nn−1 i + αξn i for n ≥ 1, where ξn i = 1 if the extracted ball at time n is of color i, and ξn i = 0 otherwise. The parameter α regulates the reinforcement mechanism: the greater α, the greater the dependence of Nn i on n h=1 ξ h i . The Generalized Rescaled Pólya (GRP) urn model is characterized by the introduction of the sequence (βn)n of parameters, together with the replacement of the parameter α of the original model by a sequence (αn)n, so that Nn i = b0 i + Bn i with Therefore, the urn initially contains b0 i+B0 i balls of color i and the parameters βn ≥ 0, together with αn > 0, regulate the reinforcement mechanism. More precisely, the term βnBn i links Nn+1 i to the "configuration" at time n through the "scaling" parameter βn, and the term αn+1ξn+1 i links Nn+1 i to the outcome of the extraction at time n + 1 through the parameter αn+1.
We are going to show that, with a suitable choice of the model parameters, we have a long-term almost sure convergence of the empirical mean N n=1 ξn i/N to the deterministic limit p0 i = b0 i/ n i=1 b0 i, and a chi-squared goodness of fit result for the long-term probabilities {p0 1, . . . , p 0 k }. In particular, regarding the last point, we have that the chi-squared statistics where N is the size of the sample, pi = Oi/N , with Oi = N n=1 ξn i the number of observations equal to i in the sample, is asymptotically distributed as χ 2 (k − 1)λ, with λ > 1, or χ 2 (k − 1)N 1−2e λ, where λ > 0 may be smaller than 1, but e is always strictly smaller than 1/2. In both cases, the presence of correlation among units mitigates the effect in (1.1) of the sample size N , that multiplies the chi-squared distance between the observed frequencies and the expected probabilities. This aspect is important for the statistical applications in the context of a "big sample", when a small value of the chi-squared distance might be significant, and hence a correction related to the correlation between observations is desirable (see, for instance, [9,12,14,26,27,31,38,42,43,45]). More precisely, in the first case, the observed value of the chi-squared distance has to be compared with the "critical" value χ 2 1−θ (k − 1)λ/N , where χ 2 1−θ (k − 1) denotes the quantile of order 1 − θ of the chi-squared distribution χ 2 (k − 1). In the second case, the critical value for the chi-squared distance becomes χ 2 1−θ (k −1)λ/N 2e , where, although the constant λ may be smaller than 1, the effect of the sample size N is mitigated by the exponent 2e < 1. In other words, for this second case, the Fisher information given by the sample does not scale with the sample size N , but with rate N 2e . Hence, since the long-term correlation, collecting more and more data does not provide a linear increment of the information.
Summing up, the GRP urn provides a theoretical framework for a chi-squared test of goodness of fit for the long-term probabilities of correlated data, generated according to a reinforcement mechanism. Specifically, we describe a possible application in the context of clustered data, with independence between clusters and correlation, due to a reinforcement mechanism, inside each cluster. In particular, we develop a suitable estimation technique for the fundamental model parameters. We then apply the proposed test to a data set of Twitter posts about COVID-19 pandemic. Given the null hypothesis that the daily long-run sentiment rate of the posts is the same for all the considered days (suitably spaced days in the period February 20th -April 20th 2020), performing a classical χ 2 test, the data result strongly significant for the rejection of the null hypothesis, while, taking into account the correlation among posts sent in the same day, the proposed test leads to a different conclusion.
The sequel of the paper is so structured. In Section 2 we set up the notation and we define the GRP urn. In Section 3 we illustrate its relationships with previous models and we discuss the connections with related literature. In particular, the object of the present work gives us the opportunity to study Stochastic Approximation (SA) dynamics, which are infrequent in SA literature and so fill in some theoretical gaps. In Section 4 we provide the main result of this work, that is the almost sure convergence of the empirical means to the deterministic limits p0 i and the goodness of fit result for the long-term probabilities p0 i, together with comments and examples. In Section 5 we describe a possible statistical application of the GRP urn and the related results: a chi-squared test of goodness of fit for the long-term probabilities of clustered data, with independence between clusters and correlation, due to a reinforcement mechanism, inside each cluster. We apply the proposed test to a data set of Twitter posts about COVID-19 pandemic. In Section 7 we state two convergence results for the empirical means, which are the basis for the proof of the main theorem. All the shown theoretical results are analytically proven. The proofs are left to Section S1 in the Supplementary Material [2], except for the proof of Theorem 7.2, which is methodologically new and emphasizes new techniques of martingale limit theory and so it is illustrated in Section 8. Finally, in the Supplementary Material we also provide some complements, some technical lemmas and some recalls about stochastic approximation theory and about stable convergence. When necessary, the references to the Supplementary Material are preceded by an "S", so that (S1.2) will refer to the equation (S1.2) in [2].
The urn initially contains b0 i + B0 i > 0 distinct balls of color i, with i = 1, . . . , k. We set b0 = (b0 1, . . . , b 0 k ) and B0 = (B0 1, . . . , B 0 k ) . We assume |b0| > 0 and we set p0 = b 0 |b 0 | . At each discrete time (n+1) ≥ 1, a ball is drawn at random from the urn, obtaining the random vector ξn+1 = (ξn+1 1, . . . , ξ n+1 k ) defined as ξn+1 i = 1 when the extracted ball at time n + 1 is of color i 0 otherwise, and the number of balls in the urn is so updated: which gives (since |ξn+1| = 1) Therefore, setting r * n = |Nn| = |b0| + |Bn|, we get Moreover, setting F0 equal to the trivial σ-field and Fn = σ(ξ1, . . . , ξn) for n ≥ 1, the conditional probabilities ψn = (ψn 1, . . . , ψ n k ) of the extraction process, also called predictive means, are It is obvious that we have |ψn| = 1. Moreover, when βn > 0 for all n, the probability ψn i results increasing with the number of times we observed the value i, that is the random variables ξn i are generated according to a reinforcement mechanism: the probability that the extraction of color i occurs has an increasing dependence on the number of extractions of color i occurred in the past (see, e.g. [41]). More precisely, we have The dependence of ψn on ξh depends on the factor f (h, n) = α h n−1 j=h βj, with 1 ≤ h ≤ n, n ≥ 0. In the case of the standard Eggenberger-Pólya urn, that corresponds to αn = α > 0 and βn = 1 for all n, each observation ξh has the same "weight" f (h, n) = α. Instead, if the factor f (h, n) increases with h, then the main contribution is given by the most recent extractions. We refer to this phenomenon as "local" reinforcement. For instance, this is the case when (αn) is increasing and βn = 1 for all n. Another case is when αn = α > 0 and βn < 1 for all n. The case βn = 0 for all n is an extreme case, for which ψn depends only on the last extraction ξn (recall that conventionally n−1 j=n = 1). For the next examples, we will show that they exhibit a broader sense local reinforcement, in the sense that the "weight" of the observations is eventually increasing with time.
(2.8) Therefore, the asymptotic behaviour of (θn) depends on the two sequences ( n)n and (δn)n. Finally, we observe that, setting ξ N = N n=1 ξn/N and µn = ξ n − p0, we have the equality that links the asymptotic behaviour of (µn) and the one of (θn). Different kinds of sequences ( n)n and (δn)n provide different kinds of asymptotic behaviour of θn, i.e. of the empirical mean ξ N . In Section 4, we provide two cases in which we have a long-term almost sure convergence of the empirical mean Oi/N = N n=1 ξn i/N toward the constant p0i = b0 i/|b0|, together with a chi-squared goodness of fit result. In particular, the quantities p0 1, . . . , p 0 k can be seen as a long-run probability distribution on the possible values (colors) {1, . . . , k}.

Related literature
The particular case when in the GRP urn model we have βn = β = 0 for all n corresponds to a version of the so-called "memory-1 senile reinforced random walk" on a star-shaped graph introduced in [29]. The case αn = α > 0 and βn = β = 1 for all n corresponds to the standard Eggenberger-Pólya urn with an initial number N0 i = b0 i + B0 i of balls of color i. When (αn) is a not-constant sequence, while βn = β = 1 for all n, the GRP urn coincides with the variant of the Eggenberger-Pólya urn introduced in [40] (see also [41,Sec. 3.2]). Instead, when β = 1, the GRP urn does not fall in any variants of the Eggenberger-Pólya urn discussed in [41,Sec. 3.2].
The case when αn = α > 0 and βn = β ≥ 0 for all n corresponds to the Rescaled Pólya (RP) urn introduced and studied in [1] and applied in [6]. It is worthwhile to point out that the two cases studied in the present work do not include (and are not included in) the case studied in [1]. Moreover, the techniques employed here and in [1] are completely different: when βn = β ∈ [0, 1) as in [1], the jumps ∆ψn do not vanish and the process ψ = (ψn)n converges to a stationary Markov chain and so the appropriate Markov ergodic theory is employed; in this work, we have |∆ψn| = o(1), so that the martingale limit theory is here exploited to achieve the asymptotic results. Obviously, the two techniques are not exchangeable or adaptable from one contest to the other one.
When (βn) is not identically equal to 1, since the first term in the right hand of the above relation, the GRP urn does not belong to the class of Reinforced Stochastic Processes (RSPs) studied in [3,5,4,20,21,23]. Indeed, the RSPs are characterized by a "strict" reinforcement mechanism such that ξn i = 1 implies ψn i > ψn−1 i and so, as a consequence, ψn i has an increasing dependence on the number of times we have ξ h i = 1 for h = 1, . . . , n. When (βn) is not identically equal to 1, the GRP urn does not satisfy the "strict" reinforcement mechanism, because the first term is positive or negative according to the sign of (1 − βn) and of (ψn − p0). Furthermore, we observe that equation (2.6) recalls the dynamics of a RSP with a "forcing input" (see [3,20,44]), but the main difference relies on the fact that such a process is driven by a classical stochastic approximation dynamics, that is a dynamics of the kind (2.7) with n = δn (up to a constant) with n n = +∞ and n 2 n < +∞, while the GRP urn model also allows for n and δn with different rates and also for • n n = +∞ and n δ 2 n = +∞ or • n n < +∞. Since (2.7) is the fundamental equation of the Stochastic Approximation (SA) theory, we deem it appropriate to say a few more words on the relationship of the present work with the SA literature. The case when δn = c n in (2.7) is essentially covered by the Stochastic Approximation (SA) theory (see Section S5, where we refer to [25,32,37,39,48]). The most known case is when n n = +∞ and n 2 n < +∞. The case n → 0, n n = +∞ and n 2 n = +∞ is less usual in literature, but it is well characterized in [32]. The case when ( n)n and (δn)n in (2.7) go to zero with different rates is typically neglected in SA literature. To our best knowledge, it is taken into consideration only in [39], where the weak convergence rate of the sequence (ψN ) toward a certain point ψ * is established under suitable assumptions, given the event {ψN → ψ * }. No result is given for the empirical mean ξ N , which instead is the focus of the present paper (see Theorem 4.1 below, whose proof is based on Theorem 7.2). More precisely, the assumptions on n and δn in the following Theorem 7.2 imply assumption (A1.3) in [39] and so Theorem 1 in that paper provides the weak convergence rate of the sequence (ψN − ψ * ) given the event {ψN → ψ * }. However, this result is not useful for our scope because of two reasons: first, we need convergence results for the empirical mean ξ N , not for the predictive mean ψN ; second, in one case included in Theorem 7.2 (see Section 7 for more details), it seems to us not immediate to check the convergence of the predictive means and so we develop another technique that does not ask for this convergence (see Section 8). Hence, the contribution of Theorem 7.2 to the SA literature is that, for a dynamics of the type (2.7) with ( n)n and (δn)n going to zero with different rates, it provides the asymptotic behaviour of the empirical mean ξ N , covering a case when n n = +∞ and n δ 2 n = +∞ and without requiring the convergence of the empirical means ψN . Finally, it is worthwhile to point out that we also analyze the case when n n < +∞, which is also excluded in SA literature and so it could be relevant in that field. Specifically, we prove almost sure convergence of the predictive means and of the empirical means toward a random variable and we give a central limit theorem in the sense of stable convergence. However, even if interesting from a theoretical point of view, we collect these results in Section S2, because they are not related to the chi-squared test of goodness of fit.
The following statistical application of the GRP urn was inspired by [1,12,36]. However, those papers only deal with the case when the statistics (1.1) is asymptotically distributed as χ 2 (k − 1)λ, with λ > 1, while we also face the case when the statistics (1.1) is asymptotically distributed as χ 2 (k − 1)N 1−2e λ, illustrating a suitable estimation procedure for the fundamental parameters η = 1 − 2e and λ. To the best of our knowledge, this is the first work presenting a model that provides a theoretical framework for a such chi-squared test of goodness of fit.
In the next two examples we show that it is possible to construct suitable sequences (αn)n and (βn)n of the model such that the corresponding sequences ( n)n and (δn) converge to zero with the same rate or with different rates and satisfy the assumptions a) or b) of the above theorem, respectively.

Statistical applications
In a big sample the units typically can not be assumed independent and identically distributed, but they exhibit a structure in clusters, with independence between clusters and with correlation inside each cluster [12,17,30,36,46,47]. The model and the related results presented in [1] and in the present paper may be useful in the situation when inside each cluster the probability that a certain unit chooses the value i is affected by the number of units in the same cluster that have already chosen the value i, hence according to a reinforcement rule. Formally, given a "big" sample {ξ n : n = 1, . . . , N }, we suppose that the N units are ordered so that we have the following L clusters of units: Therefore, the cardinality of each cluster C is N . We assume that the units in different clusters are independent, that is . . , ξN ] are L independent multidimensional random variables. Moreover, we assume that the observations inside each cluster can be modelled as a GRP satisfying case a) or case b) of Theorem 4.1. Given certain (strictly positive) intrinsic probabilities p * 0 1 ( ), . . . , p * 0 k ( ) for each cluster C , we firstly want to estimate the model parameters and then perform a test with null hypothesis based on the the statistics and its corresponding asymptotic distribution Γ k−1 2 , 1 2λ , where λ is given in (4.1). Note that we can perform the above test for a certain cluster , or we can consider all the clusters together using the aggregate statistics L =1 Q and its corresponding distribution Γ( L(k−1) 2 , 1 2λ ). Regarding the probabilities p * 0 i ( ), some possibilities are: • we can take p * 0 i ( ) = 1/k for all i = 1, . . . , k if we want to test possible differences in the probabilities for the k different values; • we can suppose to have two different periods of times, and so two samples, say {ξ (1) n : n = 1, . . . , N } and {ξ n i /N for all i = 1, . . . , k, and perform the test on the second sample in order to check possible changes in the intrinsic long-run probabilities; • we can take one of the clusters as benchmark, say * , set p * 0 i ( ) = n∈C * ξn i/N * for all i = 1, . . . , k and = * , and perform the test for the other L − 1 clusters in order to check differences with the benchmark cluster * .
Finally, if we want to test possible differences in the clusters, then we can take p * 0 i ( ) = p * 0 i = N n=1 ξn i/N for all = 1, . . . , L and perform the test using the aggregate statistics L =1 Q with asymptotic distribution Γ( (L−1)(k−1) 2 , 1 2λ ).

Estimation of the parameters
The model parameters are , δ and c. However, as we have seen, the fundamental quantities are η = 2( − δ) and λ given in (4.1). Moreover, recall that in case a), we have η = 0 and λ > 1 and, in case b), we have η ∈ (0, 1) and λ > 0. Therefore, according the considered model, the pair (η, λ) belongs to S = {0} × (1, +∞) ∪ (0, 1) × (0, +∞). In order to estimate the pair (η, λ) ∈ S, we define Given the observed values t1, . . . , tL, the log-likelihood function of Q reads where R1 is a remainder term that does not depend on (η, λ). Now, we look for the maximum likelihood estimator of the two parameters (η, λ). We immediately observe that, when all the clusters have the same cardinality, that is all the N are equal to a certain N0, then we cannot hope to estimate η and λ, separately. Indeed, the log-likelihood function becomes This fact implies that it possible to estimate only the parameter (λN η 0 ) as λN η 0 = L =1 t /(k − 1)L. From now on, we assume that at least two clusters have different cardinality, that is at least a pair of cardinalities N are different. We have to find (if they exist!) the maximum points of the function (η, λ) → ln(L(η, λ)) on the set S, which is not closed nor limited. First of all, we note that ln(L(η, λ)) → −∞ for λ → +∞ and λ → 0. Thus, the log-likelihood function has maximum value on the closure S of S and its maximum points are stationary points belonging to (0, 1) × (0, +∞) or they belong to {0, 1} × (0, +∞). For detecting the points of the first type, we compute the gradient of the log-likelihood function, obtaining Hence, the stationary points (η, λ) of the log-likelihood function are solutions of the system In particular, we get that the stationary points are of the form (η, λ(η)), with In order to find the maximum points on the border, that is belonging to {0, 1} × (0, +∞), we observe that, fixed any η, the function where R2 is a remainder term not depending on λ, takes its maximum value at the point λ(η) defined in (5.2). Summing up, the problem of detecting the maximum points of the log-likelihood function on S reduces to the study of the maximum points on [0, 1] of the function where R3 is a remainder term not depending on η. To this purpose, we note that we have Setting and denoting by Ex[·] and by Eu[·] the mean value with respect to the discrete probability distribution {p(x, ) : = 1, . . . , L} on {N1, . . . , NL} and with respect to the uniform discrete distribution on {N1, . . . , NL} respectively, the above function g can be written as Moreover, we have where V arx[·] denotes the variance with respect to the discrete probability distribution {p(x, ) : = 1, . . . , L} on {N1, . . . , NL}. Since, we are assuming that at least two N are different, we have V arx[ln(N )] > 0 and so the function g is strictly decreasing. Finally, we observe that we have where Covu(·, ·) denotes the covariance with respect to the discrete joint distribution concentrated on the diagonal and such that P {N = N , T = t } = 1/L with = 1, . . . , L. Hence, we distinguish the following cases.
First case: Cov u (ln(N ), T ) ≤ 0 We are in the case when g(0) ≤ 0 and so the function (5.3) is strictly decreasing for η > 0. Thus, its maximum value on [0, 1] is assumed at η = 0. Consequently, we have λ = λ(0) = L =1 t L(k−1) . Recall that we need (0, λ) ∈ S and so λ > 1. If the model fits well the data, this is a consequence. Indeed, λ is an unbiased estimator: λ d ∼ Γ(L(k − 1)/2, 1/(2λ)) and so E[ λ] = λ > 1. A value λ ≤ 1 means a bad fit of the consider model to the data (the smaller the value of λ, the worse the fitting). Note that in the threshold case ( η = 0, λ = 1), the corresponding test statistics (5.1) and its distribution coincide with the classical ones used for independent observations. Second case: Cov u (ln(N ), T ) > 0 and Cov u (ln(N ), T N ) < 0 We are in the case when g(0) > 0 and g(1) < 0. Hence, the function (5.3) has a unique stationary point We are in the case when g(1) ≥ 0 and so the function (5.3) is strictly increasing on [0, 1]. Hence, its maximum point is at η = 1, and, accordingly, . However, the point (1, λ) does not belong to S and so, in this case, we conclude that we have a bad fit of the model to the data. Note that, if the considered model fits well the data, then we have

COVID-19 epidemic Twitter analysis
We illustrate the application of the above statistical methodology to a data set containing posts on the on-line social network Twitter about the COVID-19 epidemic. More precisely, the data set covers the period from February 20th (h. 11pm) to April to 20th (h. 10pm) 2020, including tweets in Italian language. More details on the keywords used for the query can be found in [13]. For every message, the relative sentiment has been calculated using the polyglot python module developed in [16]. This module provides a numerical value v for the sentiment and we have fixed a threshold T = 0.35 so that we have classified as a tweet with positive sentiment those with v > T and as a tweet with negative sentiment those with v < −T . We have discarded tweets with a value v ∈ [−T, T ].
We are in the case k = 2 and the random variables ξn = ξn 1 take the value 1 when the sentiment of the post n is positive and the value 0 when the sentiment of the post n is negative. We have partitioned the data so that each set P d collect the messages of the single day d, for d = 1(February 20st), . . . , 61(April 20th) and then, in order to obtain independent clusters, we have set C = P 1+3( −1) , for = 1, . . . , 21 = L. (We have tested the independence of the timed sequence {Q : = 1, . . . , 21} with a Ljung-Box test and we give the results in Table 2.) Therefore N is the total number of tweets posted during the day 1 + 3( − 1) and N = L =1 N = 699 450 is the sample size. It is plausible that inside each cluster the sentiment associated to each message is driven by a reinforcement mechanism, that can be modelled by means of a GRP: the probability to have a tweet with positive sentiment is increasing with the number of past tweets with positive sentiment and the reinforcement is mostly driven by the most recent tweets, in the sense explained in Section 2. Note that the main effect of the GRP urn model is the presence of "local fashions", resulting in unexpected excursions of ψn around the long-run probabilities p0. In order to point out that the considered data set exhibits this characteristics, for each , we have computed the daily sentiment rate p0( ), then, according to this probability, we have generated an independent sequence (ξ n ) of bernoulli variables, finally we have used the same smoothing procedure (i.e. classical cubic spline given in R package) to get an estimate of ψn = ψn 1, for both the real and the simulated independent data. In Fig. 1 the daily curves clearly show different behaviors in the two cases, highlighting a local reinforcement among tweets.  Our purpose is to test the null hypothesis H0 : p0( ) = p0 for any . Therefore, taking p * 0 1 ( ) = p * 0 = N n=1 ξn/N for each , we have firstly estimated the model parameters and then we have performed the chi-squared test based on the aggregate statistics L =1 Q and its corresponding asymptotic distribution Γ( (L−1)(k−1) 2 , 1 2λ ). The estimated values are η = 0.4363572 and λ = 2.728098 (in Fig. 2 we plot the function (5.3)).
The contingency table and the associated statistics for testing H0 is given in Table 1. The obtained χ 2 -statistics for a usual χ 2 -test is 5507.803, which is significant at any level of confidence. Under the proposed GRP model and the null hypothesis, the aggregate statistics L =1 Q has (asymptotic) distribution Γ( L−1 2 , 1 2 λ ) and the corresponding p-value associated to the data is equal to 0.4579297. The null hypothesis that the daily long-run sentiment rate of the posts is the same for all the considered days is therefore strongly rejected with a classical χ 2 test, while the same hypothesis is accepted if we take into account the reinforcement mechanism of correlation given in GRP model. In Fig. 3 there are the values of the single statistics Q compared to the 95th-quantile of the distribution Γ( 1 2 , 1 2 λ ).

Date
Obs   7 Asymptotic results for the empirical means In the sequel, we will use the symbol s −→ in order to denote the stable convergence (for a brief review on stable convergence, see Section S6).
For the case when ( n)n and (δn)n in (2.7) go to zero with different rates, we prove the following theorem (the proof is illustrated in Section 8): Theorem 7.2. Take n = (n + 1) − and δn ∼ c(n + 1) −δ , with ∈ (0, 1), δ ∈ ( /2, ) and c > 0. Then ξ N a.s. −→ p0 and In the framework of the above theorem, we can distinguish the following two cases: 1) ∈ (1/2, 1) and δ ∈ (1/2, ) or 2) ∈ (0, 1) and δ ∈ ( /2, min{ , 1/2}] \ { }. In case 1), we have n n = +∞, n 2 n < +∞ and n δ 2 n < +∞ and so the typical asymptotic behaviour of the predictive mean of an urn process, that is its almost sure convergence. In case 2), we have n n = +∞ and n δ 2 n = +∞ (while the series n 2 n may be convergent or divergent) and it seems to us not immediate to check the convergence of the predict means. Therefore, for the proof of Theorem 7.2 in this last case, we will employ a different technique, which is based on the L 2 -estimate of Lemma 8.1 for the predictive mean ψN and the almost sure convergence of the corresponding empirical mean ψ N −1 .

Proof of Theorem 7.2
For all the sequel, we set ψ N −1 = N n=1 ψn−1/N and θN−1 = N n=1 θn−1/N . To the proof of Theorem 7.2, we premise some intermediate results. Proof. We observe that, starting from (2.7), we get Therefore, we can write where the second equality is due to the Abel transformation for a series. It follows the decomposition (8.2) with is decreasing and so the last term in the above expression is O(N −1 −1 N −1 ). Therefore, since < 1 by assumption, we have |RN | = O(N −(1− ) ) → 0. Regarding the last statement of the lemma, we observe that, from what we have proven before, we obtain N e E |RN | = O(N e−(1− ) ) = O(N δ−1/2 ) → 0 when δ < 1/2. However, in the considered cases 1) and 2), we might have δ ≥ 1/2. Therefore, we need other arguments in order to prove the last statement. To this purpose, we observe that, by Lemma 8.1, we have E[ |θn| ] = O(n /2−δ ) and so, by (8.3), we have −→ p0. In particular, when ∈ (1/2, 1) and δ ∈ (1/2, ), we have θN a.s.
In this document we collect some proofs, complements, technical results and recalls, useful for [S2]. Therefore, the notation and the assumptions used here are the same as those used in that paper.

S1 Proofs and intermediate results
We here collect some proofs omitted in the main text of the paper [S2].

S1.1 Proof of Theorem 4.1
The proof is based on Proposition 7.1 (for case a)) and Theorem 7. 2 (for case b)). The almost sure convergence of Oi/N immediately follows since Oi/N = ξ N i . In order to prove the stated convergence in distribution, we mimic the classical proof for the Pearson chi-squared test based on the Sherman Morison formula (see [S18]), but see also [S16,Corollary 2].
We start recalling the Sherman Morison formula: if A is an invertible square matrix and we have 1 − v A −1 u = 0, then Given the observation ξn = (ξn 1, . . . , ξ n k ) , we define the "truncated" vector ξ * n = (ξ * n 1 , . . . , ξ * n k−1 ) , given by the first k − 1 components of ξn. Proposition 7.1 (for case a)) and Theorem 7.2 (for case b)) give the second order asymptotic behaviour of (ξn), that immediately implies where p * is given by the first k − 1 components of p0 and Γ * = λ(diag(p * ) − p * p * T ). By assumption p0 i > 0 for all i = 1, . . . , k and so diag( Therefore we can use the Sherman Morison formula with A = diag(p * ) and u = v = p * , and we obtain where Ii 1 i 2 is equal to 1 if i1 = i2 and equal to zero otherwise. Finally, from the above equalities, recalling (S1.1) and (S1.2), we obtain

S1.2 A preliminary central limit theorem
The following preliminary central limit theorem is useful for the proofs of the other central limit theorems stated in [S2] and in Section S2.

S2 Case n n < +∞
In this section we provide some results regarding the case n n < +∞, even if, as we will see, this case is not interesting for the chi-squared test of goodness of fit. Indeed, as shown in the following result, the empirical mean almost surely converges to a random variable, which does not coincide almost surely with a deterministic vector.
−→ ψ∞, where ψ∞ is a random variable, which is not almost surely equal to a deterministic vector, that is P (ψ∞ = q0) > 0 for all q0 ∈ R k .
Proof. When +∞ n=0 n < +∞, the sequence (ψn) is a (bounded) non-negative almost supermartingale (see [S17]) because, by (2.7), we have As a consequence, it converges almost surely (and in L p with p ≥ 1) to a certain random variable ψ∞. An alternative proof of this fact follows from quasi-martingale theory [S12]: indeed, since n E[ |E[ψn+1|Fn] − ψn| ] = O( n n) < +∞, the stochastic process (ψn) is a non-negative quasi-martingale and so it converges almost surely (and in L p with p ≥ 1) to a certain random variable ψ∞.
In order to show that ψ∞ is not almost surely equal to a deterministic vector, we set and observe that, starting from (2.7), we get with ζn = 2 n yn + δ 2 n E[ ∆Mn+1 2 ] ≥ 0. It follows that, givenñ such that n < 1/2 for n ≥ñ, we have yN ≥ yñ N −1 n=ñ (1 − 2 n) for each N ≥ñ and so The above exponential is strictly greater than 0 because +∞ n=ñ ln(1 − 2 n) ∼ −2 +∞ n=ñ n > −∞. Therefore, if yñ > 0, then we have y∞ > 0. This means that ψ∞ − p0, and consequently ψ∞, is not almost surely equal to a deterministic vector, that is P (ψ∞ = q0) > 0 for all q0 ∈ R k . If yñ = 0, that is if ψñ is almost surely equal to a deterministic vector ψ, then, by (S2.1), we get because δn > 0 for each n and ψ is different from a vector of the canonical base of R k by means of the assumption b0 i + B0 i > 0 and equality (2.4). It follows that we can repeat the above argument replacingñ byñ + 1 and conclude that ψ∞ is not almost surely equal to a deterministic vector.
As a consequence of the above theorem, if we aim at having the almost sure convergence of ξ N to a deterministic vector, we have to avoid the case +∞ n=0 n < +∞. However, for the sake of completeness, we provide a second-order convergence result also in this case. First, we note that Theorem S1.1 still holds true with V = diag(ψ∞) − ψ∞ψ∞ . Indeed, assumption (S1. Note that case a) covers the case n = (n + 1) − and δn ∼ c(n + 1) −δ with c > 0 and min{ , δ} > 3/2. The case n = 0 (that is βn = 1) for all n corresponds to the case considered in [S15], but in that paper the author studies only the limit ψ∞ and he does not provide second-order convergence results. In both cases a) and b), we have N n=1 n n−1 = o(N 1−e ) and so QN converges almost surely to 0. Moreover, by Theorem S1.1, N n=1 YN,n stable converges to N (0, V ) with V = Γ = diag(ψ∞) − ψ∞ψ∞ . Therefore it is enough to study the convergence of N n=1 ZN,n. To this purpose, we observe that, if we are in case a), then N n=1 ZN,n converges almost surely to 0 and so

S3 Computations regarding the local reinforcement
Suppose αn ∼ an −α for n ≥ 1 and (1 − βn) ∼ b(n + 1) −β for n ≥ 0. In the following subsections we study the behaviour of the factor f (h, n) = α h n−1 j=h βj in some particular cases that cover the cases of the two examples in Section 4. Specifically, for all the considered cases, we set (h, n) = ln(α h n−1 j=h βj) = ln(α h )+ n−1 j=h ln(βj) for n ≥ h and we prove that there exists h * such that max h≤h * (h, n) ≤ (h * , n) and h → (h, n) is increasing for h ≥ h * . This means that the weights f (h, n) of the observations until h * are smaller than those with h ≥ h * and the contribution of the observation for h ≥ h * is increasing with h.
Then lim sup n xn ≤ lim sup n ζn.
Proof. Take L > lim sup n ζn and n * large enough so that ζn < L and γn ≤ 1 when n ≥ n * . Then, using that (x + y) + ≤ x + + y + , we have for n ≥ n * Since n γn = +∞, the above inequality implies that limn(xn − L) + = 0. This is enough to conclude, because we can choose L arbitrarily close to lim sup n ζn.

S6 Stable convergence
This brief section contains some basic definitions and results concerning stable convergence. For more details, we refer the reader to [S5, S7, S10] and the references therein.
Let (Ω, A, P ) be a probability space, and let S be a Polish space, endowed with its Borel σ-field. A kernel on S, or a random probability measure on S, is a collection K = {K(ω) : ω ∈ Ω} of probability measures on the Borel σ-field of S such that, for each bounded Borel real function f on S, the map is A-measurable. Given a sub-σ-field H of A, a kernel K is said H-measurable if all the above random variables Kf are H-measurable. A probability measure ν can be identified with a constant kernel K(ω) = ν for each ω.
On (Ω, A, P ), let (Yn)n be a sequence of S-valued random variables, let H be a sub-σ-field of A, and let K be a H-measurable kernel on S. Then, we say that Yn converges H-stably to K, and we write Yn for each bounded continuous real function f on S. In [S7] the notion of H-stable convergence is firstly generalized in a natural way replacing in (S6.1) the single sub-σ-field H by a collection G = (Gn) (called conditioning system) of sub-σ-fields of A and then it is strengthened by substituting the convergence in σ(L 1 , L ∞ ) by the one in probability (i.e. in L 1 , since f is bounded). Hence, according to [S7], we say that Yn converges to K stably in the strong sense, with respect to G = (Gn), if for each bounded continuous real function f on S.
We now conclude this section recalling some convergence results that we apply in our proofs. This last result contains as a special case the fact that stable convergence and convergence in probability combine well: that is, if Cn stably converges to M and Dn converges in probability to a random variable D, then (Cn, Dn) stably converges to M ⊗ δD, where δD denotes the Dirac kernel concentrated in D.