Universality of Approximate Message Passing Algorithms

We consider a broad class of Approximate Message Passing (AMP) algorithms defined as a Lipschitzian functional iteration in terms of an $n\times n$ random symmetric matrix $A$. We establish universality in noise for this AMP in the $n$-limit and validate this behavior in a number of AMPs popularly adapted in compressed sensing, statistical inferences, and optimizations in spin glasses.


Introduction
Motivated by the ideas from belief propagation algorithms, Approximate Message Passing (AMP) algorithms were initially introduced in the context of compressed sensing, see [13,14,15,16]. Thereafter they have received great popularity in a number of emerging applications in data science, statistical physics, etc. concerning the development of efficient algorithms for some randomized estimations and optimizations with large complexity.
One major application has been laid on the subject of matrix estimations, in which one aims to extract the structure of a signal matrix in a randomized environment. A popular setting is the so-called spiked model, where the data arrives as the sum of a noise, an n × n symmetric random matrix A n , and the signal, an n × n symmetric low-rank matrix Z n ,Â n := A n + Z n .
where z 1 , . . . , z r are non-random column vectors with z 2 = √ n and the parameters γ 1 , . . . , γ r ≥ 0 are the signal-to-noise ratios (SNR's). In probability and statistics, this spiked model has been intensively studied by means of the spectral method, see [2,6,7,9,18,20,22,32,33]. In Bayesian optimal approach, the setting often considered in the literature is to assume that the vectors z 1 , . . . , z r are randomized and their i-th marginal vectors, (z 1 i , . . . , z r i ) for 1 ≤ i ≤ n, are independently sampled from a given prior distribution. It turns out that the corresponding Minimum Mean Square Error Estimator (MMSEE), E[(z 1 , . . . , z r )|Â n ], can be connected to the Gibbs expectation of the famous Sherrington-Kirkpatrick (SK) mean-field spin glass model arising from the statistical physics [35]. One peculiar feature within this connection is that this model satisfies the so-called Nishimori identity, namely, the conditional distribution of the vectors z 1 , . . . , z r given the dataÂ n is equal to the distribution of the vector-valued spin configuration of the corresponding SK model. This allows one to fully understand the behavior of the MMSEE and its phase transition in terms of SNR's, see [3,10,11,23,24,25,26] for the recent progress.
The study of the above Bayesian estimation arises a challenging computational problem in searching for polynomial-time algorithms in simulating the MMSEE. To this end, AMP algorithms have been widely adapted [21,28,29,30,31,34,37] and known to achieve a good level of success; in some cases, it allows to obtain the Bayesian-optimal error estimates, see [11,12,29]. In addition to being useful in matrix estimations, AMP has also been applied to a number of randomized optimization problems in mean-field spin glass models in recent years. In particular, it was shown in [17,27] that AMP allows to implement polynomial-time algorithms in the optimization of the SK Hamiltonian and its variants.
In these applications, the AMP algorithm is formulated as a sequence of n-dimensional vectors (v [k] ) k≥0 of the form where f k ∈ C 1 (R k+1 ) and the two vectors f k (v [k] , . . . , v [0] ) and f j−1 (v ; it influences the convergence of the AMP in the large n limit. When Z n is a zero matrix and the initialization v [0] is independent of A n , it was known [4,5,8,19] that under mild assumptions on f j 's, this iterative algorithm converges in the sense that for any Lipschitz function φ ∈ C(R k+1 ), almost surely, i , . . . , v [0] i ) = Eφ(V k , . . . , V 0 ),
For x, y ∈ R n , set x 2 = n i=1 x 2 i 1/2 and x, y = n i=1 x i y i . Let M n (R) be the collection of all n × n real-valued symmetric matrices. For X, Z ∈ M n (R), denote The generalized approximate message passing is formulated as follows.

Universality of approximate message passing algorithms
We now specify the randomness on X, Z, and u 0 Let σ > 0 be fixed. For any n ≥ 1, let u 0 = (u 0 i ) i∈ [n] be an n-dimensional random vector and Z = (z ii ) i,i ∈[n] be an n × n random symmetric matrix. Assume that there exists a constant C(σ) > 0 such that E exp |z ii | σ ≤ C(σ). (2.1) Suppose that A = (a ii ) i,i ∈[n] is an n × n random symmetric matrix, whose upper triangular entries are independent with zero mean and unit variance and are σ-subgaussian, i.e., Ee λa ii ≤ e λ 2 σ 2 /2 for all λ ∈ R. We further assume that A is independent of u 0 and Z, but allow u 0 and Z to be dependent on each other. An important example of A is when the entries a ii 's are standard normal. In this case, we denote A by G and we call Definition 2.1 associated to X = G a Gaussian AMP. Our main result shows that if we initialize u [0] (X) = u 0 , then the AMP corresponding to any A is essentially the same as the Gaussian AMP. i (X)), X ∈ M n (R).

(2.2)
Next we introduce the AMP used in the matrix estimation and some optimization problems in mean-field spin glasses. Definition 2.3. Let f −1 ≡ 0. For k ≥ 0, assume that f k ∈ C 1 (R k+1 ) is Lipschitz and its first-order partial derivatives are also Lipschitz. Let X ∈ M n (R). Starting from an initialization v [0] (X), define the AMP orbit for k ≥ 0 iteratively by i (X)).
Note that Definition 2.3 is not a direct example of the generalized AMP in Definition 2.1 due to the term b k,j (X). Nevertheless, since b k,j (X) is an average of the partial derivatives, this quantity is essentially indistinguishable between different randomness and this allows us to establish the following universality.
i (X)), X ∈ M n (R). Remark 2.5. As pointed out in the introduction, it was known [4,5,8,19] that if Z = 0, the Gaussian AMP in Definition 2.3 converges, see (1.2). If f 0 , . . . , f k are polynomials and Z = 0, it was further understood in [4] that this convergence is independent of the choice of the randomness of A. Theorem 2.4 here extends this universality to Lipschitz functions and in the presence of Z. We refer the reader to check [17,27] for the universality of this AMP in the optimization of the Hamiltonian of the SK and related models. We mention that while our extension here allows to achieve the same result in [27], the work [27] adapted [4] directly with a truncation argument.
Example 2.6. For γ ≥ 0, set Z = γu 0 ⊗ u 0 . In this case,Â n is a rank-one spiked matrix, In matrix estimation, one would like to recover the vector u 0 from the realization ofÂ n . When A = G, the MMSEE, E[u 0 |Â n ], is popularly adapted for this purpose and it can be simulated via the AMP in Definition 2.3 with specifically chosen functions f k 's, see, e.g., [11]. Theorem 2.4 here indicates that in a non-Gaussian noise environment, the AMP in Definition 2.3 still allows to implement the same simulation for u 0 as the the Gaussian AMP.
Recall that the initialization u 0 and the signal matrix Z are assumed to be independent of the noise. In Example 2.6, since the MMSEE is a measurable function of the spiked matrixÂ n , it is often more desirable that the initialization depends onÂ n , as it should provide a better estimate for the MMSEE. When A = G, an attempt along this line has been successfully carried out in [29]. Our last main result addresses universality towards this direction. For any X ∈ M n (R), denote by λ 1 (X n ) ≥ λ 2 (X n ) ≥ · · · ≥ λ n (X n ) the eigenvalues ofX n and by ψ 1 (X n ) the top eigenvector ofX n with ψ 1 (X n ) 2 = √ n. Set ψ(X) = sign ψ 1 (X n ), u 0 ψ 1 (X n ) (2.4) whenever ψ 1 (X n ), u 0 = 0. Note that although there are two possible choices of ψ 1 (X n ) up to a sign, the definition ψ(X) here is not influenced by this difference.

5)
and lim inf The assumptions (2.5) and (2.6) say that the top eigenvalue stays a gap away from the rest of the eigenvalues and the principal eigenvector is correlated to the prior vector u 0 . These behaviors are not only required in our proofs for technical purposes, but also appear to be quite typical in the BBP phase transition, see the following example. Universality of approximate message passing algorithms from [6] that the BBP transition point is equal to 1: If γ < 1, lim n→∞ λ 1 (Ĝ n ) = 2, a.s., These imply that the spectral method can be used to gain useful information about u 0 only if the SNR exceeds the critical threshold, i.e., γ > 1, as in this case the principal eigenvector is positively correlated to u 0 . In [29], the convergence of AMP in Definition 2.1 initialized by the top eigenvector was investigated, which states that again when Here, starting from µ 0 = 1 − γ 2 and σ 0 = 1/γ, (µ k ) k≥1 and (σ k ) k≥1 are defined through where g ∼ N (0, 1) is independent of w. Note thatĜ n is a perturbation of G n by a rank-one matrix. The eigenvalue interlacing property implies that for any small δ > 0, for all 2 ≤ i ≤ n, where λ 1 (G n ) and λ n (G n ) are the largest and smallest eigenvalues of G n , respectively. Note that this inequality, (2.7), and (2.8) are also valid for A. Hence, the assumptions of (2.5) and (2.6) are valid and as a result, the convergence of (2.9) is universal in probability.
Remark 2.9. In Theorems 2.2, 2.4 and 2.7, although we assume that φ is Lipschitz, these theorems indeed also hold for polynomial φ (this is particularly relevant for applications in statistics when considering square losses). This essentially follows from a truncation argument along with the moment bound of u [k] (A) 2 in Proposition 5.4.
Our approach to proving Theorem 2.2 is to match the first and second moments of Φ k,n between A and G, respectively. To this end, we define a Gaussian interpolation X = A(t) := √ tA + √ 1 − tG for 0 ≤ t ≤ 1 and control the t-derivatives of EΦ k,n (A(t)) and EΦ k,n (A(t)) 2 . The hope is that if the total number of the terms as well as their orders appearing in these derivatives are small enough, then we anticipate that an application of the approximate Gaussian integration by parts would make the derivatives small. However, due to the iteration of the AMP, these derivatives involve highly complicated Hadamard products of a large number of column vectors in terms of the higher order partial derivatives of u [ ] (X) andX n u [ ] (X). As a result, the control of their p-th moments are extremely delicate, especially for those ofX n u [ ] (X). The novelty of our analysis adapts a Taylor expansion of the derivatives up to the p-th order, which allows us to extract the dependence of the i-th row of X out of the derivatives. This combining with a subtle moment computation in this expansion perfectly cancels out the majority of the smaller order terms and yields the following moment controls (see Proposition 5.4 and Lemma 5.5) that for any p ≥ 1, there exists a universal constant C > 0 such that for any collection P of variables x ii for i, i ∈ [n] counting multiplicities with |P | = m, we have where ∂ P is the partial derivatives with respect to the variables in P. Using the Markov inequality and the union bound, these yield a uniform control on the derivatives that for any P with |P | = m and δ > 0, with probability at least 1 − Cn −δ , max i∈[n] Once Theorem 2.2 is established, the proof of Theorem 2.4 follows essentially by a special choice of the functions F 0 , . . . , F k , . . .. Although the term b k,j (X) in (2.3) relies on all coordinates, its form averages out the partial derivatives and consequently, b k,j (A) and b k,j (G) are asymptotically equal in probability, which is already enough to establish Theorem 2.4 following an induction argument. Lastly to show Theorem 2.7, recall that while both AMP's in Theorems 2.4 and 2.7 share the same iteration procedure, their initializations are of different kind; the former is initialized independent of A n , but the latter adapts the principal eigenvector ofÂ n . We show that this eigenvector can be approximated very well by the power method (see Lemma 8.1). In view of this method, it is essentially a special case of our generalized AMP with the choice F k (x k , . . . , x 0 ) =X n x k and an analogous argument as that for Theorem 2.4 allows to establish Theorem 2.7. One technicality here is that in order to guarantee the convergence of the power method, one would have to choose the initialization carefully and ensure that the principle eigenvalue ofÂ n is well-separated from the other eigenvalues. This explains why the assumptions (2.5) and (2.6) need to be in position. We mention that our approach can also be extended to prove universality for a number of different settings of the AMPs, such as for the high-dimensional version of the AMP in [19] and for the signal recovery in the square/rectangular spiked matrices via the AMP initialized by top eigenvectors/singular vectors [29], whose corresponding eigenvalues/singular values exhibit the BBP phase transition. It is plausible that they can be obtained from our approach. We do not pursue these directions here.
The rest of the paper is organized as follows. Sections 3-6 are the preparation for the proof of Theorem 2.2. Section 3 establishes a Gaussian concentration inequality for the function Φ k,n (X) as well as a number of prior controls on the AMP orbit. In Section 4, we show that in proving Theorem 2.2, it suffices to assume that φ and F k 's are smooth with uniformly bounded derivatives. Section 5 provides the main estimates on the moments of the derivatives of the AMP orbits. In Section 6, we carry out our interpolation argument and present the proof of Theorem 2.2. The proofs of Theorems 2.4 and 2.7 are provided in Sections 7 and 8, respectively. Finally, the Appendix gathers error estimates of some approximate Gaussian integration by parts.

Lipschitz property and concentration inequality
Consider the AMP in Definition 2.1 with initialization u [0] (X) = u 0 . In this section, we establish a Lipschitz property for this AMP and a concentration inequality for Φ k,n (G).
These will be used later in the proof of Theorem 2.2. Recall that the functions F k in Definition 2.1 are Lipschitz. Let η k be the Lipschitz constant of F k . For any X ∈ M n (R), denote by X 2 the 2 − 2 operator norm of X.
This proposition says that the AMP orbits behave stably subject to a small perturbation to the matrix X and the initialization. From this, we show that the Gaussian AMP is concentrated.
whereẼ is the expectation conditionally on u 0 and Z.
For the rest of this section, we establish these results.

Proof of Proposition 3.1
The proof of this proposition relies on two lemmas on the boundedness and the Lipschitz property of the vector u [k] (X) following an iterative argument.

Lemma 3.3.
For every k ≥ 1, Proof. Write Using the Lipschitz property of F −1 and the trivial bound (a + b) 2 ≤ 4(a 2 + b 2 ) yields so that from the Minkowski inequality, If we let t := u [ ] (X) + n 1/2 and C := 1 + C, then the above inequality implies that Using induction yields that and this completes our proof.

Lemma 3.4.
For any X, Y ∈ M n (R), Proof. From the Lipschitz property of F −1 , Here, for any 1 ≤ ≤ k, If we let and this completes our proof by noting that our assertion follows immediately.

Some prior bounds
Next, note that the operator norm is no larger than the Frobenius norm. This and the Jensen inequality lead to Finally, since the entries of A are independent σ-subgaussian with zero mean, it is well-known (see, for instance, Corollary 4.4.8 in [36]) that sup n≥1 E A n p 2 < ∞. Putting these bounds together yields the uniform integrability of Â n p 2 and this completes our proof.
Proof. The proof follows directly from Lemmas 3.3 and 3.5.

Proof of Theorem 3.2
First of all, we establish a Gaussian concentration inequality for the functional Φ k,n .
whereP andẼ are the probability and expectation conditionally on u 0 and Z, and where Ω n is defined in (3.6) and c 0 is a constant independent of n. Observe that for any Next, note that for any Y n ∈ M n (R) with Y n 2 ≤ M and X, X ∈ M n (R), From these and noting that the 2 -operator norm of a matrix is less than its Frobenius norm, we see that T (X) is Lipschitz with respect to the Frobenius norm with Lipschitz constant c 1 Ω n /n 1/2 for some constant c 1 independent of n. Hence, the usual Gaussian concentration inequality for Lipschitz functions implies that Now note that as long as we fix M large enough at the beginning, where the last inequality used the well-known bound that the largest eigenvalue of G is concentrated around its mean with exponential tail bound, which follows by the Borell-TIS inequality and c 2 , c 3 are two constants independent of n and M. On the other Here, for some c 4 independent of n and M. From these, This completes our proof.
Proof of Theorem 3.2. From Lemmas 3.5 and 3.7 and the Markov inequality, in probability P, In addition, from Hence, the assertion follows.

Smooth approximation
Recall that the functions F k in Definition 2.1 and φ in Theorem 2.2 are Lipschitz.
In this section, we show that to prove Theorem 2.2, it suffices to assume that these functions are smooth and their derivatives of any nonzero orders are uniformly bounded.

Proposition 4.1.
For any k ≥ 0, there exists a constant C independent of n such that for any ε > 0, there exist some functionsφ ∈ C ∞ (R k+1 ) andF ∈ C ∞ (R +1 ) for 0 ≤ ≤ k − 1, whose partial derivatives of any nonzero orders are uniformly bounded such that is the k-th AMP orbit in Definition 2.2 associated to the functionsF 0 , . . . ,F k−1 and the initial conditionū Proof. Denote by η the Lipschitz constant of F . Let ε > 0 be fixed. Assume that ζ ∈ C ∞ (R +1 ) is a mollifier with ζ ≥ 0 and ζ dx = 1 and it is supported on the unit ball {x ∈ R +1 : Note that for any ε > 0 and x ∈ R +1 , Since F is Lipschitz and ζ is supported on the unit ball, it follows that the partial derivatives of all nonzero orders ofF are uniformly bounded. In particular, .
, an induction argument implies that where C is a constant depending only on and η. Finally, by the same argument, for any ε > 0, there exists aφ ∈ C ∞ (R k+1 ) with uniformly bounded partial derivatives of any nonzero orders such that φ −φ ∞ < ε. From (4.2) and the Lipschitz property of φ, our proof is completed.
We establish uniform moment controls on the partial derivatives of the generalized AMP orbit in Definition 2.1. For σ > 0, recall the random vector u 0 and the random matrix Z from (2.1). Let G n (σ) be the collection of n × n symmetric random matrices A = (a ii ) i,i ∈[n] , whose entries are centered independent and each of them are σsubgaussian for some 0 ≤ σ ≤ σ. We also assume that G n (σ) is independent of u 0 and Z. For any m ∈ N, denote by [m] = {1, . . . , m}. For any n ≥ 2, let T n be the collection of all sequences (i r , i r ) r≥1 ⊂ [n] 2 with i r < i r for all r ≥ 1. Let P be an arbitrary finite subset of N and let m = |P |. For h ∈ C m (M n (R)) and (i r , i r ) r≥1 ∈ T n , denote by ∂ P h(X) ∈ R the partial derivative of h with respect to the variables x iri r for all r ∈ P counting multiplicities. For a vector-valued function H = (h 1 , . . . , h n ) for h 1 , . . . , h n ∈ C m (M n (R)), we also set the partial derivative of H by For any n ≥ 2 and m ≥ 0, denote by B n (m) the collection of all P, (i r , i r ) r≥1 , A, i for P ⊂ N with |P | = m, (i r , i r ) ∈ T n , A ∈ G n (σ), and i ∈ [n]. The following is our main estimate. F k ∈ C ∞ (R k+1 ) and its partial deriatives of all nonzero orders are uniformly bounded.
). Assume that its partial derivatives of nonzero orders are uniformly bounded. Define a vector-valued random function on As we shall see, this bound will be used to control the Gaussian interpolation between the first two moments of Φ k,n (A) and Φ k,n (G). For the rest of this section, we establish this proposition in three subsections. First of all, we derive explicit formulas for the derivatives of the AMP orbit. Next, we show that Proposition 5.1 is valid if U depends only on the marginal variables. The general case is treated in the last subsection.

Some auxiliary lemmas
. . , k} r and set P r (P ) the collection of all partitions P = {P 1 , . . . , P r } of P into r nonempty subsets. For J = (j 1 , . . . , j r ) ∈ J r (k), set Proof. We argue by induction on the size of the set P . The case |P | = 1 is obvious. Suppose that the conclusion holds for some m ≥ 1 and all P ⊂ N with |P | = m.
To handle the first summation, note that On the other hand, since To simplify this summation, we write where in the first equality we changed the variable r + 1 → r, while in the second equality we divide 2 ≤ r ≤ m + 1 into 2 ≤ r ≤ m and r = m + 1 and use the observation that P 1 ([m + 1]) contains no element P so that {m + 1} ∈ P. Combining this summation with (5.2) yields the desired formula.
where E r ∈ M n (R), whose entries are equal to 1 at (i r , i r ) and (i r , i r ) and are zero otherwise.
Proof. It suffices to assume that P = Assume that the assertion is valid for m ≥ 1. Then H(X).

Moment control
The most crucial ingredient of this paper lies on the following proposition, which establishes two special cases of Proposition 5.1.
where E r ∈ M n (R) is equal to 1 on the entries (i r , i r ) and (i r , i r ) and is zero elsewhere.
To control the first case, note that u 0 is independent of A and the entries in A are independent. From the subgaussianity of a ij and (2.1), there exist positive constants λ(σ) and D(σ) such that for any n ≥ 1 and λ ∈ [−λ(σ), λ(σ)], Consequently, from x 2p ≤ (2p)! cosh x and the Jensen inequality, On the other hand, the Cauchy-Schwarz and Jensen inequalities imply that where D (σ) is a constant independent of n and is guaranteed by the moment assumption (2.1). Combining these two inequalities validates (5.4) for (k, p, m) = (0, p, 0). In the second case, since E r u 0 has only two nonzero entries and they are u 0 ir and u 0 i r , the bound (2.1) implies (5.4) for (k, p, m) = (0, p, 1). The third case is evident. In conclusion, (5.4) holds for k = 0, p ≥ 1, and m ≥ 0. Next, we assume that there exists some k 0 ≥ 0 such that (5.3) and (5.4) are valid for all 0 ≤ k ≤ k 0 , p ≥ 1, and m ≥ 0. Our goal is to show that they are also valid for k = k 0 + 1, p ≥ 1, and m ≥ 0. Denote by where η k0 is the Lipschitz constant of F k0 . By induction hypothesis, (5.3) follows when (k, p, m) = (k 0 + 1, p, 0) for all p ≥ 1. Now suppose that m ≥ 1. For n ≥ 2, consider an arbitrary P with |P | = m, (i r , i r ) r≥1 ∈ T n , A ∈ G n (σ), and i ∈ [n]. From Lemma 5.2, i (X).
Note that from Lemma 5.3, The first term can be controlled by where δ i,i = 1 if i = i and it is zero if i = i . As for the second term, note that for any u ∈ R n , This implies that Here, from (2.1), n −p E n j=1 z 2 ij p is bounded above by a constant independent of n.
From the validity of (5.3) for k = k 0 + 1, p ≥ 1, and m ≥ 0 that we established above, we also have that and from Lemma 5.5 below, where Υ k0+1,p,m is a universal constant independent of n. Therefore, we arrive at where C is a constant independent of n. Plugging this inequality and (5.6) into (5.5) yields (5.4) for k = k 0 + 1, p ≥ 1, and m ≥ 0. This completes our proof.
At the end of this subsection, we establish the following lemma used in the above proof.
Then for any p ≥ 1 and m ≥ 0, there exists a constant Υ k,p,m such that sup Bn(m) Proof. Our idea is to use the Taylor expansion to track the dependence of A n ∂ P u [k] (A) p on each variable a ii in each iteration. By Jensen's inequality, it suffices to assume that p is even. Let m ≥ 0 be fixed. Let P ⊂ N with |P | = m, (i r , i r ) r≥1 ∈ T n , A ∈ G n (σ), and i ∈ [n]. Denote V (X) = ∂ P u [k] (X).  For I ∈ I D and X ∈ M n (R), let X I ∈ M n (R), in which each entry of X I is equal to that of X except that it vanishes on the sites (i, ι s ) and (ι s , i) for all s ∈ D. For ι ∈ [n] and Here, V ι is the ι-th entry of V . Write by Taylor's theorem, ι (t)dt =: M I ι (X) + N I ι (X).
First we handle ∆ n,1 . From the given assumption, E|a iιs 1 · · · a iιs a | 4p 1/4p ≤ Γ k,4p,a+m ξ a Also, since where the last inequality used the fact that |S c | ≥ 1 since S [p].
Note that when b = 0, every entry in I ∈ I D,0 must appear at least twice in I and hence, Sā ,I,b = Sā, or equivalently, S c a,I,b = ∅. Also, note that Sā ,I,b is nonempty only if |ā| ≥ b.
From these, we can write .
To control the first summation, note that for any I ∈ I D,b , there are exactly b many entries in I that appear once and the other entries are repeated. This implies that Γ k+,p,m+ar Γ k,p,m+ar , (5.9) where the third inequality used |ā| ≥ b. As for the second summation, observe that for 1 ≤ b ≤ |D|, if I ∈ I D,b and (s 1 a1 , . . . ,s p ap ) ∈ S c a,I,b , then there exists one entry, say ι , in I that appears only once in I and it does not appear in the entries ofs 1 a1 , · · · ,s p ap . Hence, a iι is independent of  Γ k,p,m+ar .
This completes our proof.

Proof of Proposition 5.1
Let m = |P |. From Lemma 5.2, the Minkowski inequality, and the Cauchy-Schwarz inequality, we can compute the partial derivatives of U (X) i to see that E ∂ P U (A) i p 1/p is bounded above by a sum, in which each summand is of the form for some 1 ≤ r ≤ m and (j 1 , . . . , j r ) ∈ J r (2(k + 1)). More importantly, P = {P 1 , . . . , P r } ∈ P r (P ) and each term v js (A) is equal to either (X n u [ ] (A)) i or u [ ] (A) i for some 0 ≤ ≤ k.

Now, using the Hölder inequality gives that
Here, from Proposition 5.4, each term on the right-hand side is bounded above by a term of order 1/n |Ps|/2 and they together yield that a bound of order 1/n |P |/2 since |P 1 | + · · · + |P r | = |P |. This completes our proof.

Proof of Theorem 2.2
We establish universality for the generalized AMP in Definition 2.1. Recall that in Theorem 3.2, we useP andẼ to denote the probability and expectation conditionally on u 0 , Z. The following proposition shows that conditionally on u 0 , Z, the first two moments of the AMP orbits between A and G asymptotically match each other. Proposition 6.1. Consider the AMP orbit in Definition 2.1 with the initialization u [0] (X) = u 0 and assume that (F k ) k≥0 satisfies (5.1). Let k ≥ 0. Assume that φ ∈ C ∞ (R k+1 ) has uniformly bounded partial derivatives of any nonzero orders. There exists a universal constant C independent of n such that for any n ≥ 2, Note that from Theorem 3.2, lim n→∞ E|Φ k,n (G) −ẼΦ k,n (G)| 2 = 0.
For the rest of this section, we establish Proposition 6.1 in five subsections using the Gaussian interpolation and approximate Gaussian integration by parts. In doing these, Proposition 5.1 will be of great use in tracking the error terms. Subsection 6.1 shows that to prove Proposition 6.1, it suffices to assume that the main diagonals of A, G, Z are all equal to zero. The Gaussian interpolation between Φ k,n (A) and Φ k,n (G) is introduced in Subsection 6.2 and the control of its derivative is handled in Subsection 6.3. Finally, the proofs of (6.1) and (6.2) are established in Subsections 6.4 and 6.5, respectively.

Deletion of the main diagonal
By the virtue of Proposition 3.1, it suffices to assume that the main diagonals in A and Z are zero. To see this, recall Φ k,n (X), ∆ k (X), and Θ k (X) from Proposition 3.1. Assume that A is equal to A except that the main diagonal vanishes. Note that from Lemma 3.5, A n 2 and A n 2 are of order O(1). On the other hand, since Next one can prove by an almost identical argument to show that the AMP orbits correspond to (A , Z) and (A , Z ) are also asymptotically the same under the L 1 (P)distance, where Z is the same as Z except that its main diagonal is zero. From this, in what follows, we assume that the main diagonals of A, G, and Z are all equal to zero.

Interpolation
Define the Gaussian interpolation between A and G by For φ ∈ C ∞ (R k+1 ) with uniformly bounded partial derivatives of all nonzero orders, define Φ k,n (t) =ẼΦ k,n (A(t)).
Note that To show (6.1), our goal is to show that Here and thereafter, as the Hadamard product between v and v . Note that this operation is commutative.
To simplify our notation, denotė Also, denote Observe that d dt u [1] .
From these equations, one readily sees that the vector appearing in (6.3) can be written as a summation of column vectors, in which each summand is of the form w r (t) = (w r i (t)) i∈[n] for some 0 ≤ r ≤ − 1 that is defined by an iterative procedure through some functions L 0 , L 1 , . . . , L r+1 ∈ C ∞ (R 2 ), whose partial derivatives of any nonzero orders are uniformly bounded. More precisely, starting from Finally, set w r (t) = w r,[r] (t).

Bounding the derivative of the interpolation
For r ≥ 0, from the iteration (6.4) and expanding the Hadamard product, (6.8) where I r is the collection of all I = (i 0 , i 1 , . . . , i r , i r+1 ) ∈ [n] r+2 with i 0 = i 1 = · · · = i r = i r+1 . Here we view each I as a directed graph of length r + 1 with vertices (i s ) 0≤s≤r+1 and edges e I (l) = (i l , i l+1 ) for 0 ≤ l ≤ r. For any I ∈ I r , disregard the direction, let Λ 0 I be the collection of all 0 ≤ l ≤ r with e I (l) = e I (0). Let I r (s) be the collection of all graphs in I r so that disregard the direction there are exactly s many edges that appear once. Proposition 6.2. For any I ∈ I r (s), we have that for any 0 < t < 1, whereẼ is the expectation conditionally on u 0 , Z and C s is a universal constant independent of n.
To prove this proposition, we first establish a key lemma. For any 0 ≤ b ≤ r + 1, let I r (s, b) be the collection of all I ∈ I r (s) with |Λ 0 I | = b. Note that when b = 1, the set I r (s, b) is nonempty for all 1 ≤ s ≤ r + 1 and when 2 ≤ b ≤ r + 1, the set I r (s, b) is nonempty only if 0 ≤ s ≤ r + 1 − b. (6.10) Here, C r,b,s 's are universal constants independent of n.
Proof. For any graph I ∈ I r (s, 1), let I be the graph, in which we disregard both multiplicities and directions of the edges. See Figure 1 for examples. Observe that there are at most (r + 1 − s)/2 many edges in I that appear at least twice in I and the total number of edges of I is at most (r + 1 − s)/2 + s. This implies that the total number of vertices of I should be at most (r + 1 − s)/2 + s + 1 so there are at most n r+1−s 2 +s+1 many such I . Since different I can correspond to the same I (again we refer the reader to Figure 1 for examples), it remains to show that for each I , there are at most a constant multiple (independent of n) of many different I that corresponds to I . First fix such a possible I and write E(I ) for the edge set of I . Each edge of I may correspond to an edge of multiplicity 1 in I or to an edge of multiplicity at least 2 in I (ignoring directions). We choose s edges in I so that they correspond to the multiplicity 1 edges in I. There are |E(I )| s ways to choose such s edges. Now, for the remaining (|E(I )| − s) many edges, the multiplicities are at least 2 in I and they add up to r + 1 − s. To count how many possibilities there are, it is equivalent to find how many ways r + 1 − s can be written as sum of (|E(I )| − s) many integers which are at least 2, which in turn is bounded above by the number of ways to partition the integer r + 1 − s as sum of (|E(I )| − s) many positive integers, and it is well-known that the number of ways is r−s |E(I )|−s−1 . Moreover, each edge has two possible directions in I. Finally, each vertex in I can correspond to several i t (0 ≤ t ≤ r + 1) in I (see Figure 1). As there are at most (r + 1 − s)/2 + s + 1 vertices in I , and each vertex can correspond to at most r + 1 many i t 's, there are at most (r + 1) (r+1−s)/2 +s+1 many such correspondence. Therefore, the total number of I that an I can correspond to is bounded above by This proves (6.9).
To prove (6.10), note that in this case, the edge e I (0) has multiplicity b, and hence there are at most (r + 1 − b − s)/2 + 1 many edges in I that appear at least twice in I. Here, the latest +1 comes from e I (0). The remaining of the proof is similar to that of (6.9), and we omit the detail. After we disregard the multiplicities and directions, they correspond to the same I as shown in (c), where the solid edges correspond to the edges in I that appear only once and the dashed edges correspond to those in I with multiplicity ≥ 2.
Proof of Proposition 6.2. Let I ∈ I r (s) be fixed. Disregard the direction, let Λ 1 I be the collection of all l / ∈ Λ 0 I so that e I (l) appears exactly once in I and Λ 2 I be the collection of all l / ∈ Λ 0 I so that e I (l) appears more than once in I. In addition, for any R ⊆ [r], set Λ 0 Note that inside the expectation, the two parentheses are independent of A I,R (t), and each term in the second parentheses appears only once in I. From these, we can apply Proposition 5.1 and Lemmas A.1, A.2, and A.3 in Appendix to control V I (t). To see this, note that for any 0 < t < 1 and p ≥ 1, and for some universal constant C p independent of t. Let E I,R be the expectation only with respect to a e I (l) (t) for all l ∈ Λ I (R). Using these bounds and Lemmas A.1, A.2, and A.3, we get that is bounded above, up to an absolute constant independent of I and n, by (6.12) The summand in the above bound is over all α := (α l ) l∈Λ I (R) ∈ ({0} ∪ N) s0 , |α| := l∈Λ I (R) α l , and ∂ α I,R is the partial derivative with respect to x e I (l) of order α l for all l ∈ Λ I (R). From this inequality, it follows that dt, (6.13) for some constants C and C independent of I and n, where the second inequality used the Cauchy-Schwarz inequality, while the third inequality used the Hölder inequality and the bounds (2.1) and (6.11). The last inequality can further be controlled as follows. From the product rule, , where the summand is over all β l ∈ ({0} ∪ N) r+2 satisfying that r+1 =0 β l, = α l for l ∈ Λ I (R) and ∂ β( ) I is the partial derivative ∂ β l, x e I (l) for all l ∈ Λ I (R). Now, using the Minkowski and Hölder inequalities leads to E ∂ α I,R U I (A(t, ξ))) . From (6.5) and (6.6), note that any nonzero-order partial derivatives of U [ ] 's are uniformly bounded. From Proposition 5.1, each term on the right-hand side is bounded by Plugging this inequality into (6.13) and noting that Z is independent of A, G together with the bound (2.1) yield that Here, note that Also, note that s is the number of edges in I that are crossed once disregard the direction. This implies that Λ 1 I ([r]) ≥ s − 1 and that Λ 1 From these, if |Λ 0 I | ≤ 2, then |Λ 0 I (R)| can only be 0 or 1 for any R ⊆ [r] and this implies and if |Λ 0 I | ≥ 3, then |Λ 0 I (R)| could be larger than 2 for some R ⊆ [r] and hence, for some constant C s > 0. This completes our proof.

Proof of Proposition 6.1: first moment
Recall (6.7). Our proof will be completed once we establish that 1 n 1+(r+1)/2  In what follows, we let C r,b,s and C r,b,s be absolute constants independent of n. Here, from the first case of Proposition 6.2 and Lemma 6.3, the first two summations can be controlled by 1 n 1+(r+1)/2 I∈Ir(s,1) for 0 ≤ s ≤ r + 1 − b. Plugging these into (6.15) yields (6.14) and this completes our proof.

Proof of Proposition 6.1: second moment
Our approach is the same as that for the first moment. Set Ψ k,n (t) =ẼΦ k,n (A(t)) 2 , 0 < t < 1.
Note that To control this expectation, we again consider the derivative Again, our goal would be to show that is bounded above, up to an absolute constant, by n −1/2 . To see this, recall from Section 6.2 that Φ k,n (t) can be written as a summation, in which each summand is of the form for some 0 ≤ r ≤ k − 1. In a similar manner, from (6.7) and (6.8), EΨ k,n (t) can also be written as a sum, in which each term is equal tõ Here, V i,I (t) is essentially the same as V I (t) (see (6.8)) except that it contains one extra term U [r+2] i (A(t)). In view of the proof of Proposition 6.2, an identical argument implies that for any i ∈ [n] and I ∈ I r (s), where C s is a universal constant independent of n and i. Consequently, as in the proof of (6.14), it follows that 1 n 1+(r+1)/2 for some constant C independent of n and i and the same inequality remains valid after which completes our proof.

Proof of Theorem 2.4
We establish the proof of Theorem 2.4. For notational convenience, if a n and b n are two random variables, we denote a n b n if |a n −b n | → 0 in probability; if they are n-dimensional random vectors, then this notation means that in probability, lim n→∞ 1 n a n − b n 2 2 = 0.
To begin with, recall from Lemma 3.6 and Theorem 3.2 that due to the Lipschitz property of the functions (F k ) k≥0 , the AMP orbit defined in Definition 2.1 is uniformly square-integrable and the average along the Gaussian AMP orbit is concentrated with respect toẼ. Here, in the setting of Definition 2.3, since both f and its first-order derivatives are Lipschitz, an identical argument also allows to show that (v [k] ) k≥0 is uniformly squared-integrable and when X = G, its average along the orbit is selfaveraged with respect toẼ. More precisely, for any Lipschitz φ ∈ C(R k+1 ), ). φ v Proof. Consider the initialization u [2] (X) = f 0 (u [0] (X)), u [3] (X) =X n u [2] (X) =X n f 0 (u [0] (X)) and for ≥ 1, set [3 ] (X), u [3( −1)] (X), . . . , u [3] (X), u [0] (X)), The main feature of this construction is that u [3 ] (X) = v G,[ ] (X) for all ≥ 0. Note that b G ,j depends only on u 0 , Z and it is uniformly bounded. Although Definition 2.1 assumes that F k 's are nonrandom, with no essential changes to the proof, Theorem 2.2 indeed extends to randomized F k 's that are dependent only on u 0 , Z and the Lipschitz constants of F k 's are bounded by some constants independent of u 0 , Z. Hence, the assertion follows by applying Theorem 2.
Proof. We argue by induction. Obviously the assertion is valid for k = 0. Assume that there exists some k ≥ 0 such that it is also valid for all 0 ≤ k ≤ k . From the triangle By induction hypothesis, Lemma 3.5, and noting that the first-order partial derivatives of f j 's are uniformly bounded, the first two terms after dividing by √ n converge to zero in probability. As for the last term, note that (7.1) implies that b G k ,j b k ,j (G) for all 1 ≤ j ≤ k . Also, note that from the relation, u [3 ] = v G,[ ] , in the proof of Lemma 7.1, the Lipschitz property of f j−1 and Lemmas 3.5 and 3.6 (here, again Lemma 3.6 is valid despite of the fact that F k 's are dependent on Z and u 0 ) imply Hence, the third term also vanishes in probability and this validates the announced result.
We are ready to prove Theorem 2.4, namely, for any k ≥ 0 and Lipschitz φ ∈ C(R k+1 ), 1 n We argue by induction. Evidently this is valid for k = 0. Assume that there exists some k ≥ 0 such that it is also valid for all 0 ≤ k ≤ k . From Lemmas 7.1 and 7.2, 1 n If this is valid, then (7.3) is also true for k + 1 and this would complete our proof. It Easy to see that this is valid if k = 0. Assume that there exists some 0 ≤ k ≤ k such that this equation holds for Here, from the induction hypothesis, after dividing by √ n, the first two lines vanish in probability by using the fact that b k ,j is uniformly bounded and Lemma 3.5. As for the

Proof of Theorem 2.7
We establish the proof of Theorem 2.7. Our strategy is to approximate the principal eigenvector by the power method. In view of this, it is essentially a special case of the generalized AMP in Definition 2.1. Once this is done, universality would follow by an analogous argument as that for Theorem 2.7. Again, we adapt the notation a n b n from Section 7.
If α j = 0 for some 2 ≤ j ≤ s, then this expectation is equal to zero. If α j > 0 for all 2 ≤ j ≤ s, then α 1 = 0 or α 1 = 1. In the former case, the expectation vanishes; in the latter case, this expectation is also equal to zero since Ea 1 (t)ȧ(t) = 0 due to (A.2).