Eigenvalue distribution of nonlinear models of random matrices

This paper is concerned with the asymptotic empirical eigenvalue distribution of a non linear random matrix ensemble. More precisely we consider $M= \frac{1}{m} YY^*$ with $Y=f(WX)$ where $W$ and $X$ are random rectangular matrices with i.i.d. centered entries. The function $f$ is applied pointwise and can be seen as an activation function in (random) neural networks. We compute the asymptotic empirical distribution of this ensemble in the case where $W$ and $X$ have sub-Gaussian tails and $f$ is real analytic. This extends a previous result where the case of Gaussian matrices $W$ and $X$ is considered. We also investigate the same questions in the multi-layer case, regarding neural network applications.


Introduction
Machine learning has shined through a large list of succesful applications over the past five years or so (see for instance applications in image or speech recognition [KSH12, HDY + 12] or translation [WSC + 16]) but is also used now in video, style transfer, dialogues, games and countless other topics. The interested reader can go to [Sch15] for an overview of the subject. However, a complete theoretical and mathematical understanding of learning is still missing. The main difficulty comes from the complexity of studying highly non-convex functions of a very large number of parameters [CHM + 15, PB17]. We also refer to [CCD + 19] for a comprehensive exposition of the problem we are interested in.
An artificial neural network can be modeled as follows: some input column vector x ∈ R n 0 goes through a multistage architecture of alternated layers with both linear and non linear functionals: let g i : R → R, i = 1, . . . , L be some given activation functions and W i , i = 1 . . . L be n i × n i−1 matrices. The output vector after layer L is (1.1) The functions g i are here applied componentwise. The matrices W i are the (synaptic) weights in the layer i and the activation function g i models the impact of the neurons in the architecture. There are different possible choices for the activation functions: some notable examples are g(x) = max(0, x) (known as the ReLU activation function for Rectified Linear Unit) or the sigmoid function g(x) = (1 + e −x ) −1 . The parameter L is called the depth of the neural network. This depth is important with respect to the question of machine learning in artificial neural networks: we hereafter introduce the problem of learning in this context. One may also refer the interested reader to [LBH15] for more information on the development of deep learning, i.e. when L > 1.
Generally in supervised machine learning, one is given a n 0 × m matrix dataset X coinjointly with a target dataset Z of size d × m. The parameter m is here the sample size. For instance the n 0 −dimensional column vectors of X encode the (pixels of) photographs of cats and Z is the m sample of d possible breeds of cats. The aim of supervised learning is to determine a function h so that, given a new photo x, the output of the function h(x) yields an acceptable approximation of the target (true) object (that is the breed in our example). The parameters to be learned are here the weight matrices. The error when performing such an approximation is measured through a loss function. In the context of Feed Forward Neural Networks as in (1.1), when the input vector is high dimensional and the sample size is comparably large, one of the commonly used learning method is ridge regression (and h is linear in Y ). More precisely, in the one layer case (L = 1) the loss function is where γ is the learning rate or penalizing parameter. The optimal matrix B can then be proved to be proportional to Y QZ * where Y = (g 1 (W 1 X)) and (1.2) As a consequence, the performance of this learning procedure can be measured thanks to the asymptotic spectral properties of the matrix 1 m Y * Y. Indeed, for the one layer case, the expected training loss can be proved to be related to the asymptotic e.e.d. (and Stieltjes transform). It is given by where Q is given by (1.2) and Tr denotes the unnormalized trace.
A possible idea to understand better such large complex systems is to approximate the elements of the system by random variables as it is done in statistical physics and thermodynamics. This is the place where random matrix theory can bring its techniques in principle. Random matrix theory has already been proved to be useful in machine learning. In [GSB16] for instance, neural networks with random Gaussian weights have been studied for practical interest while eigenvalues of non-Hermitian matrices were used to understand neural networks in [RA06]. See also [ZMW12] who study echo state networks used to model nonlinear dynamical systems. In [BGC16,CBG16], a random matrix approach has been used to do a theoretical study of spectral clustering by looking at the Gram matrix W W * where the columns of W are given by random vectors. They compute the asymptotic deterministic empirical distribution of this matrix which allows the analysis of the spectral clustering algorithm in large dimensions. Nonlinear random matrix models have also been studied in [EK10] e.g.
We are here interested in random neural networks where both the number of samples m and the number of parameters n 0 are large. We consider rectangular matrices of size n 0 × m in the regime where n 0 /m goes to some constant φ as the dimension grows to infinity. The study of such matrix models for random neural networks was first accomplished in [LLC18,PW17], where they consider for 1 i n 1 , 1 j m.
In the above equation f is a nonlinear activation function, W is the n 1 × n 0 matrix corresponding to the weights and X the n 0 × m matrix of the data. There are several possibilities to incorporate randomness in this model. In [LLC18], the authors consider random weights with deterministic data X. The weights are given by functions of Gaussian random variables and the asymptotic eigenvalue distribution of M is studied thanks to concentration inequalities in the case where the function f is Lipschitz continuous. They prove that the eigenvalue distribution corresponds to that of a (usual) sample covariance matrix 1 m T * X * XT with population covariance T * T = M as studied in [SB95]. Thus the nonlinearity coming from applying the function f entrywise is rather mysteriously hidden in the asymptotic empirical eigenvalue distribution. However, there is a major difference from a usual sample covariance matrix ensemble, which is the non universality of the eigenvalue distribution (as M depends on the distribution of W beyond its first two moments). The authors [LLC18] use this equation to study the effect of the fourth moments of the distribution for the efficiency of the neural networks. The general approach based on concentration arguments that they develop is detailed in the recent preprint [LC18].
On the other side, and this is the scope of this article, [PW17] consider the case where both the matrices W and X are random as both matrices are chosen to be independent random matrices with normalized Gaussian entries. Thus, interestingly, they derive (using Gaussian integration and a saddle point argument) a fixed point equation for the Stieltjes transform of the asymptotic e.e.d., which is a quartic equation. This equation will be recalled in our main Theorem 2.3 below. Before discussing our result, one may note that the quartic equation specializes in some special cases of the parameters to the Marčenko-Pastur equation for the Stieltjes transform: Thus there exists a class of functions such that the nonlinear matrix model has the same limiting e.e.d. as that of Wishart matrices. It was then conjectured in [PW17] that choosing such an activation function could speed up training through the network. The equation also becomes cubic when the function f is linear and corresponds to the product Wishart matrix. The limiting e.e.d. of such matrices, known as the Fuss-Catalan or Raney distribution, has been computed in [PZ11,DC14,FL15]. We refer the reader to Sections 4 and 5 of [PW17] for a more detailed discussion on machine learning applications of such a result. In particular [PW17] use this equation to facilitate the choice of activation function, a problem which has a crucial impact on the training procedure. In [HDR19], the choice of function was studied for random neural networks after going through a large number of layers. This is of particular interest to consider the multi-layer case, due to potential application to Feed Forward Neural Networks. It is achieved in [PW17], where they conjecture that the Marčenko-Pastur is invariant through multiple layers for some appropriate activation function f . Interestingly for practical applications, one can then measure some performance of the network (and its depth) using the shape of the Marcenko-Pastur distribution. For linear models, one may note the multilayer case corresponds to the maybe simpler setting of products of random matrices. One refers the reader to [KZ14, CKW15, AB12, AIK13, PZ11] for products of complex Ginibre matrices and to [HN18] where a large product of large random matrices is considered.
The scope of this paper is both theoretical and practical: one aims to study the asymptotic e.e.d. of such nonlinear models of random matrices f (W X) where f is applied entrywise and to extend the result established by [PW17] to non-Gaussian matrices. In particular, the question of universality of the limiting e.e.d. is of interest here as initial weights can be chosen to be non-Gaussian (a typical example is the uniform distribution as in [GB10]). In this setting, it has to be compared to the result of [EK10] where some kernel matrices are investigated (with another universal limiting empirical eigenvalue distribution). There is also some practical interest as it provides an easy way to compare different possible activation functions for a certain class of distribution for both weights and data. For practical purpose, we also investigate the multilayer case Y ( ) = f (W ( −1) Y ( −1) ) for = 1 . . . L with L fixed and study again the asymptotic empirical eigenvalue distribution for a class of activation functions. This gives one of the few theoretical results on multilayer nonlinear random neural networks and confirm the prediction made in [PW17]. From a theoretical point of view, such a study is also of interest in random matrix theory itself as it introduces a new class of ensembles of random matrices as well as a new class for universality.
Acknowledgments. The authors would like to thank D. Schröder and Z. Fan for pointing out errors in a previous version of the article as well as anonymous referees for helpful suggestions on how to improve the present paper.

Model and results
Consider a random matrix X ∈ R n 0 ×m with i.i.d. elements with distribution ν 1 . Let also W ∈ R n 1 ×n 0 be a random matrix with i.i.d. entries with distribution ν 2 . W is called the weight matrix. Both distributions are centered and we denote the variance of each distribution by We also need the following assumption on the tails of W and X: there exist constants ϑ w , ϑ x > 0 and α > 1 such that for any t > 0 we have Note that the above implies that there exists a constant C > 0 such that We now consider a smooth function f : R → R with zero Gaussian mean in the sense that As an additional assumption, we also suppose that there exist positive constants C f and c f and A 0 > 0 such that for any A A 0 and any n ∈ N we have, (2.5) Remark 2.1. (2.5) guarantees that the function is real analytic which may be seen as a strong restriction. However, commonly used activation functions fall within the scope of this paper such as the sigmoid f (x) = (1 + e −x ) −1 , f (x) = tanh x or the softplus f (x) = β −1 log(1 + e βx ), i.e. a smooth variant of the ReLU. Extensions to more general (non analytic) functions f is the object of current research.
We consider the following random matrix, where f is applied entrywise. We suppose that the dimensions of both the columns and the rows of each matrix grow together in the following sense: there exist positive constants φ and ψ such that Denote by (λ 1 , . . . , λ n 1 ) the eigenvalues of M given by (2.6) and define its e.e.d. by Theorem 2.2. There exists a deterministic compactly supported measure µ such that we have Similarly we denote by (λ 1 , . . . ,λ n 1 , 0 . . . , 0) the eigenvalues of 1 m Y * Y (note that m − n 1 such eigenvalues are necessarily null). We setμ m its e.e.d. and byμ its limit.
The moments of the asymptotic empirical eigenvalue distribution depend on the two following parameters of the function f : we set (2.8) We also define the following Stieltjes transforms: let z ∈ C \ R, we set Theorem 2.3. The measure µ satisfies the following fixed point equation for its Stieljes transform G: with θ 1 (f ) and θ 2 (f ) are defined in (2.8).
Remark 2.4. Assumption (2.4) is not really needed for this result to hold true: the asymptotic e.e.d. is unchanged if Y is switched by a rank one matrix but this is the correct centering to avoid a very large eigenvalue.
In the case where θ 2 (f ) = 0, θ 1 (f ) = φ = ψ = 1, the limiting measure µ is the Marcenko-Pastur distribution (with shape parameter 1). In the general case, the above fourth-order equation admits two pairs of conjugated solutions. In the companion article [Péc19] (see Theorem 1.4), it is shown that µ is actually the limiting e.e.d. of an information plus noise sample covariance matrix.
Remark 2.5. Our proof is based on a method of moments as in [ZM19] to recover the self-consistent equation for the Stieltjes transform. Our analysis is actually strong enough to obtain convergence of the largest eigenvalue to the edge of the support when the function f is odd by considering moments of order larger than log n 1 . The behavior of the largest eigenvalue in the general case is the object of a forthcoming article.
Figure 1: Eigenvalues of M for different activation functions. Note that every function displayed here is actually scaled so that θ 1 (f ) = 1 and centered so that there is no very large eigenvalue. For the two bottom figures, we have θ 2 (f ) = 0 and the Marcenko-Pastur of shape parameter φ/ψ density is plotted in red.
The model given by (2.6) consists in passing the input data through one layer of a neural network as we apply the function f a single time. However, we could reinsert the output data through the network again, thus multiplying layers. It was conjectured in [PW17] that for activation functions such that θ 2 (f ) = 0 the limiting e.e.d. is invariant and given by the Marčenko-Pastur distribution at each layer. We prove this statement in Theorem 2.6 below. We denote by L the number of layers and consider, for p ∈ [[0, L − 1]] a family of independent matrices W (p) ∈ R n p+1 ×np where (n p ) p is a family of growing sequences of integers such that there exists (φ p ) p and (ψ p ) p such that We suppose that all the matrix entries (W Consider also X ∈ R n 0 ×m with i.i.d entries of variance σ 2 x and define the sequence of random matrices The scaling is here chosen to normalize the variance of the entries of Y (p) at every layer. This normalization is known (adding centering) as batch normalization and is proved to improve the training speed [IS15]. The centering (2.4) is only important here. Now, one can define k ) are the eigenvalues of M (L) . We then prove the following theorem under the additional assumption that the function f is bounded.
Theorem 2.6. Let L be a given integer. Suppose that f is a bounded analytic function such that (2.4) and (2.5) hold. In the case where θ 2 (f ) = 0, then the asymptotic e.e.d. µ (L) n L is given almost surely by the Marčenko-Pastur distribution of shape parameter φ ψ 0 ψ 1 ···ψ L−1 .
In particular the above result implies that the range of the spectrum of the matrixM (L) = 1 n L Y (L) * Y (L) is less and less spread as L grows. This may be of importance when one has to determine the parameter γ so as to minimize the loss in the testing phase. Unfortunately our result does not encompass the case of a number of layers also growing to infinity.
Remark 2.7. The model we present here for several layers can be thought as a theoretical toy model since in practicality, weights would be updated along the neural network (using gradient descent for instance) and the independence assumption on the weights would not be true. However, the correlation induced by updating the weights does not seem to be tractable yet on a random matrix theory point of view.
The next section is dedicated to proving Theorem 2.2 for polynomial activation functions using the moment method. Our choice is motivated by the fact that [PW17] introduce a family of graphs to describe the asymptotic e.e.d. of Gaussian non linear random matrices, which we want to understand in greater generality. Thus the main part of the article has some combinatorial aspects and we believe that this point of view can give some insights in the study of these matrix models since it relates analytic objects such as θ 1 or θ 2 to combinatorial ones. Our approach is very similar to that of [ZM19] and the results quite similar. In Section 4, we generalize the result to other functions by using a polynomial approximation . Finally, in Section 5 we first give a combinatorial description of the multilayer case for polynomials and then prove Theorem 2.6.

Limiting e.e.d. when f is a polynomial
The point of this section is to compute the moments of the empirical eigenvalue distribution of the matrix M when the activation is a polynomial. The following statement gives the expected moment of the distribution in this case using a graph enumeration. Before stating the result, we need the following definition.
Definition 3.1. Let q ≥ 1 be a given integer. A coincidence graph is a connected graph built up from the simple (bipartite) cycle of vertices labeled i 1 , j 1 , i 2 , . . . , i q , j q (in order) by identifying some i-indices respectively and j-indices respectively. Such a graph is admissible if the formed cycles are joined to another by at most a common vertex and each edge belongs to a unique cycle.
Remark 3.2. In the following the edges and vertices of such an admissible graph are colored red. Remark 3.3. An admissible graph has 2q edges. It can also be seen as a tree of cycles (simply replacing cycles by edges) also called a cactus graph. These graphs appear also in random matrix theory in the so-called theory of traffics when expanding injective traces (see [CDM16] e.g.).
The basic admissible graph is given by the simple cycle (left figure on Figure 2) whose associated tree is a simple edge. The two right figures show a tree and one admissible graph that is associated to the tree: note that the points i 1 and j 1 where cycles are glued to each other are not determined by the tree, neither the lengths of the cycles. We can now state the following Theorem. Let θ 1 and θ 2 are defined in (2.8).
The degree of f , K, can grow with n 1 but we suppose that Let µ (f ) n 1 be defined in (2.7) and its expected moments We then have the following asymptotics (3.2) Note that in this theorem we allow the degree K of the polynomial to grow with n 1 as in (3.1) but the theorem holds true for any fixed integer q (independent of n). It is possible to improve the assumption (3.1) in the sense that K could grow faster with n 1 . However, this bound is enough for the polynomial approximation we need later (using a Taylor approximation of the function f ). The proof of the above Theorem relies on combinatorial arguments we now develop.

Proof of Theorem 3.5 when f is a monomial of odd degree:
We first consider the case where f (x) = x k k! for an odd integer k. We first assume that the entries of W and X are bounded in the following sense: there exists a A > 0 such that

Basic definitions
For this activation function, the entries of Y = f (W X/ √ n 0 ) are of the form We want to study the normalized tracial moments of the matrix M . Thus we want to consider, for a positive integer q, We first encode each of the summand in (3.4) as a coincidence graph (not necessarily admissible) by simply marking the coinciding indices in the summand. Then injecting (3.3) in the previous equation we obtain the following development (3.5) To take the l−indices into account, we now add to the red graph 2kq blue vertices. We can represent the vertices in a graph such as in Figure 3a. We call a red edge a niche. Each niche is decorated by k blue vertices from which leave blue edges corresponding to a term W i X j in (3.5). Since the W i and X j are centered and independent, each such entry has to arise at least twice in the summand in equation (3.5). Thus, to compute the spectral moment, one needs to match the blue edges so that each entry arises with multiplicity greater than 2. The matching of indices in (3.5) corresponds to a matching of the blue vertices. Then, the main contribution shall come from those summands maximizing the number of pairwise distinct indices.

The simplest admissible graph: a cycle of length 2q
In this subsection, we assume that the i and j indices are pairwise distinct and consider the associated contribution to the spectral moment, which we denote by E q (k). We show the following Lemma: Lemma 3.6. One has that Proof of Lemma 3.6: Because the i-and j−indices are pairwise distinct, the associated red graph is the simple cycle of length 2q. Thus we can really encode the products in the summand as in the left case of Figure 3. Since each matrix entry has to arise twice, say for instance W i 1 , 1 1 , it needs to occur at least an other time in the product. There are then two different ways it can happen: (i) There exists p ∈ {2, . . . , k} such that 1 p = 1 1 . (ii) There exists p ∈ {1, . . . , k} such that 2q p = 1 1 . Applying the same reasoning for X 1 1 ,j 1 , there exists p ∈ {1, . . . , k} such that 2 p = 1 1 . The same reasoning applies for each niche. Now, in order to maximize the number of pairwise distinct indices, one has to perform the most perfect matchings inside each niche. Note that, as k is odd, case (ii) necessarily occurs.
We first consider the contribution of those decorated graphs maximizing the number of pairwise distinct indices. The case where q > 1: In this case, there is a blue cycle of size 2q as in Figure 3a. Thus we can construct the decorated graphs maximizing the number of pairwise distinct indices in the following way : One chooses an index p in each niche which is in the only blue cycle of the graph and then the remaining blue edges are perfectly matched inside niches. The corresponding contribution from the basic cycle to the moment is, as every entry exactly occurs twice in the products, using (2.1), To obtain this formula, note that we choose the i-labels over n 1 possible indices and the j-labels over m indices. Now, we also choose the -labels over n 0 : the one for the blue cycle and those vertices corresponding to matched edges. Finally, we have to fix the blue vertices belonging to the blue cycle: there are k 2q possible choices. The number of perfect matchings on the rest of the vertices in each niche is then equal to ((k − 1)!!) 2q . We then obtain that Note that, by (2.8), one has that and we can write Case where q = 1. The behavior in the case where k = 1 is slightly different. Indeed in this case, we can do any perfect matching between the 2k blue vertices since there is no difference between any factor W or X in the summand in (3.5). The graph can bee seen in Figure 4a. Thus, the contribution of the moments in this case is the following where the error comes from performing a matching which is not a perfect one. We now consider the contribution of other matchings, that is those not maximizing the number of pairwise distinct indices. We will show that E q is indeed the typical contribution from the basic cycle, that is all other matchings lead to a negligible contribution with respect to E q . There are four different phenomena that can give a (lower order) contribution. First, there may be more than one cycle linking every niche as in Figure 3b. Also, in at least one niche there could be more identifications between -indices, which raises moments of entries of W and X. There could be an identification between the index of the cycle and an index from a perfect matching inside a niche. Finally, there could also exist identifications between two distinct niches; note we can only get higher moments in the case where the two niches are adjacent. While these four behaviors can happen simultaneously, we see the contribution separately since it would induce an even smaller order if counted together.
a) There is more than one cycle between niches. We call E (1) q the contribution to the moments of such decorated graphs. Suppose there are c cycles. Note that necessarily c is odd since k is odd and entries are centered, then we can write, if we suppose that indices not in cycles are being perfectly matched, In order to understand the very first term, note that one has to select in each niche c blue vertices to create the cycles and then do a perfect matching for the rest of the vertices. Thus one has that Thus this is of smaller order than (3.6) when the number of cycles is strictly greater than 1 as in Figure 3b for instance. Indeed, one obtains that b) The matching in each niche is not a perfect matching-apart from the vertex in the cycle. If the matching is more complicated than a perfect matching, the associated moments could be of higher order than the variance. Consider a matching inside a niche which is not a perfect one: there exists then an identification between a 1 , . . . , a b entries such that a 1 + · · · + a b = k − 1 and such that at least one of the a i 's is greater than 2. For ease we suppose that a 1 = · · · = a b 1 = 2 and . See e.g. Figure 5.  We call E (2) q the contribution of all such matchings where a single niche breaks the perfect matching condition. Then we obtain that: The first term in the summand corresponds to assigning the k − 1 remaining blue vertices (after the choice of the cycle) into b classes of size a 1 . . . a b . We can bound it in the following way In the first inequality we use the fact that a 1 = · · · = a b 1 = 2 and the definition of the double factorial. Then we expand the factorial and in the last inequality we use the fact that a i 3 for i > b 1 . Now, for the second term, we compare the number of possible choices for indices, yielding that Finally, the last term in the summand corresponds to the different possible moments, as only variances intervene in the leading contribution, while higher moments can appear inside the niche {i 1 , j 1 }. We use the fact that Now we need to bound the combinatorial factor coming from the sums: Finally, putting all these contributions together, we obtain the following comparison between E (2) q and E q , Note that k 3 = o(n 0 ). Here we suppose that in all other niches a perfect matching and a single cycle is used to match the blue vertices. The other cases are just negligible.
c) There are identifications between matchings from different niches If these niches are not adjacent, then such matchings would not increase the moments of the entries of W or X. On the contrary, matchings between adjacent niches may result into moments of higher order than the variance. We can then perform the same analysis as the previous one where we replace k − 1 (the remaining indices after the choice of the cycle in one niche) to 2k − 2 corresponding to the number of vertices of two adjacent niches. This yields a contribution in the order of (3.9) with respect to E q . d) There are identifications between the cycle and perfect matchings inside niches. Suppose that these identifications happen in d niches, and for p ∈ {1, . . . , d}, we identify the index from the cycle with 2b p blue vertices from the niche. Indeed if the number of identifications was odd, in order to obtain a non-vanishing term, we would need to either create another cycle or perform more identifications inside the niches. Thus, we obtain the following upper bound This comes from the choices of the niches, the identifications we make in each niche, and the perfect matchings we perform in the other niches. Finally, we suppose that we perform perfect matchings in the rest of the d niches. Then, we can use the bounds (3.10) From the above we obtain that This finishes the proof of Lemma 3.6.

Contribution of general admissible graphs
Lemma 3.7. The total contribution from admissible graphs to the spectral moment is Remark 3.8. Lemma 3.7 is almost the statement of Theorem 3.5.
Proof of Lemma 3.7: We now suppose that there are I i identifications between the vertices indexed by i labels and I j identifications between the vertices indexed by j labels. Note that by our definition, such a graph is admissible if and only if it consists of I i + I j + 1 cycles. See for example Figures 6a and 6b. As seen earlier in the case of a simple cycle, the case of a cycle of size 2 has to be considered separately. Thus we denote by b the number of cycles of size 2. We can do a similar analysis in the case of general admissible graphs because we can realize blue identifications inside each red cycle as they are well defined. Thus, recalling A(q, I i , I j , b) from Definition 3.4, we can write the contribution from all admissible graphs as  Thus we obtain (3.11) provided we show that the error terms are negligible. Note that the same error terms arise as in cases a), b), c) or d) for each red cycle: their contribution is then negligible as before as soon as matchings are still performed inside each cycle. Another possible contribution may come from cross-cycle blue identifications: we now show this contribution is subleading. Consider the first case where such a cross-cycle identification arises around an i-identification or a j-identification: see e.g. Figure 7. These blue edges match entries of W to get a non-vanishing moments. However, in order to match the corresponding X entries, some new identifications are needed. This either implies that inside a niche, the matching is not a perfect matching. In this case the total final contribution is in the order of (3.9). Either this implies that two blue cycles going through two cycles bear the same vertex. Thus the total contribution of such cases is in the order of n −1 0 E q due to the fact that one loses a possible choice for the index of a blue cycle. There may also exist cross-cycle blue identifications which do not arise around an i− or j− identification. It is not difficult to check in this case that the total contribution is again at most of the order of n −1 0 E q . This finishes the proof of Lemma 3.7.

Contribution from non-admissible graphs
Now we estimate the contribution of non-admissible graphs which we denote E (NA) q . Our aim is to show the following Lemma.
Lemma 3.9. One has that E (3.12) Once Lemma 3.9 is proved, this finishes the proof of Theorem 3.5 in the case where f is an odd monomial.
Proof of Lemma 3.9: Let us first come back to admissible graphs. Starting from the origin i 1 of an admissible graph G, there is a single way to run through the different cycles and return to the origin. Note that all the cycles are oriented e.g. counter clockwise. They are called the fundamental cycles: they correspond to the cycles where we perform a matching on the blue vertices. An admissible graph G can then be (partially) encoded into a rooted tree T = (V, E) as follows. The number of edges of T is the number of fundamental cycles of G. Given the tree T , one replaces each edge e ∈ E with a cycle of length 2L(e) with L(e) 1 though one may have to choose the vertices where cycles are glued to each other.
A non-admissible graph is a multigraph G = (V, E 1 , E 2 ) where E 1 denotes the set of single edges and E 2 is the set of multiple edges. There are first multiple ways to determine the fundamental cycles. Thus, we have to count the number of non-admissible graphs labeled by their fundamental cycles (see Figure 8 for an illustration). There may be also multiple ways to go through the whole graph: we explain later how this can be counted thanks to the associated admissible graph.
Our aim is to obtain all the non admissible graphs from the set of admissible ones by adding i− and j-identifications. This will determine the fundamental cycles and there just remains to count the number of ways to run through the graph. Consider the tree T encoding an admissible graph, we can then choose two edges and glue them together in the sense of identifying one vertex of one edge to one of the other. This adds an identification in the initial graph and encodes a non-admissible graph. Now, while these two cycles (corresponding to these two edges) are identified at an additional vertex, there could be more identifications for the same two cycles by choosing additional vertices in each cycle to be again identified. So finally doing this step a single time, the number of possible ways to choose two cycles which are then identified at r pairs of vertices is at most: since we need to choose two edges and then two vertices. And the number of possible ways to go through the whole graph is then multiplied by a factor at most 3 r . Indeed one needs to see the fundamental cycles in the order they are initially numbered on the admissible graph. The moments of time where one can make a choice when running through the graph correspond to the vertices of degree greater than 2. Suppose one lies on cycle i (for some integer i) and meets a vertex of degree at least 3. Then there are multiple possible ways to continue the walk on the graph iff this vertex belongs to cycle i + 1 or i − 1. Indeed because the fundamental cycles are oriented the first moment one jumps from a cycle to discover a new one is determined. However, one loses a power of n 1 (or m) for each additional identification as we lose a choice of index without gaining a cycle. Assume that q 3 n 0 , and denote by E (NA,r) q the contribution of non admissible graphs with no cross matchings (apart from the cycle inside each red cycle). Then one has that (3.14) Hereabove r denotes the total number of additionnal identifications (that is the surplus s(G) of the non admissible graph). Now, once the fundamental cycles are identified, cross identifications between blue edges from distinct niches (or fundamental cycles) are subleading unless in the following case: there are multiple cycles of length 2. Consider a cycle of length 2, with multiplicity p. Then pk blue vertices have to be Gluing r = 3 Figure 8: The left picture is an admissible graph with its encoding tree. The two dashed lines correspond to the two edges we glue together. The second graph correspond to the glued tree where there is now a cycle. Now the last step consists in choosing the number of identifications: here we have three total identifications between the two cycles.
matched. While the leading order is given by performing a perfect matching between these vertices such as in Figure 9, we can do any kind of matching and use the similar analysis we did for (3.9). Suppose that we have an identification between a 1 , . . . , a b entries such that a 1 + · · · + a b = pk. For ease we suppose that a 1 = · · · = a b 1 = 2 and a b 1 +1 , . . . , a b > 2 for some b 1 ∈ [[1, b − 1]], then we can compare their contribution to that of the admissible graph (used to encode it) by The factor of n 1−p 0 comes from the additional identifications between i's and j's in order to obtain a multiple edge. For instance in Figure 9 there are less identifications in the admissible graph than in the corresponding non-admissible graph. The first term in the summand compares the number of possible matchings of the pk edges to that of a perfect matching in every single cycle. There exists a constant C > 0 such that The second term now comes from the number of indices chosen and the ratio of moments and we bound it in the same way as in (3.9), Also in the same way as in (3.9), we can bound the combinatorial factor coming from the sums as Finally, putting all the contribution together we have where we used the fact that the leading order comes from the case where b = kp 2 . Actually (3.16) can be improved to O( q kj n j−1 0 ) for any integer 1 < j < q.

Proof of Theorem 3.5 when f is a monomial of even degree:
In the case of an even monomial we center the function f , to do so we substract a constant given by the corresponding expectation. We then consider centered monomial of the form and θ 2 (f ) = 0.
Here, the fact that θ 2 (f ) vanishes means that all admissible graphs which have at least one cycle of size greater than 2 are subleading so that we see admissible graphs consisting only in cycles of size 2 such as Figure 6b for instance. Note that we have seen earlier that we can write Thus, by developing the tracial moments of M we obtain the following formula, Now it is not difficult to check that where the * * means that the indices are matched according to a perfect matching inside each niche. Thus the centering by c 0 corresponds to the contribution of the admissible graphs where blue vertices make a perfect matching inside each niche. After centering, the typical graphs may be those which have additional identifications between niches which have a common red vertex as in Figure 10. We first consider the contribution of one red cycle to the moments and then deduce the contribution of all admissible graphs. One can see that to maximize the number of possible choices of blue indices, we can first perform a perfect matching into each niche as before the centering, then we can choose either the i-vertices or the j-vertices and add identifications around the corresponding niches. This prevents having a perfect matching inside any niche (which is forbidden by the centering) but still the gives the maximal number of blue indices. With such a matching, moments of order 4 arise in the contribution and we obtain: Note that the contribution is of order n −1 1 and thus is negligible compared with the contribution from odd polynomials. For the number of distinct indices we obtain n (k−1)q 1 . We could try to instead create cycles between niches as for the odd polynomial case, but one can see that we would need to create two cycles instead of one and would obtain (k − 2)q + 2 distinc indices which is of lower order. Now if we only create one cycle, we need to perform at least identifications between three vertices in each niche since we would have an odd number of blue vertices left and the number of distinct indices becomes at most (k − 4)q + 2q + 1 which is also of lower order than Figure 10. If, instead of identifying between different niches we would identify blue vertices inside the same niche we can only obtain at most (k − 4)q + 2q distinct indices which is of lower order than Figure 10. Now, in the same way, the case of a simple cycle (i.e. with length 2 ) is slightly different due to the centering. Indeed, at least one (thus two) vertices has to be connected to some other niche. Note also that any perfect matching where the two niches are connected is of the same order, thus we obtain for the leading order The above formula is self explanatory.
For the general case of admissible graphs with possible identifications, we use the fact that the contribution is just a product over the different cycles. For simplicity, we suppose that we have EW 4 11 σ 4 x = EX 4 11 σ 4 w . Since the contribution of the cycles of length greater than 2 are O(n −1 1 ), the i1 j1 Figure 11: Contribution in the case q = 1 for an even monomial.
terms involving the 4-th moments of the entries of X and W (which are not n 1 -dependent) are subleading. Thus, this condition does not impact the overall order of the contribution but gives a simpler formula. The leading order of a q-moment, corresponding to the total contribution of admissible graphs with 2q edges can be written as where we used in the last equality the fact that θ 2 (f ) = 0 in order to prove the expression (3.2). Note again that we did not give here all the errors since we have computed them in the previous subsection, the case of even monomials can be done similarly. Thus we can see that only the graphs which correspond to a tree of simple cycles contribute to the moments.
We can lead the analysis of the contribution from non-admissible graphs as in the previous section, as the non admissible structure only concerns the red graph while the (centered) polynomial involves only the matching on blue vertices. We leave the detail to the reader.

Proof of Theorem 3.5 when f is a polynomial:
We now suppose that we can write |a k | C k for some C.
In particular, the parameters are in this case Note that for any polynomial, by expanding the moment as in (3.5), we have to compute the following quantity, for any k 1 , . . . , k 2q integers, To compute the leading term of this moment, first note that the centering creates disparity between even and odd monomials. Indeed let q > 1, if we consider one red cycle of length 2q, there are now 2q niches of different sizes, namely k 1 , . . . , k 2q . We first bound these moments in order to see that, in each cycle, the niches with an even number of vertices are subleading so that the dominant term in the asymptotic expansion of the moment corresponds to admissible graphs with only odd niches when expanding the polynomial. The behavior in a fundamental cycle can be understood as follows: there i1 j2 i2 j1 Figure 12: Admissible graph in the case of a polynomial with (k 1 , k 2 , k 3 , k 4 ) = (4, 3, 2, 5).
has to be at least one cycle connecting each niche for the odd or the centered even niches. Now, in each odd niche of length k i , the leading term corresponds to a perfect matching of the k i −1 remaining vertices from (3.9). The number of pairwise distinct l indices in the niche is then (k i − 1)/2, apart from the cycle. However, in the even niches, since there is already a cycle, there remains an odd number of vertices to be matched. The leading order is to disgard 2 vertices and then to perform a perfect matching of the k i − 2 remaining vertices. The remaining vertices are matched to a blue cycle or to an existing matching. Then, the number of distinct l indices inside one niche is at most (k i − 2)/2 (apart from cycles). Denote the number of choices of indices for red and blue vertices for a configuration of niches k 1 , . . . , k 2q by C(k 1 , . . . , k 2q ). Then we obtain This contribution can be understood in the following way: apart from the normalization, we have to choose the q i-indices, the q j-indices, the -indices. Thus, if we consider the contribution of cycles of size q > 1 for the polynomial P = a k k! (X k − k!!1 k even ), we get the following asymptotic expansion for the moments As we now explain, in the case of a cycle consisting of two edges decorated by k 1 and k 2 blue vertices, there are three different possibilities: i) if k 1 and k 2 are odd: the contribution to the moment is (σ x σ w ) k 1 +k 2 (k 1 + k 2 )!!.; ii) if k 1 and k 2 are even: the contribution is (σ w σ x ) k 1 +k 2 ((k 1 + k 2 )!! − k 1 !!k 2 !!); iii) while if k 1 is even and k 2 is odd: the leading term in the asymptotic expansion is of order n −1/2 0 due to the discrepancy. Thus, the 1-moment for a polynomial f is where we used the fact that for any k 1 and k 2 , (k 1 + k 2 )!!/(k 1 !k 2 !) is bounded. While these analysis work in the case of a single cycle, we can do the same generalization to any (non) admissible graphs as before. Thus we get the following q-moment in the case of a polynomial This finishes the proof of Theorem 3.5 when f is a polynomial.

Convergence of moments in probability
In the previous subsection, we have proved convergence of the expected moments of the empirical eigenvalue distribution. We turn to the proof of the convergence in probability of these moments.
In addition there exists a constant C such that Proof. We can write the variance of the moments in the following way with G p = (G p , i p , j p ) are labeled graphs with the i-labels and j-labels given respectively by i p , j p . For a given labeled graph G = (G, i, j) and a matching , the notation M G ( ) corresponds to the following product after expansion Now, note that the shape of the graph and the possible expansion of the polynomial f does not depend on n 0 , n 1 or m. By independence, the two graphs G 1 and G 2 have to share an edge otherwise the contribution to the variance is null. In particular, the concatenated graph G cannot be admissible. Thus we only need to consider graphs G 1 , G 2 which share a common edge: either a red one or some X j or W i for some i, j, and . In other words the concatinated graph G cannot be admissible. We here assume for ease that G 1 and G 2 have 2q edges. The case where the number of edges is different in each cycle can be similarly handled.
To simplify the exposition of the argument further, we suppose that G 1 and G 2 are both a cycle and f is an odd monomial x k . Note that the generalization comes from the fact that admissible graphs are a tree of cycles and non-admissible graphs yield a lower order contribution from (3.12). If we suppose that the coincidence between the two graphs comes from an i-label and a -label, in other words an entry W i , we have different possibilities that we now develop. The first case consists in taking the two red cycles and attaching them at a fixed vertex i 0 . We then perform a cross-cycle identification as in Figure 7 in order to match two entries W i 0 0 together from G 1 and G 2 . Once these W entries are matched, note that the corresponding X entries have not been matched yet. We then need to identify this l-vertex with another vertex from an adjacent niche (and then creating a blue cycle going over the whole red cycle) or to another vertex in the same niche. Finally, it can be seen as simply performing the dominant matching into each graph, identifying two i indices and then identifying two blue edges from niches adjacent to i. Finally we can compute the contribution of these graphs in the covariance as Indeed, in each graph we perform the typical matching corresponding to a blue cycle going over every niche and perfect matchings between the remaining indices in each niche. Now the fact that we identify two W i 0 0 entries create a moment of order 4 when we compute E [M G 1 M G 2 ] . We then have to count the number of possible choices for indices: we have n 2q−1 1 choices for the i indices as we identify two from G 1 and G 2 , m 2q for the j indices, n 2+4q(k−1)/2−1 0 choices for the indices (2 cycles, 4q niches and an identification between the two graphs). Taking into account the normalization m −2q n −2kq 0 , this yields a factor ψ 1−2q asymptotically. In the same way, for general polynomial and admissible graphs, for such an identification we would obtain that , for some C > 0. Indeed, we get the q 2 k 2 from the choices for the edge we want to identify between the two graphs, the constant factor in φ and ψ consists in the choice of choosing a {i, } edge or a {j, } edge. Then the previous computation in the case of a cycle can be generalized to all graphs as the construction only involves one cycle in each graph. For the second equality we use the fact that m q ≤ C q as proved in the next subsection.
The second case consists in identifying a pair of red vertices in each graph. Such a pair is chosen in one fundmaental cycle in both G 1 and G 2 . Then we identify the pair from one graph to the other pair. This allows the existence of edges belonging to the two graphs G 1 and G 2 . The whole graph G created by this construction is non admissible as we have two identifications and two fundamental cycles. We thus need to choose the fundamental cycles in G. The fundamental cycles we choose for this red graph are given by the cycles between the two vertices with edges belonging to both graphs in each cycle. Since we need to choose a pair of vertices in each graph we have q 4 choices. In each fundamental cycles, we perform the typical blue matching and we have an edge between a niche from G 1 and a niche from G 2 (corresponding to the cycle going over every niche for instance). Thus we have a common W or X entry between the two graphs and the contribution to the covariance does not vanish. Considering the q 4 choices for the red vertices, we can see that we have Regarding the number of possible choices for the vertices, the number of l-indices is unchanged while that of i− or j-indices decreases of 2 if we compare to the computation of the expected moment. Finally, we obtain that Using Bienaymé-Chebyshev inequality, one easily deduces (3.19).
(a) In this figure, the two highlighted cycles correspond to the graph G 1 and G 2 which are attached at i 0 . We perform a typical blue matching in each graph and then add an identification between the two graphs. The highlighted orange edges correspond to the edges common to the two graphs, which yields a moment of order 4.
In this figure, the two highlighted cycles correspond to the graph G 1 and G 2 which are attached at two vertices i 0 and i 0 . The graph is non-admissible and we choose the fundamental cycles so that neither G 1 or G 2 are fundamental cycles. The typical matching in the chosen cycles create common edges between the two graphs highlighted in orange on the figure.

From bounded to sub-Gaussian random variables
We have computed the limiting expected moments in the case of bounded random variables. However, note that while high moments of W or X can appear in the error terms, as in (3.8), (3.10) and (3.15), one may use for such sub-Gaussian random variables the following bound for some constant C. Thus one may simply replace in all the error terms A by k 1/α . Since k is of order log n 1 log log n 1 all the errors are still o(1).

Weak convergence of the empirical spectral measure
In this section we briefly finish the proof of Theorems 2.2 and 2.3 for a polynomial activation function. The fact that the sequence of moments uniquely defines a probability measure µ so that x q dµ(x) = m q follows from Carleman's condition. Indeed, denote by Θ(q) the number of unlabeled cactus graphs with q vertices. It has been shown in [FU56] that, regardless of the number of identifications or simple cycles, there exists numerical constants δ > 0 and ξ > 1 such that Θ(q) ∼ 3δ 4 √ π ξ q+3/2 q 5 . Thus there exists a constant C such that m q C q . This can also been used to show that the measure has compact support.

Derivation of the self-consistent equation for the Stieltjes transform
Consider the Stieltjes transform of the limiting empirical eigenvalue distribution of M , One can also write it as the following generating function of moments, since the following equality makes sense at least on a neighborhood of infinity, Using that one can write the Stieltjes transform as Fix a vertex v and denote q 0 the length of one of the fundamental cycles containing v. Suppose first that we have q 0 > 1, this cycle contains 2q 0 edges with q 0 vertices labeled with i and q 0 vertices labeled with j. On each vertex labeled with i, either a graph is attached and we have a i-identification on this vertex, or nothing is attached. Thus, considering the formula above, we have that the contributions for identifications for each vertex is H ψ (z) := 1 − ψ + ψH(z) for i-labels and H φ (z) := 1 − φ + φH(z) for j-labels.
Also, one can see in the leading order of the moment that a cycle of length q 0 give a contribution of θ 2 (f ) ψz q 0 . Now, if the cycle is of length 1, in the same way, there is a single i-labeled vertex and a single j-labeled vertex which can give a contribution of H ψ and H φ but the contribution of a simple cycle is not given in terms of θ 2 (f ) but by θ 1 (f ) ψz . This is illustrated in Figure 14. Thus, we have the following recursion relation for H, Note that we obtain the final equation from Theorem 2.3 by noting that H ψ (z) = −zG(z) and

Proof of Theorem 2.2 for general activation function
In this section, we now allow the activation function to belong to a wider class, thus proving Theorem 2.2. For ease, we assume that σ w = σ x = 1, which can be achieved by scaling.
Proof of Theorem 2.2. We begin by defining the following polynomial which approximates f up to a constant, for x ∈ R we define with the convention that j!! = 0 for j odd and 0!! = 1. This choice ensures that the polynomial is centered with respect to the Gaussian distribution. Thus, using Taylor's theorem, we obtain the following approximation for any A > 0 Now, we compare the Hermitized version of the matrix M (up to finite rank modification), and define We want to control the spectral radius of the (m + n 1 ) × (m + n 1 ) symmetric matrix E. Now consider the event, for δ 1 ∈ (0, 1 2 ), (4.4) On this event, we have, considering the approximation (4.2), We then choose k c 0 log n 1 log log n 1 . (4.5) We obtain, by using Stirling formula, that there exists a δ 2 > 0 such that for any ε > 0 we have By taking ε small enough we then see that, on the event A n 1 (δ 1 ) and with k as in (4.5), ρ(E) → 0 as n 1 → ∞. It remains to see that the event A n 1 (δ 1 ) occurs with high probability which comes from the assumption on the entries W ij and X ij . Indeed, which goes to zero faster than any polynomial in n 1 . Now we know the limiting e.e.d. of the matrix M P k constructed with the centered polynomial P k as activation function. The above argument yields it is the same for M f −a k constructed with f − a k instead. Now Y (a k ) is just a rank one deformation of Y and by the rank inequalities (see [BS10] for instance), M and M f −a k have the same limiting e.e.d.. This finishes the proof of Theorem 2.2.

Propagation of eigenvalue distribution through multiple layers
In this section, we study the eigenvalue distribution of a nonlinear matrix model when the data passes through several layers of the neural network. The case of a single layer has been considered in Theorems 2.2 and 2.3 where we describe the asymptotic e.e.d. in the one layer case. It has been conjectured in [PW17] that the limiting e.e.d. is stable through the layers in the case where θ 2 (f ) = 0. We give here a positive answer to this conjecture (with the appropriate normalization). We first develop the combinatorial arguments for an odd monomial of the form for several layers. It can be shown as in Subsection 3.2 that the even monomial are subleading. Thus the leading order for moments is given by the contribution of odd monomial only. From now on, we assume (5.1) holds true. We can write the entries of the two layers data matrix Y (2) as Then, developing the expected moment of the e.e.d. and using (5.2), we obtain the following We call the terms contributing in a non negligible way typical. Now, we can give a graphical representation of these terms as in the previous sections.We will see that the contributing graphs are actually the same admissible graphs from Definition 3.1. However, there are less constraints in the choices of the blue edges. Indeed, the entries of the matrix Y (1) are not independent: we do not need each entry to be matched with at least another. This constraint however holds for the entries of the matrix W (1) .

The simpler case of the simple cycle
In this subsubsection, we explain the combinatorics in the case where the i-labels and j-labels are pairwise distinct. We first perform a matching on the entries of W (1) . This matching on the W (1) entries induces one on the entries of Y (1) . This matching thus induces another graph between jlabeled and -labeled vertices. The i-labeled vertices do not appear in the graph (as they correspond to entries of W (1) ). This graph can be constructed from the initial graph by seeing which niches are connected by a blue edge. Figure 15 explains this construction: 2 links the same niche adjacent to j 2 while 1 links the niches adjacent to j 1 and j 2 . Corresponding moment: Figure 15: Graph obtained after a blue matching in the initial graph. The green edges, corresponding to bridges between niches, induce a cycle in the final graph. The remaining edges coming from matched pairs inside a niche create simple cycles attached to j labeled indices.
We start with general observations. The largest number of possible distinct indices is kq, which is obtained as follows: One matches at least two indices from different adjacent niches of an i-label index and perform a perfect matching between the 2k − 2 remaining indices. Such a matching gives kq different indices and matches every W (1) entry with another. This is illustrated in the leftmost graph in Figure 15. Note that this type of matching gives kq distinct indices but is actually not necessarily typical (see Figure 16 for an illustration) and is not the sole typical configuration. As in Figure 15, we see that the matching on the initial graph induces another admissible graph. Note that it does not consist in one cycle but in a cycle (in green on the figure) where k − 1 cycles of length 2 are attached to each j-labeled vertex. Also one has to note that it is possible to perform identifications between the blue edges and obtain a graph contributing in a non negligible way to the asymptotic expansion (see Figure 17 for an illustration). This behavior is explained in the second step when we develop the entries of Y (1) . Let us briefly indicate, as in Figure 16, a blue matching on the initial cycle which maximizes the number of distinct indices may give rise to a non-admissible induced graph. This comes from the fact that too many edges link two distinct niches. j2 j1 j2 j1 Figure 16: Non-admissible graph obtained after a blue matching which induces a maximum number of distinct indices in the initial cycle. We can see that several green bridges between the same niches create a non-admissible graph and is thus subleading via the analysis from the previous section.
The main tool to understand the combinatorial arguments for the multilayer case is the following Lemma. It states that the leading order is actually given by the matchings as in Figure 15. Proof. The proof is based on the construction of the second graph and the fact that the typical graphs are admissible. We first show that any other matching gives a non-admissible second layer graph. Firstly, more than one bridge between two distinct niches breaks the tree structure and thus yields a non admissible graph. The same reasoning holds for possible identifications between bridges and a matched pair inside a niche. If we identify two matched pairs inside a niche, we can see via the construction of the graph that it creates double edges and we would obtain an entry of Y (1) to the power of 4. However, note that in the initial cycle of size q, we can add identifications between the q bridges and still keep the second graph admissible. This behavior is illustrated in Figure 17 where we perform identifications between bridges and still obtain an admissible graph.
We now need to show that the contribution of the matchings leading to a non admissible graph is subleading. As in Subsection 3.1.4, we have additional identifications between the vertices and we need to choose the fundamental cycles as well as the way one runs through the graph. Suppose we have I identifications between the vertices. Then if the graph was admissible we would have I + q(k − 1) + 1 fundamental cycles in the induced graph on (j, ) vertices. Thus, if the graph is non-admissible, we have at most I + q(k − 1) fundamental cycles.
Let m I + q(k − 1) be the number of fundamental cycles of the induced graph. We denote by C 1 , . . . , C b , C b+1 , . . . , C m its cycles such that if (C i ) denotes the length of the cycle C i we have: (C 1 ) = . . . = (C b ) = 2 and (C b+1 ), . . . , (C m ) > 2. One then has that m i=1 (C i ) = 2kq. Now, the contribution of such graphs (initial and induced), taking into account the normalization and the number of ways to run through the graph, is at most for some constant C. Indeed one has to choose the i-and j-labels of vertices in the initial cycle, the vertices in the initial graph with the constraint that there are I identifications. Then, in the induced graph, there are at most k indices in each cycle of length 2 and 1 + (k − 1)/2 (C i ) indices in the cycle C i for i > b. Thus the contribution is negligible due the constraint that m I + q(k − 1). Some (negligible) contribution depending on k comes from the possible multiple cycles of length 2 attached together as in Figure 9. Here the error is slightly bigger since the induced graph has 2kq edges instead of simply 2q. Fix a vertex j 0 , if we match together 2p -indices together in the niche adjacent to j 0 , using (3.16), the corresponding error is given by O(n 0 (k(2p) k /n 0 ) p ). However, up to 2k indices can be matched together so that the contribution of non admissible graphs in this case is given by It actually decays faster than any polynomial for such k. This finishes the proof of the Lemma.  Figure 17: Admissible graph after a matching with an identification between two bridges. While two vertices are identified, the matching is of leading order as one more cycle is in the induced graph.
Lemma 5.1 has been proved for the two layers case. It can readily be extended to the case of L ≥ 2 layers: the number of possible distinct l-indices is multiplied by k at each layer and the final graph after performing matchings has to be admissible so that it contributes in the limit. The proof is similar to that of Lemma 5.1. The detail is left to the reader.

Invariance of the distribution in the case when θ 2 (f ) vanishes.
In light of the previous combinatorial arguments, it is interesting to consider the special case where θ 2 (f ) = 0. Indeed, for the one layer case, by Theorem 2.2, the limiting e.e.d. is the Marčenko-Pastur distribution with shape φ ψ , denoted by µ φ/ψ , as proved also by the following lemma. Lemma 5.2. Let q be a positive integer we have the following equality Proof. Firstly, we can slightly rewrite the left hand side as Now there only remains to see that This fact comes from another representation of admissible graphs. Consider admissible graphs with 2q edges, q cycles of length 2, k j-identifications and q − k − 1 i-identifications. Thus we can count this as double trees, in the sense that one of every two vertices are i-labeled and the others are j-labeled, with the appropriate number of each type of vertex (q − k j-labeled vertices and k + 1 i-labeled vertices). This number is known as a Narayana number [CYY08] and given by (5.4).
This fact then means that if we consider a function f such that θ 2 (f ) = 0, the e.e.d. (up to a change in variance and shape) is "stable" after going through one layer of the network. Indeed, if one considers the matrix 1 mσ 2 x XX * , the asymptotic e.e.d. is given by µ φ the Marčenko-Pastur distribution with shape parameter φ. Now, after a layer of the network, we see that for 1 We now consider the case of an arbitrary fut fixed number of layers, mostly interested in the case where θ 2 (f ) = 0. Let Y (L+1) be as in (2.9), and consider the matrices Proof. We again first develop the arguments in the case of a monomial of odd degree f (x) = x k since the case of an even monomial is completely similar (we only consider graphs with simple cycles).We study and count the admissible graphs along each layer. It is enough to identify in the asymptotic expansion of the moment those terms where no θ 2 arises. Thus one can consider only admissible graphs made of cycles of length 2. For the error terms one has to consider also admissible graphs with longer cycles but where the matching in each niche does not yield an occurence of θ 2 .
We begin with the case where q = 1 for two layers. Then the cycle has length 2 as in Figure 4a. The dominant term in the asymptotic expansion consists in performing a perfect matching between all edges from Lemma 5.1. The contribution coming from this first construction (in Lemma 5.1) is given by σ 2k x n 2 mn k 1 n 2 mn k 1 σ 2k w (2k)!! = θ 1 (f ) (1 + o(1)).
This follows from the choices for the i index, the j index and the indices. Now, this construction on the initial graph induces a second graph as in Figure 18. This induced graph is an admissible graph where all j's are identified to a single vertex and k cycles of length 2 are attached to it (corresponding to the k blue edges in the initial cycle). We use the same reasoning as before and develop the entries Y (1) as a product of entries of W (0) and X. Since the graph is admissible, the dominant term in the asymptotic expansion corresponds to performing a perfect matching in all cycles of length 2 as in Section 3 (illustrated in Figure 18). Thus this adds a contribution of Here, the normalization in n −k 2 0 comes from the fact there are 2k entries with a normalization of n −k/2 0 . We then have to choose n k 2 0 indices in the second graph. Finally, we obtain for the final contribution for a cycle of length 2 that E 1 (f ) = θ 1 (f ) (1 + o(1)). i1 j1 j1 Figure 18: Construction/matching on the second layer graphs from a matching on the initial graph. The first graph gives a combinatorial factor of (2k)!! while the second graph gives a factor of (2k)!! k .
For the general case we saw that the first step of the procedure (by the construction explained before) yields a forest of star admissible graph where each graph is given by a certain number of cycles of length 2 attached to a unique j-labeled vertex. Consider now a connected component (of the induced forest) which corresponds to a unique j vertex. The number of cycles of length 2 attached to j is then k times the total number of cycles adjacent to j in the previous steps (since we have k blue edges in each cycle of length 2). From this first process we then get the following contribution for this first two steps Let us explain the above formula: there are n q−I i L choices needed to label the i-labeled vertices and m q−I j for the j-labeled vertices. For the powers of n L−1 we take into account the normalization and the corresponding number of indices to choose. Finally in each cycle of length 2 we perform a perfect matching between the two niches: there are q cycles of length 2 in the initial graph and kq such cycles in the forest obtained. See Figure 19 for an illustration. Now, we can perform one more step of the procedure, we now have a forest of these star admissible graphs where each graph has only one j vertex. To the j vertex are now attached k times more cycles than in the previous step. Thus, for the 3 step procedure, the total number of cycles of length 2 i1 j1 i2 j2 j1 j2 Figure 19: Effect on going through several layers for admissible graphs with only cycles of length 2. The first step consists of separating each j-labeled vertex into his own graph where it is attached to cycles of length 2. At each layer after the first one, we multiply by k the number of cycles attached.
in the forest is given by k 3 q. We can perform this for each layer the data goes through as the only parameter to be changed is the number of cycles of length 2 attached to each j vertex.
In the whole, adding the layer L 0 multiplies the contribution with no θ 2 by a factor Thus the whole contribution can be written in the following way : (1 + o(1)) And we obtain the final result by using that n L m → φ ψ 0 ψ 1 ...ψ L−1 and A(q, q − k − 1, k, q) = 1 k+1 r k r−1 k . Now, in the statement of the theorem we do not explicit the leading contribution of admissible graphs with at least one cycle of length greater than 2. We only need now to get an estimate on the other possible errors and show that they are negligible. Using Subsection 3.6, the total number of cactus trees with q edges does not exceed Cξ q for some constant C. As we are interested in the case where θ 2 vanishes, it is enough to show that the error terms cannot grow faster than the Marcenko-Pastur moment. Actually using the arguments of Section 3, the whole analysis of errors remains true. The errors can only come from subleading matchings on the graph at each possible step. However, the main difference comes from the number of vertices at each step which is k L 0 q instead of just kq. Note that it still only consists of a power of k which grows slower than any power of n 1 . Again, the leading contribution of the errors comes from possible multiple edges arising in the graph. Say that a given j vertex is first connected to r cycles of length 2 in the initial graph. At the step L 0 , it is now connected to k L 0 −1 r cycles of length 2. Thus if at this stage we connect blue indices together, say p of them we obtain at the next step a multiple edge of multiplicity 2p. We have a total of 2k L 0 r blue indices to match at this stage since we have 2k vertices per cycle of length 2. Thus, by comparing the contribution of such matchings with the typical matching we obtain, similarly to (3.16), We now finish the proof of Theorem 2.6.
Proof of Theorem 2.6. We have shown that for a polynomial of degree up to 1 L−1 log n 1 log log(n 1 ) , the expected moments of the e.e.d. are those of the Marčenko-Pastur distribution with the appropriate shape parameter. We first see that the variance of the moments is of order k L /n 2 1 in order to show convergence of the actual moments. The principle is similar to that of Lemma 3.10 as we count the corresponding graphs such that their covariance is non zero. We can perform the same expansion as in Lemma 3.10 and see that we have for the first layer . Now, in order to have a non vanishing contribution to the variance (5.7), we need to have additional identifications between the two graphs. Indeed, either at a given layer L 0 an entry of W (L 0 ) is matched between G 1 and G 2 or at the last layer there are identifications between the X entries. In the case where there are identifications of Y (L 0 ) entries we see, by expanding the expansion with respect to the entries of W (L 0 −1) , that this implies that there are further identifications in the layers beyond L 0 . Since at each step we would lose an order O(q 2 (k) 2L 0 ) /n 0 ) (from the choice of which vertices to identify and the fact that we have one less choice for possible indices), we see that the leading order comes from identifying X entries in the two last layers.
Thus, since the main contribution to moments are still given by admissible graphs, a similar analysis can be done as in Lemma 3.10: we can, right at the first layer, identify i and j vertices to obtain an identification on the W (L) entries. Or one can choose two W (L 0 ) entries to be identified at a given layer L 0 (or X entries at the last layers L 0 = 1) and thus we obtain since q is fixed here. Let us now extend the result to a bounded function f . As in Section 4, we consider a polynomial P k such that, for some A > 0, sup x∈[−A,A] |(f (x) − a k ) − P k (x)| C f A (1+c f )k (n+1)! . Now, we can consider Y (L,a k ) the matrix constructed as (2.9) with f − a k as an activation function and Y (L,P k ) the same matrix constructed with P k . Note that we consider the same sampling of W and X for the construction of this model. We describe the case of L = 2 as we can recursively do the same reasoning for a higher number of layers, for simplicity we also forget the change of variance σ x / θ 1 (f ) at each layer. As we saw in Section 4, we simply need to bound We split the right hand side into two parts and write (5.8)