Scaling positive random matrices: concentration and asymptotic convergence

It is well known that any positive matrix can be scaled to have prescribed row and column sums by multiplying its rows and columns by certain positive scaling factors (which are unique up to a positive scalar). This procedure is known as matrix scaling, and has found numerous applications in operations research, economics, image processing, and machine learning. In this work, we investigate the behavior of the scaling factors and the resulting scaled matrix when the matrix to be scaled is random. Specifically, letting $\widetilde{A}\in\mathbb{R}^{M\times N}$ be a positive and bounded random matrix whose entries assume a certain type of independence, we provide a concentration inequality for the scaling factors of $\widetilde{A}$ around those of $A = \mathbb{E}[\widetilde{A}]$. This result is employed to bound the convergence rate of the scaling factors of $\widetilde{A}$ to those of $A$, as well as the concentration of the scaled version of $\widetilde{A}$ around the scaled version of $A$ in operator norm, as $M,N\rightarrow\infty$. When the entries of $\widetilde{A}$ are independent, $M=N$, and all prescribed row and column sums are $1$ (i.e., doubly-stochastic matrix scaling), both of the previously-mentioned bounds are $\mathcal{O}(\sqrt{\log N / N})$ with high probability. We demonstrate our results in several simulations.


Introduction
Let A ∈ R M ×N be a nonnegative matrix. It was established in a series of classical papers [22,23,24,2,1,16]  is a diagonal matrix with v on its main diagonal. The problem of finding the appropriate x and y that produce P with the prescribed row and column sums is known as matrix scaling or matrix balancing; see [10] for a comprehensive review of the topic and its various extensions. Formally, we use the following definition.
Definition 1 (Matrix scaling). We say that a pair of vectors (x, y) scales A to row sums r and column sums c, if for all i ∈ [M ] and j ∈ [N ]. We refer to x and y from (1) (or their entries) as scaling factors of A.
In the special case that M = N and r i = c j = 1 for all i ∈ [M ] and j ∈ [N ], the problem of matrix scaling becomes that of finding a doubly-stochastic normalization of A, originally studied by Sinkhorn [22] with the motivation of estimating doubly-stochastic transition probability matrices.
It is important to mention that (1) is a system of nonlinear equations in x and y with no closed-form solution. Nevertheless, if the scaling factors x and y exist, they can be found by the Sinkhorn-Knopp algorithm [24] (also known as the RAS algorithm), which is a simple iterative procedure that alternates between computing x via (1) using y from the previous iteration, and vice versa (a procedure equivalent to alternating between normalizing the rows of A and normalizing the columns of A to have the prescribed row and column sums, respectively).
From a theoretical perspective, given a nonnegative matrix A, existence and uniqueness of the scaling factors and of the scaled matrix P depend primarily on the particular zero-pattern of A; see [1] and references therein for more details. In this work, we focus on the simpler case that A is strictly positive, in which case existence and uniqueness of the scaling factors and of the scaled matrix P are guaranteed by the following theorem (see [23]).
Theorem 1 (Existence and uniqueness [23]). Suppose that A, r, and c are positive, and r 1 = c 1 . Then, there exists a pair of positive vectors (x, y) that scales A to row sums r and column sums c. Furthermore, the resulting scaled matrix P = D(x)AD(y) is unique, and the pair (x, y) can be replaced only with (αx, α −1 y), for any α > 0.
Over the years, matrix scaling and the Sinkhorn-Knopp algorithm have found a wide array of applications in science and engineering. In economy and operations research, classical applications of matrix scaling include transportation planning [12], analyzing migration fields [25], and estimating social accounting matrices [21]. In image processing and computer vision, matrix scaling was employed for image denoising [18] and graph matching [5]. Recently, matrix scaling has been attracting a growing interest from the machine learning community, with applications in manifold learning [17,27], clustering [28,14], and classification [7]. See also [20,6] for applications of matrix scaling in data science through the machinery of optimal transport.
In many practical situations, matrix scaling is actually applied to a random matrix that represents a perturbation, or a random observation, of an underlying deterministic population matrix; see for example [13,27,18,4]. Arguably, this is the case in all of the previously-mentioned applications of matrix scaling whenever real data is involved. In particular, applications of matrix scaling in machine learning and data science often involve large data matrices that suffer from corruptions and measurement errors, and hence are more accurately described by random models. Due to such challenges, it is important to understand the influence of random perturbations in A on the required scaling factors and on the resulting scaled matrix, particularly in the setting where A is large and the entrywise perturbations are not necessarily small. It is noteworthy that existing literature related to scaling random matrices is mostly concerned with special cases such as the scaling of symmetric kernel matrices [27,13] and the spectral properties of random doubly-stochastic matrices [19,3].
Let A ∈ R M ×N be a positive random matrix, and define A = E[ A]. Theorem 1 establishes the existence and uniqueness of a set of scaling factors {(αx, α −1 y)} α>0 of A, together with the existence and uniqueness of the corresponding scaled matrix P . Theorem 1 can also be applied analogously to A, establishing the existence and uniqueness of a set of random scaling factors {(α x, α −1 y)} α>0 of A, as well as the existence and uniqueness of the corresponding scaled random matrix P = D( x) AD( y).
The main purpose of this work is to establish that under suitable conditions on A, r, and c, there is a pair of scaling factors ( x, y) of A that concentrates around a pair of scaling factors (x, y) of A (in an appropriate sense), and furthermore, the resulting scaled random matrix P concentrates around P in operator norm. Notably, the main technical challenge in deriving such results is the implicit nonlinear representation of x and y in (1), which prohibits the direct application of standard concentration inequalities. Therefore, an important aspect of this work is providing a mechanism for applying standard vector and matrix concentration inequalities in the analysis of random matrix scaling.
The main contributions of this work are as follows. We begin by providing a concentration inequality for the scaling factors of A around those of A assuming the entries of A are bounded from above and from below away from zero, and in addition that they satisfy the property of being independent within each row and each column of A separately; see Theorem 3 in Section 2.1. To that end, we derive a result concerning the stability of the scaling factors of a matrix under perturbations in the prescribed row and column sums, which may be of independent interest; see Lemma 9 in Section 4.2. We then turn to consider an asymptotic setting of M, N → ∞, and employ Theorem 3 to bound the pointwise convergence rate of the scaling factors of A to those of A; see Theorem 4 and Corollary 5 in Section 2.2. In addition, under the same asymptotic setting as above but further assuming that all entries of A are independent, we make use of Theorem 4 to bound the asymptotic concentration of P around P in operator norm; see Theorem 6 and Corollary 7 in Section 2.3. We conclude by conducting several numerical experiments that corroborate our theoretical findings and demonstrate that our convergence rates are tight in certain situations; see Section 3.

Concentration of matrix scaling factors
Let us define for all i ∈ [M ] and j ∈ [N ]. The following lemma describes a useful normalization of the scaling factors of A and the resulting bounds on their entries.
Lemma 2 (Boundedness of scaling factors). Suppose that A, r, and c are positive, and r 1 = c 1 . Then, there exists a unique pair of positive vectors (x, y) that satisfies x 1 = y 1 and scales A to row sums r and column sums c. Furthermore, denoting a = min i,j A i,j and b = max i,j A i,j , we have that for all i ∈ [M ] and j ∈ [N ]: The proof can be found in Section 4.1 and is based on a straightforward manipulation of the sys- From this point onward we will always assume that r and c are positive and r 1 = c 1 , denoting the sum of all entries in P by We now have the following theorem, which provides a concentration inequality for a certain pair of scaling factors of A around the pair (x, y) from Lemma 2 (taking A = E[ A]). . Then, there exists a pair of positive random vectors ( x, y) that scales A to row sums r and column c, such that for any δ ∈ (0, 1], with probability at least we have that for all i ∈ [M ] and j ∈ [N ]: where Note that Theorem 3 requires that the entries of A are independent in each of its rows and each of its columns separately. This condition is clearly less restrictive than the requirement that all of the entries of A are independent. For instance, consider the matrix {v j } N j=1 are independent Rademacher variables, and g i,j : . Evidently, each row and column of A contains independent entries, yet the entries of A are strongly dependant since knowing any single row (column) of A substantially restricts the distribution of any other row (column).
It is worthwhile to point out that Theorem 3 also implies the following statement, which is perhaps more intuitive than the formulation in Theorem 3. For any pair of scaling factors ( x ′ , y ′ ) of A, Theorem 3 implies that there exists a pair of scaling factors (x ′ , y ′ ) of A such that with probability at least (5) the bounds in (6) hold if we replace ( x, y) and (x, y) with ( x ′ , y ′ ) and (x ′ , y ′ ), respectively (under to the conditions in Theorem 3). This claim stems simply from the fact that any pair ( x ′ , y ′ ) of scaling factors of A can be written as (α x, α −1 y) for some α > 0, where ( x, y) is the specific pair whose existence is guaranteed by Theorem 3. Subsequently, taking The proof of Theorem 3 can be found in Section 4.3 and is based on the following idea, which is a simple two-step procedure. First, we use Lemma 2 in conjunction with Hoeffding's inequality [8] to prove that the row and column sums of D(x) AD(y) concentrate around r and c, respectively; see Lemma 8 in Section 4.3. Second, we prove that if the matrix A can be approximately scaled by the pair (x, y), it must imply that (x, y) is sufficiently close to a pair of scaling factors of A.
This result is based on Sinkhorn's technique in [23] for proving the uniqueness of the scaling factors, which we substantially extend to describe the stability of the scaling factors under approximate scaling (or equivalently, under perturbations of the prescribed row and column sums); see Lemma 9 in Section 4.3. It is worthwhile to point out that Hoeffding's inequality in the proof of Theorem 3 can be replaced with any other concentration inequality for sums of random variables, allowing one to relax the assumptions on boundedness and independence.

Asymptotic convergence of scaling factors
We now place ourselves in an asymptotic setting where the dimensions of A tend to infinity, and apply Theorem 3 to study the asymptotic convergence of the scaling factors of A to those of A in

sequence of positive random matrices and define
where N 0 is some positive integer, and lim N →∞ M N = ∞. Suppose that for any positive integer N ≥ N 0 , we are given positive row sums r (N ) and column sums c (N ) that satisfy r (N ) 1 = c (N ) 1 . According to Lemma 2, A (N ) can be scaled to row sums r (N ) and column sums c (N ) by a unique pair of positive vectors (x (N ) , y (N ) ) that satisfies x (N ) 1 = y (N ) 1 . As a measure of discrepancy between (x (N ) , y (N ) ) and another pair of vectors ( x, y), where x ∈ R M N and y ∈ R N , we define It is important to mention that the scaling factors x In what follows we use the notation and {γ N } is a deterministic sequence, to mean order with high probability, namely that there exists [15]. In particular, We now have the following theorem, which provides a bound on the convergence rate of a certain The proof can be found in Section 4.4 and is largely based on a direct application of Theorem 3 using an appropriate δ.
To exemplify Theorem 4, let us consider the setting of doubly-stochastic matrix scaling, namely . According to (9) we have ρ Similarly, it is easy to verify that the same convergence rate of O w.h.p log N /N holds whenever M N grows proportionally to N and is dominated by the minimum between M N and N , as described in the next corollary of Theorem 4.
Corollary 5. Suppose that the conditions in Theorem 4 hold, and in addition max i r Then, The proof follows immediately from the fact that ρ Aside from the setting where the r as N grows. Consequently, the convergence rate of ( x (N ) , y (N ) ) to (x (N ) , y (N ) ) in this case would

Concentration of P around P in operator norm
Let P (N ) and P (N ) be the matrices obtained from A (N ) and A (N ) , respectively, after scaling them to row sums r (N ) and column sums c (N ) , i.e., where ( x (N ) , y (N ) ) is any pair of scaling factors of A. Note that by Theorem 1 the matrices P (N ) and P (N ) are uniquely determined ( P (N ) being a random matrix). In addition, we define the quantity We now have the following result, which provides an upper bound on the concentration of P (N ) around P (N ) in operator norm.
The proof can be found in Section 4.5, and is based on Theorem 4 and the concentration of Note that since P (N ) is doubly-stochastic, it follows that P (N ) 2 = 1 (see [9]). In the more general case where the prescribed row and column sums are not 1, the operator norm of P (N ) can converge to zero or grow unbounded with N , depending on the asymptotic behavior of the prescribed row and column sums. Consequently, we also consider the normalized error P (N ) − P (N ) 2 / P (N ) 2 , which is the subject of the following Corollary of Then, Proof. Observe that where 1 N is the column vector of ones in R N . In addition, it is easy to verify that ρ where we used the fact that j . Applying Theorem 6 and using all of the above concludes the proof.
It is evident from Figures 1a, 1b, 1c that the asymptotic bound in Theorem 4 agrees very well with the experimental results, suggesting that this bound is tight for the considered scenarios, and in particular that the factor log(N ) in the corresponding bounds is necessary.  that x 1 = y 1 . According to (1) we have and since a ≤ A i,j ≤ b for all i, j, it follows that for all i ∈ [M ], j ∈ [N ]. Summing the inequalities for x i in (19) over i = 1, . . . , M , and using Lastly, plugging the above back into (19) gives the required result.

Lemmas supporting the proof of Theorem 3
The first step in proving Theorem 3 is to make use of Lemma 2 together with Hoeffding's inequality [8] to provide a concentration inequality for the sums of the rows and of the columns of D(x) AD(y) around r and c, respectively, where (x, y) is any pair of scaling factors of A. This is the subject of the following lemma.

Lemma 8 (Concentration of row and column sums). Suppose that {
s. for all i ∈ [M ] and j ∈ [N ], and denote a = min i,j a i,j , b = max i,j b i,j . Then, for any pair of vectors (x, y) that scales A = E[ A] to row sums r can column sums c, we have for any ε > 0 and all i ∈ . Therefore, using the fact that j x i A i,j y j = r i , Hoeffding's inequality [8] gives that for all i ∈ [M ]. Since we can always find a constant α > 0 such that αx 1 = α −1 y 1 , Lemma 2 implies that for all i ∈ [M ] and j ∈ [N ], Applying the above inequality to (23), we get for all i ∈ [M ]. Analogously to the derivation of (25), by using i x i A i,j y j = c j together with Hoeffding's inequality, one can verify that for all j ∈ [N ] We next prove that if A can be approximately scaled by a pair of vectors (x, y), then there exists a pair of vectors ( x, y) that scales A and is also close to (x, y). The proof relies on extending Sinkhorn's original proof of uniqueness of the scaling factors in [23] to describe the stability of the scaling factors under approximate scaling. We note that A is not considered as random in the following Lemma.
for all i ∈ [M ] and j ∈ [N ]. Then, A can be scaled to row sums r and column sums c by a pair ( x, y) that satisfies Proof. Let ( x, y) be the unique pair of scaling factors of A with x 1 = y 1 (see Lemma 2), and for all i ∈ [M ] and j ∈ [N ]. Observe that i P i,j = c j , j P i,j = r i , and . The proof of Lemma 9 is based on a manipulation of (31) using P , , and their propertie. Since this manipulation is somewhat technical, in what follows we break it down into several steps.

Deriving bounds on {u i } and {v j }
Using the first inequality in (31) for j = argmin k v k , we have Similarly, using the second inequality in (31) for i = argmax ℓ v ℓ gives and by combining (32) and (33) we obtain Analogously to (32) and (33), it is easy to verify that by using the first inequality in (31) for j = argmax k v k and using the second inequality in (31) for i = argmin ℓ v ℓ , one gets Note that (35) can also be obtained directly from (34) by a symmetry argument, that is, by considering (34) in the setting when A is replaced with its transpose, thereby interchanging the roles of u and v.
In addition, according to Lemma 2 and using the fact that x i ≥ C 1 r i and y j ≥ C 2 c j (from the conditions in Lemma 9), it follows that for all i ∈ [M ] and j ∈ [N ] 4.2.2 Bounding max j v j − min j v j and max i u i − min i u i Let us denote ℓ = argmax i u i . By the second inequality in (31) together with (34), we can write implying that 1 Multiplying (38) by min j v j / min jPℓ,j , it follows that where we used the definition ofP together with the conditions in Lemma 9. Multiplying (39) by r ℓ /N and employing the definitions of r i and c j (see (2)) gives We next provide a derivation analogous to (37)-(40) to obtain a bound for 1 N N j=1 (max j v j −v j ). Let us denote t = argmin i u i . Using the second inequality in (31) together with (35), we have and therefore Multiplying the above by max j v j / min jPt,j , it follows that Furthermore, multiplying the above by r t /N and using the definitions of r i and c j (see (2)), we get Lastly, summing (40) and (44) gives It is easy to verify that by repeating the derivation of (37) -(45) analogously for u i instead of v j , we get We omit the full derivation for the sake of brevity. Note that (46) can also be obtained directly from (45) by a symmetry argument, namely by considering (45) in the setting where A is replaced with its transpose, so that N is replaced with M , c is replaced with r, and v is replaced with u.

Bounding
and all j ∈ [N ]. Taking τ as the geometric mean of max j v j and min j v j , together with (45) gives for all j ∈ [N ]. Multiplying both hand sides of (47) by where we also used (36) in the last inequality. According to (34), we have for all ε ∈ (0, 1) that which together with (47) implies that Analogously to (47), from (46) we obtain for all i ∈ [M ]. Multiplying both hand sides of (51) by α = max j v j / max i u i we get Consequently, using (35), and analogously to the derivation of (50), it follows that which together with the definition of u and v in (30) concludes the proof (since (α x, α −1 y) is a pair of scaling factors of A).

Proof of Theorem 3
According to Lemma 2 there exists a unique pair of positive vectors (x, y) satisfying x 1 = y 1 that scales A to row sums r and column sums c, with and j ∈ [N ]), asserts that with probability at least we have where C = b/a 2 . If (56) holds, we apply Lemma 9 with C 1 = min i {x i /r i } ≥ √ a/b and C 2 = min j {y j /c j } ≥ √ a/b (using (54)), which guarantees that A can be scaled to row sums r and column sums c by a pair ( x, y) that satisfies for all i ∈ [M ], j ∈ [N ], and any ε ∈ (0, 1/2]. Overall, taking ε = δ/2, and using the fact that s = r 1 ≥ M min i r i and s = c 1 ≥ N min j c j , asserts that that exists a pair ( x, y) that scales A to row sums r and column sums c, such that for any δ ∈ (0, 1] for all i ∈ [M ] and j ∈ [N ], with probability at least Therefore, using b i,j − a i,j ≤ max i,j {b i,j − a i,j } = d proves Theorem 3 with

Proof of Theorem 4
We begin by considering indices N for which ρ from Theorem 3. In this case, we apply Theorem 3 using δ = ρ (N ) 1 2C p log(max{M N , N }), noting that δ ≤ 1 as required since ρ we have for all i ∈ [M N ] and j ∈ [N ], Next, we consider indices N for which ρ log(max{M N , N }) > 1/ 2C p . In this case, we first apply Lemma 2 to A (N ) , which states there exists a pair of positive vectors ( x (N ) , y (N ) ) that scales A (N ) to row sums r (N ) and column sums c (N ) , such that for all i ∈ [M N ] and j ∈ [N ]. Combining the above inequalities with the analogous inequalities for x (N ) and y (N ) (that scale A and satisfy x (N ) 1 = y (N ) 1 ) gives for all i ∈ [M N ] and j ∈ [N ]. Therefore, we have that Since the lower bound in the right hand-side of (63) converges to 1 as N → ∞, (64 We now bound the summands in the right-hand side of (68) one by one. Note that since {A i,j } i,j are independent, have mean zero, and are bounded (and therefore sub-Gaussian). Applying Theorem 4.4.5 in [26] with t = √ log N gives Combining (70) where we used the fact that ρ

Acknowledgements
The author would like to thank Thomas Zhang, Yuval Kluger, and Dan Kluger for their useful comments and suggestions. This research was supported by the National Institute of Health, grant numbers R01GM131642 and UM1DA051410.