Least squares estimation of spatial autoregressive models for large-scale social networks ∗

: Due to the rapid development of various social networks, the spatial autoregressive (SAR) model is becoming an important tool in social network analysis. However, major bottlenecks remain in analyzing large-scale networks (e.g., Facebook has over 700 million active users), including computational scalability, estimation consistency, and proper network sampling. To address these challenges, we propose a novel least squares estimator (LSE) for analyzing large sparse networks based on the SAR model. Computationally, the LSE is linear in the network size, making it scalable to analysis of huge networks. In theory, the LSE is √ n -consistent and asymptotically normal under certain regularity conditions. A new LSE-based network sampling technique is further developed, which can automatically adjust autocorrelation between sampled and unsampled units and hence guarantee valid statistical inferences. Moreover, we generalize the LSE approach for the classical SAR model to more complex networks associated with multiple sources of social interaction eﬀect. Numerical results for simulated and real data are presented to illustrate performance of the LSE.

* Danyang Huang's research is supported by National Natural Science Foundation of China (Grant No. 11701560), the Beijing Municipal Social Science Foundation (Grant No. 17GLC051) and the Center for Applied Statistics, School of Statistics, Renmin University of China. Wei Lan's research is partially supported by National Natural Science Foundation of China (No. 11401482, No. 71532001), and the Center of Statistical Research, Southwestern University of Finance and Economics. Hansheng Wang's research is partially supported by National Natural Science Foundation of China (No. 71532001, No. 11525101, No. 71332006). It is also supported in part by China's National Key Research Special Program (No. 2016YFC0207700). Zhang's research was supported in part by NSF DMS-1309507, DMS-1418172, and NSFC 11571009. We deeply appreciate Professor James P. LeSage for generously sharing with us the well written MLE codes; see http: // www. spatial-statistics. com/ software\ protect_ index. htm .

Introduction
We consider a network with n nodes. An adjacency matrix A = (a ij ) ∈ R n×n could be defined to describe the network structure. Let Y = (Y 1 , · · · , Y n ) ∈ R n be the continuous responses collected from the n nodes. In social network analysis, the spatial autoregressive (SAR) model has been popularly employed for modeling the social interaction within a network (Bronnenberg and Mahajan, 2001;Anselin, 2013), which is, ( 1.1) where W = (w ij ) ∈ R n×n with w ij = a ij / n j=1 a ij . And ρ is the autocorrelation parameter representing the social interaction . The random noises are collected in the vector E = (ε 1 , · · · , ε n ) ∈ R n , which is assumed to have mean 0 n = (0, · · · , 0) ∈ R n and covariance matrix σ 2 I n ∈ R n×n , where I n is the identity matrix of dimension n. Due to the rapid development of online social network websites (e.g. Facebook, Twitter, Sina Weibo, Wechat), the usefulness of the SAR model has been increasingly recognized in recent years (Sampson et al., 1999;Leenders, 2002;Fujimoto et al., 2011;Robins, 2013). Remark 1. The estimation of social interaction coefficient ρ is important in large networks. For example, in marketing research, a node i could be a user on a social network platform, and Y i could be defined to represent a person's attitude about a brand. It could be influenced by his or her friends on the social network platform. Studying the influence is important . Statistically, this amounts to estimate the social interaction ρ. A positive and significant social interaction indicates potential profit in marketing strategy on the platform. This further suggests that estimation of ρ is important in large networks. See Chen et al. (2013) for more detailed discussions.
We assume |ρ| < 1 throughout the article. According to the proof in Banerjee et al. (2004), I n − ρW is invertible in this case. Thus we have Y = (I n − ρW ) −1 E. This implies that E(Y ) = 0 n , and cov(Y ) = Σ = σ 2 (I n − ρW ) −1 (I n − ρW ) −1 . If E is further assumed to follow a normal distribution, one can obtain maximum likelihood estimator (MLE) of ρ and σ 2 ( Barry and Pace, 1999). The estimator's asymptotic distribution has been studied by Lee (2004) and Hillier and Martellosio (2014). Some higher order asymptotic results have been investigated recently by Robinson and Rossi (2014).
Despite many recent advances and successes in the SAR model applications, existing estimation methods do experience bottlenecks when analyzing large networks. The size of many popular social websites can be enormously large. For example, the Sina Weibo (www. weibo. com ) network analyzed in this paper has n = 557, 818 nodes. Thus the traditional methods are computationally data may omit relationships between sampled and unsampled nodes, it may lead to biased estimation and the resulting statistical inference could be misleading (Frank, 1979;Costenbader and Valente, 2003;Handcock and Gile, 2010;Shalizi and Rinaldo, 2013). Chen et al. (2013) pointed out that sampled network may lead to seriously biased estimation for the SAR model when the MLE is applied. Zhou et al. (2017) proposed the paired maximum likelihood estimator. However, the method is based on Taylor's expansion, which works well only when ρ is sufficiently small. See euation (2.5) and the relative discussion in the paper for more details. Thus better techniques for network sampling are needed to ensure consistent estimation of social interaction effect.
Motivated by these challenges, we propose a novel, fast and scalable estimation method for the SAR model. The new method is particularly designed for large social networks with tens of thousands of (or millions of) nodes. One main advantage of the new method is that, under appropriate sparsity assumptions, its computational complexity is linear in network size. Our basic idea is described as follows. For any node i, define Y (−i) = (Y j , 1 ≤ j ≤ n, j = i). Under the normal assumption, we can show that E{Y i |Y (−i) , where the coefficient vector γ * i (ρ) ∈ R (n−1) depends on both i and ρ. Based on this, we propose to construct a least squares type objective function Q(ρ) ρ)} 2 and obtain the least squares estimator (LSE) of ρ by minimizing Q(ρ). We can show that under appropriate assumption, the computational complexity of minimizing Q(ρ) is O(n), i.e., linear in the network size.
In practice, the effect of social interaction on individuals may come from multiple sources (Leenders, 2002). First, the existence of an edge is the result of a combination of different properties of the network (Krivitsky et al., 2009), such as the homophily (nodes with similar characteristics are more likely to relate) or degree heterogeneity (super stars receive edges more than others). Therefore different relationships between nodes, such as friends or fans, may have disparate social interaction effects. Second, researchers often collect data from multiple social network platforms, such as instant messaging, mobile phone communication, and online social networks. Networks obtained from different platforms may have separate impact on the response. This leads to multiple weighting matrices. In this paper, we will also generalize the LSE approach to estimate the SAR model with multiple weighting matrices, which allows to capture different types of social interaction from multiple sources. The resulting estimator is referred to as the mLSE.
The rest of the article is organized as follows. Section 2 introduces the proposed LSE approach, establishes its theoretical properties, and presents the generalization of the proposed approach. Asymptotic properties of the estimators are established as well. Simulation studies and a real data example are given in Section 3. Section 4 concludes the article with discussions. All the theoretical proofs are relegated to Appendix.

Motivation
Consider a network with n nodes. Let Y i be the response for node i with 1 ≤ i ≤ n, and Y = (Y 1 , Y 2 , · · · , Y n ) ∈ R n . Let A = (a ij ) ∈ R n×n be the network adjacency matrix, where a ij = 1 if node i follows node j and a ij = 0 otherwise. For completeness, we assume a ii = 0 for i = 1, · · · , n. To assess the social interaction effect of the network structure, we assume the model in (1.1).
In order to ensure (I n −ρW ) to be invertible, the eigenvalues of ρW should be all different from 1. Banerjee et al. (2004) showed that the largest eigenvalue of W is 1. These two facts imply that |ρ| < 1 is a sufficient condition to guarantee the invertibility of (I n − ρW ) for a general W . In fact, this is also a necessary condition. Otherwise, one can always find an appropriately defined matrix W so that (I n − ρW ) is not invertible; see Banerjee et al. (2004) for more detailed discussions. Consequently, we assume that |ρ| < 1.
The autocorrelation parameter ρ measures the effect of social interaction. By omitting some constants, the quasi log likelihood function can be written as, The MLE could be obtained by maximizing (2.1). However, this classical MLE approach becomes computationally expensive when the network size n is large, mainly due to high cost of computing |I n −ρW | (Trefethen and Bau, 1997;Barry and Pace, 1999;Smirnov and Anselin, 2001).

Least squares estimation
We propose to estimate the SAR model (1.1) based on the following intriguing observation. Recall that Y (−i) = (Y 1 , · · · , Y i−1 , Y i+1 , · · · , Y n ) ∈ R (n−1) for each i. Then, we have the following result, with the proof given in Appendix A.1.

Proposition 1. Assuming that E is normally distributed, then
. Correspondingly, we propose to construct the following least squares type objective function, (2.2) represents the ith column of W . As a result, we have The detailed verification of (2.3) is given in Appendix A.2. This leads to the proposed LSEρ = argmin ρ Q(ρ).
Next, we point out that minimizing Q(ρ) in (2.2) is computationally feasible even when n is large. This is based on the fact that, even though γ which is equivalent to either a ik + a ki > 0 or j a ji a jk > 0. For each node i, there are two types of nodes (denoted by ks) satisfying the aforementioned conditions: its"direct" friends (those ks satisfying a ik + a ki > 0), and its one particular type of "indirect" friends (those ks satisfying j a ji a jk > 0).
In an ideal situation, if the number of friends connected with each node is finite, then one can verify that the computational complexity ofρ is linear in n. The following proposition presents a formal result.
Proposition 2. If we further make the following assumptions: (1) Assume the objective function (2.3) is optimized by the Newton-Raphson algorithm, which converges in a finite number of steps; (2) Assume that there exists a finite constant d max such that max i d i = max i j a ji ≤ d max as n → ∞. Then, the computational complexity demanded for optimizing The proof of Proposition 2 is given in Appendix A.3. We make some explanations about the assumptions in this proposition.

Remark 2.
The assumption (2) in Proposition 2 is a constraint on network sparsity. By this assumption, we know that the degree d i of each node is bounded by a finite constant. This means the density of the network is O(n −1 ), which goes to 0 as n goes to infinity. This implies that the network structure is very sparse. In the meanwhile, if d max diverges to infinity at a low speed (e.g., log(n)), then the computational complexity becomes O(n log(n)), which is slightly higher than O(n).

Asymptotic properties
The following conditions are needed to establish asymptotic results for the LSE. Define λ j (B) to be the jth eigenvalue of an arbitrary matrix B ∈ R p×p such that The following conditions are needed.
(C1) (Network Connectivity) The set {1, · · · , n} is defined including all nodes as the state space of a Markov chain. The transition probability is given by the weighting matrix W . We assume the Markov chain to be irreducible and aperiodic. Additionally, we define π = (π i ) ∈ R n to be the stationary distribution vector of the Markov chain, which satisfies π i ≥ 0, i π i = 1, and W π = π. We further assume Assume the limits of the following network features exist: σ 2 1 = lim n→∞ σ 2 1n , and σ 2 2 = lim n→∞ σ 2 2n , where . First, in the first two conditions, we impose assumptions on the network structure. Condition (C1) assumes certain connectivity for the network structure. Specifically, it could be verified that, if the network is fully connected after a finite number of steps, then both irreducibility and aperiodicity could be satisfied. If the famous six degrees of separation (Newman et al., 2006) holds, then the condition is satisfied. According to Meyn and Tweedie (2012), if condition (C1) is satisfied, then it holds that lim m→∞ W m = 1 n π , where 1 n is an ndimensional vector with all elements to be 1. In condition (C2), we assume certain uniformity on the network structure. Classical SAR models (Lee, 2004;Yang and Lee, 2017) require the row and column sums of W to be bounded. While (C2) allows λ(W 1 ) to diverge with the rate of O(log n). It is remarkable that conditions (C1) and (C2) are not related to the assumption (2) in Proposition 2. This means the asymptotic property of LSE in the following theorem does not rely on sparsity of the network. Second, condition (C3) is a regularity assumption on the noise terms. It could be verified that the normal distribution with mean 0 satisfies this condition. Third, condition (C4) is a law of large number type condition. We consider two special cases to help understand the condition. In this cases, the existence of limits could be theoretically verified.
The above assumptions are to facilitate the asymptotic analysis based on the central limit theorem. Next, we establish asymptotic properties of the LSEρ in the following theorem, which provides theoretical justifications for the new estimator.

Theorem 1. Assume that the conditions (C1)-(C4) hold, then we have
The proof of Theorem 1 is left to Appendix C.1. By Theorem 1, we know that the LSEρ is asymptotically normally distributed and √ n-consistent. This is the same convergence rate of the classical SAR models. The LSE's asymptotic variance is σ −4 2 σ 2 1 . Thus consistent estimators for σ 2 2 and σ 2 1 are to be derived to make valid inferences. See Appendix C.2 for discussion of computationally feasible estimators for σ 2 1 and σ 2 2 . By assuming E is normally distributed, MLE of ρ could be obtained. Then it is of interest to theoretically compare the relative estimation efficiency between MLE and LSE. It is worth noting that in real practice, the network effect is typically small (Chen et al., 2013). We are then motivated to conduct Taylor's expansion of ρ to approximate the asymptotic covariance of MLE and LSE by their leading terms. For simplicity, we assume that σ 2 = 1. Define σ 2 L = σ −4 2 σ 2 1 to be the asymptotic variance of LSE, and σ 2 M to be that of MLE. Thus, under appropriate assumptions for the MLE, the following theorem could be established.
. The proof of Theorem 2 is given in Appendix C.3. See the appendix for more detailed discussions. By Theorem 2, the conclusion could be drawn that, similar estimation efficiency can be obtained for MLE and LSE if ρ is small.

New LSE-based scheme for sampling networks
In this section, we develop a novel sampling scheme to cope with the LSE approach, and further show that the sampled data can lead to a consistent estimation for the SAR model. For each node i, define a sampling indicator s i , which equals to 1 if node i is sampled as the response node and equals to 0 otherwise. Define S y = {i : s i = 1}, which is the collection of all the sampled response nodes. In order to evaluate γ * i (ρ) correctly for each sampled node i, we need to sample its "direct" and appropriately defined "indirect" friends. Altogether, the related nodes are denoted by, Without involving all the nodes in networks, the new sampling scheme can be implemented in the following three steps: • Step 1: Obtain the response set S y via some convenient sampling method.
• Step 2: Collect all the nodes js, which are directly connected with some node i ∈ S y in different directions ("direct" friends). Denote the two types of nodes as S m1 = {j : a ji = 1 and i ∈ S y } and S m2 = {j : a ij = 1 and i ∈ S y }, respectively. • Step 3: Obtain the "indirectly" connected nodes by searching for ks which are connected with node j ∈ S m1 , in the direction a jk = 1. Denote S m3 = {k : a jk = 1 and j ∈ S m1 }. Putting all together, we get Using S x and S y , one can construct the following least squares objective function Define the resulting sampling-based LSE asρ s = argmin ρ Q s (ρ), which is referred to as the Sample-LSE. Define S = diag(s 1 , · · · , s n ) ∈ R n×n , where s i = 1 if i ∈ S y and 0 otherwise. Define n s to be the number of nodes collected in S y .Q s (ρ) andQ s (ρ) are similarly defined asQ(ρ) andQ(ρ) . Fur- The following condition is needed to establish the asymptotic property of Sample-LSE.

Theorem 3. Assume that the conditions (C1)-(C3) and (C5) hold, then we have
Proof of Theorem 3 is similar to that of Theorem 1 and is hence omitted. By Theorem 3, the Sample-LSEρ s is √ n s -consistent and asymptotically normal. The discussion of computationally feasible estimators for σ 2 1s and σ 2 2s is similar to that in the previous section and thus omitted. It is worthy mentioning that although S is an n×n diagonal matrix, it only has n s nonzero diagonal elements. Therefore, only the sampled nodes in S x and S y are involved in the estimation. Both the performances of simple random sampling and snowball sampling are demonstrated in the simulation studies.

mLSE: generalization to multiple weighting matrices
Now we generalize the LSE approach to models with multiple weighting matrices. It allows us to consider separate social interaction effects from multiple sources. Multiple weighting matrices are sometimes used in spatial statistics; see for example LeSage and Pace (2009). Consider the SAR model with L weighting matrices. For simplicity, we assume L = 2 in this paper, but it is straightforward to extend the proposed estimation procedure and theoretical results to the case of L > 2. The model is defined as, 2}. E ∈ R n has mean 0 n and covariance matrix σ 2 I n ∈ R n×n . In order to insure the invertibility of the matrix I n − ρ 1 W 1 − ρ 2 W 2 , one could verify that a sufficient condition is |ρ 1 | + |ρ 2 | < 1. Thus we assume |ρ 1 | + |ρ 2 | < 1 throughout the rest of this article. Then we have represents the weight of the adjacency matrix A l . Correspondingly, the autocorrelation parameters ρ l s represent influence effects of different types of social interactions.
Remark 3. For model defined in (1.1) (Anselin, 2013), the parameter estimation becomes more and more inaccurate as ρ ≈ 1. Similarly, for model defined in (2.4), when |ρ 1 | + |ρ 2 | ≈ 1, model estimation becomes inaccurate. However, in practice, the autocorrelation could be small (Chen et al., 2013). We have also demonstrated this fact through the Weibo example in real data analysis.
To employ the LSE approach to (2.4), we define γ The result is given in the following proposition.
The proof is similar to that of Proposition 1 and given in Appendix A.1. Next, we propose the objective function, . We refer to the LSE for multiple weighting matrices as the mLSE. Computationally, it is worth pointing out that, minimizing Q(θ) in (2.5) is still feasible for a very large n. A necessary condition for the kth entry of Typically, A l s are extremely sparse, which implies that the total number of nodes involved in γ * i (θ) is finite for any i. This would dramatically reduce the computational cost ofθ.
Before establishment of the property of θ, the following conditions are needed.
The conditions are similar to (C1), (C2), and (C4). Conditions (C6)-(C7) are assumptions on the network structure. Condition (C8) is a law of large numbers type condition. Then, we could establish the asymptotic property ofθ.

Theorem 4. Assume that the conditions (C3), and (C6)-(C8) hold. We then
The proof of Theorem 4 is similar to that of Theorem 1 and thus omited. By Theorem 4, we could see that the mLSEθ is asymptotically normally distributed and √ n-consistent. The computationally feasible estimators for Π 1 and Π 2 can be obtained similarly with those for σ 2 2 and σ 2 1 . Similar to the previous result, when the number of nodes directly connected with any arbitrary node is finite in each W l , only a finite number of nonzero coefficients will be involved in

Numerical studies
To evaluate finite sample performance of the proposed LSE methods, we carry out a number of simulation studies. Specifically, we report the performance of the LSE and compare it with the classical MLE, in terms of both estimation efficiency and computational complexity. We also demonstrate the performance of the Sample-LSE and the mLSE. Finally, the LSE methods are applied in the network of the Sina Weibo.

Performance of the LSE
In the following, we evaluate the finite sample performance of the proposed methodology for three typically encountered network models, and a special case of stochastic block model. Given n, we first generate the adjacency matrix A = (a ij ), and then set a ii = 0 for each 1 ≤ i ≤ n. Note that A is not necessarily symmetric. Subsequently, W can be computed by normalizing each row of A. Example 1. (Dyad Independence Model) By Holland and Leinhardt (1981), define a dyad as D ij = (a ij , a ji ) for any 1 ≤ i < j ≤ n. Assume D ij s are mutually independent of each other. To allow for sparsity of the network, we define P {D ij = (1, 1)} = 0.5n −1 and P {D ij = (1, 0)} = P {D ij = (0, 1)} = 5n −1 . Then we have P {D ij = (0, 0)} = 1 − 5.5n −1 , which is very close to 1 for a large n. Example 2. (Stochastic Block Model) Consider the network structure generated from the stochastic block model (Wang and Wong, 1987;Nowicki and Snijders, 2001). Let K = 20 be the total number of blocks. First, we follow Nowicki and Snijders (2001), and randomly assign each node a block label (k = 1, · · · , K) with equal probability 1/K. Next, let P (a ij = 1) = 20n −1 if i and j are in the same block; P (a ij = 1) = 2n −1 otherwise. In this way, nodes within the same block are more likely to be connected with each other, when compared with nodes from different blocks. Example 3. (Power-Law Distribution) According to Barabási and Albert (1999), the majority of nodes have few edges but a small amount have huge number of edges. Following Clauset et al. (2009), we simulate A as follows. First, for node i, generate its in-degree d i = j a ji according to the discrete power-law distribution with P (d i = k) = ck −α , where c is a normalizing constant and the parameter α is set to be 2. Next, for each i, randomly select d i nodes as its potential followers. Example 4. (Densely Connected Communities) Although the social networks are often sparse, they are expected to have densely connected communities. To  further examine the robustness to the sparsity assumption for the LSE method, we consider a fully connected block defined in Case 2 in Section 2.3 with k = 10. This means d i = 10 for 1 ≤ i ≤ n. To simulate A, for each node i, randomly select d i nodes as its potential followers.
In each experiment, ρ is set to be 0 or 0.2 and σ 2 is fixed at 1. We consider various network sizes (n=2000, 5,000, 10,000, 20,000). For a reliable evaluation, each experiment is randomly replicated M = 1, 000 times. For the mth replication, writeρ (m) as the estimate of ρ. Then the bias is evaluated as = ρ −ρ, whereρ = M −1 mρ (m) . We estimate the standard error SE . We compare SE with the Monte Carlo standard deviation ofρ, which is calculated by SE . The null hypothesis of H 0 : ρ = 0 is rejected if |Z (m) | > z α/2 . α = 0.05 is used throughout the rest of this article. The empirical rejection probability (ERP) is computed as M −1 m I(|Z (m) | > z α/2 ). According to whether the true parameter ρ is 0 or not, the ERP might correspond to either empirical size or power.
We summarize numerical results of the LSE in Table 1. It is observed that, for all of the three examples, the ERP results of nonzero ρ (i.e. ρ = 0.2) are always larger than 95% when n is not smaller than 2000. This suggests that the proposed Z-test is consistent in power. On the other hand, the ERP results of

Comparison of the LSE vs. the MLE
We compare the LSE and the MLE in terms of estimation efficiency and computational efficiency. In this study, we fix ρ = 0.2 and σ 2 = 1. In order to implement the MLE, two algorithms are considered. The first one is the traditional Newton-Raphson algorithm, which is expected to be accurate but slow. For fast computation, we consider the toolbox in http: // www. spatial-statistics. com/ software_ index. htm . The function far is used and the log-determinant algorithm of Barry and Pace (1999) is implemented. This algorithm provides an approximated solution to the MLE of (1.1); it is expected to be faster than the Newton-Raphson algorithm. Following Lee and Liu (2010), we generate W as follows. Let W A be the weighting matrix for the study of crimes across 49 districts in Columbus, Ohio (Anselin, 2013). This is a contiguity matrix constructed based on the latitude and longitude coordinates of the districts. Fix the sample size to be n = 49m, where m = 5, 10, 50, or 100. Therefore, the network size varies from 245 to 4,900. Accordingly, the spatial weighting matrix W is generated as I m W A , where denotes the Kronecker product operator. The response variable Y is generated according to (1.1). We compare three different estimators of ρ, including the MLE computed by the Newton-Raphson algorithm (denoted as the MLE-NR), the MLE computed by the log-determinant algorithm of Barry and Pace (1999) with the Matlab function far (denoted as the MLE-BP), and the new LSE. For each parameter setup, the experiment is randomly replicated for M = 500 times. To compare different methods' computational efficiency, their average CPU times (Time) for each experiment are also reported. The detailed results are summarized in Table 2. Based on Table 2, we make the following interesting observations. First, the biases of all the estimation approaches are close to 0. Second, all the three estimators are fairly comparable in terms of SE. However, the SE * (i.e., the SE estimate) results of the MLE-BP are seriously biased. This makes the corresponding statistical inference problematic. Lastly, in terms of computational efficiency, the MLE-NR is the worst; the MLE-BP is substantially better in terms of the average computation time (i.e., Time); and the LSE is the best. This is particularly true for a large sample size (i.e., n = 4, 900). See Figure 1 for further numerical evidence. Based on the above, the LSE method is the only method (among the three in Table 2) which can deliver highly accurate estimation results and statistically sound inference, yet with a minimal computational cost.

Performance of the sample-LSE
Here we demonstrate finite sample performance of the Sample-LSE method. The data are simulated in the same manner as in Section 3.1 based on Examples 1-3. We fix the whole network size to be n = 20, 000 and the autocorrelation parameter ρ ∈ {0, 0.2}. Various sizes of the network sample are considered: n s = 2,000, 5,000, and 10,000. The experiments are replicated in a similar manner as before with M =1,000.
Two different sampling mechanisms are considered: (a) Simple Random Sampling (SNS); (b) Snowball Sampling (SNOW). In the SNOW, the following steps are conducted: (1) We start with s s = 10 selected nodes and define the set to be S y0 with |S y0 | = s s = 10; (2) In the kth (1 ≤ k ≤ K) step , all of the n k connected friends of nodes in S y k−1 are selected and added to the sample,  which makes S y k . (3) If in the kth step, there is no more connected node that could be selected, then a random node i ∈ S \ S y k−1 is added to the sample set, which leads to S y k . (4) Repeat (2) and (3) until at least n s nodes are selected in the Kth step, which means |S yK | ≥ n s . (5) Randomly select n * K nodes from the sampled n K nodes in the Kth step, which makes |S y K−1 | + n * K = n s . Thus |S y | = |S yK | = n s .
The detailed results are presented in Tables 3 and 4. We observe that: (1) the Sample-LSE is consistent with ignorable bias and decreasing SE as n → ∞; (2) the SE estimator developed for the Sample-LSE in Section 2.4 works quite well, because the difference between SE and SE * values reported in Tables 3 and 4

Comparison of the sample-LSE vs. the MLE
In this subsection, we compare three methods, based on the sampled data: the MLE-NR, the MLE-BP, and the Sample-LSE. Data are simulated in the same manner as in Section 3.2. For the purpose of illustration, we fix n = 9, 800, ρ = 0.2, and n s ∈ {1000, 2000, 5000}. In order to implement the two MLE methods, which are the MLE-NR and the MLE-BP, we treat the sampled network structure as if they were the whole network. This leads to another rownormalized W matrix based on the sampled data only (Chen et al., 2013). Then the isolated nodes are eliminated from the data accordingly. We replicate the experiment M =500 times and report the summary in Table 5. One immediate observation is that, both the MLE-NR and the MLE-BP methods are inconsistent with the sampled data, because their estimation biases are clearly above 0. Such a finding is not surprising and is essentially consistent with that of Chen et al. (2013). In contrast, the Sample-LSE remains to be consistent and statistically valid.

Performance of the mLSE
In this subsection, we demonstrate finite sample performance of the mLSE. To allow for different features of adjacency matrices, we generate W 1 and W 2 in the same manner as in Examples 3 and Example 2, separately. The parameters are set as ρ 1 ∈ {0, 0.1} and ρ 2 = 0.2. Various network sizes are considered: n = 2,000, 5,000, 10,000 and 20,000. The experiments are replicated in the same way as before with M =1,000.
Numerical results of the mLSE are summarized in Table 6. For all the three examples, the ERP results of nonzero ρ l s (l ∈ {1, 2}) are always larger than 95%. This means the proposed Z-test is consistent in power. By contrast, the ERP results of zero ρ 1 are always close to the nominal level 5%, which indicates that the proposed Z-test controls Type I error very well. This is because the difference between SE and SE * is very small, suggesting that the true standard error can be well approximated by the estimators derived in Section 2.5, in accordance with Theorem 4.

Sina Weibo network analysis
We apply the LSE methods to a real social network collected from Sina Weibo (www. weibo. com ), the largest Twitter-type social media in China. The goal of this study is to understand how the Sina Weibo users interact with each other in terms of their posting activity. To collect the data, we start with the Sina Weibo accounts of four major online travel agencies in China. We randomly select 5,000 nodes from each travel agency's followers and collect the followers' friends. To better mimic a sparse network, only active users is with out-degree (d i = j a ij ) no more than 20 are kept. The final network has n = 557, 818 nodes. Their follower-followee relationships are recorded by the adjacency matrix A. In total, the network includes a ij =1,496,399 edges and i<j a ij a ji = 535, 408 mutually connected pairs. The density of the network is 4.8×10 −6 , which is extremely sparse.
For each node, the response is defined to be total amount of its posted messages in log-scale. The responses are standardized to be mean 0 and variance 1. For such a large network size, both the MLE-NR and the MLE-BP are computationally too expensive to be implemented. However, the LSE can be easily computed using a personal computer within 58 seconds. It gives the estimatê ρ = 0.125 with SE = 1.4 × 10 −3 , implying that the social interaction is statistically significant at 5% level. This suggests that on average, a Weibo user's posting activity does positively correlate with his/her connected friends.
To further demonstrate usefulness of the Sample-LSE method, we conduct an interesting simulation study as follows. Specifically, we treat the above sampled nodes and edges as if they were the whole social network. We then treat the "whole network" LSE estimatorρ = 0.125 as if it were the true parameter. This allows us to conduct the simulation experiments in a similar manner as we have stated in Sections 3.3 and 3.4. The results are presented in Table 7. By Table  7, n s = 2, 000 is large enough to detect a positive and statistically significant ρ. The corresponding ERP (i.e., power) is as large as 99.9%. However, n s = 2, 000 only accounts for about n s /n = 0.36% of the entire network size. Consequently, the saving in both sampling and computational costs is significant. Last, we demonstrate the usefulness of the proposed mLSE method. According to Holland and Leinhardt (1981), and Huang et al. (2016), for different nodes i and j (i = j), mutual relationship (a ij = a ji = 1) and asymmetric relationship (a ij + a ij = 1) are different types of relationship, i.e., friends and fans. To test which type of relationship has a stronger impact on one's posting activity in Weibo network, we construct A 1 and A 2 based on the above mentioned n = 557, 818 nodes. Define A 1 = (a ij ) ∈ R n×n to record all the relationships with a ij = a ji = 1, and A 2 = (a ij ) ∈ R n×n to record those with a ij + a ij = 1. Thus W 1 and W 2 could be obtained. Correspondingly, ρ 1 measures the social interaction impact from mutually connected friends, and ρ 2 measures that from asymmetrically connected friends. Thus we obtain the mLSE for ρ 1 is 0.115 with SE ρ1 = 1.3 × 10 −3 , and that for ρ 2 is 0.061 with SE ρ2 = 1.2 × 10 −3 . Both the coefficients are statistically significant at 5% level. This suggests that on average, (1) a Weibo user's posting activity positively correlates with his/her mutually connected or asymmetrically connected friends; (2) mutual relationship has a stronger impact on one's posting activity than asymmetric relationship.

Conclusion
To conclude this article, we discuss four interesting topics for future study. First, the proposed LSE method requires the adjacency matrices to be sparse. It would be intriguing to study the problem without the network sparsity assumption. Second, the proposed approach is developed for the SAR model with no covariates. A natural question would be how to extend the approach by taking covariate information into consideration. Third, we have assumed that the adjacency A is pre-determined and the response Y is generated base on the weighting matrix W . However in practice, network structure could be influenced by individual features, implying a possible two-way relationship (Robins et al., 2001). How to model this bilateral relationship is another interesting topic for future study. Lastly, we have only empirically verified the performance of simple random sampling and snow ball sampling. We find both methods work fairly well and comparable. However, what would be the effect of a general sampling method is not clear. Further study along this direction is needed.

A.1. Proof of Proposition 1 and Proposition 3
For both model (1.1) and model (2.4), without the loss of generality, define Σ i. and Σ .i to be the ith row and column of Σ separately. Further define Σ i(−i) ∈ R 1×(n−1) to be the ith row of Σ with out the i-th element. Thus Σ (−i)i ∈ R (n−1)×1 can be defined in the similar manner. Σ (−i)(−i) ∈ R (n−1)×(n−1) is defined to be Σ without its ith row and ith column. Further let Σ = (σ ij ).
According to the distribution of Y , we know that the joint distribution . Therefore, by the normality assumption of E in both of the models, we know that the conditional distribution of Y i given Y (−i) In accordance with Σ i , we define Ω i and thus we have, . To prove Proposition 1, recall that W .i ∈ R n×1 and W i. ∈ R n×1 are the ith column and row of W , respectively. For convenience, further define W i(−i) ∈ R 1×(n−1) to be the i-th row of W without the i-th element, W (−i)i ∈ R (n−1)×1 to be the i-th column of W without the i-th element, and W (−i)(−i) ∈ R (n−1)×(n−1) to be the weighting matrix W but without the i-th column and ith row. Thus we have, By combing the results above, we obtain that which completes the entire calculations of Proposition 1. To prove Proposition 3, (2.4). This leads to, If we further define γ * (θ) = −Ω i,12 (Ω −1 i,11 ) ∈ R n−1 , thus the conclusion in Proposition 3 could be obtained.

A.2. Notations and the verification of (2.3) and (2.5)
Before the proof of the theorems, we firstly verify (2.3), (2.5) and define some useful notations for the first and second order derivatives of the objective functions .
Verification of (2.3). First, we consider model (1.1). Let W 0 ii ∈ R n×n to have the same elements as W except that the i-th column and the i-th row equal to 0 for notation convenience for 1 ≤ i ≤ n. Recalling that W .i ∈ R n is defined to be the i-th column of W , W i. ∈ R n is defined to be the ith row of W , respectively. Then by Proposition 1, we could verify that, for 1 ≤ i ≤ n, Therefore, the objective function can be also written in the form of (2.3).

A.3. Proof of Proposition 2
The conclusion in Proposition 2 could be obtained as follows. Under the assumption (2) in Proposition 2, j ω 2 ji and j ω ji ω jk are both summations of finite terms for any node i (1 ≤ i ≤ n). Thus the calculation complexity of Q(ρ) and its derivatives is linear in the network size n. See Appendix A.2 for the detailed expressions for the forms of its derivatives. If the Newton Raphson iteration converges in K steps, where K is finite, then complexity of estimation is also O(n). This completes the proof.

Appendix B
To investigate the theoretical property of the LSE, several useful lemmas are proved before the establishment of the theorems.
ij ) ∈ R n1×n2 are two arbitrary matrices. Define |B| e = (|b ij |) for any arbitrary matrix B. Then, we have the following results.
(1) Assume the condition (C1) is satisfied. Then, we can find an integer sufficiently large such that for c K ≥ K, W c K c w 1 n π , where c w is a positive constant. Define W 0 = K m=0 W m + 1 n π , and W q = W q W 0 . For positive integer q, further define Δ n = (log n) 2(K+q) if δ = 1/2 and Δ n = n 1/2−δ if 0 < δ < 1/2, and δ here is the positive constant, which is defined in condition (C1). Then, we have, for 0 ≤ q 1 , · · · , q 4 ≤ 1, and finite positive integers r 1 , r 2 , r 3 and q, as n → ∞, (2) Recall that S = I n − ρW , then we have (3) For the notations of matrices defined in Section 2 and Appendix A.2, we have the following conclusions: Proof. In this section, we prove (1)-(3) in Lemma 1 subsequently.
Proof of (1). We prove each of the results in the following parts.
Proof of (2). By condition (C1), one can obtain the conclusion that there exists an integer K > 0, such that for n > K, W n c w 1 N π . Thus, As a result, |S −1 | e c s W 0 can be obtained. Additionally, |W q S −1 | e c s W q can be subsequently verified.
Proof of (3). The proofs of (B.4)-(B.8) are similar, and we only calculate the upper bound for |M 2 | as an example. We have |M 2 | e σ c 1m W W (I n + c 1s W )+c 0m W +c 0m (I n +c 1s W )W ( Next, we prove (B.9). One can verify Thus, the order is O (log n) 6 n 1/2−δ .
Letε = (ε 1 ,ε 2 , · · · ,ε n ) ∈ R n , Q 1 =ε M 1ε , and Q 2 =ε M 2ε , where M 1 and M 2 are n × n dimensional matrices. Then, we have, Proof. Define M 1 = (m 1,ij ) ∈ R n×n and M 2 = (m 2,ij ) ∈ R n×n . Then we have, E(ε M 1ε ) = tr(M 1 ). Next, it can be calculated that Proof. We apply the martingale difference theorem to prove the asymptotic normality of Q. First, we construct a martingale difference array Q i which satisfies Q = n i=1 Q i . Second, we apply the martingale difference theorem to prove n −1/2 Q → d N (0, σ 2 1q ) as n → ∞. Define M = (M ij ) ∈ R n×n . And define F i to be the σ-field generated by where e i ∈ R n is a zero vector with only the ith element being 1. Then, we have Q = i Q i and E(Q i |F i−1 ) = 0, where F i is defined to be the σ-field generated by {ε j : 1 ≤ j ≤ i}. Thus, we only need to prove the following two results, Next, we prove (B.12) and (B.13) separately.
We can calculate that where M i· ∈ R n is the ith row vector of M, I i = i j=1 e j e j , recalling that e j is a zero vector with only the jth element being 1. Thus, we only need to prove, n −2 var(ε M * ε) → 0 where M * = N i=1 I i−1 M i· M i· I i−1 . Owing to |I i−1 M i· M i· I i−1 | |M i· | e |M i· | e , we only need to prove N −2 tr(M * M * ) → 0. It can be calculated that (|M i1,· | e |M i2,· | e )(|M i2,· | e |M i1,· | e ) ≤ tr(MM MM ).
By (B.11), the result can be obtained.