Score-Matching Representative Approach for Big Data Analysis with Generalized Linear Models

We propose a fast and efficient strategy, called the representative approach, for big data analysis with generalized linear models, especially for distributed data with localization requirements or limited network bandwidth. With a given partition of massive dataset, this approach constructs a representative data point for each data block and fits the target model using the representative dataset. In terms of time complexity, it is as fast as the subsampling approaches in the literature. As for efficiency, its accuracy in estimating parameters given a homogeneous partition is comparable with the divide-and-conquer method. Supported by comprehensive simulation studies and theoretical justifications, we conclude that mean representatives (MR) work fine for linear models or generalized linear models with a flat inverse link function and moderate coefficients of continuous predictors. For general cases, we recommend the proposed score-matching representatives (SMR), which may improve the accuracy of estimators significantly by matching the score function values. As an illustrative application to the Airline on-time performance data, we show that the MR and SMR estimates are as good as the full data estimate when available.


Introduction
In the past decade, big data or massive data has drawn dramatically increasing attention all over the world. It was in the 2009 ASA Data Expo competition when people found out that no statistical software was available to analyze the massive Airline on-time performance data. At that time, the airline data file, about 12GB in size, consists of 123,534,969 records of domestic flights in the United States from October 1987 to April 2008 (Kane, Emerson and Weston, 2013). Up to December 2019, the Airline on-time performance data collected from the Bureau of Transportation Statistics consists of 387 files and N = 188, 690, 624 valid records in total.
The response in the Airline on-time performance data was treated as a binary variable Late Arrival with 1 standing for late by 15 minutes or more . Generalized linear models (GLMs) have been widely used for modeling binary responses, as well as Poisson, Gamma, and Inverse Gaussian responses (McCullagh and Nelder, 1989;Dobson and Barnett, 2018). In order to fit a GLM with p predictors, a typical algorithm searching for the maximum likelihood estimate (MLE) based on the full data of size N requires O(ζ N Np 2 ) time to run, where ζ N is the number of iterations required for the convergence of the full data MLE algorithm (Wang, Zhu and Ma, 2018).
Starting in 2009, substantial efforts have been made on developing both methodologies and algorithms towards big data analysis (see, for example, Wang et al. (2016), for a good survey on relevant statistical methods and computing).
Divide-and-conquer, also known as divide-and-recombine, split-and-conquer, or split-and-merge, first partitions a big dataset into K blocks, fits the target model block by block, and then aggregates the K fitted models to form a final one . A divide-and-conquer algorithm proposed by Lin and Xi (2011) reaches the time complexity of O(ζ N/K Np 2 ), where ζ N/K is the number of iterations required by a GLM MLE algorithm with N/K data points. The accuracy of the estimated parameters based on the divide-and-conquer algorithm relies on the block size N/K, which typically depends on the computer memory. Therefore, as N increases, K has to increase accordingly. Typically, its accuracy is not as good as the full data estimate. The divide-and-conquer idea has been applied to numerous statistical problems (Chen and Xie, 2014;Schifano et al., 2016;Zhao, Cheng and Liu, 2016;Lee et al., 2017;Battey et al., 2018;Shi, Lu and Song, 2018;Chen, Liu and Zhang, 2019). For example, Chen and Xie (2014) made an innovative attempt of majority voting on variable selection based on a divide-and-conquer framework that is similar in spirit to combining confidence distributions in meta-analysis (Singh, Xie and Strawderman, 2005;Xie, Singh and Strawderman, 2011). Schifano et al. (2016) extended the divide-andconquer idea for online updating problems motivated from a Bayesian inference perspective.
Another popular strategy for big data analysis is subsampling. For example, leveraging technique has been used to sample a more informative subset of the full data for linear regression problems (Ma and Sun, 2014). Inspired by D-optimality in optimal design theory, Wang, Yang and Stufken (2019) proposed an information-based subsampling technique, called IBOSS, for big data linear regression problems. Its time complexity is O(Np) while the ordinary least square (OLS) estimate for linear models takes O(Np 2 ) time complexity. Motivated by A-optimality, Wang, Zhu and Ma (2018) developed an efficient two-step subsampling algorithm for large sample logistic regression, which is a special case of generalized linear models. The time complexity of the A-optimal subsampler is also O(Np). Compared with the divide-and-conquer strategy, the subsampling approach requires much less computational cost. Nevertheless, its accuracy relies on the subsample size and is typically not as good as the divideand-conquer estimate.
Other developments include stochastic gradient descent and the polynomial approximate sufficient statistics approaches. Stochastic gradient descent algorithms (Tran, Toulis and Airoldi 2015;Lin and Rosasco 2017) update in a sequential manner based on a noisy gradient. Polynomial approximate sufficient statistics methods (Huggins, Adams and Broderick 2017;Keeley et al. 2020;Zoltowski and Pillow 2018) construct polynomial approximate sufficient statistics for GLMs for any combination of batch, parallel, or streaming. In terms of accuracy, their performances are comparable with subsampling techniques.
The success of divide-and-conquer methods relies on the similarity between blocks, which can be achieved by randomly partitioning the original dataset. In practice, however, a massive dataset is often stored in multiple files, within which data points have sort of similarity in some fields. We call it homogeneous partition. For example, the Airline on-time performance data (see Section 4) consists of 387 files labeled by month. In each file, all data points share the same values in fields YEAR, MONTH, and QUARTER, while between files these values are different. A divide-and-conquer strategy would not work directly with the original homogeneous partition due to distinct blocks. Even worse, a massive dataset is sometimes distributed in multiple hard disks, or even a network of interconnected computers, known as nodes (see, for example, a distributed database (Özsu and Valduriez, 2011)). If one needs to re-partition the data randomly, it may involve intensive communication between nodes and require extensive network bandwidth.
Another critical issue on massive data analysis is data localization requirements. Due to security or privacy concerns, data localization or data residency law prohibits health records, personal data, payment system data, etc, from being transferred freely (Bowman, 2017;Fefer, 2020). Technology companies, such as Microsoft, also use data storage locale controls in their cloud services (Vogel, 2014). The Irish Data Protection Commission sent Facebook a preliminary order to suspend moving data from its European users to the United States (Schechner and Glazer, 2020). Under such kind of circumstances, the partition of data is predetermined and raw data exchange between nodes is prohibited. We call such kind of data partition a natural partition.
In the computer science literature, algorithms of the CoCoA family have been developed for optimization problems with distributed dataset and a central server (see Smith et al. (2018) for a good review), which combine local solvers for subproblems in an efficient manner. He, Bian and Jaggi (2018) further extended CoCoA to the COLA algorithm for decentralized environment, that is, distributed computing without a central coordinator.
To avoid intensive data communications between nodes and even avoid any raw data transfer, we propose a different data analysis strategy for distributed massive dataset with data localization requirements, named the representative approach. When a massive data is provided in data blocks, either naturally or partitioned by a data binning procedure, we construct a representative data point for each data block using only the data within the block and initial parameter values, and then run regression analysis on the representative data points. The constructed representatives are typically artificial ones which may not belong to the original dataset. The collection of representative data points may be used for further analysis and inference. Compared with the original data, its data volume is significantly reduced.
The representative approach provides an ideal solution for the analysis of naturally partitioned massive data. By exchanging only the estimated parameters and the representative data points among parallel computing computers, the representative approach can work well even with slow-speed or restricted network connection. It fulfills user privacy or security requirements since analysts perform regression analysis on the representatives without direct access to the raw data.
The representative approach is inspired by the data binning technique in the computer science literature. By discretizing continuous variables into categorical variables (see, for example, Kotsiantis and Kanellopoulos (2006)), a data binning procedure partitions a continuous feature space into blocks labeled by so-called smoothing values. It focuses on how to partition data into blocks or bins, while the smoothing values are typically chosen from class labels, boundary points, centers, means, or medians of the data blocks. A data binning procedure is often used for data pre-processing, whose performance is not guaranteed, especially for nonlinear models.
Different from the data binning technique, the representative approach proposed in this paper assumes a given data partition and concentrates on constructing the best smoothing values, which we call representatives, more efficiently for a pre-specified regression model. With a given partition, the goal of the representative approach is to run as fast as subsampling approaches, while estimating model parameters as accurately as the divide-and-conquer method.
Actually, using the proposed representative approach, a GLM is fitted on K representatives constructed from the original N data points (K N ). Its time complexity is also O(Np), same as the subsampling approaches. On the other hand, the K representatives are not a subset of the N data points, but summarize the information from each single one of the original N data points. Based on our comprehensive simulation studies, by matching the score function values of GLMs, the proposed score-matching representative (SMR) approach is often comparable with the full data estimate.
The data binning technique that we consider here is to partition the data according to the predictors. It is different from the sufficient dimension reduction techniques which start with a data partition based on the responses or the conditional class probabilities of binary responses (Shin et al., 2014(Shin et al., , 2017. The representatives described in this paper are different from the representative points (rep-points) developed in the statistical literature, which aims to represent a distribution the best. Research work on rep-points includes the principal points in Flury (1990), the mse-rep-points in Fang and Wang (1994), and the recent support points in Joseph (2017, 2018). Although the reppoints can capture the covariate distribution better, they may not be more informative in estimating model parameters, and thus are not as efficient as subsampling methods or the proposed representative approaches (see Section A.4 in the Appendix for comparisons).
The remainder of this paper proceeds as follows. In Section 2 we describe the general framework of the representative approach and data partitions. After comparing mean, median and mid-point representatives, we recommend mean representative (MR) for big data linear regression analysis. In Section 3 we develop SMR along with its theoretical justifications. We recommend SMR for big data general linear regression analysis. In Section 4, we use the Airline ontime performance data as an illustrative example for real big data analysis. We show that the MR and SMR estimates are as accurate as the full data estimate when available. We conclude in Section 5. The proofs of theorems and corollaries, more corollaries, more simulations, as well as more details about the real case study, are relegated to the Appendix. Table 1 Examples ν(η) and G(η) of commonly used GLMs

Generalized linear model and score function
Given the original data set {(x i , y i ), i = 1, 2, . . . , N} with covariates x i = (x i1 , . . ., x id ) T ∈ R d and response y i ∈ R, we consider a generalized linear model assuming independent response random variable Y i 's and the corresponding pre- For model-based data analysis with fairly general known functions h 1 (·), . . . , h p (·), we would rather regard the data set as Following McCullagh and Nelder (1989), there exists a link function g and regression parameters β = (β 1 , ..., β p ) T , such that E(Y i ) = μ i and η i = g(μ i ) = X T i β, where Y i has a distribution in the exponential family. For typical applications, the link function g is one-to-one and differentiable.

Representative approaches and mean representatives
In this paper, we assume that the data is given along with a data partition. Let I = {1, 2, . . . , N} be the data index set. The data partition can be denoted by a partition {I 1 , I 2 , . . ., A representative approach for model-based regression analysis constructs a representative data point (X k ,ỹ k ) for data block D k , k = 1, . . . , K, and then fit the regression model based on the weighted representative datasetD = {(n k ,X k ,ỹ k ), k = 1, . . . , K}. This procedure could be repeated to achieve a desired accuracy level of estimated model parameters.
Unlike subsampling approaches, a representative data point may not belong to the original data set. With a given data partition, the goal of the representative approach is to make the model parameter estimateβ based on the weighted representative dataset close enough to the full data estimateβ. Given an initial value of parameter estimate, the construction of representative data points in one block shall not be affected by another to facilitate parallel computing.
In practice, a massive dataset is often provided in multiple data files, which forms a natural partition with each block representing a data file. In order to improve the efficiency of the proposed representative approach, one may construct a sub-partition within each natural data block or data file such that the predictors have similar values in each finer block. Many partitioning methods have been proposed in the literature (see, for example, Fahad et al. (2014) for a good survey). From our point of view, there are two types of partitioning methods, grid or clustered. Grid partition is based on the feature space in R p , with cut points obtained from the summary information of data, such as quantiles (called equal-depth) or equal-width points, which is usually feasible with a moderate number of predictors (Kotsiantis and Kanellopoulos, 2006). Its time complexity is O(Np). Clustered partition is based on clustering algorithms. For example, Pakhira (2014) proposed a linear k-means algorithm with time complexity O(Np), which is especially useful with large p. When the massive data consists of multiple natural blocks or data files, these partitioning methods could be applied on the natural blocks one-by-one to obtain a finer partition of the whole dataset. In practice the sizes of natural blocks or data files are typically fixed. Only the number of data files increases when the massive dataset is updated in a time order. Therefore, the overall time complexity for obtaining all sub-partitions is still O(Np). Improving the efficiency of sub-partitioning is important but out of the scope of this paper.
There are lots of representative choices that could possibly work for the representative approach. A naive choice of the representative is the block center, which is popular in the data binning literature. More specifically, given the kth data block D k = {(X i , y i ), i ∈ I k } with block size n k , the options for its representativeX k include (1) mid-point of the rectangular block, when a grid partition is given; (2) (component-wise) median; (3) (vector) mean, that is, X k = n −1 k i∈I k X i . Then the weighted representative data for the kth block is defined as (n k ,X k ,ỹ k ) withỹ k = n −1 k i∈I k y i , which will be justified in Theorem 3.2.
A comprehensive simulation study with linear models shows that using block means, the mean representative approach (MR), is more efficient than the midpoint and median options, as well as the IBOSS subsampling approach (Wang, Yang and Stufken, 2019). For details please see Section A.1 in the Appendix.
As for time complexity, the full data ordinary least squares (OLS) estimate takes O(Np 2 ). The IBOSS (Wang, Yang and Stufken, 2019) costs O(Np). For typical applications, K N and p N , then the overall time complexity including constructing representatives and fitting models is O(N ) for mid-point or O(Np) for median and mean representative approaches for both linear regression models and GLMs.
The simulation studies in Section A.1 of the Appendix also imply that the maximum Euclidean distance within data blocks, denoted by Δ = max k max i,j∈I k X i − X j , plays an important role on the efficiency of mean representatives. For general representatives, the key quantity is the maximum deviation from the corresponding representativesΔ = max k max i∈I k X i −X k .

Score-matching representative approach for GLMs
The MR approach works very well for linear models and can be validated for GLMs when Δ is sufficiently small. Nevertheless, for moderate Δ with general GLMs, MR approach is not so satisfactory (see Section 3.2). In this section, we propose a much more efficient representative approach, called score-matching representative (SMR) approach for GLMs.
Recall that in Section 2.1 the maximum likelihood estimateβ solves the score equation s(β) = 0. It is typically obtained numerically by the Fisher scoring method (McCullagh and Nelder, 1989), which iteratively updates the score function with the current estimate of β. Inspired by the Fisher scoring method, given some initial values of the estimated parameters, our score-matching representative approach constructs data representatives by matching the values of the score function block by block, and then applies the Fisher scoring method on the representative dataset and gets estimated parameter values for the next iteration. We may repeat this procedure for a few times till a certain accuracy level is achieved. According to our comprehensive simulation studies (see Section A.2 in the Appendix), three iterations are satisfactory for typical applications.

Score-matching representative approach
Let s k (β) = i∈I k (y i − G(η i ))ν(η i )X i denote the value of the score function contributed by the kth data block D k = {(X i , y i ), i ∈ I k }, ands k (β) = n k (ỹ k − G(η k ))ν(η k )X k denote the value of the score function based on the weighted representative data (n k ,X k ,ỹ k ), where η i = X T i β andη k =X T k β are functions of β.
Suppose the estimated parameter value isβ (t) at the tth iteration. For the (t + 1)th iteration, our strategy is to find the representative (X k ,ỹ k ) carrying the same score as the kth data block atβ (t) , that is, s k (β (t) ) =s k (β (t) ), or Multiplying byβ (t) on both sides of (3.1), we get 2) suggests that we takeỹ k as a weighted average of y i 's for the SMR approach, that is, Remark 3.1. Theỹ k defined by (3.3) is a natural generalization of the mean representative. Actually, letȳ k = n −1 , then it can be verified thatỹ k =ȳ k + O(Δ) as Δ goes to 0, given thatη k is bounded away from 0. In order to avoid 0 ∈ (min i∈I k η i , max i∈I k η i ), which may lead to unboundedỹ k when i∈I k ν(η i )η i is close to 0, we split such a block into two pieces by the signs of η i 's and generate two representatives, one for positive η i 's and the other for negative η i 's.
Sinceỹ k in (3.3) does not rely onη k , we can further obtainη k by solving (3.2).

Theorem 3.1. There exists anη
The existence of the score-matching representative is guaranteed by Theorem 3.1, whose proof is relegated to the Appendix, Section A.8. Since the solution solving (3.2) may not be unique, we choose theη k closest to the mean representative to keep a smallerΔ. Different from mean representatives, the coordinates of the score-matching representatives corresponding to the intercept term may not be exactly one.
Plugging theη k solving (3.2) into (3.1), we get the predictor representativẽ X k for SMR: The kth weighted representative data pointD k = (n k ,X k ,ỹ k ), which carries the same score value atβ (t) as the kth original data block, will be used for fittingβ (t+1) via the Fisher scoring method. For typical applications, we may repeat this procedure for T = 3 times to achieve the desired accuracy level (see Figure 5 in the Appendix for a trend of SMR iterations). The complete procedure of the T -iteration SMR approach is described by Algorithm 1.
For a GLM, the time cost is O(Np) for calculating all η i 's, O(N ) for calculating allỹ k 's using (3.3), O(N ) plus ζ r iterations for solving (3.2), and O(Np) for calculating allX k 's by (3.4). Along with the time cost O(ζ K Kp 2 ) for finding the MLE based on K representative points, a 3-iteration SMR requires O(Np + Nζ r + ζ K Kp 2 ). Since ζ r , ζ K , K, p N , the time complexity of SMR for a GLM is essentially O(Np).
. . , N} Result: SMR estimateβ for a generalized linear model and a pre-specified number T of iterations Use the mean representatives as the initial weighted dataset ; Apply the Fisher scoring method onD (0) and obtain the initial estimateβ (0) ; Apply the Fisher scoring method on the weighted data set Algorithm 1: Score-Matching Representative Approach
For illustration purpose, we choose a moderate population size N = 10 6 in this simulation study. In the absence of a natural partition, we use two datadriven partitions: (1) An equal-depth partition with m = 4 splits for each predictor, that is, using the three sample quartiles (25%, 50%, and 75%) as the cut points for each predictor and partitioning the whole data into up to 4 7 = 16, 384 blocks (after removing empty blocks, the number of blocks K is actually between 11, 488 and 16, 384); (2) a k-means partition with the number of blocks K = 1000. The proposed SMR approach starts with MR estimates as its initial values and repeats the iterations for 3 times.
According to Table 2, MR, A-opt and SGD do not match either DC or SMR in terms of accuracy. SMR performs either the best or comparable with DC. For more than half of the scenarios, SMR is even comparable with the full data estimates. With the k-means partition, SMR achieves roughly the same accuracy level with only 1000 representatives. A justification under linear models, which is relegated to the Appendix (Section A.1), shows that a best partition keeps the cluster size δ k = max i,j∈I k X i − X j the same for k = 1, . . . , K, which partially explains the better performance of k-means partitions.
As a conclusion, when the predictors are bounded, MR is a fast and lowcost (computationally cheaper) solution for big data analysis with generalized linear models. For more general cases, especially when a higher accuracy level is desired, MR can be used as a pre-analysis for SMR, while the latter has a significant improvement across different scenarios and different partitions.

Other GLM examples
Commonly used GLMs include binary responses with logit, probit, cloglog, loglog, and cauchit links, Poisson responses with log link, Gamma responses with reciprocal link, Inverse Gaussian responses with inverse squared link (see detailed formulae in Table 1).
In Table 3, we show the RMSEs (( The representative approaches, MR and SMR, are based on k-means partition with K = 1000 for the following three models: (1) Binary response with complementary log-log (or cloglog) link g(μ) = log(− log(1 − μ)). Since G(η) = 1 − exp{− exp(η)} is relatively flat, SMR estimate is comparable with the full data estimate even with a not-so-good MR estimate.
(2) Poisson response with the canonical link g(μ) = log μ. Since G(η) = exp(η) increases exponentially, the improvement of SMR is slowed down with a not-so-good MR estimate. The variances of MR and SMR estimates are both high. Thus a good initial value for Poisson regression is crucially important (see Section 3.5 for detailed discussion on how the slope of G(η) affects the performance of SMR).
(3) Logistic model with interactions. In this case, we simulate x = (x 1 , x 2 , x 3 ) T from mzNormal and assume the non-intercept predictors to be (h . Both MR and SMR estimates work well.

Theoretical justification of SMR
First of all, for the proposed SMR approach in Section 3.1, the full data estimatê β is a stationary point of the SMR iteration. That is, if the current estimatẽ β (t) =β, then the representative dataset achieves score 0 atβ (t) and thus β (t+1) =β as well.
Recall thatβ is the maximum likelihood estimate (MLE) based on the full data D = {(X i , y i ), i = 1, . . . , N}. Letβ be the MLE based on a weighted representative dataD = {(n k ,X k ,ỹ k ), k = 1, . . . , K}, which could be obtained by MR, SMR, or other representative approaches. Since Theorem 3.2 below provides asymptotic results for fairly general representative approaches, whose proof is relegated to the Appendix (Section A.8).

Theorem 3.2. For a generalized linear model with a given dataset, suppose its log-likelihood function l(β) is strictly concave and twice differentiable on a compact set B ⊂ R p and its maximum can be attained in the interior of B. Suppose the representatives satisfyỹ
Theorem 3.2 covers mean, median, and mid-point representatives. For mean and mid-point representatives, we actually haveΔ ≤ Δ. For median representatives, we also haveΔ ≤ p 1/2 Δ. Thus as direct corollaries, β −β = O(Δ 1/2 ) for all the three center representatives (see Corollaries A.1 & A.2 in Section A.8 of the Appendix).
Technically speaking, any representative approach satisfying (3.1) could be called a score-matching approach. The proposed SMR approach which satisfies (3.3) and (3.4) is one of the possible solutions for (3.1), which is a natural extension of the MR approach. Both Theorem 3.3 and the following theorem provide consistency results for general score-matching approaches, whose proofs are relegated to Section A.8 of the Appendix.
Theorem 3.4. Consider a more general iterative representative approach with estimated parameterβ (t) at its tth iteration. Suppose for the (t + 1)th iteration, for each k = 1, . . . , K, the obtained weighted representative data (n k ,X ) satisfies the following two conditions: (a) The representative matches the score function atβ (t) , that is, (3.1) is true; Then the estimated parameterβ (t+1) based on the weighted representative data satisfies Remark 3.2. We call ρ(Δ) in (3.6) the global rate of convergence, which depends on the size ofΔ. Its specific form can be found in the proof of Theorem 3.4. Based on our experience, even for moderate size ofΔ, ρ(Δ) can be significantly smaller than 1 and the first few iterations can improve the accuracy level significantly.
For score-matching representatives, condition (a) of Theorem 3.4 holds instantly. As for condition (b) of Theorem 3.4, if |η k | > δ for some δ > 0 as Δ goes to 0, thenỹ k =ȳ k + O(Δ) (see Remark 3.1), and thusỹ k =ȳ k + O(Δ) since Δ ≤ 2Δ. After splitting blocks according to the signs of η i 's, the cases of data blocks withη k close to 0 are rare. For those blocks, we may simply definẽ y k =ȳ k . Thus condition (b) can be guaranteed in practice.
The conditions and conclusions of Theorem 3.4 are expressed in terms of Δ. When applying the SMR approach, a slight modification may guaranteẽ Δ ≤ Δ ≤ 2Δ. Actually in our simulation studies, it is almost always the case for the proposed SMR approach. Occasionally,X k could be out of the convex For such kind of cases, we replaceX k with the MR representativeX k . The difference caused for the value of score function is negligible. By this way, the conclusions of Theorem 3.4 hold for Δ → 0 as well.
Overall, Theorem 3.4 justifies why the proposed SMR approach works well. A special case of Corollary 3.1, whose proof is relegated to Section A.8 of the Appendix, is when all covariates are categorical and the dataset is naturally partitioned by distinct covariate values. When most covariates are categorical except for a few continuous variables, for example, the Airline on-time performance data analysis in Section 4, the partition could be chosen such that Δ is fairly small and thus both MR and SMR estimates work very well.

Asymptotic properties of MR and SMR for big data
In order to study the asymptotic properties of MR and SMR estimates as N goes to ∞, we assume that the predictors X 1 , . . . , X N ∈ R p are iid ∼ F with a finite expectation, and the partition {B 1 , . . . , B K } of the predictor space R p is fixed. To avoid trivial cases, we assume p k = F (B k ) > 0 for each k = 1, . . . , K. Then the index block I k = {i ∈ {1, . . . , N} | X i ∈ B k } with size n k . By the strong law of large numbers (see, for example, Resnick (1999, Corollary 7.5.1)), as N → ∞, n k /N → p k > 0 almost surely. In order to investigate asymptotic properties, we consider the discrepancy from the true parameter value β instead of the estimateβ based on the full data.
For the MR approach, as N → ∞, almost surely. If the link function g or G = g −1 is linear, then g(ỹ k ) −X T k β → 0 and thus the MR estimateβ → β. Nevertheless, in general g is nonlinear, and the accuracy of the MR estimate mainly depends on the size Δ of blocks, not the sample size N . In other words, for a general GLM and a fixed partition of the predictor space, the accuracy of the MR estimate is restricted by (3.7) and thus will not benefit from an increased sample size. Different from MR, by matching the score function of the full data, the proposed SMR approach can still improve its accuracy as the sample size increases, even with a fixed partition of the predictor space. Actually, for a general GLM, i is also bounded. By the strong law of large numbers for independent sequence of random variables (see, for example, Resnick (1999, Corollary 7.4.1)) and the first-order Taylor expansion, as N → ∞ and thus n k → ∞, the left hand side (LHS) of (3.2) after divided by n k is

K. Li and J. Yang
From (3.8) we see that as N increases, the leading discrepancy of LHS caused by response y i 's vanishes. Even if the maximum block size Δ is fixed, whenβ (t) is close to β, the LHS of (3.2) is small, and so is its right hand side. For blocks withη k away from 0, it indicates thatỹ k − G(η k ) and thusỹ k − G(X T k β) are small. That is, when N → ∞, the SMR representatives {(X k ,ỹ k ), k = 1, . . . , K} stay close to the true curve, μ = E(Y ) = G(X T β), which leads to a faster convergence rate of the SMR estimate towards β than MR's.
From (3.9) we conclude that a relatively large G (X T k β) may slow down the convergence of the SMR estimate. For example, under a Poisson regression model with log link (see Model (2) in Section 3.3), G(η) = e η . If the initial estimate of the regression parameter is not so accurate, SMR may converge slowly. For such kind of cases, we suggest a finer partition or smaller Δ to obtain a good initial estimate. For models with fairly flat G functions, such as models with logit link, G is small for most blocks. For this kind of cases, even if the initial estimate for SMR is not so accurate, we can still improve the accuracy of the estimate significantly after a few iterations.
To reveal the performance of SMR visually over sample size N , we use the first simulation setup mzNormal for illustration purpose, varying N = 10 5 , 10 6 , 10 7 , 10 8 . For MR and SMR, the number of blocks is fixed at K = 1000 with a k-means partition. For the divide-and-conquer method, the block size typically restricted by the computer memory and thus will not change as N increases. For illustration purpose, we fix the same block size 1000 as in Section 3.2. As N increases, the number of blocks for the divide-and-conquer method increases proportionally. Figure 1 (see also Table 4) shows how much those methods benefit from increased sample size N over 100 simulations for each N . Figure 1(a) shows that in terms of RMSE from the true parameter value, SMR's estimate is comparable with the full data estimate and converges to β much faster than other methods. Figure 1(b) shows that SMR's estimate quickly gets closer to the full data estimateβ as N increases, while other methods' estimates do not show a clear pattern getting closer. This simulation study confirms our conclusion in this section.

CPU time
We use R (version 3.6.1) for all simulation studies listed in this paper. For the IBOSS (Wang, Yang and Stufken, 2019) and A-optimal (Wang, Zhu and Ma, 2018) subsampling methods, we use the R functions provided by the authors. The CPU time costs for MR and 1-iteration SMR with a given k-means (K = 1000) partition, A-optimal subsampling with subsample size 20, 000, and divide-and-conquer with 1000 random blocks are shown in Table 12 and Figure 8 in the Appendix. With the given partition, MR is comparable to A-optimal subsampling method in terms of computational time; SMR is a little slower than MR but faster than the divide-and-conquer method.
One drawback of k-means clustering algorithm is that its time cost grows fast as N increases. In practice, we recommend a subset clustering strategy. That is, the K clustering centers are determined by a subset of M data points (M N ), and the partition of the full data is determined by measuring the distance between the data points and the K centers. Since the predetermined M mainly depends on the number p of predictors, not the sample size N , the time cost of subset clustering is O(Np). According to our simulation study (see Section A.7 of the Appendix), the subset clustering strategy can significantly reduce the clustering time cost, while keeping the efficiency of representative approaches.

A case study: airline on-time performance data
The Airline on-time performance data for the US domestic flights of arrival time from October 1987 to December 2019 were collected from the Bureau of Transportation Statistics (https://www.transtats.bts.gov/) as a real example for big data analysis. The original dataset consists of 387 csv files with the total number of records 192, 555, 789 (see Table 16 in the Appendix for more details). After cleaning, the total number of valid records is N = 188, 690, 624.
For illustration purpose, we consider a main-effects logistic regression model for binary response ArrDel15 (arrival delay for 15 minutes or more, 1=YES) with three categorical covariates and one continuous covariate: QUARTER (season, 1 ∼ 4) instead of MONTH for simplification purpose; DayOfWeek (day of week, 1 ∼ 7); DepTimeBlk (departure time block, 1 ∼ 4) following the convention of the O'Hare International Airport; and DISTANCE (distance of flight, 8 ∼ 4983 miles). More details about the data and the fitted models could be found in Section A.9 of the Appendix.
In order to evaluate the performance of MR and SMR given that the full data estimate is not available, we choose the MR estimate of the main-effects logistic model on the last 5 years' data (from January 2015 to December 2019) as the "oracle" regression coefficients (denoted by β, see Table 14 in the Appendix), and simulate 10 independent replicates of responses using the logistic model with the oracle parameter values β.
To show how MR and SMR work on the natural partition, we treat each data file (labeled by MONTH) as a node and prohibit raw data exchanging between nodes. Note that the natural partition in this example is pre-determined. The data from different nodes follow different distributions since the predictor QUARTER is a constant in each node but varies across different nodes. We further split each data file into 7 × 4 × 8 = 224 sub-partitions by cutting the only continuous covariate DISTANCE at 8 equal-depth points and combining distinct values of DayOfWeek and DepTimeBlk.
In order to show the change of estimate accuracy along with increased data size, we run a sequence of four experiments using the first 60 months, 120 months, 240 months and all 387 months of data, respectively. In each experiment, we obtain the full data estimate (not available for 240 months and 353 months due to too big data size), as well as the MR and SMR estimates, which are listed in Table 5. The average and standard deviation (std) of RMSEs are obtained from 10 independent simulations.
In terms of RMSE from the oracle β (see Table 5), MR and SMR perform as good as the full data estimate for 60 months and 120 months. The main reason is that the oracle coefficient of the only continuous predictor DISTANCE is as small as 7.955 × 10 −5 . Even multiplied by the largest value of DISTANCE, 4983, the contribution of DISTANCE is still less than 0.4, which is too small compared with the oracle intercept −2.322. In other words, this scenario is fairly close to a case where all predictors are categorical. According to Corollary 3.1, both MR and SMR estimates match the full data estimate very well. Table 5 also shows that as the data size gets bigger, including 240 months and 387 months, both MR and SMR estimates are better than the last available full data estimate obtained at 120 months. It is interesting that if we amplify the effect of the continuous predictor DISTANCE, say enlarge its oracle coefficient from 7.955 × 10 −5 to 7.955 × 10 −4 , SMR estimate will show clearly higher accuracy than MR estimate (see Section A.9 in the Appendix for more details).

Discussion and conclusion
In practice, a natural partition may be provided with partially homogeneous blocks. In order to investigate the dependence of the proposed SMR approach on homogeneous partitions, we run another simulation study letting the given partition gradually change from a random partition to a homogeneous partition. More specifically, under the logistic regression model (3.5) with 7 covariates, for k = 0, 1, 2, . . . , 7, we first partition the data according to the first k covariates into 4 k blocks using the equal-depth criterion. We then, for each of the 4 k blocks, randomly divide the data in this block into 4 7−k sub-blocks. By this way, k = 0 actually leads to a random partition of 4 7 blocks, and k = 7 corresponds to an equal-depth homogeneous partition. As k increases from 0 to 7, the partition becomes more and more homogeneous. The performance of MR and SMR on these partitions are displayed in Figure 2. In this simulation study, both MR and SMR are affected by the homogeneous level of the natural partition. The higher k is, the better performance they have. Overall, MR seems to be the worst, which does not meet A-optimal sampler until k = 7. SMR is not as good as A-optimal one and DC when k is small, while it surpasses A-optimal one at k ≥ 4 and does better than DC at k = 7. An implication is that sub-partitions are necessary for representative approaches when the natural partition is not homogeneous.
When all predictors of the GLMs are categorical or discrete, the best solution would be partitioning the data according to distinct predictor values if applicable. In this case, Δ = 0, both the MR and SMR estimates exactly match the full data estimate.
For GLMs with flat G(η) (that is, G (η) is bounded by some moderate number), such as logit, probit, cloglog, loglog, and cauchit links for binomial models, one may check the coefficients of the continuous variables fitted by MR. If all linear predictors contributed by continuous variables are relative small compar- ing to the intercept or linear predictors contributed by categorical ones, then the MR estimate might be good enough. Otherwise, we recommend the SMR solution. For GLMs with unbounded or large G (η), such as Poisson model and Gamma model, we recommend SMR over MR with a finer partition (see Section 3.5 for more detailed discussion).
Data partition, or more specifically, the maximum block size Δ, is critical for both MR and SMR. Asymptotically, the accuracy of MR estimate depends on Δ and will not benefit from an increased sample size unless for linear models, while the accuracy of SMR estimate can still be improved with increased sample size and fixed partition of the predictor space (see Section 3.5 and Section A.5 in the Appendix for more detailed discussion).
To illustrate how SMR scales with dimension d, we also run simulations towards various covariate dimension d using MR, SMR, A-opt, and DC. To avoid the increment of linear predictor along with dimension d, we randomly generated a (d + 1)-dimensional β for each simulation such that β = 3 as the true regression coefficients. Figure 3 shows that, as the covariate dimension d in the main-effects logistic model increases, the performance of the SMR estimate, using MR estimate as initial value, gets away from the full data estimate. As we expected, Δ increases with d, which leads to a challenge for both MR and SMR. It seems that A-optimal sampler is fairly robust across different dimensions. One strategy for solving large-d problems is to use A-optimal estimate as the initial parameter value for SMR iterations.
When d is large, it is also difficult to reduce the maximum block size Δ efficiently due to the curse of dimensionality. How we can obtain an efficient partition or do variable selection for large dimension d is critical for representative approaches, but is out of the scope of this paper.
The framework of representative approaches allows the data analysts to work with the representative data instead of the raw data. In the scenario where different sources of data are owned by different individuals, organizations, companies, or countries (also regarded as nodes) with competing interests, the exchange of raw data between nodes might be prohibited. In this scenario, neither divideand-conquer nor subsampling approaches are feasible, while representative approaches may provide an ideal solution since the representative data points typically could not be used to track the raw data.
Compared with the CoCoA algorithms (Smith et al., 2018), the representative approaches introduced in this paper also require a central server to collect representative data points and process regression analysis. Utilizing a similar idea as in the COLA algorithm (He, Bian and Jaggi, 2018), the representative approaches may be applied to decentralized environment as well.

A.1. SMR and MR for linear model
The simulation study in this section is based on the linear regression model where i = 1, . . . , N and i 's are iid ∼ N (0, σ 2 ). Note that linear regression models are actually special cases of the generalized linear models with normally distributed responses and identity link (see Table 1). Analogue to the simulation setup in Section 3.2, we take N = 10 6 , d = 7, β 0 = 0, β 1 = · · · = β d = 0.5 and σ 2 = 1, as well as the same distributions for simulating The main-effects predictors in (A.1) are for illustration purpose. The representative approach can actually work with general predictors including, for example, interactions of covariates.
In the absence of a natural partition, we again use an equal-depth partition with m = 4 splits for each predictor, and a k-means partition with the number of blocks K = 1000. Mid: mid-point representative; Med: median representative; IBOSS: information-based optimal subdata selection. Table 6 shows both the average and standard deviation of the root mean squared errors (RMSEs, ( 7 i=1 (β i − β i ) 2 /7) 1/2 ) between the estimated parameter valueβ i 's and the true value β i 's across different simulation settings and each with 100 independent simulations.
In terms of RMSE, Table 6 clearly shows that MR outperforms both midpoint (Mid) and median (Med) representative approaches, as well as information-based optimal subdata selection (IBOSS) proposed by Wang, Yang and Stufken (2019) with 20, 000 subsamples, which is larger than the largest possible number of non-empty blocks or representatives. Compared with the true parameter value, MR estimates are comparable even with the estimates based on the full data.
From Table 6, we also see that the RMSE of MR based on the equal-depth partition obtained from 11, 488 ∼ 16, 384 non-empty blocks or representatives on average are comparable with the RMSE of MR from the k-means partition with 1000 representatives. It implies that representative approaches based on clustered partition are more efficient. Actually, the maximum distance within data blocks Δ = max k max i,j∈I k X i − X j , may play an important role in extracting data information more efficiently. The following theorem shows that for linear models, the MR estimate is unbiased. It is also asymptotically efficient as Δ → 0.
Proof of Theorem A.1. Recall that X i ∈ R p is the ith predictor vector, y i ∈ R is the ith response variable, i = 1, 2, . . . , N; and y = (y 1 , . . . , y N ) T is the response vector of the full data, X = (X 1 , . . . , X N ) T is the predictor matrix of the full data.
Given a data partition {I 1 , . . . , I K } of I = {1, . . . , N}, we denote by X k , y k , k the predictor matrix, the response vector and the error vector of the kth block, k = 1, . . . , K, respectively. Then which is positive definite according to our assumption.
Denote by · 2 the induced matrix norm defined by A 2 = max x =1 Ax , which is actually the square root of the largest eigenvalue of A T A. If A is positive semi-definite, then A 2 is simply its largest eigenvalue.
Denote by λ 1 and λ * 1 the smallest eigenvalues of matrices K k=1 X T k X k and K k=1 n kX T kX k , respectively. According to our assumption, λ 1 > 0 regardless of the partition. By (A.2), we have λ * 1 > λ 1 − Δ 2 N > 0 if Δ 2 < λ 1 /N . That is, K k=1 n kX T kX k is invertible when Δ 2 is sufficiently small. Therefore, we have the weighted least squares (WLS) estimate from mean representative dataset where 1 n k is an n k × 1 vector of all 1's, and J n k is an n k × n k matrix of all 1's. Then the MR estimator is unbiased since and the covariance matrix of the MR estimator is given by and the matrix norm of difference between the Fisher information matrices of OLS and MR is given by Consider the induced matrix norm of difference between covariance matrices of OLS and MR estimators The last "≤" holds if Δ 2 < λ 1 /(2N ). Therefore, as Δ goes to zero, Cov(β) converges to Cov(β) in terms of the largest eigenvalue.
Best Partition: From the proof of Theorem A.1, we know that the difference between the Fisher information matrices ofβ andβ is actually bounded by K k=1 n k δ 2 k up to a constant, where δ k = max i,j∈I k X i − X j is the kth block size. Fixing the number K of blocks, a natural question is to find a partition that minimizes the upper bound K k=1 n k δ 2 k . Assume that the average density of the kth block is f k , such that, n k ≈ c·δ r k f k for some constant c > 0 and r > 0. Typically r = d while in general r depends on the mapping from the covariate vector x i to the predictor vector X i . Then the goal is to minimize 3) and differentiating it with respect to w k , we get Therefore, the partition minimizing the upper bound satisfies which is the same for k = 1, . . . , K. That is, the best partition keeps all the blocks about the same size. It explains why a k-means partition works usually better than a equal-depth partition for MR in linear regressions, because it minimizes Raykov et al., 2016). From Theorem A.1, we also see that if the grids of partition are coarse, the variance of MR estimate could be large and away from the variance of OLS.
As for the comparison between MR and SMR, in terms of the RMSEs betweeñ β and β (such as in Table 6), the performances of MR and SMR are similar and both comparable with the full data estimateβ (shown in Table 6). Since for  practical data "true" model or "true" parameter value may not exist, a more realistic goal is to match the full data estimateβ. In Table 7, we show the RMSEs betweenβ andβ. It confirms the conclusions in Theorem 3.4 and Section 3.5. That is, for linear models, both MR and SMR perform very well, while SMR is slightly better. Nevertheless, when the Δ is small (for example, in the partition obtained by k-means), the global rate ρ(Δ) of convergence (see Remark 3.2) is close to 0 and the improvement from MR to SMR is significant.
It is also interesting to see different estimates on the intercept β 0 . From Table 8, it seems that MR and SMR are among the best and comparable with the full data estimate.

A.2. Practical number of iterations for SMR
In order to determine a practical number T of iterations for SMR algorithm, we simulate 100 datasets of size N = 10 6 for each of the seven distributions of covariates in the linear model (A.1) and the logistic regression model (3.5). We apply 20 iteration steps of SMR for examining a practical number of iterations. Figure 4 and Figure 5 show that the iterative SMR based on a k-means partition with K = 1000 improves the accuracy level significantly in the first 1 or 2 iterations. Starting from the 4th iteration, the relative improvements of RMSE are less than 5% for ueNormal, and less than 2% for other distribution settings.
For GLMs with flat G(η) such as linear model and logistic model, typically the first iteration based on the initial MR estimate improves the accuracy level significantly. For GLMs with steeper G(η) such as Poisson model with log link, we recommend T = 3 (see also asymptotic properties discussed in Section 3.5).

A.3. SMR vs divide-and-conquer for logistic models
In this section, we use simulation studies to show that in terms of accuracy level the 3-iteration SMR is comparable with the divide-and-conquer approach (Lin and Xi, 2011), which is also known as divide and recombine, split and conquer, or split and merger in the literature . When there is no ambiguity, we call the 3-iteration SMR simply SMR.
We simulate 100 datasets of size N = 10 6 for each of the seven distributions of covariates in the logistic regression model (3.5). We apply both SMR and the divide-and-conquer (DC) algorithm proposed by Lin and Xi (2011) to estimate the parameter values. Table 2 shows that SMR based on a k-means partition with K = 1000 outperforms the divide-and-conquer method with 1000 blocks in most of simulation settings. Actually, the SMR estimates are comparable with the full data estimates in terms of RMSEs from the true parameter value. In Figure 6 and Figure 7, we plot the corresponding boxplots. For comparison purpose, we also list the corresponding MR estimates, which are overall not as good as DC's. SMR is ideal for massive data stored in multiple nodes, since it exchanges only the representative data points and estimated parameter values between nodes. It can perform well even with limited network bandwidth.
On the contrary, divide-and-conquer methods typically operate on random partitions. Each random data block for divide-and-conquer might consist of data points from many different nodes, which requires heavy communications between nodes. It may not be feasible when raw data transfer is prohibited.

A.4. SMR vs support points for logistic models
In order to compare the performance of the proposed methods and the support points techniques (Mak and Joseph, 2018), we run a simulation study under the logistic regression model with 7 covariates. We use R function sp in package support for searching support points for a given dataset. Since it takes sp more than one hour to generate 20,000 support points from 1,000,000 data points, we reduce the simulation setup to choosing 1,000 support points from 10,000 original data points, which is the simulation setup used by Wang, Zhu and Ma (2018).  In Table 9, we list the average RMSE ( 7 i=1 (β i −β i ) 2 /7) 1/2 over 100 simulations, whereβ i is estimated by the full dataset consisting of 10,000 data points,β i is estimated by 1000 simple random samples (SRS), support points (Support), A-optimal sample points (Aopt), mean representatives (MR), or the proposed score-matching representatives (SMR).
In terms of RMSE, SMR still performs the best. The difference between MR and SMR is not as large as in Table 2 since the sample size N = 10 4 is smaller in this simulation study.

A.5. MR and SMR with finer partition
According to Theorems 3.2 and 3.3, the estimateβ obtained by MR or SMR converges to the full data estimateβ asΔ → 0. That is, with a finer partition, β gets closer toβ (but not necessarily closer to the true parameter β once the dataset is given). Table 10 and Table 11 show that with finer and finer partitions, both MR and SMR estimates get closer toβ, while SMR is more robust to the block size than MR.

A.6. CPU time of MR and SMR
In this section, we provide Table 12 and Figure 8 mentioned in Section 3.6. All computations are carried out on a single thread of a MAC Pro running macOS 10.15.6 with 3.5 GHz 6-Core Intel Xeon E5 and 32GB 1866 MHz DDR3 memory.

A.7. Subset clustering strategy
When there is no natural partition available or when sub-partitions are needed, a partitioning procedure is required by representative approaches. Based on our simulation study and justification (see Section 3.2 and Section A.1), a clustering procedure, such as k-means, is more efficient than grid partitions. However, the computational cost of k-means clustering is quite heavy especially for large sample size N . We recommend the subset clustering strategy (see the last paragraph of Section 3.6), which can significantly reduce the clustering cost for representative approaches. More specifically, the K cluster centers are determined by a clustering algorithm on a much smaller subset of size M , which will not be changed as the full sample size N increases. Once the K centers are determined, each data point with predictor vector X ∈ R p is assigned to the cluster whose center is closest to X. Since both K and M will not change with N and they are much less than N , the time complexity of the overall clustering procedure is O(Np).
To illustrate the performance of the subset clustering strategy compared with a full data clustering for MR and SMR, we carry two simulation studies as follows. In the first simulation study, we generate datasets with sample size N = 10 5 , 10 6 , 10 7 , 10 8 and d = 7 covariates following mzNormal for the logistic model described in Section 3.2. For each N , we generate 100 independent replicates. The size of a randomly selected subset for locating K = 1000 clustering centers is fixed at M = 10 5 . Figure 9 shows in terms of the average SMR RMSE over 100 simulations the difference between the subset clustering and the full data clustering is negligible (the result of full data clustering is not available at N = 10 8 due to too much computational time. Table 13 shows the computational time on clustering. For full data clustering, the time cost is proportional to the sample size N . For subset clustering, we break the overall time cost into two parts. The first part is for locating the K cluster centers based on the subset of size M , whose time cost is fairly short and roughly constant. The second part is also full data clustering but with given K cluster centers, whose time cost is also proportional to N but much less than the original full data clustering. Overall, the subset clustering strategy reduces the clustering cost significantly.
To further explore the interaction between the dimension of predictors and the subset clustering strategy, we simulate datasets with d = 4, 7, 15, 20 covariates and sample size N = 10 6 for logistic main-effects models. Since the dimension of covaraites changes, for each simulation we randomly generate the regression coefficients β such that β = 3. We also adopt different subset size M = 5 × 10 4 , 10 5 , 2 × 10 5 , 10 6 . Figure 10 shows that in terms of MR and SMR RMSEs, the subset clustering strategy is not sensitive to the initial subset size M across different dimensions of predictors.
Lemma A.1 (Kanniappan and Sastry, 1983, Theorem 2.2). Suppose X is a finite dimensional space and f n : X → R, n = 0, 1, . . ., are strictly convex. Suppose f n → f 0 uniformly and x * n = arg min x f n (x) exists uniquely. Then x * n → x * 0 as n goes to ∞. Proof of Theorem 3.2. According to McCullagh and Nelder (1989, Section 2.5), the log-likelihood of a GLM with data D = {(X i , y i ), i = 1, . . . , N}, which is fixed throughout the proof, is given by where θ(·) = (b ) −1 (g −1 (·)), a(·), b(·) and c(·, ·) are known functions, and φ is the dispersion parameter. Given a data partition {I 1 , . . . , I K } of I = {1, . . . , N}, which will become finer and finer later, the log-likelihood contributed by the kth block D k = {(X i , y i ), i ∈ I k } with block size n k = |I k | is essentially while the log-likelihood contributed by the weighted kth representative (n k ,X k , y k ) is Then the log-likelihood based on the full data is l(β) = k l k (β), and the log-likelihood based on the weighted representative data is l(β) = kl k (β).
Recall that the derivative ∂l/∂β is simply the score function (2.1). By plugging in the first order Taylor expansion ofl k aboutX k at X i and the Cauchy-Schwarz inequality, we have

K. Li and J. Yang
Denote F k = (n −1 k i∈I k (y i − G(X T i β)) 2 ν(X T i β) 2 ) 1/2 . Then for sufficiently smallΔ and all β ∈ B, we have for some M > 0, which depends on the data, which is given and fixed, but not the representatives orΔ. That is,l(β) converges to l(β) uniformly for all β ∈ B asΔ goes to 0. The strict concavity of l(β) implies the existence and uniqueness ofβ ∈ B, such thatβ = arg max β l(β). Letβ maximizel(β). For sufficiently smallΔ,l(β) is also strictly concave, which guarantees the existence and uniqueness ofβ. By Lemma A.1,β converges toβ asΔ → 0.
When all the covariates are categorical or have finite discrete values, one may partition the data according to distinct X i 's. In this case, Δ = 0.
The rest of this section provides detailed description of the airline raw data (Table 16) and working data (Table 17), the MR and SMR estimates based on all available 387 months (Table 18), and regression coefficients fitted yearly based on the original GLM (full) or SMR algorithm (Figure 11).  "-": Full data estimates for 240 months and 387 months are not available due to memory limitation Origin Airport, Airport ID. An identification number assigned by US DOT to identify a unique airport. Use this field for airport analysis across a range of years because an airport can change its airport code and airport codes can be reused.