Uncertainty quantification for principal component regression

Abstract: Principal component regression is an effective dimension reduction method for regression problems. To apply it in practice, one typically starts by selecting the number of principal components k, then estimates the corresponding regression parameters using say maximum likelihood, and finally obtains predictions with the fitted results. The success of this approach highly depends on the choice of k, and very often, due to the noisy nature of the data, it could be risky to just use one single value of k. Using the generalized fiducial inference framework, this paper develops a method for constructing a probability function on k, which provides an uncertainty measure on its value. In addition, this paper also constructs novel confidence intervals for the regression parameters and prediction intervals for future observations. The proposed methodology is backed up by theoretical results and is tested by simulation experiments and compared with other methods using real data. To the best of our knowledge, this is the first time that a full treatment for uncertainty quantification is formally considered for principal component regression.


Introduction
The high-dimensional regression problem has attracted enormous attention in recent years. A typical linear model can be expressed as y = Xβ + σu, where y = (y 1 , . . . , y n ) T is a vector of responses, X = (x 1 , ..., x n ) T is a design matrix of size n × p. Also, β = (β 1 , ..., β p ) T is a vector of p unknown parameters, σ 2 is the unknown noise variance, and u = (u 1 , ..., u n ) T is a vector of n i.i.d. normal random variables with zero mean and variance one. It is assumed that u and x 1 , ..., x n are independent.
A high-dimensional model is the one with p n. In many situations, the large number of predictors are often correlated. For such cases, principal com-ponent regression, which uses principal components (PCs) instead of the original predictors, is a powerful method that reduces the dimension and orthogonalizes the regression problem (Jolliffe, 1982). The idea of applying principal component analysis in regression was first discussed by Hotelling (1957). Some recent engineering applications of PC regression can be found in Adusumilli et al. (2015); Huang and Yang (2012); Lipponen et al. (2010); Xu et al. (2013);Zhu et al. (2018). The additional structure, imposed by the assumption that the directions of biggest predictor variability are also important for explaining the dependent variable, allows for more specialized and efficient statistical methods than those that do not make this assumptions such as Lai et al. (2015).
Here is a brief description of PC regression. Let X = U LV T be the singular value decomposition of X, where L = diag{l i } are the non-negative singular values of X. The columns of U n×p and V p×p are, respectively, orthonormal left and right singular vectors of X. For any k ∈ {1, . . . , p}, let V k denote the p × k matrix having the first k columns of V . Then X k = XV k is a n × k matrix having the first k principal components as its columns. In PC regression it is customary to assume that Y is generated by the first k 0 < n principal components of X (k 0 unknown). That is, Y is generated from the model where γ k0 = (γ 1 , ..., γ k0 ) T is a vector of k 0 unknown parameters and σ is the unknown standard deviation of the noise, and u = (u 1 , ..., u n ) T is a vector i.i.d. standard normal random variables. The unknown k 0 can be estimated by using a model selection criterion such as AIC or BIC; denote the resulting estimate ask 0 . Then one can estimate γk 0 for example by regressing Y on Xk 0 . The final PC regression estimator of β based on the firstk 0 principal components is then given by βk 0 = Vk 0 γk 0 ∈ R p . A top-down selection rule was used for deciding which PCs should be kept in the model (Xie and Kalivas, 1997). That is, one always selects thek 0 PCs that correspond to thek 0 largest singular values. Some alternative selection rules have also been proposed. For example, Sun (1995) suggested using correlation principal component regression in which the PCs are orderd by their correlations with the response, and Sutter et al. (1992) treated the PC selection as an optimization problem. However, Xie and Kalivas (1997) showed that the topdown approach described in the previous paragraph generates the most stable global model. In sequel, this paper will follow this top-down approach, although the methodology can be straightforwardly extended to other selection rules.
Although many researchers have worked on the problem of choosing the number of PCs (i.e., k 0 , the model size) in PC regression, it seems that the issue of uncertainty quantification has received very little treatment. To fill this important gap, this paper develops a fully automatic method that quantifies the uncertainties in the estimates for the model parameters (β and σ 2 ), model size (k 0 ), and prediction error. The proposed method is based on generalized fiducial inference (Hannig et al., 2016). To the best of our knowledge, this is the first time that a full treatment for uncertainty quantification is formally considered in PC regression.

An introduction to generalized fiducial inference
Bayesian inference is an important methodology in statistics. It provides a "distribution estimate" for the unknown parameter, which has a wealthier information comparing to frequentist inference (Xie and Singh, 2013). However, the over-enthusiastic application of this methodology when prior information is unknown has also caused concerns (Efron, 2013). To avoid such potential issue, Fisher (1930) introduced fiducial inference. Instead of using uninformative prior as in Bayesian inference, Fisher considered a switching principle, which is similar to the idea of maximum likelihood, to assign a prior using the information from the observed data. However, for many years, Fisher's idea did not attract much attention from the majority of statisticians.
Recently, there has been a resurgence of interest in the variant of fiducial inference. The modern contributions include Dempster-Shafer theory (Dempster, 2008), its related work called inferential models (Martin and Liu, 2015), confidence distribution (Xie and Singh, 2013), and more generally fusion learning Cheng et al. (2014).
The particular variant of Fisher's fiducial idea that this paper considers is the so-called generalized fiducial inference (GFI). The success of GFI for conducting statistical inference has been demonstrated in many areas, including both traditional and modern problems; see Hannig et al. (2016) and reference therein.
Generalized fiducial inference begins with expressing the relationship between the data y and the parameter θ as where G(·, ·) denotes the so-called data generating algorithm, and u is the random component whose distribution is completely known; for example, an i.i.d. N(0,1) random vector. Similar to maximum likelihood estimation, in GFI the roles of y and θ are switched: the random y is treated as deterministic in the likelihood function, while the deterministic θ is treated as random. With this, we can define a set {θ : y = G(u * , θ)} as the inverse mapping of G, where u * is an independent copy of u. Note that such an inverse does not always exist: there may be either no θ or more than one θ such that y = G(u * , θ).
For the first case, we remove the values of u for which there is no solution from the sample space and re-normalize the probability. For the second case, Hannig (2009) suggested randomly picking one element from {θ : y = G(u * , θ)}.
A probability distribution on θ can be defined through (2) in the following manner. Suppose for any observed y and all u, there exists a unique θ such that y = G(u, θ). That is, the inverse always exists. Recall that the distribution of u is assumed to be completely known, one can always generate a random sampleũ 1 ,ũ 2 , ..., and via (3), this sample can be transformed into a random sample of θ:θ 1 = Q y (ũ 1 ),θ 2 = Q y (ũ 2 ), ... In below we shall call {θ 1 ,θ 2 , . . .} a fiducial sample of θ. As with a posterior sample in the Bayesian context, we can use it to calculate the point estimates and also construct confidence intervals for θ. Meanwhile, we can also obtain the density r(θ) for θ. Here r(θ) is called the generalized fiducial density for θ, and plays a similar role as the posterior density in the Bayesian context. Observe that, for the PC regression problem the data generating algorithm is (1) and θ contains three components: θ = {k, σ, γ k }, where k denotes the number of principal components, σ 2 is the noise variance, and β k = V k γ k is the vector of regression coefficients estimated using k principal components.
Next we will use the idea of GFI to obtain a generalized fiducial density for k and construct confidence intervals for β k , σ 2 and prediction for y (Wang et al., 2012;Shen et al., 2018).

GFI for principal component regression
While the above description for GFI seems conceptually simple and general, it may not be directly applicable in some situations. When the model dimension is known, Hannig (2013) derived a workable formula for r(θ) applicable in most situations where the data follows a continuous distribution. In what follows, we assume that for the observed y and all θ (2) has a unique solution u = G −1 (y, θ).
The next theorem derives fiducial density for a single model.
Theorem 3.1 (Theorem 1 of Hannig et al. (2016)). Under some differentiability assumptions, the generalized fiducial distribution is absolutely continuous and has density where f (y, θ) is the likelihood and the function However, for the current problem, dimension k 0 is unknown so it has to be included as a parameter in the data generating equation. Since k is a discrete parameter, Theorem 3.1 is not directly applicable. The following theorem gives fiducial probability in the context of model selection: Theorem 3.2 (Theorem 4 of Hannig et al. (2016)). Under identifiability and regularity assumptions (in particular the number of parameters |M | ≤ n) the marginal generalized fiducial probability of model M is where f M (y, θ M ) is the likelihood under model M and J M (y, θ M ) is the Jacobian (5). If |M | > n then r(M ) = 0.
Notice that (6) bears some similarities to the Bayesian posterior probability of a model, with the integral Θ M f M (y, θ M )J M (y, θ M ) dθ M playing the role of marginal probability of the data, and q |M | playing a role of a prior model probability. However, (6) is not a result of a Bayes theorem. Rather, it is derived by inverting the data generating algorithm.
In our problem θ M = {σ, γ k } and |M | = k + 1, so in what follows we will simplify notation by replacing the subscript M with the subscript k. We will also follow the minimum description length principle (e.g., Barron et al., 1998;Rissanen, 1996) and use q = e − 1 2 log n . Let us first calculate J k (y, θ k ) for a fixed k using the data generating algorithm (1) and formula (5). Denote the residual sum of squares as RSS k when the corresponding γ k is estimated by maximum likelihood using the first k principal components. Direct use of (5) and Cauchy-Binnet formula gives Next we will calculate Consequently, the marginal generalized fiducial probability (6) is proportional to

Practical generation of fiducial sample
This subsection presents a practical procedure for generating a fiducial sample {k,σ 2 ,β}. First, for any fixed k, it is straightforward to show that the generalized fiducial distribution of σ 2 conditional on k is and the generalized fiducial distribution of γ k conditional on (k,

S. Wu et al.
Lastly, we will approximate the generalized fiducial density r(k) for k in (7) as follows. For k ∈ {0, ..., n − 1}, calculate With the above, we can generate a fiducial sample {k,σ 2 ,β} with the following steps: 1. Generate ak from (10). 2. Withk, generate aσ 2 from (8). (9). Repeating the above steps one can obtain multiple copies of {k,σ 2 ,β k }. With these one can form point estimates and confidence intervals for the unknown parameters in a similar manner as with a Bayesian posterior sample. For example, The average of allσ 2 can be used as an estimate for σ 2 , while the 2.5% smallest and 2.5% largestσ 2 values can be used as, respectively, the lower and upper limits for a 95% confidence interval for σ 2 . Steps for obtaining estimates and confidence intervals for β and prediction intervals for y are similar.

Theoretical properties
We have the following theorem for which the proof is delayed to the appendix.
We remark that the Assumption (11) ensures that the true model is identifiable. We also remark that Theorem 4.1 implies that the confidence intervals constructed using the generalized fiducial density (7) will have correct asymptotic coverage, and the generalized fiducial distribution and the derived point estimators are consistent.

Simulation results
Simulation experiments are conducted to evaluate the practical performance of the proposed GFI method. We set n = 100 and test different combinations of p, k 0 and σ 2 . For any fixed combination of (p, k 0 , σ 2 ), we first generated X (of size n × p) where its elements are i.i.d. N (0, 1). Let X k0 be the n × k 0 matrix having the first k 0 principal components of X as its columns. Then the noisy training data were generated by y = X k0 γ k0 + , where is a vector of n i.i.d. N (0, σ 2 ) noise random variables. We then applied the proposed generalized fiducial procedure to obtain 1,000 fiducial samples {k,σ 2 ,β}, from which the generalized fiducial confidence intervals for σ 2 and the regression coefficients β k can be computed.
We use 2 values of p = (200, 1000), 3 values of k 0 = (2, 5, 10), 2 values of σ = (0.5, 1) and 2 values of γ k0 = ({1, ..., 1}, {5, ..., 5}); thus a total of 2 × 3 × 2 × 2 = 24 experimental configurations are considered. For each experimental configuration, we simulated 1, 000 datasets of size n = 100, and for each data set generalized fiducial confidence intervals are obtained for σ 2 and β 1 . For comparison, we also report results obtained from two methods, Oracle and BIC. For Oracle the true k 0 is used and the confidence intervals are calculated using classical linear model theory, while for BIC, k 0 is estimated using BIC and the confidence intervals are calculated using the same classical linear model theory.
The results are summarized in the Tables 1, 2, 5 and 6. One can see that the performance of the proposed method is very close to Oracle, and is superior to BIC.
Additional testing data (X * , y * ) are generated to assess the qualities of the confidence intervals for E(y * |X * ) and the prediction intervals for y * obtained by the proposed method. These testing data are generated as follows to ensure X and X * to have the same PC structure. First we set n = 100 and generated a n × p matrix X * with its elements as i.i.d. N (0, 1). Then we apply the singular value decomposition to both X and X * to obtain X = U LV T and X * = U * L * V T * , respectively. Lastly the design matrix is calculated as X * = U * L * V T , and the response vector as y * = X * k0 γ k0 + , where similarly X * k0 contains the first k 0 PCs of X * . We use the fiducial samples {k,σ 2 ,β} obtained from the training data to construct confidence intervals for the first test data E(y * 1 |X) and prediction intervals for y * 1 . The empirical coverage rates are reported in Tables 3, 4, 7 and 8. As before, the proposed method performs very well.
Lastly, Tables 9 and 10 summarize how well the proposed method selects the correct model. In particular we show the percentage of times the correct model selected, the coverage and average size of 95% high probability confidence interval for k 0 . These intervals were obtained by sorting the candidate models according to their fiducial probability and taking the smallest number that adds up to at least 95% fiducial probability. The tables show that the correct model usually has the highest fiducial probability. Even if the true model does not have the highest fiducial probability, it is almost always in the 95% confidence interval. Moreover, the average size of the CIs for n = 1000 is very close to 1. This agrees with Theorem 4.1, that states that the fiducial distribution concentrates on the true model as n increases.  Lan et al. (2006) conducted an experiment to examine the genetics of two inbred mouse populations (B6 and BTBR). Expression levels of 22575 genes of 31 female and 29 male mice were recorded. Some physiological phenotypes, includ- ing numbers of phosphoenopyruvate carboxykinase (PEPCK) and glycerol-3phosphate acyltransferase (GPAT) were also measured by quantitative real-time polymerase chain reaction. Using the credible set approach, Bondell and Reich (2012) derived two methods to predict each of these two phenotypes based on the gene expression data. They also compared their results with those from the LASSO estimator Tibshirani (1996). To reduce the number of candidate predictors, we first apply a screening procedure to remove insignificant predictors and kept the 1999 genes having largest marginal correlation with response. Then the proposed method along with the Joint Sets and Marginal Sets methods in Bondell and Reich (2012) and the LASSO estimator are applied to the data set with p = 2000 predictors (1999 genes along with gender) and n = 60 observations. To compare the performance of these methods, we randomly split the sample into a training set with size 55 and a test set with size 5. The four methods are applied to predict the 5 observations in the test set. We repeat this process for 100 times to compare the mean squared prediction errors (MSPEs) and the model sizes of the final models obtained by the methods. The results are summarized in Table 11. Notice that there are two responses: PEPCK and GPAT. The results show that the proposed method performs well. Although the MSPEs of the proposed method are slightly larger than the methods by Bondell and Reich (2012), the standard errors of the MSPEs and the sizes of the models selected by the proposed method are smaller. These suggest that the proposed method gives a more stable performance, and is less prone to overfitting. Next, the following experiment was carried out to evaluate the empirical coverage rates of the proposed method on this data set. For both responses (GPAT and PEPCK), we left out the first observation of the data set and used the remaining 59 observations to construct a 95% and a 99% prediction interval for this first observation. We repeated this leave-one-out process for the remaining 59 observations. The resulting prediction intervals are summarized in Figure 1. For both PEPCK and GPAT, the coverage rates of the 99% prediction intervals are both 95%, while the coverage rates of the 95% prediction intervals are both 90%. Since the fiducial intervals are well calibrated when the PC regression assumption is appropriate, these under-coverage rates suggest that the assumption might not be entirely suitable for these data sets.

Conclusion
This paper developed a new approach for variable selection and uncertainty quantification for sparse high-dimensional principal component regression based on generalized fiducial inference. The consistency properties of the proposed method was established and was verified by simulation experiments. The pro-  posed method was also compared with other existing methods using a mice phenotype data. It was shown that the proposed method was competitive for a smaller MSPE, model size and standard error.
In the above the following restriction is imposed: if one wants to include the kth principal component in the final model, the first k − 1 principal components should also be included. To relax this restriction, one needs to derive a new generalized fiducial density, adopt a new penalty as now the number of possible models is increased to n k , and develop a new algorithm for generating fiducial samples from the enlarged model space. We plan to explore this possibility in the future. Proof. First we prove that Without loss of generality, we assume that σ 2 = 1. Rewrite R(k) R(k0) = exp{−T 1 − T 2 }, where and T 2 = k − k 0 2 (log n − log(πRSS k0 )) + log Γ( n−k0 2 ) Γ( n−k 2 ) . Case 1: k < k 0 Now calculate For the last term in (12), first notice that ε T (H k0 − H k )ε = −χ 2 k0−k . Then let c k = k log log p and calculate Therefore, For the second term in (12), denote Z k = μ T (I − H k )ε/ √ Δ k , we have μ T (I − H k )ε = √ Δ k Z k and Z k ∼ N (0, 1) as var(Z k ) = μ T (I−H k )(I−H k ) T μ Δ k = 1. Furthermore, Therefore, P (|μ T (I − H k )ε| > √ Δ k c k ) −→ 0 as p −→ ∞. Thus, we have In addition, P (χ 2 n−K < n 4 ) ≤ P (χ 2 n−K < n−K 2 ) ≤ ( √ e 2 ) n−K 2 −→ 0 as n −→ ∞, which means P (min 0<k≤K χ 2 n−k < n 4 ) −→ 0 as n −→ ∞. Thus, and For the first term of T 2 notice that log(RSS k0 ) = Ω p (log(n − k 0 )), while for the second term, log(Γ( n−k0 2 )/Γ( n−k 2 )) = Ω p ( k−k0 2 log n), for n large enough. Case 2: k > k 0 In this case, RSS k − RSS k0 = −χ 2 k−k0 . By Chernoff bound,