Estimating High-dimensional Covariance and Precision Matrices under General Missing Dependence

A sample covariance matrix $\boldsymbol{S}$ of completely observed data is the key statistic in a large variety of multivariate statistical procedures, such as structured covariance/precision matrix estimation, principal component analysis, and testing of equality of mean vectors. However, when the data are partially observed, the sample covariance matrix from the available data is biased and does not provide valid multivariate procedures. To correct the bias, a simple adjustment method called inverse probability weighting (IPW) has been used in previous research, yielding the IPW estimator. The estimator plays the role of $\boldsymbol{S}$ in the missing data context so that it can be plugged into off-the-shelf multivariate procedures. However, theoretical properties (e.g. concentration) of the IPW estimator have been only established under very simple missing structures; every variable of each sample is independently subject to missing with equal probability. We investigate the deviation of the IPW estimator when observations are partially observed under general missing dependency. We prove the optimal convergence rate $O_p(\sqrt{\log p / n})$ of the IPW estimator based on the element-wise maximum norm. We also derive similar deviation results even when implicit assumptions (known mean and/or missing probability) are relaxed. The optimal rate is especially crucial in estimating a precision matrix, because of the"meta-theorem"that claims the rate of the IPW estimator governs that of the resulting precision matrix estimator. In the simulation study, we discuss non-positive semi-definiteness of the IPW estimator and compare the estimator with imputation methods, which are practically important.


Introduction
One of the overarching themes in statistical and machine learning societies is to discover complex relationships among high-dimensional variables. Out of many, the estimated covariance matrix and its inverse matrix (i.e., the precision matrix) are arguably important statistical tools in this line of research. Hence, methodological and theoretical analyses of these statistics, such as scalability, consistency, and convergence rate, have been established by many researchers (see the section Introduction from Fan et al. (2016) for a comprehensive literature review, and references therein), because of their utility in a broad range of disciplines such as biology, geophysics, economics, public health, and social sciences. Despite much advance over decades in the estimation of a covariance/precision matrix under the high-dimensional setting, most approaches to date have been oblivious to handling missing observations. However, widespread applications have emerged in modern sciences where the primary interest is placed on estimating the correlation structure involved in observations subject to missing, for example, climate data (Schneider 2001), genomic studies (Cui et al. 2017;Liang et al. 2018), and remote sensing data (Glanz and Carvalho 2018), to name a few. Even so, there has been relatively less development in both methodology and theory that deal with the (inverse) covariance estimation problem in the presence of missing data.

Past works on (inverse) covariance matrix estimation with missing values
Previous research in the field of estimation of an (inverse) covariance matrix with incomplete data, though not many to our best knowledge, can be classified into two branches; the likelihood-based approach and the plug-in approach.
The first line of the works is the likelihood-based inference, mostly achieving the maximum likelihood estimator by an expectation-maximization (EM) algorithm (or its variants) (Allen and Tibshirani 2010; Huang et al. 2007;Liang et al. 2018;Städler and Bühlmann 2012;Thai et al. 2014). In spite of individual successes in covariance/precision matrix estimation when missing observations are present, the major drawback of this approach is separate development of estimating algorithms and supporting theories. That is, one considering a new proposal under this framework should put huge efforts on implementing the new method for practical purposes and prove theoretical properties (e.g. consistency). Furthermore, the Gaussian assumption on observations commonly used in the likelihood inference could be restrictive in the high-dimensional setting.
The other scheme of research studied rather in recent years utilizes the idea of a plug-in estimator, based on the fact that many procedures for estimating a covariance/precision matrix solely rely on the sample covariance matrix S, not the data itself. Preceding works (Cai and Zhang 2016; Kolar and Xing 2012;Lounici 2014;Pavez and Ortega 2019;Rao et al. 2017;Wang et al. 2014) have considered adjusting the missing proportion, or the bias that appears in the sample covariance matrix S Y computed by partial observations (see the definition in (1)). The modified estimator is often referred to as an inverse probability weighting (IPW) estimator and put into a module (procedure) of the (inverse) covariance estimation. For example, Kolar and Xing (2012) plug the IPW estimator into the graphical lasso procedure (Friedman et al. 2008) to estimate a sparse precision matrix, while Cai and Zhang (2016) use banding, tapering, or thresholding techniques to recover a structured covariance matrix in the missing data context. Wang et al. (2014) apply the CLIME method (Cai et al. 2011) to the bias-corrected rank-based correlation matrix to estimate a sparse precision matrix of a non-paranormal distribution. In the low-rank approximation problem, the IPW estimator is plugged into the matrix lasso (Rohde andTsybakov 2011) by Lounici (2014), which is extended by Rao et al. (2017) to vector autoregressive processes. All of these works are based on one common assumption about missing; for each sample, each variable is independently subject to missing with equal (uniform) probability. Their theoretical analyses, though recovering the aimed rate log p/n (n: the sample size, p: dimension), are established based on such a restrictive independence assumption. In contrast, dependent (and nonuniform) missing has not been paid attention to until very recent year when Park and Lim (2019) made an initial attempt and Pavez and Ortega (2019) made a further investigation.
While the two results are based on the spectral norm using the effective rank of a matrix (see Table 1), we derive the optimal convergence rate of the IPW estimator in terms of the element-wise maximum norm under general missing dependency.

Our contributions
Our main contributions are outlined below.
Derivation of the optimal convergence rate under dependent missing structure. We develop a non-asymptotic deviation inequality of the IPW estimator in the element-wise maximum norm by extending missing independence to missing dependency (Theorem 1). The theoretical results maintain the conventional convergence rate log p/n achieved by the earlier works (Bickel and Levina (2008a) and the references in Table 1). Theorem 1 can be further used to estimate the structured precision matrix for the Gaussian graphical model, due to no assumptions on the covariance/precision matrix, the sample size, or the dimension.
Relaxation of implicit assumptions to derive the rates.
In analyzing the concentration of the IPW estimator, estimation of the population mean and missing probability has been largely unexplored (Lounici (2014), Wang et al. (2014), Park and Lim (2019), Pavez and Ortega (2019)), which is not desirable in practice. Filling the gaps, this paper establishes the concentration inequalities for the IPW estimator under unknown mean (Theorem 2) and missing probability (Theorem 3).

Outline
The remainder of this paper is organized as follows. At the beginning of Section 2, we formally state the problem setup and introduce the IPW estimator under general missing dependency. Based on it, we present our theoretical results in Section 2 and their variants considering relaxations in Section 3. Section 4 deals with non-positive semi-definiteness of the IPW estimator and its potential remedies. In Section 5 and 6, we show our numerical studies on simulated and real data. We conclude this paper with a brief discussion and summary in Section 7.
2 The IPW estimator under general missing dependency and its rate Let X = (X 1 , . . . , X p ) T be a p-dimensional vector of random variables with mean zero and covariance matrix Σ = E(XX T ). We denote missing observations by 0, which has a simple mathematical representation using a missing indicator 1 δ j that takes its value either 0 (missing) or 1 (observed); Y = (Y 1 , . . . , Y p ) T , Y j = δ j X j , j = 1, . . . , p.
The multivariate binary vector δ = (δ 1 , . . . , δ p ) T is assumed to follow some distribution where a marginal distribution of δ j is the Bernoulli distribution with success probability 0 ≤ π j ≤ 1.
This formulation is an extension of independent missing structure used in previous works (Cai et al. 2016;Kolar and Xing 2012;Lounici 2014;Wang et al. 2014), which assume δ k is independent of δ (k = ). Contrary to it, this paper assumes the p random variables {δ j , j = 1, . . . , p} are allowed to be dependent and not identically distributed. The probability of observing at multiple positions is henceforth denoted by P(δ i = δ j = δ k = . . . = 1) = π ijk... . Let us consider n samples from the population above where the covariance matrix Σ = (σ k , 1 ≤ k, ≤ p) is to be estimated. Denote the i-th sample version of X, Y, δ j by X i , Y i , δ ij , respectively. Then, the sample covariance matrix from partially observed data is obtained by

Dependent
It can be easily checked that S Y is biased for Σ, since its expectation is Σ π = π jk σ jk , 1 ≤ j, k ≤ p by assuming independence between {X i } n i=1 and {δ ij } i,j . This motivates one to adjust each component of S Y by a weight and define the IPW estimator Σ provided that π jk > 0, ∀j, k. Then, Σ IP W is unbiased for Σ under the missing completely at random (MCAR) mechanism (Little and Rubin 1986), that is, {δ ij } p j=1 is independent of {X ij } p j=1 for i = 1, . . . , n. For example, when data acquisition is carried out through censors (e.g. remote sensing data), missing arises due to faults in censors and thus is independent of values to be measured.
We note this adjustment technique is frequently used in a general context of missing data and also known as the propensity score method. The underlying idea of it is to construct an unbiased estimating equation by reweighting the contribution of each sample on the equation. The corresponding equation for the covariance estimation problem under the Gaussian setting without missing is a score function given by where where R i = δ ij δ ik /π jk , 1 ≤ j, k ≤ p and * is an element-wise product. Solving the equation above with respect to Σ yields an empirical version of the IPW estimator that replaces π jk in (2) with n −1 n i=1 δ ij δ ik . This estimator has been used and analyzed before in Kolar and Xing (2012) and Cai and Zhang (2016), which will be studied in Section 3.2 of this paper under general missing dependency. Remark that the inverse probability π jk in R i is ignorable and does not play any role in defining the empirical estimator. However, when the probability is dependent on sample-specific variables (X i or extrinsic covariates W i ), we should give weights in the form of the conditional probability defined by P(δ ij = δ ik = 1|X i , W i ), which adjusts the selection bias from partial observations {i : δ ij = δ ik = 1}. For the sake of simplicity, analyses in this paper only concern the identical setting on missing indicators, that is, π jk ... ∀i = P(δ ij = δ ik = δ i = . . . = 1).

Main results
We state our assumptions used in the following theoretical analyses; (i) sub-Gaussianity for each component of X i , (ii) a general dependency structure for δ i , and (iii) MCAR for missing mechanism. We begin with one of the equivalent definitions of the sub-Gaussian variable (Vershynin 2018): the uniformly bounded moments.
Assumption 1 (Sub-Gaussianity). X is a sub-Gaussian random variable in R satisfying for some K > 0.
Missing is assumed to occur with general dependency in sense of the following; Assumption 2 (General missing dependency). A missing indicator vector δ = (δ 1 , . . . , δ p ) T ∈ {0, 1} p follows some multivariate distribution where each marginal distribution is a Bernoulli distribution with a missing probability 2 π j ∈ (0, 1], i.e., δ j ∼ Ber(π j ). Further, assume that This distribution is examined by Dai et al. (2013) where they call it the multivariate Bernoulli distribution. If interaction terms are considered up to the second-order, this multivariate 2 Following the previous footnote, this is called a "missing" probability.
model coincides with a well-known Ising model. The non-degenerate condition for the missing probabilities (i.e., π j > 0, π jk > 0) is required since, for example, π jk = 0 implies no data could be observed for estimating the second moment σ jk , which is unrealistic for our discussion. Next, we formally state our missing mechanism again; Assumption 3 (Missing completely at random). An event that an observation is missing is independent of both observed and unobserved random variables.
Under the data structure in this paper, the above mechanism essentially says that two random vectors, δ i and X i , are independent. We note that Assumptions 1 and 3 are commonly used in the context of covariance estimation with incomplete data, while Assumption 2 is more general than the independent structure that previous research depends on. Based on these assumptions, Lemma 1 describes the element-wise deviation of the IPW estimator from a true covariance matrix.
A proof of Lemma 1 can be found in Section B.1 of Supplementary Material. We provide some remarks as regards this lemma. This concentration inequality covers the existing results as special cases. First, if data is assumed to be fully observed (i.e., π k = 1, ∀k, ), then (6) is reduced to where C 1 , C 2 , C 3 are scalar constants. It can be seen that this form is equivalent to Lemma A.3. in Bickel and Levina (2008b) (Gaussian) or Lemma 1 in Ravikumar et al. (2011) (sub-Gaussian), up to multiple constant difference. When an independent and identical structure of missing indicators is assumed (i.e., δ k ∀k ∼ Ber(π)) in Lemma 1, the reduced probabilistic bound is similar to that from Kolar and Xing (2012) (plugging in t ← log(4/δ)/n in (6)) for the sample size n chosen according to Lemma 1. Rigorously speaking, the proposed IPW estimator in Lemma 1 and that of Kolar and Xing (2012) (see (11)) are different by the inverse weighting factor when correcting missing observations. However, replacing missing probabilities with unbiased empirical estimates will not cause a considerable change in our result (see Section 3.2).
Using the lemma above, the rate of convergence of the IPW estimator can be derived in terms of the element-wise maximum norm. Let us define the maximum and minimum value of parameters that appear in Lemma 1 as follows; Theorem 1. Assume the conditions of Lemma 1 hold, and further assume the sample size and dimension satisfy then it holds that where c, C > 0 are scalar constants.
A proof of the theorem can be found in Section B.2 of Supplementary Material. The above result provides a few intuitions. First of all, the convergence rate log p/n is satisfied with the IPW estimator when missing is present. Also, small portion of missing in data agrees with a faster convergence rate since π min ≈ 1. Furthermore, if we reparametrize the missing , then we see that the entries in v min are rewritten by (1,1) k (1 − |ρ 12 |), k, = 1, . . . , p.
Thus, if less observations are missing (i.e. larger values of p (1,1) k ), less samples are needed to achieve the same convergence rate.
If we assume an independent structure on missing indicators, we get the following result, which is comparable to those from Kolar and Xing (2012) and Lounici (2014). Let Σ IP W ind be the IPW estimator (2) with π jk = π 2 , j = k and π jj = π for all j, k.
Corollary 1 (Identical and independent missing structure). Under the conditions of Lemma 1, we further assume δ ik ∼ Ber(π), independently, k = 1, . . . , p. Then, when the sample size and dimension satisfy where c, C > 0 are scalar constants and ρ max = max Note that the Taylor expansion of an exponential function yields for some c 1 , c 2 > 0. Therefore, the sample size (relative to the dimension) required for accurate estimation is less sensitive to the missing probability π compared to the previous works (Kolar and Xing 2012;Lounici 2014;Wang et al. 2014) whose magnitude is in order of 1/π 2 (see Table 1). However, the bound of the IPW estimator in the element-wise maximum norm increases in the order of magnitude 1/π 2 , which is larger than the rate 1/π claimed in other literature (see Table 1).
(2) || · || max σ max log p π 2 min n See (7) Table 1: Summary of literature using the idea of the IPW estimator. "Rate" is the convergence rate (up to a constant factor depending only on distributional parameters) of an estimator ("Est.") measured by a matrix norm ("Norm"). "Size" is a condition for n and p to guarantee the rate holds with probability at least 1/p. In the first column, we use the following labels: L2014=Lounici ( Table 1 shows the rate of convergence log p/n has appeared in the previous literature. When dependency for missing indicators is allowed, the achieved rate in Park and Lim (2019) under the spectral norm is not optimal, though they have first tackled it. Very recently, Pavez and Ortega (2019) show an improved rate for expectation of an estimation error based on the spectral norm. In terms of the element-wise maximum norm, to our best knowledge, this paper is among the first to obtain the optimal rate.

The meta-theorem in estimation of a precision matrix
The derived concentration inequality is crucial because of its application to precision matrix estimation. The related theory, known as the meta-theorem that has first appeared in Liu et al. (2012), implies that the rate of the precision matrix estimator Ω is determined by the rate ||·|| max of an input matrix (e.g. the IPW estimator) used to estimate Ω. Therefore, when there is no missing observation, the success of the graphical lasso (Ravikumar et al. 2011), the CLIME (Cai et al. 2011), and the graphical Dantzig selector (Yuan 2010) in accurate estimation and graph recovery depends on the fact that the sample covariance matrix S for some C, d > 0. To grasp the underlying mechanism of the meta-theorem, we refer readers to the proof of Corollary 2. Since the claimed rate of convergence in Theorem 1 is the same as that of S in (8), the meta-theorem also guarantees the same optimal rates of the precision matrix estimators with missing observations.
It should be remarked that the rate in Theorem 1 is not driven for a certain class of covariance/precision matrices (e.g. sparse or low-rank) or with a specific restriction on n and p such as an asymptotic ratio between them, i.e., p/n → α ∈ [0, ∞). Such flexibility makes it possible to adopt different conditions (on Σ, Ω, n, or p) required from different precision matrix estimation methods (e.g. the graphical lasso). We describe the meta-theorem under the dependent missing structure below, which is an extension of Theorem 4.3 in Liu et al. (2012). A proof of the corollary can be found in Section B.3 of Supplementary Material.
Corollary 2. Let the true covariance matrix Σ satisfy the same assumptions that a precision matrix estimation procedure such as the graphical lasso, the graphical Dantzig selector, and the CLIME requires to guarantee the consistency and the support recovery of a graph.
If we plug the IPW estimator Σ IP W into one of the aforementioned methods, the end prod-uct retrieves the optimal rate of convergence, and thus has consistency and support recovery properties 3 even under general missing dependency.

Relaxation of implicit assumptions
Estimation using the IPW estimator with missing data depends on two implicit assumptions other than Assumptions 1, 2, and 3; known mean (or equivalently zero mean) and missing probabilities. In this section, we will relax such conditions and show corresponding concentration results.

The case of unknown mean
When the first moment of the underlying distribution is unknown, the IPW estimator should be modified accordingly, but the same rate O p ( log p/n) holds. We do not directly estimate the mean parameter µ k , but µ k µ because of the dependent missing structure.
Assume that we observeỸ ik = δ ikXik whereX ik has an unknown mean µ k . Adopting previous notations, we define X ik to satisfyX ik = X ik + µ k . Then, it is easy to show that With a simple calculation, we can define the unbiased covariance matrix estimator by It is not difficult to find resemblance of (20) with the sample covariance matrix S when data is completely observed. The (k, )-th component of S is defined by i=1X ik , and it can be rearranged by which is equal to (20) when π k = π k = 1 for all k, . The following theorem shows the Theorem 2. Assume the conditions of Lemma 1 hold except a mean zero condition, and further assume the sample size and dimension satisfy where c > 0, d > 0 are scalar constants and C > 0 is a scalar constant depending only on K, σ max , max k |µ k |, π min , and min k π k .
A proof of Theorem 2 can be found in Section B.4 of Supplementary Material, where we introduce a new version of Hanson-Wright inequality to handle a sum of cross-products of sub-Gaussian variables.
Remark. In the theorem above, dependency of the constant C on the parameters can be specified by, (up to a constant factor) where µ max = max k |µ k | and π min,d = min k π k . Supposedly, dependency on the mean parameter µ max can be taken away in C if a missing value is filled by the empirical mean of available data. However, we leave this as future work.

The case of unknown missing probability
In real applications, the missing probability π jk is rarely known and needs to be estimated.
Letπ jk be any estimate satisfyingπ jk > 0, ∀j, k, with high probability. Then, the resulting IPW estimator is presented by provided that the population mean is known for the sake of simplicity. When additional information on missing is not available for estimating π jk , it is natural to use the empirical proportionsπ emp jk = n −1 n i=1 δ ij δ ik of observed samples since it is asymptotically unbiased for π jk (by the law of large numbers). We denote the empirical version Σ emp of the IPW estimator by which corresponds to (10) withπ emp jk in place ofπ jk . One may realize the equivalence of the empirical estimate (11) to a pairwise complete analysis. Based on Lemma 7 that describes the concentration for the inverse probability ofπ emp jk , we can derive the concentration inequality of Σ emp . Theorem 3. Assume the conditions of Lemma 1 without knowing missing probabilities hold, and further assume the sample size and dimension satisfy where c > 0, d > 0 are scalar constants and C > 0 is a scalar constant depending only on K, σ max , and π min .
A proof of the theorem can be found in Section B.6 of Supplementary Material. This result has an implication that the convergence rate log p/n in Theorem 1 is preserved, and thus the same statements in Theorem 2 hold true with Σ emp . It should be pointed out that Kolar and Xing (2012) use the estimator Σ emp , while their theory is limited to the independent missing structure. Thus, Theorem 3 generalizes the theory for the empirical IPW estimator to the dependent structure.
Remark. In the theorem above, dependency of the constant C on the parameters can be specified by, (up to a constant factor) 4 Non-positive semi-definiteness of the plug-in estimator Despite its straightforward derivation and applicability to multivariate procedures in the presence of missing data, the IPW estimator has one critical issue from a practical point of view; non-positive semi-definiteness (non-PSDness). Note that this does not cause problems in the convergence rate, since the norm is element-wisely defined. It is well known that the element-wise product of two matrices may not preserve a nice property of the matrices. As addressed in high-dimensional covariance estimation (thresholding, banding, and tapering) (Bickel and Levina 2008a; Rothman et al. 2009), the positive semi-definiteness is one of the typical examples to be broken down by the Hadamard product of a positive semi-definite (PSD) matrix and a general matrix. This is also the case for the IPW estimator, which makes it practically difficult to use the IPW estimator when using existing algorithms for estimating a precision matrix. For instance, we can plug the IPW estimator into the graphical lasso or the CLIME to estimate a sparse precision matrix Ω = (ω k , 1 ≤ k, ≤ p), when missing data occur. However, the popularly used algorithms (glasso package or clime package in R) require the plugged-in estimator to be positive semi-definite. In this section, we examine the graphical lasso algorithm from this point of view and also suggest possible solutions. A similar discussion about the CLIME can be found in Section C.1 of Supplementary Material.

Graphical lasso
In what follows, we distinguish between a plug-in matrix (estimator) Σ plug and an initial matrix (estimator) Σ (0) (or Ω (0) ) that is used to initialize iterative steps.
The graphical lasso proposed by Friedman et al. (2008) aims to maximize the penalized likelihood function for a penalty parameter λ > 0. To solve (12), a coordinate descent algorithm described in Algorithm 1 is proposed by Friedman et al. (2008) and implemented in R package glasso.

Algorithm 1 The coordinate descent algorithm for the graphical lasso
Input: An initial matrix Σ (0) of Σ, the plug-in matrix Σ plug 1: for i = 1, 2, . . . , do 2: for j = 1, . . . , p, do 3: Solve the least squared regression with the 1 -penaltŷ where Σ \j\j is obtained by removing the j-th row and column in Σ (i−1) and Σ plug j is the j-th column of Σ plug without the j-th entry.
One can easily see that the optimization problem (12) is convex regardless of Σ plug (∵ the trace term is a linear function in Ω), but PSDness of Σ plug is needed when the algorithm is initialized.
First, PDness of Σ (i−1) is required in (13)  , the coordinate descent algorithm would fail to converge. For this reason, we propose to use the following inputs The above proposal for the initial matrix is made because diagonals of the solution Σ (∞) should satisfy Σ ii +λ, ∀i, by the subgradient condition of (12), as noted in Friedman et al. (2008), and because diagonals of Σ (i) do not change as iterations proceed. To use these proposed inputs, one should modify the off-the-shelf code (e.g. glasso function in glasso package) since it does not currently allow users to control Σ (0) and Σ plug individually.
Last but not least, it should be remarked that there is an algorithm developed to solve (12) by approximating the Hessian function (R package QUIC from Hsieh et al. (2014)). This method does not suffer from the PSDness issue discussed here, which is verified through a numerical experiment given in Section C.2 of Supplementary Material. However, solving the similar issue in the other multivariate procedures remains open.

More general solution: matrix approximation
Previously, we present the solutions that are specific to the precision matrix estimation problem, but we can circumvent the non-PSD issue for general statistical procedures. The idea is to approximate Σ plug by the nearest PSD matrix, which can be achieved by where d measures the distance between two matrices. For instance, the Frobenius norm (Katayama et al. 2018;Wang et al. 2014) and the element-wise maximum norm (Loh and Tan 2018) are used previously. Then, the nearest matrix Σ psd would be put into the subsequent multivariate analyses (e.g. the graphical lasso) without modification in the current implementations. However, solving the problem (15) comes at the price of such convenience.
When the Frobenius norm is used, (15) amounts to a well-known projection onto the convex cone of PSD matrices. The solution denoted by Σ psd F can be explicitly expressed by where Σ plug has the spectral decomposition V W V T and the maximum between two matrices operates element-wisely. The computational cost for this case is mostly from the eigenvalue decomposition, but the convergence rates derived for the IPW estimator in terms of the element-wise maximum norm (e.g. Theorem 1) are not guaranteed for Σ psd F .
In contrast, when d is the element-wise maximum norm (Loh and Tan 2018), the conver- where the first inequality uses the triangular inequality and the second is from the definition of Σ psd M . The algorithm to solve (15) with the element-wise maximum norm is first proposed by Xu and Shao (2012) and used in the robust covariance estimation context (Han et al. 2014;Loh and Tan 2018). We note, however, by experience that the approximation based on || · || max is computationally heavy so that it often dominates the computation time of multivariate procedures (e.g. the graphical lasso and the CLIME).

Numerical study
In this section, we perform a number of simulations for estimating a covariance/precision matrix with partially observed data. First, in Section 5.2, we experimentally check the convergence rate of the IPW estimator given in our theorems. In Section 5.3, we conduct a comparison study between several imputation methods and the IPW method. Performance of the estimates is also measured and compared according to simulation parameters, and the related results can be found in Section D.2 of Supplementary Material.

Setting Data generation
We generate Gaussian random vectors X i , i = 1, . . . , n, in R p with mean vector 0 and

Estimators
We compare two types of plug-in estimator: Σ

IP W
, an oracle type estimator labeled by "orc" and Σ emp , an empirical type estimator labeled by "emp". A closed form of the weight π k is accessible according to each missing structure, so the oracle IPW estimator is explicitly computable. It is noteworthy that the estimator Σ emp is used in Kolar and Xing (2012), but their theoretical analysis is limited to the independent missing structure.
We exploit QUIC algorithm proposed by Hsieh et al. (2014) to solve the graphical lasso (12). The grid of a tuning parameter λ ∈ Λ is defined adaptively to the plug-in matrix Σ plug .

The rate of convergence
We verify our theoretical results (Theorem 1 and 3) by computing the element-wise maximum deviation || Σ plug − Σ|| max . We fix p = 100 and vary the sample size in 20 ≤ n ≤ 10000.
We repeat each scenario 20 times and plot the log-transformed empirical distance against log(n/p). Different plug-in estimators ("orc", "emp") and precision matrices (chain, star, random) are under consideration. Figure 1 shows that each graph connecting the averaged distances nearly forms a straight line. The results in the column "orc" confirm the rate of convergence in Theorem 1, while those in the column "emp" confirms that in Theorem 3. , "emp"= Σ emp ) against log(n/p). Loss is computed by the element-wise maximum norm between the plug-in matrix and the true covariance matrix. The dependent missing structure and p = 100 are assumed. Each dot (or mark) is an average loss from 20 repetitions.

Comparison with imputation methods
In the missing data context, unobserved data is often substituted by some function of observed values. One very intuitive way to do it is the imputation method. Once the pseudo complete data is produced, we perform a usual statistical analysis. In this experiment, we compare different (single) imputation approaches with the IPW estimator for the precision matrix estimation. Imputation methods we use are "median" (a median of available data for each variable), "pmm" (predictive mean matching from R package Hmisc (Harrell Jr et al. 2019)), "knn" (an average of k-nearest neighbors from R package impute (Hastie et al. 2018)), "cart", "rf", and "norm" (regression-based methods from R package mice (van Buuren and Groothuis-Oudshoorn 2011)). We use the default parameter setting for each R function. More details of each method can be found in each reference.
By fixing n = 100 and r = 1, 2, we generate 10 random data sets based on different precision matrices. Missing observations are produced under the independent structure. Once missing observations are filled by a single imputation method, then we compute the sample covariance matrix with the imputed complete data and carry out the precision matrix estimation using the QUIC algorithm. We compare the competing methods based on support recovery of the estimated precision matrix. Figure 2 shows the pAUC values, where the IPW method using the empirical estimator ("emp") achieves the largest pAUC compared to the imputation approaches. This is more distinct when the dimension is larger than the sample size (i.e., r = 2). The results demonstrate that the IPW method is not only theoretically solid, but also practically useful. Admittedly, we have not thoroughly examined more diverse and complex imputation methods that may produce better performance, which calls for extensive numerical studies in the future.

Application to real data
We examine the estimation performance of the IPW estimator through a real data application. We use the riboflavin data available from the R package hdi, where 4088 gene expressions are observed across 71 samples. Since the ground truth precision matrix is not known, we construct it by solving the graphical lasso (12) at a fixed λ with a complete data. We impose missing values in a similar manner described in Section 5. Throughout this analysis, it is confirmed again that having more missing values yields worse estimation. Also, it is possible to see that the denser model that has a precision matrix with more non-zero elements is more difficult to achieve satisfactory accuracy in estimation and graph recovery.
More details and results can be found in Section E of Supplementary Material.

Discussion
This paper considers a theoretical establishment of the IPW estimator with missing observations. Contrary to the previous literature, this is achieved under dependency among missingness, meaning that missing indicators are not necessarily independent across variables. The rate of convergence of the IPW estimator is derived based on the element-wise maximum norm, which is (asymptotically) in the same order of the rate claimed in the past works. Our analysis can be applied to an estimation of a sparse precision matrix. Due to the meta-theorem, the favorable properties (consistency, support recovery) of the final estimator are preserved in the missing data context.
The plug-in estimators (e.g. the sample covariance matrix and the IPW estimator) and their concentration are often not of primary interest, but the ultimate goal lies in applying them to downstream procedures (e.g. Hotelling's T 2 , a portfolio optimization, etc). In the portfolio optimization, Fan et al. (2012) show that the risk inequality is bounded by the error of the plug-in estimator Σ Here, w and Σ are true (or optimal) parameters. However, it is still elusive how the rate ||ŵ − w|| V for the optimal solutionŵ that minimizes the risk t → t T Σ plug t is linked to the rate || Σ plug − Σ|| M of the plug-in estimator. || · || V and || · || M are some norms of a vector and a matrix, respectively. This line of research could be interesting for future work and in urgent need, not to mention its extension to the missing data context.
The underlying assumptions on the missing mechanism (i.e., MCAR) and the missing structure (i.e., identical dependency across samples) are essentially not verifiable, but it is natural to think of extending our results to the cases beyond such patterns. Recall from the text below the equation (4) that the IPW estimator under the general missing mechanism is to be defined by 1 n It is easy to show the above estimator is still unbiased for Σ. Now, let us consider the missing at random (MAR) assumption (additionally assume W i = ∅), i.e.,

Assumption 4 (Missing at random). An event that an observation is missing is independent
of unobserved random variables given observed variables.
This assumption essentially requires independence between δ i and X i given observed data X i,obs . It is not straightforward to extend the analysis in this paper to this case. For example, one should work on the calculation displayed below; Unfortunately, the fraction π i,k /π i,k cannot be moved out of expectation since π i,k , π i,k are functions of X i,obs , which makes it difficult to explicitly express (16) in terms of σ kk and others. This was possible under MCAR because of the independence of π i,k from X i . It would be interesting to identify suitable assumptions that are less stronger than MCAR, but still guarantee the missing probability to be free from X i,obs .

Supplementary Material for "Estimating High-dimensional Covariance and Precision Matrices under General Missing Dependence"
A Auxiliary lemmas The first supporting lemma tells a tail bound of a variable with a cumulant generating function dominated by a quadratic function. (1991)). Let a random variable ξ j with Eξ j = 0, Var(ξ j ) = σ 2 j satisfy the following; there exist positive constants A, C, c 1 , c 2 , . . ., such that
Then, we have for ξ = n j=1 ξ j n j=1 σ 2 j , Furthermore, if ξ i 's are identically distributed and satisfying the conditions above, then the variance term σ 2 j does not appear in the concentration inequality: The following auxiliary results for a sub-Gaussian variable X facilitate one to check the condition (17) in Lemma 4.
Lemma 3. Assume that X is a random variables satisfying Assumption 1 for some K > 0.
Lemma 4. Assume that X is a random variables satisfying Assumption 1 for some K > 0.
Proof. The proofs of (a) and (b) directly come from applications of Lemma 2 and 3.

B.1 Proof of Lemma 1
Proof of Lemma 1. Assume k and are distinct. We start by decoupling the product of two sub-Gaussian variables Y ik Y i /π k using an identity xy = {(x + y) 2 − (x − y) 2 }/4 so that we

To apply Lemma 4 in
Supplementary Material , we first show Y * ik + Y * i is a sub-Gaussian variable satisfying the conditions of the lemma.
Fact. For i = 1, . . . , n and 1 ≤ k = ≤ p, we have Proof. To obtain an uniform bound on higher moments, we observe that where the first inequality holds due to convexity of x → |x| r (r ≥ 1) and the third inequality uses the moment condition of the sub-Gaussian variable X ik / √ σ kk . We note that π k + π 2 1/r ≤ 1 for all r ≥ 1 since 0 ≤ (π k + π )/2 ≤ 1. which concludes the proof.
By applying Lemma 4 (b), we have for some numerical constants c, C > 0, in the above inequality, we get for some numerical constantc > 0. Note that and using this bounds, we now have Therefore, combining these results with (18) π k + π − 2π k |ρ k | , which completes the proof for the case of k = .
The concentration inequality for diagonal entries (i.e., k = ) of the IPW estimate is similarly derived. One can easily check Then, due to Lemma 4 (b), we get . This concludes the whole proof.

B.2 Proof of Theorem 1
Proof. From Lemma 1, it holds that for 1 ≤ k, ≤ p, where π min,d = min k π k . Then, by using an union bound argument, we get for t 2 /c ≤ R K/ (v min /4) ∧ π min,d = R 2K/ √ v min . Note that v min /4 ≤ π min,d .
Then, by plugging-in t ← α log p/n (α > 0), we get the convergence rate of the maximum norm of the IPW estimate, so that we can choose α 2 = 3. This concludes the proof.

B.3 Proof of Corollary 2
Proof. We summarize theorems/lemmas from the original works that bridge the rate of the plug-in estimator with those of the final precision matrix. If δ = log p/n in each theorem, then the rates of the precision matrix are optimal and guarantee both estimation consistency in different norms and support recovery (∵ || · || max ). As usual, Σ plug denotes the plug-in estimator.
We note that δ n,p corresponds toδ f (n, p τ ) in the original reference.
where Ω is the CLIME estimator that solves (23) and C > 0 is a numerical constant.

Graphical Dantzig selector
The graphical Dantzig selector aims to solve p optimization problems below (Yuan 2010) for j = 1, . . . , p. Let d be the maximum degree of the graph, or equivalently d = max i j I(|ω ij | = 0).
Note that the 1 -norm of a matrix bounds the spectral norm, so we also have || Ω − Ω|| 2 ≤ Cdδ.

B.4 Proof of Theorem 2
Proof. Recall that the the proposed estimator when mean is not known has its form as follows: Σ A deviation inequality for A 1 comes from Lemma 1. On the other hands, since A 2 , A 3 , and A 4 are independent sum of sub-Gaussian variables, the related concentration inequalities can be found in Lemma 4 (a) and 8. The second term in (20) can be decomposed by The concentration of each term except B 1 is easily derived using Lemma 4 (a) and 8. To analyze the concentration of B 1 which is a dependent sum of cross-product of sub-Gaussian variables, we need a new version of Hanson-Wright inequality. Lemma 5 is more general than that given in Rudelson and Vershynin (2013) in the sense that two random variables X i , Y i are not necessarily equal. The generalization is possible because of the decoupling technique from which we can separately handle {X i : i ∈ Λ} and {Y i : i / ∈ Λ} for some Λ ⊂ {1, . . . , n}.
Details of the proof of Lemma 5 can be found in Section B.5.
Lemma 5. Let (X, Y ) be a pair of (possibly correlated) random variables satisfying EX = EY = 0, and for some numerical constant c > 0. Now, we get the concentration bound for B 1 using the lemma above; for t ≥ σ 1/2 kk σ 1/2 K 2 π k π n , since the matrix in R n×n with off-diagonals 1 and diagonals 0 has both Frobenius and spectral norms of being n − 1. By replacing t with tσ Then, if we assume n > log p, the probability above is bounded by 2p −t .
Combining all results for A 1 , . . . , A 4 , B 1 , . . . , B 5 , we can derive the concentration inequality for each component of Σ

IP W µ
, which completes the proof.

B.5 Proof of Lemma 5
Proof. Without loss of generality, we assume K Bernoulli variables with success probability 1/2. Then, by observing Eη i (1−η j ) = I(i = j)/4, it can be seen that is an expectation taken over {η i }. Let Λ η = {i : η i = 1} be the index set of successes. Since conditionally follows is a sub-Gaussian distribution.
We assume {η i } is conditioned on all the following statements unless specified otherwise.
Then, the previous results yield where the equality holds since exp(4λS η ) does not depend on {X j } j∈Λ c η and the inequality is from sub-Gaussianity of S η . Taking expectation with respect to {X i : i ∈ Λ η } on both sides, we get the following result; where the equality holds from independence among n samples. Also, since the left-hand side where we begin to display the conditional dependency on {η i }. Following the step 3 and 4 in Rudelson and Vershynin (2013), we can achieve an uniform bound of T η independent of {η i } and thus get for some positive constants c and C. Then, we have Following the step 5 in Rudelson and Vershynin (2013), we can get the concentration of S given in the lemma below. Let ||X|| ψ 2 be a ψ 2 -norm of X defined by Lemma. Let (X, Y ) be a pair of (possibly correlated) random variables satisfying EX = EY = 0, and Assume n samples {(X i , Y i )} n i=1 are identically and independently observed. For a matrix A = (a ij , 1 ≤ i, j ≤ n) with zero diagonals, we have that for some numerical constant c > 0.
Note that the finite ψ 2 -norm in (21) characterizes a sub-Gaussian random variable and can be replaced by the uniformly bounded moments in (17), since sup r≥1 E|X| r 1/r / √ r ≤ K implies ||X|| ψ 2 ≤ 2eK. In other words, provided X i and Y j satisfy the moment condition with constants K X and K Y , respectively, the conclusion of the lemma above still holds (with different c). This completes the proof.

B.6 Proof of Theorem 3
Theorem 3 is not difficult to show if Lemmas 1, 6, and 7 are used together. Let us show and prove the two additional lemmas. First, the following lemma shows how the concentration of (10) is related to that ofπ jk .
Lemma 6. Assume where B 1 , B 2 , and B 3 are positive constants. Then, we have Proof. By the triangular inequality, we observe Finally, we note that where the last equality holds for a symmetric positive definite matrix.
Lemma 7. Assume the sample size and dimension satisfy n/ log p > C/π min for some numerical constant C > 0. Then, it holds that with probability at most 2/p max k, Proof. First, we observe that on the event G = G n,p = {π emp k > 0, ∀k, }, we have for t > 0 Using these notations, we get We introduce the deviation inequality for a sum of Bernoulli variables.
Lemma 8 (Boucheron et al. (2016), p 48). Let {δ i } n i=1 be independent Bernoulli variables with probability π of being 1. Then, there exists a numerical constant C > 0 such that for If t < π −1 k , by using Lemma 8, it holds Similarly, we have If we define π min = min k, π k , we get by the union argument where the last inequality depends on monotonicity of x ∈ (0, 1) → x 3 (1 + tx) 2 for t > 0.
Combining these results, we can conclude If t ← 4 log p/(Cπ 2 min n) and assume n/ log p > 12/(Cπ min ), then we can derive which completes the proof.

C Additional analyses and details in Section 4 C.1 Non-PSD input for CLIME
In what follows, we distinguish between a plug-in matrix (estimator) Σ plug and an initial matrix (estimator) Σ (0) (or Ω (0) ) that is used to initialize iterative steps.
We analyze the CLIME method proposed by Cai et al. (2011), which solves Cai et al. (2011) divide (23) into p column-wise problems and relax each problem to be a linear programming, which leads to Algorithm 2.
It is easily seen that the optimization problem (23)  is guaranteed to be PSD. We conjecture this irregularity is due to the initialization.
Thus, our proposal for the inputs is Similarly to the graphical lasso, one should modify the implemented R functions (e.g. clime in clime package) to separately handle two inputs, since it is not allowed for now to control two input matrices Ω (0) and Σ plug independently.

C.2 Failure of Algorithm 1 under missing data
It is mentioned that the undesirable property, non-PSDness, of the IPW estimator may hamper downstream multivariate procedures. We give one of the examples where it causes a problem; the graphical lasso. Recall that the existing algorithms available in glasso and huge packages are not suitable especially with the tuning parameter fixed at small λ, since they use the non-PSD initial matrix Σ (0) = Σ IP W + λI. As a consequence, in Figure 3 where data is similarly generated to the simulation study (see Section D), the blue solid ROC  Figure 3: Comparison of ROC curves between two different algorithms for solving the graphical lasso using incomplete data. n = 100, r = 1, and the dependent missing structure are assumed. The oracle IPW estimator is plugged-in. 10 random data sets are used.

D Additional analyses and details in Section 5 D.1 Details of the simulation setting
Recall that we generate Gaussian random vectors X i , i = 1, . . . , n, in R p with mean vector 0 and precision matrix Ω = (ω ij , 1 ≤ i, j ≤ p) under different pairs of n = 50, 100, 200 and p satisfying r(= p/n) = 0.2, 1, 2. The graph structure induced by a precision matrix and the missing structure are described in details below.
Graph structure (precision matrix) 1. Chain-structured graph : The edge set E of a graph is defined by the structure of a chain graph. ω ij = 0.1, if (i, j) ∈ E, and 0, otherwise; ω ii = 1.
2. Star-structured graph : The edge set E of a graph is defined by the structure of a star-shaped graph. ω ij = 0.9/ √ p − 1 4 , if (i, j) ∈ E, and 0, otherwise; ω ii = 1. Every Ω is rescaled so that the largest eigenvalue of Ω is set as 1.
We choose different values α = 0, 0.15, 0.3. The case α = 0 where all samples are completely observed is included as a reference.

Estimators
Based on our experience, the graphical lasso is preferred to the CLIME in estimation of sparse precision matrices since the implemented R packages are either too conservative to find true edges (R package fastclime) or too slow (R package clime). We exploit QUIC algorithm proposed by Hsieh et al. (2014) to solve the graphical lasso (12). The grid of a tuning parameter λ ∈ Λ is defined adaptively to the plug-in matrix Σ

D.2 Additional results of the simulation study
We investigate the finite sample performance by changing various parameters (e.g. r = p/n, missing proportion) in the simulation study. To evaluate estimation accuracy and support recovery of the Gaussian graphical model, different matrix norms and an area under the receiver operating characteristic (ROC) curve are used.

Estimation accuracy
We numerically examine behaviors of the inverse covariance matrix estimated using the IPW estimator as simulation parameters vary. To this end, the Frobenius and spectral norms are used to measure the accuracy of an estimator. We fix the 0.7 T -th tuning parameter in Λ (in an increasing order) to get a single sparse precision matrix, because selection of the tuning parameter is not of our primary interest and our findings stated below do not change much according to the tuning parameter. that determines the magnitude of estimation error. It is uniformly observed that larger size of a precision matrix is more difficult to estimate, but the degree of difficulty depends on the shape of the true graphs (or precision matrix). − Ω −1 || (left) and || Ω − Ω|| (right) are measured. Figures 6 and 7 compare the performance of the two plug-in matrices. When complete data is available, no adjustment for missing is needed so that there is no difference in errors (see the leftmost red boxplots in each sub-figure). If missing occurs in data, the precision matrix estimator based on the oracle IPW estimator is closer to the true matrix (either Σ or Ω), and the extent is more evident as the missing proportion α increases.   − Ω −1 || (left) and || Ω − Ω|| (right) are measured. Figures 8 and 9 imply that dependency in missing degrades estimation accuracy, as the missing proportion is set at the same level in both missing structures. We do not show the results when using complete data (i.e., α = 0) since the two missing structures are the same by definition.

Support recovery
We investigate the support recovery of the Gaussian graphical model using the ROC curve. It is observed that the ROC curves end at different false positive rate (FPR) values, especially when different missing proportions are assumed (see Figure 10).  Figure 10: The ROC curves according to different missing proportions with 10 times of repetition. n = 100, r = 1, a random graph structure, and the dependent missing structure are assumed. The oracle IPW estimator is plugged-in.
Thus, it is not fair to directly compare an area under the curve (AUC) because the maximum value of AUC depends on the endpoint (largest value) of FPR and thus cannot reach 1 if the endpoint is less than 1. Instead, we use the rescaled partial AUC (pAUC) proposed by Walter (2005). The pAUC rescales the AUC by the largest FPR in the ROC curve (see Walter (2005)   Boxplots of the pAUC for support recovery with different plug-in estimators. n = 100, r = 2, and the dependent missing structure are assumed. (Bottom right) Boxplots of the pAUC for support recovery with different missing structures ("depen" and "indep"). n = 100 and r = 1 are assumed. The oracle IPW estimator is plugged-in. Figure 11 shows the results of the pAUC as the simulation parameters are varying. Considering a large value of the pAUC implies better performance in the support recovery, we have similar interpretations based on the given results as before.

E Details in Section 6
We use the riboflavin data available from the R package hdi, where 4088 gene expressions are observed across 71 samples. Each variable is log-transformed and then centered. We select 1000 genes with the largest empirical variances for the sake of simplicity. As in the previous analyses, the QUIC algorithm is used to solve the graphical lasso.
With the complete data set, we solve the graphical lasso (12) with a fixed λ and set the obtained estimate Ω λ as the ground truth precision matrix. We generate three different models with λ 1 < λ 2 < λ 3 . Note that the estimated precision matrix with a smaller tuning parameter (e.g. λ 1 ) gives a denser true model that has a precision matrix with more non-zero elements. We also consider another ground-truth precision matrix with an optimal tuning parameter that is chosen by the cross-validation procedure, following Kolar and Xing (2012).
Let an index set of n samples split into K folds {G k } K k=1 of equal size. Without samples in the k-th fold, we estimate the precision matrix at a fixed λ, denoted by Ω (k) λ . We finally choose λ CV among a grid of λ's that minimizes the cross-validated (negative) log-likelihood function below; We let Ω CV = Ω λ CV the precision matrix at this level of the optimal sparsity λ CV . It turns out λ CV is close to, but slightly smaller than the smallest tuning parameter λ 1 . The four precision matrix models have 36, 170 (λ 1 ), 5, 860 (λ 2 ), 14 (λ 3 ), 35, 630 (λ CV ) non-zero elements (except diagonals) in each.
We impose missing values on the complete data matrix in a similar manner described in Section D. For this analysis, we assume the independent missing structure and note that results do not alter significantly using the dependent structure. To estimate Ω λ i , we solve the graphical lasso (12) using the incomplete data with the tuning parameter fixed at λ i .
Since the optimality of the tuning parameter can vary as different data is available due to missing, the cross-validation procedure is separately performed, instead of using the same λ CV to estimate Ω CV . Let Ω (k) λ be the solution with the tuning parameter λ without the k-th fold of incomplete data. Then, the (cross-validated) log-likelihood is computed over observed data as follows; λ ) −1 ) k , k, ∈ {k : δ ik = 1} and X i,obs = X ik , k ∈ {k : δ ik = 1} T . Letλ CV the optimal parameter that minimizes CV mis and Ω CV the graphical lasso solution using all observed data atλ CV . Figure 12 presents three different measures to assess precision matrix estimation. An error distance between the truth and an estimate is evaluated by the spectral norm. Due to readability, the boxplots of the distance for dense models ("D" and "CV") under the missing proportion 30% are not shown, but their summary statistics are provided in Table 2. It is confirmed again that having more missing values yields worse estimation. Also, it is possible to see that the denser model is more difficult to achieve satisfactory accuracy in estimation and graph recovery. FPR) using the riboflavin data. "D", "M", "S", and "CV" on the x-axis stand for the dense (λ 1 ), moderate (λ 2 ), sparse (λ 3 ), and cross-validated (λ CV ) models, respectively. Due to readability, two boxplots for the distance from "D" and "CV" are not shown when the missing proportion is 30%.  Table 2: Quantiles for the spectral norms of the dense ("D") and cross-validated ("CV") models with the missing proportion 30%.