Using the accelerated failure time model to analyze current status data with misclassified covariates

Current status data arise commonly in applications when there is only one feasible observation time to check if the failure time has occurred, but the exact failure time remains unknown. To accommodate the covariate effect on failure time, the accelerated failure time (AFT) model has been widely used to analyze current status data with the distribution of the failure time assumed to be specified or unspecified. In this paper, we consider a logistic regression with a misclassfied covariate from the current status observation scheme. A semiparametric AFT model was built to model current status data to eliminate the bias caused by this misclassification. This model is also robust to the misspecification of the failure time compared to the parametric AFT model, as we assume an unknown distribution of the failure time in the proposed model. Furthermore, incorporating the covariate effect on the failure time increases the flexibility of the model. Finally, we adapt the Expectation-Maximization algorithm for estimation, which guarantees the convergence of the estimate. Both theory and empirical studies show the consistency of the estimator. MSC2020 subject classifications: Primary 62N99; secondary 62N02.


Introduction
In many applications, it is challenging to obtain an accurate measurement of a covariate. In general, the naive use of mismeasured covariates may lead to inconsistent estimators of covariate effects in a regression model (e.g., [6], [41], and [40]). There exists an extensive literature on methods to address the mismeasurement covariate problem, including likelihood-based methods (e.g., [31]), pseudo-likelihood methods (e.g., [23]), Bayesian methods (e.g., [15]), or the estimating equation method (e.g., [27]). To estimate the parameters of the mismeasurement distribution, all of these methods require either internal or external validation samples from replication studies ( [6]).
There has been considerable work on methods for estimations under the framework of current status observation. Ayer et al. [1] discussed the nonparametric estimation of the survivor function using the pool adjacent violator algorithm (PAVA). Groeneboom, Maathuis and Wellner ( [13,14]) considered the current status data with competing risks, where they studied consistency, rates of convergence and limiting distribution of the maximum likelihood estimate (MLE). Rabinowitz, Tsiatis and Aragon [29] studied an accelerated failure time (AFT) model with interval-censored data. Jewell and Shiboski [19], Rossini and Tsiatis [30], Huang and Wellner [18], Lin, Oakes and Ying [24], and Shiboski [32], among others, considered the maximum likelihood inference on semiparametric transformation models for current status data. Maathuis and Hudgens [25] proposed nonparametric inference for competing risks current status data with continuous, discrete or grouped observation times. By inverting a Waldtype test for testing a null embedded proportional hazards model, Tian and Cai [33] introduced a novel method for an AFT model based on current status data or interval-censored data. Lam and Xue [21] and Cook et al. [9] developed semiparametric cure rate models to allow for the fact that not all individuals in the population are susceptible to such an event.
Zeng, Cook and Warkentin [43] considered a logistic regression analysis with a misclassified covariate from a current status observation scheme, from which they developed a likelihood-based approach for fitting regression models. Parametric and nonparametric estimates for the distribution of failure (seroconversion) time were developed. In particular, they only considered modeling the distribution of failure time without covariate effects. Because it is known that failure time may depend on the exposure of an antithrombotic drug, ignoring the dependence of the failure time on these covariates can lead to invalid inference (See Section 6). Thus, in practice, it is desirable to accommodate some covariates when modeling the distribution of the failure time.
To overcome this problem, in this paper we consider a logistic regression analysis with a misclassified covariate from a current status observation scheme, in which a semiparametric AFT model for the failure time was built. In the semiparametric AFT model the distribution of the failure time is assumed to be unspecified. Few studies exist in the current status data literature regarding misclassified covariates that use a semiparametric AFT model. An adapted Expectation-Maximization (EM) algorithm ( [10]) is developed using the PAVA to estimate the parameters in the AFT model. We have shown that, both in theory and in simulation studies, all estimators are consistent under some regularity conditions. This method is appealing in that it corrects the bias from covariate misclassification in the current status scheme. Furthermore, it accommodates the dependence of the failure time on covariates, leading to a more accurate and flexible model. Moreover, as a semiparametric method, it is robust to the misspecification of the distribution assumption of the failure time. Finally, this method is easily implemented in practice.
The remainder of this article is organized as follows. In Section 2, we introduce a motivating example. In Section 3, we introduce the notations and models. In Section 4, we describe an EM algorithm for the estimation of parameters. In Section 5, we study the asymptotic property of the estimator. Simulation studies are performed in Section 6 to assess the performance of the proposed method. Concluding remarks are made in Section 7.

Motivating example: A thrombosis study
Prophylaxis with antithrombotic heparin-based therapies is known to be highly effective in reducing the risk of thrombosis. In orthopedic surgery, administration of these therapies is now a standard practice ( [38]). A study from four multi-center randomized trials in which patients underwent orthopedic surgery was conducted ( [2,11,22,34]). In this study, 3132 patients were randomized to receive enoxaparin in order to evaluate the treatment effect of enoxaparin vs. fondaparinux on reducing the risk of deep vein thrombosis (DVT) following hip or knee replacement. Some patients experienced serological responses when they underwent orthopedic surgery and were exposed to antithrombotic. The current research extends these results by investigating the association between this serological response and the subsequent risk of DVT.
Before surgery, patients are seronegative, and approximately 10 days following the surgery, antibodies develop among seroconverters. If all patients were tested approximately ten days after surgery, the seroconversion status would be known for all patients. However, patients were only tested at the time of hospital discharge, which in general is less than 10 days. According to Zeng, Cook and Warkentin [43], the median time from surgery to blood test is 5.9 days, and most seroconversions occur between 5 and 8 days. Therefore, negative test results may represent false negative classifications of the true seroconversion status since these patients may have been tested before they developed the antibody response. Thus, to obtain a valid analysis, it is necessary to accommodate the timing of the test. In this situation, the seroconversion time is subject to a current status observation scheme. Because the seroconversion time may depend on exposure to an antithrombotic drug, it is desirable to accommodate some covariates when modeling the distribution of the seroconversion time. Here, we have developed a novel semiparametric method to analyze current status data with misclassified covariates.

Notation and models
Let X i = 1 if individual i develops serological response and 0 otherwise, where i = 1, . . . , n. Denote T i as the serological response (failure) time if X i = 1 and C i is the random inspection time.
it means that the serological response occurs before or at the inspection time, hence X i = 1; if X i = 0 then D i = 0; if D i = 0, it is possible that the serological response time occurs after the blood test, hence D i is a misclassified version of X i . Let Z i = (Z i1 , . . . , Z ip ) T be a vector of p × 1 baseline covariates which are observed without measurement error.
For the serological response time T i , we build a semiparametric AFT model for individuals who develop serological responses to accommodate for covariate effects. Specifically, we consider the model when X i = 1, where γ 0 is an intercept, γ is an unknown p × 1 vector of coefficients, i has mean zero with an unspecified cumulative distribution function G(s). It is noted that where F (x) = G(x + γ 0 ). Since G(·) or F (·) is a nonparametric function, the model (3.1) is not identifiable unless we fix F (·) or G(·) at some point. As is commonly done in the literature (e.g., [37,20]), we posit γ 0 = 0 so that the model (3.1) is identifiable. Let Y i denote the fully observed binary response of interest for individual i which takes the value 1 if individual i develops a DVT and 0 otherwise, i = 1, . . . , n. Define π i = P (Y i = 1 | X i , Z i ), then we may consider building a logistic regression model: where β = (β 0 , β 1 , β T 2 ) T is a vector of unknown regression parameters. Our primary interest is to estimate β.
If X i is fully observed, we can then directly fit the model (3.3) to obtain an estimate β. However, the X i value is not available when the inspection time is before the serological response time. A naive method is to use the current status indicator D i instead of the true serological status X i in the model (3.3); i.e., we fit the model: Simulation studies conducted in Section 6 demonstrate that the naive method using model (3.4) leads to substantive biases for β. Methods on handling measurement errors in the literature generally require internal or external validation samples or replication studies, which are not available in the thrombosis study.
In this paper, we consider a simple and efficient estimation method for β under a current status observation scheme without needing the internal or external validation samples or replication studies by using a semiparametric AFT model. To describe this method, we must specify a model for the probability of the serocoversion process X i . Let p i = P (X i = 1 | Z i ) denote the probability for the serocoversion process X i . Similarly, we can build a logistic regression model We assume that the inspection time C i is independent of X i and T i given Z i . Let θ = (β T , γ T , α T ) T be the vector of all the Euclidean parameters, where β is the parameter of interest, and γ and α are nuisance parameters.

Estimation procedure
With misclassified covariates, a popular technique to obtain the parameter estimation is to adapt the EM algorithm ( [10]) by treating the true covariates as "missing data". The complete data density-mass function is where in the last step we used the fact that Y is independent of C, and that given (X, Z), D has no predictive value on Y ( [43]), so that P (Y | X, D, Z, C) = P (Y | X, Z), and that X is independent of C, so P (X | Z, C) = P (X | Z). Note that P (Z, C) contains no parameter of interest, so for inference of parameters of interest, we only need to consider the conditional density-mass function Using the evaluations in the complete data conditional likelihood below, the conditional density of the observed incomplete data (Y, D | Z, C) is be the joint density of the observed data, and (θ n ,F n ) be the maximum likelihood estimate (MLE) of the true (θ 0 , F 0 ). Then are equivalent in terms of maximization over (θ, F ). i.e., (θ n ,F n ) is the MLE of (θ 0 , F 0 ) under the full likelihood of the observed data. However, parameter estimation based on the above incomplete data mixture model is not easy. A common practice is to compute (θ n ,F n ) via the iterative EM algorithm based on the complete data likelihood, which is a non-mixture model but with missing data.
The complete data likelihood for Chen et al. and , which involves the unknown parameter γ and unknown nonparametric function (3.5). In practice, the X is subject to missingness, so the observed data are (Y, Z, D, C). In the EM algorithm, we iterate the E and M steps until convergence. Specifically, the estimation of the parameter θ and F is updated at the (t + 1)st through the following two steps:
The E-step requires the calculation of the conditional expectation given the observed data, which can be written to three parts (4.2) where each component represents the conditional expectation of the logarithm of each of the three parts in the complete data likelihood. Since these three parts do not share the same parameters, we can update estimatesβ (t+1) ,γ (t+1) ,F (t+1) andα (t+1) by maximizing the Q-functions separately. For updatingβ (t+1) , it is noted that . It is easy to specify thatŵ ) with respect to β can be easily achieved using standard statistical software by creating a pseudo-data set similar to [43].
Finally, for updatingα (t+1) , the conditional expectation Q 3 (α;θ (t) ,F (t) ) takes the form Maximizing Q 3 (α;θ (t) ,F (t) ) with respect to α can be easily achieved as it has the form of a binary log-likelihood with dataŵ In fact, our model is equivalent to a mixture model. It is well known that there is no general convergence theorem on the EM algorithm, and generally, it converges to a local maxima ( [39], Theorem 3). If the underlying model has the concave property, the EM algorithm converges to the global MLE. On the other hand, if the underlying model is not concave, only local convergence is possible. Thus in an application, one needs to apply the EM algorithm with different starting values to get possible different local stationary points and compare the log-likelihood at these stationary points to find the global maxima.
Our algorithm is a semiparametric version of the EM algorithm. The semiparametric and nonparametric EM algorithms are used widely in the literature, such as in [26,5,16,12], and see the argument there for the convergence of such algorithm ( [12], p.67-68). Chen, Zhang and Davidian [8] applied the EM algorithm to a semiparametric random effects model, while Bordes, Chauveau and Vandekerknove [4] applied the EM algorithm to a semiparametric mixture model, using simulation studies to justify the convergence of the algorithm. Finally, Balan and Putter [3] developed an R-package of EM algorithm for semiparametric shared frailty models.

Consistency of the estimators
be the true joint mass-density of (Y, Z, D, C), Θ be the domain of θ and F be the collection of all the distribution functions F of T , and P be the probability measure of F 0 . To show the consistency of the estimators, we need the following regularity conditions: We show the convergence of the estimators based on the following theorems: Condition (C0) is commonly assumed for this type of problem, such as in [43].
With interval censoring data (type I or current status data) for the Cox proportional hazards model, the regression parameters can be efficiently estimated with rate n 1/2 (see, for example, [17]). However, using the same data for the standard semiparametric accelerated failure time model, the asymptotic distribution for the estimated regression parameters is an open problem: it is not clear whether or notθ n is n 1/2 convergence, or whether it has a normal limiting distribution (see [18], Section 3.3.2). The reason may be because for the current status model, the nonparametric maximum likelihood estimateF n of F has a convergence rate only of n 1/3 , and the process D[0, τ], and does not converge weakly as a process ([18], Section 2.2). Hence the profile log-likelihood l n (θ,F n (· | θ)) is not smooth enough as a function of θ. Chen et al. [7] encountered another similar challenge in density ratio models. Our model is more complicated than the standard AFT model, hence the asymptotic distribution ofθ n in our model is currently not clear. In application, we suggest using a bootstrap method to obtain the variability of the proposed estimators.
Next we study the possibility of rate √ n estimability of our model. Let the original data be V 0 ∼ Q ∈ Q = {Q θ : θ ∈ Θ}. In practice, sometimes the original data V 0 is not completely observed; instead only V = T (V 0 ) is observed, for some known map T . The resulting model for be the score for θ in model Q, and i(v | θ, P ) be that in model P , by Proposition 1.1 in [12], For missing data, the above observation is extended to the following notion of score operator i : L 0 2 (Q) → L 0 2 (P ) (see [12]), where It is a bounded linear operator and has an adjoint i * : L 0 2 (P ) → L 0 2 (Q) determined by The information operator is defined as . The information operator can provide a guideline for the rate √ n estimability of the model parameters. If it is boundedly invertible, the parameter is likely to be √ n estimable; otherwise, it may not. Yuan, Xu and Zheng [42] used this method to evaluate the rate √ n estimability of parameters in some incomplete data models. Below we investigate model (4.1) for the rate √ n estimability for functionals of F , by evaluating its information operator.

Theorem 5.2. For model (4.1), the information operator i * i is not boundedly invertible.
Thus many functionals of the model (4.1), including θ, may not be √ n estimable. For the MLEθ n , although its asymptotic distribution and exact convergence rate are unknown, we can get an upper bound on its convergence rate, as shown in the next Theorem.

Theorem 5.3. Assume (C0)-(C2), then
For presentational continuity, we have relegated all the long proofs of Theorems 5.1 to 5.3 to the Appendix.

Simulation studies
In this section, we assess the performance of the proposed method by comparing the following methods: (i) the naïve method by using a logistic regression model and including the current status indicator D i as a covariate (Naive), i.e., we fit the model (3.4); (ii) the proposed method by fitting a semiparametric AFT model for the seroconversion time (Proposed); (iii) the method by fitting a correctly specified parametric AFT model for the seroconversion time, which serves as a benchmark for comparisons (Parametric); and (iv) the method developed by [43] using a nonparametric estimate of the seroconversion time distribution (ZCW).
We posit that Z i ∼ Bin(1, 0.5), We consider two distributions: (a) T i follows a log-normal distribution with E[log T i ] = γ 0 + γ 1 Z i , and Var(log T i ) = σ 2 ; and (b) T i follows a Weibull distribution with a shape a, and a scale b i = exp(−γ 0 − γ 1 Z i ). For the inspection process, we posit that C i follows the Gamma distribution with a shape 1, and a rate ξ. In the simulation, we posit that α 0 = 0, α 1 = 1, β 0 = 0, β 1 = 1, β 2 = 1, γ 0 = 1, γ 1 = 1 or 0.1, σ 2 = 1, a = 1, and we posit different values of ξ to induce different proportions of patients who develop serological response before inspection time, i.e., P (T < C | X = 1). We consider a sample size n = 2000 and 200, and for each model configuration, we conduct 2000 simulations.
Tables 1 and 2 report the empirical biases and standard deviations where T follows log-normal distribution for γ 1 = 1 and γ 1 = 0.1, respectively. To evaluate the efficiency of the proposed method compared to the parametric method, we also report the relative efficiency (RE) for the proposed method, which is defined as the ratio of the empirical standard deviations between the proposed estimate and the parametric estimate. The probability P (T < C | X = 1) = 0.2 and 0.4 by assuming ξ = 0.5 and 0.2. It can be seen that the naive analysis yields substantial biases for all three parameters. The ZCW's method also yields estimators with substantial biases for a larger γ 1 value (e.g., γ 1 = 1), especially for β 1 , β 2 and α 1 ; as is expected, the ZCW's method yields estimators with smaller biases for a smaller γ 1 value (e.g., γ 1 = 0.1). The biases of the naive and ZCW's estimators also depend on the probability of P (T < C | X = 1) (or ξ), where as P (T < C | X = 1) decreases (or ξ increases), the bias increases. When P (T < C | X = 1) is large or ξ is small, the ZCW method yields comparable estimates as to the proposed method. The proposed method and the correctly specified parametric method yield negligible biases for all kinds of parameter estimates. Although the proposed estimate of β is less efficient than the parametric estimate, the efficiency loss is limited.
Tables 3 and 4 report the empirical biases and standard deviations where T follows the Weibull distribution for γ 1 = 1 and γ 1 = 0.1, respectively. In this case, we consider a smaller probability of P (T < C | X = 1). We posit ξ = 6 and 3 so that P (T < C | X = 1) = 0.04 and 0.07. The conclusions were similar as those found in the log-normal model scenario.
In summary, the proposed method is robust to the distribution assumption for the seroconversion time. The ZCW's method, by ignoring the covariate effect in the distribution of seroconversion time, yields substantial biases for the parameters of interest when the survival time is covariate dependent. When the dependency is small, the ZCW's method may yield less biased or comparable estimate to the proposed method. The naive method using the current status indicator D i as the covariate also yields substantial biases for the parameters of interest. The R code for the simulation is available upon the request from the authors.

Concluding remarks
We have proposed a semiparametric AFT model to address a misclassified covariate in the current status data scheme. This method overcomes the limitation  The question being studied can also fit into a general framework of missing data while missing of a covariate is at random, and the distribution of the observed values depends on the study time and other covariates. Though in this paper, we considered a binary outcome using a logistic regression model, this method can be applied to different type of outcomes using generalized linear models. The estimation procedure introduced in Section 4 can be easily adapted to estimate parameters in the generalized linear model.

Proof of Theorem 5.2
Denote V 0 = (Y, X, D, Z, C) for the original data, and V = (Y, D, Z, C) for the observed incomplete data. The likelihood for V 0 is where π(x, z | θ) and p(z | θ) are given in (3) and (5). Denote the density-mass for the observed V given in (6) by We identify the tangent space asQ = L 0 2 (Q). Let p(Z | θ) as given in (5). The score operator i for model (6) is given by, ∀a ∈ L 0 2 (Q), .
, from the above we get, x, d, z, c) . Thus solving i * ia = a 1 in a ∈ L 0 2 (Q) for any given a 1 ∈ L 0 2 (Q) amounts to solve the above equation (set it to a 1 ) in a. It is easy to see that the array {g(s, y, x, d, z, c) : s, y, x, d = 0, 1} is not of full rank, so the above equation cannot be solved, i.e., i * i is not boundedly invertible. Yuan, Xu and Zheng [42] used this method for several incomplete data models, more details can be found there.

Proof of Theorem 5.3
Let (θ, F | s) be the log-likelihood based on data s = (y, z, d, c), and D n be the set of all the observed data, and M n (θ, F ) and M (θ, F ) as defined in the proof of Theorem 5.1.