A Dirichlet Process Mixture Model for Non-Ignorable Dropout

. Longitudinal cohorts are a valuable resource for studying HIV disease progression; however, dropout is common in these studies. Subjects often fail to re-turn for visits due to disease progression, loss to follow-up, or death. When dropout depends on unobserved outcomes, data are missing not at random, and results from standard longitudinal data analyses can be biased. Several methods have been proposed to adjust for non-ignorable dropout; however, many of these approaches rely on parametric assumptions about the distribution of dropout times and the functional form of the relationship between the outcome and dropout time. More ﬂexible approaches may be needed when the distribution of dropout times does not follow a known distribution or violates proportional hazards assumptions, or when the relationship between the outcome and dropout times does not have a simple polynomial form. We propose a Bayesian semi-parametric Dirichlet process mixture model to ﬂexibly model the relationship between dropout time and the outcome and show that more accurate inference can be obtained by non-parametrically modeling the distribution of subject-speciﬁc eﬀects as well as the distribution of dropout times. Results from simulation studies as well as an application to a longitudinal HIV cohort study database illustrate the strengths of our Bayesian semi-parametric approach.


Introduction
Longitudinal studies are critical to understanding disease progression over time; however, obtaining complete data on all subjects can be challenging. This is particularly true in longitudinal HIV cohorts. It is well documented that many subjects in these studies have missing observations due to death or disease progression, leading to concerns of non-ignorable dropout (Lanoya et al., 2006). In this scenario, standard longitudinal data analyses can produce biased results. For example, this work was motivated by the challenges associated with comparing laboratory markers of HIV disease progression between hard drug users and other subjects in the Women's Interagency HIV Study (WIHS).
Since subjects tend to drop out due to declining health, results from traditional longitudinal analysis methods may be overly optimistic, as they are biased towards study completers, who tend to have more favorable outcomes. In addition, since dropout and missing data are more common among drug users, differences between drug users and other subjects may also be underestimated. As similar dropout related problems have been identified in quality of life data from clinical trials of cancer therapies (Fairclough et al., 1998), anti-depressant clinical trials (Molenberghs et al., 2004), and studies of smoking cessation programs (Hogan et al., 2004b), addressing the analysis challenges in the WIHS will be broadly applicable to other longitudinal clinical trials and cohort studies.
It is well known that in the presence of dropout the marginal distribution of the outcome is not identified and untestable assumptions are necessary to make inferences (Little, 1993;Daniels and Hogan, 2008). While all methods for dealing with missing data must make some unavoidable assumptions about the behavior of unobserved outcomes, most methods rely on additional parametric assumptions. For example, in frailty model frameworks, it is usual to make a parametric or proportional hazards assumption about the distribution of dropout times; in mixture model frameworks, an assumption about the functional form of the relationship between the outcome and the distribution or hazard of dropout is often made (Daniels and Hogan, 2008). Inferences can be sensitive to these assumptions (Molenberghs et al., 1997); therefore, having flexible approaches with minimal assumptions about the patterns or distribution of dropout is useful when considering how missing data may influence the results of longitudinal data analyses. In this paper we present how to use a Dirichlet process mixture of simple frailty models to relax assumptions about the distribution of dropout times and the functional form of the relationship between dropout time and the outcome. We use this model to analyze data from the WIHS, a longitudinal HIV cohort study in which the parametric assumptions of standard models for non-ignorable dropout are violated.

A Preliminary Investigation of the Women's Interagency HIV Study
The WIHS is an ongoing prospective study of the natural and treated histories of HIV infection in women, with behavioral data and specimens collected at semiannual visits by multiple sites since 1994 (Barkan et al., 1998). In contrast to male populations, HIV and AIDS are more prevalent among women of color exposed through heterosexual partners or intravenous drug use (Bacon et al., 2005; Centers for Disease Control and Prevention, 2002). Illicit drug use has been hypothesized to accelerate HIV disease progression by directly enhancing virus replication and by impairing immune responses. While laboratory in vitro and animal studies suggest that drug and alcohol use impairs immune function and increases HIV replication, results from epidemiological studies have been mixed (Moore et al., 2017). These conflicting results may be in part linked to differential dropout between drug users and other subjects. Considering the potential impact of non-ignorable dropout on the results of statistical analyses is particularly important in this context.
The goal of this analysis is to compare declines in CD4 + T cell count, a measure of immunologic health and HIV disease progression, between untreated hard drug users (N=130) and other untreated subjects (N=480) over the first five years of the WIHS study. While WIHS follow-up visits were intended to occur every 6 months, exact timing of visits varies greatly between subjects, so that observation times are not aligned. For example, visit 6 occurred anywhere between 1 and 3 years, with a median of 2.5 years. Therefore we planned to model ln(CD4 + ) as a linear function of time while controlling for baseline ln(CD4 + ) and its interaction with time. The primary comparison of interest is the difference in the time effect or slope between hard drug users and other subjects.
In our initial investigation into the data, we found several causes for concern. First, we noted that a large proportion of subjects dropped out of the study early, with half of subjects lost by 2.4 years (median of 4 observations, Figure 1a). In addition, drug users tended to drop out of the study earlier than other subjects. Due to the prevalence and differential distribution of dropout, missing data could have a large impact on the results of our analysis. Second, subjects that dropped out of the study had lower mean CD4 + at their last visit compared to subjects that remained on study (Figure 2a), and both drug users and others that dropped out early tended to have steeper (more negative) subject-specific ordinary least squares (OLS) slopes than subjects that remained on study. This suggests that subjects that dropped out may have done so due to more rapidly deteriorating health, raising concerns of non-ignorable dropout. We also noted that the distribution of dropout times was irregular (Figure 1b), not taking an obvious parametric form, and that the Kaplan Meier curves of dropout for drug users and others cross, potentially violating parametric and proportional hazards assumptions of common time to event models. Due to the potentially large number of visits for each subject as well as the mis-alignment of visit times, we chose to model correlation using a random effects framework; however, subject-specific OLS slopes do not follow the normal distribution typically assumed for random effects, having long tails off to the left ( Figure 2b). These deviations from normality and typical model assumptions led us to search for more robust alternatives.

Accounting for Missing Data in Longitudinal Studies
Dropout is not ignorable and data are missing not at random when missingness depends on the values of the unobserved outcomes, even after conditioning on the available data (Little and Rubin, 2002). There is a large body of literature describing selection (Heckman, 1979;Diggle et al., 1994;Heckman, 1998), frailty (also called "shared parameter") (Follmann and Wu, 1995;Albert and Follmann, 2000;Wu and Bailey, 1989;Schluchter, 1992;Lancaster and Intrator, 1998) and mixture models (Rubin, 1977;Bailey, 1989, 1988) to account for non-ignorable dropout in longitudinal studies with a Gaussian response. These methods are based on different factorizations of the joint distribution of the outcome and dropout time (Daniels and Hogan, 2008). Methods that can accommodate non-normal data are less developed (Ibrahim and Molenberghs, 2009). In a frequentist setting, parametric selection (Ibrahim et al., 2001;Wu and Wu, 2007), frailty (Ten Have et al., 1998), and mixture models (Ekholm and Skinner, 1998;Follmann and Wu, 1995;Fitzmaurice et al., 2001) have been proposed to account for dropout in studies with binary and other non-normal outcomes; however, semi-parametric methods are lacking. In a Bayesian framework, methods for binary outcomes have focused largely on marginalized transition models for population level rather than subject-specific in-  ference Hogan, 2008, 2010). For subject-specific inference, Kaciroti et al. have described Bayesian pattern mixture models for binary and count data (Kaciroti et al., 2008(Kaciroti et al., , 2009(Kaciroti et al., , 2012, however, these methods may not be feasible for large numbers of dropout patterns or continuous dropout times. Related to our model, for studies with a continuous response, aligned observation times and limited dropout patterns, methods proposed by Linero and Daniels (2015) model the joint distribution of dropout time and the longitudinal response as a Dirichlet process mixture of missing at random models in order to relax parametric assumptions about the distribution of the longitudinal response. Our proposed model differs from that of Linero and Daniels (2015) by accounting for correlation in the longitudinal response using a random effects framework and by accommodating continuous dropout times, as well as non-normal outcomes in the exponential family. For a continuous longitudinal response, our model can be represented as a Dirichlet process mixture of the frailty models proposed by Schluchter (1992). By taking a Dirichlet process mixture of frailty models, parametric assumptions about the distributions of the random coefficients and the dropout times are relaxed, by modeling them as mixtures of normal distributions.

General Approach
Dirichlet process mixture models have been used in semi-and non-parametric Bayesian approaches to model distributions as a mixtures of unknown and potentially infinite numbers of normal distributions (Ghosal, 2010;Christensen et al., 2011). We propose a semi-parametric Dirichlet-process mixture model for dropout in longitudinal studies with exponential family outcomes (DP-Drop) to relax common parametric assumptions made in models that account for non-ignorable dropout.
The DP-Drop can be seen as an extension of the frequentist parametric frailty model proposed by Schluchter et al. (2001). Schluchter considers a two stage model for normally distributed outcomes. The first stage assumes that each subject's responses follow a linear regression with random intercept b i0 and slope b i1 , which can be written y i = X i b i + e i , where e i is a vector of independent, normally distributed error terms for subject i. In the second stage, the subject-specific random coefficients and the natural log of dropout time, u i , are modeled with a joint multivariate normal distribution: where μ b is the mean of the random coefficients, μ u is the mean of the natural log of dropout time, Σ b is the covariance matrix of the random coefficients, σ bu is a row vector containing the covariances of u and each random coefficient, and σ 2 u is the variance of the natural log of the dropout times. This model allows the underlying slope and intercept to be associated with dropout time, via the covariance parameters σ bu . If these covariances are zero, then the random coefficients and dropout time are independent and dropout is assumed to be non-informative.
In addition to assuming a log-normal distribution of dropout times, the parametric frailty model assumes that there is a linear relationship between the log of dropout time and the dropout time specific intercepts and slopes, since the random coefficients are related to dropout time through the covariance parameters. We illustrate how to relax these assumptions using a Dirichlet process mixture of frailty models, rather than a single model, to allow for non-parametric modeling of the distribution of dropout times and random effects. There is no need to specify a parametric form for the relationship between dropout time and the subject-specific effects, as in the conditional linear model (CLM) (Wu and Bailey, 1989), which models the effect of time on the longitudinal response as a low-order polynomial function of dropout time. Our DP mixture of frailty models also does not require any a priori grouping of subjects with subjectively similar patterns of missing data, as in a pattern mixture model (PMM) approach (Pauler et al., 2003). We see the main benefits of our approach as the robust modeling of the subjectspecific coefficients and dropout time distributions, as well as the relationship between dropout time and the subject-specific coefficients, since it is known that inferences can be sensitive to parametric assumptions, particularly in the presence of dropout (Moore et al., 2017;Forster et al., 2012;Hogan et al., 2004a;Molenberghs et al., 1997;Linero and Daniels, 2015). In addition, our model allows the number and timing of responses to vary from subject to subject, as is common in long-term cohort studies, and accommodates longitudinal binary and count responses.

Outline
In Section 2, we describe the DP-Drop method, how to estimate the model using Markov chain Monte Carlo, and how to calculate marginal effects, averaged over the distribution of dropout times. In Section 3, we present results of a simulation study which show that the DP-Drop method reduces bias and mean squared error for the marginal slope compared to 1) standard generalized linear mixed models (GLMM), 2) simple frailty models for non-ignorable dropout and 3) the conditional linear model (CLM) for nonignorable dropout when data are missing not at random. In Section 4, we demonstrate the benefits of the DP-Drop method by examining the impact of hard drug use on CD4 + T cell decline in untreated HIV infected subjects enrolled in the WIHS.

The Dirichlet Process Mixture Model for Dropout
For m subjects with n i longitudinal observations, let y i = (y i1 , . . . , y ini ) be the vector of outcomes for subject i observed at times t i = (t i1 , . . . , t ini ), for i = 1, . . . , m. Let u i be the dropout time for subject i, which may occur at any continuous point in time. u i is observed for subjects that drop out and censored for those that complete the study, so that we observe u o i = min(u i , C i ), where C i is the amount of time subject i is followed on the study in the absence of early withdraw or dropout. Let δ i = 1 if u i is censored and 0 else. Similar to Linero and Daniels (2015), we model the joint distribution of the outcomes, y, and dropout times, u, as a mixture of parametric models, such that: F is modeled with a Dirichlet process with base measure H and mass α > 0 (Escobar and West, 1995). Sethuraman (Sethuraman, 1994) showed that the Dirichlet process mixture is a prior on latent class models, with an unknown, potentially infinite number of classes, so that (1) is equivalent to: The specification in (2) is known as the "stick breaking" construction of the Dirichlet process. {p k } ∞ k=1 are the probabilities of belonging to latent class or cluster k. {V k } ∞ k=1 , known as "stick breaking weights," are the conditional probabilities of belonging to latent class k given that the subject was not in classes 1 to k − 1. θ (k) are the parameters associated with joint distribution of y and u in latent class or cluster k, while parameters common across clusters are included in ω. We take f (y, u|ω, θ (k) ) to follow a frailty or shared parameter model similar to the model proposed by Schluchter (1992), in which the distributions of y and u are linked through subject-specific random coefficients, b 1 . . . b m . The main advantage of using the stick breaking formulation is that it allows for straight forward estimation of the posterior with Gibbs sampling.

Model for Latent Class k
Using the stick breaking construction, we can think of each subject as belonging to a latent class, with data from subjects clustered together in class k following a common distribution with parameters θ (k) . To facilitate posterior computation, we use a truncation of the stick breaking model by replacing the infinite sum in (2) with a sum across the first N terms (see Section 2.3 Estimation) (Ishwaran and James, 2001). We introduce latent variables K = {K 1 , . . . K m } in {1 . . . N}, which indicate the latent class or cluster assignment of each subject. Let 1 k be an indicator function for class k. We may then write our model: For y, we assume a hierarchical random coefficient model to account for correlation due to repeated measurements made on the subjects over time, where EF is an exponential family distribution with a scale parameter φ. The linear predictor for subject i is where X i is an n i x p design matrix for the subject-specific coefficients, b i , and Z i is the design matrix for covariate effects β C . For continuous data, a typical choice for EF is the normal distribution with φ = Σ = Iσ 2 , where I is an n i x n i identity matrix. For b i , we typically include a random intercept, b 0i , and time effect, b 1i , for each subject.
Within latent class k, subjects share common random coefficient and dropout time distributions, with class specific parameters θ (k) In class k, we assume that the random coefficients have a multivariate normal distribution, with cluster specific mean μ (k) b and covariance Σ b . Random coefficients account for correlation when the measurement times are misaligned between subjects or there are large number of observation times. This differs from the approach of Linero and Daniels, who modeled correlation by allowing Σ to be an unstructured covariance matrix, which is better suited to studies with a small number of common observation times.
Since dropout times take on positive values, we assume a log normal distribution for the dropout time within latent class k. We note that others working with discrete dropout times or patterns of missingness have chosen to use Bayesian bootstraps or to model the discrete hazard of dropout at time j sequentially with logistic regression (Linero, 2017;Linero and Daniels, 2015). While it has been applied to continuous dropout time distributions (Su and Hogan, 2010), the Bayesian bootstrap assumes that dropout time values that are not observed in the data have zero posterior probability. Rubin (1981) notes that using the Bayesian bootstrap for a variable X effectively makes the assumption that all possible distinct values of X have been observed in the data and that the Bayesian bootstrap is "clearly inappropriate if X can take on many more than [the] n" observed values, as inferences may be sensitive to this assumption. While the Bayesian bootstrap is well suited to modeling dropout time in studies where dropout occurs at discrete visits, the assumptions of the Bayesian bootstrap are likely not met in studies such as the WIHS where the timing of subjects' repeated measurements is mis-aligned and dropout may occur at any continuous point in time. Following Schluchter (1992), we allow a subject's dropout time, u i , to depend on y i through the random coefficients, b i , as we believe that subjects with poorer outcomes are more likely to dropout. In our notation, μ (k) u is the intercept of the mean function for latent class k and σ bu is a row vector containing the covariances of u and each random coefficient.
For simplicity in our model write up, we have not included covariate effects in the distributions of b i or u i ; however, our model does allow μ (k) b and μ (k) u to depend on categorical covariates. For example, in our application, we allow separate μ (k) b and μ (k) u for hard drug users and others.

Relationship to Parametric Frailty Models
Although perhaps an unusual presentation, we can write the joint distribution of dropout times and the subject-specific random effects as a multivariate normal distribution to highlight which parameters are included in the Dirichlet process mixture and to clearly show our model as a natural extension of Schluchter et al.: Within class k the data are modeled using the parametric frailty model proposed by Schluchter (1992). We note that this joint distribution, , is part of the posterior distribution as factored in Section 2.2: Priors and Posterior. Our approach non-parametrically models the distribution of dropout times as a mixture of log-normal distributions and the distribution of subjectspecific coefficients as a mixture of multivariate normal distributions, while also clustering subjects with similar dropout times and coefficient values.
In order to account for non-ignorable dropout, we must allow for dependence between the outcome model and the dropout model. Clustering is a flexible method to achieve this dependence without requiring that the relationship between the random coefficients and dropout time take any particular functional form. This relaxes the assumption of a linear relationship between dropout time and subject-specific slopes common in both the parametric frailty model and the conditional linear model. Pattern mixture models use a similar strategy to account for non-ignorable dropout, by fitting separate models for a series of dropout patterns. However, unlike PMM's, the DP model does not require a-priori grouping of subjects into subjective patterns or clusters. A similar strategy has been implemented by Linero and Daniels (2015) for discrete dropout times. Linero (2017) also notes that an alternative to his proposed method using the Bayesian bootstrap would be to model the joint distribution of the outcomes and the missing data as a mixture, treating the outcomes and missing data as conditionally independent given class membership, with each class having unique parameters for both the dropout distribution and the distribution of the data. We propose a similar mixture; however, since within each mixture component we assume a normal distribution for the subject specific effects as well as log(u), it is natural to model these quantities using a multivariate normal distribution, similar to that proposed by Schluchter (1992), as we believe that subjects with poorer outcomes are more likely to dropout. This allows for potential covariance between the subject specific effects and dropout time within cluster. It is possible for this covariance to be estimated at or near 0, in which case the subject specific effects and dropout time would be conditionally independent given cluster membership, similar to the model for discrete dropout times proposed by Linero and Daniels (2015).

Priors and Posterior
Above we derived the model for p (y, u|ω, θ). However, we are interested in the posterior distribution, p(ω, θ|y, u) ∝ p(y, u|ω, θ)p(ω, θ). In our model specification, p(y, u|ω, θ) p(ω, θ) factorizes as follows: As mentioned in Section 2.1, where α is the concentration parameter and H is the baseline distribution of the Dirichlet process. In practice, we use a normal baseline distribution, so that the parameters θ (k) ∼ N (μ 0 , Σ 0 ). In addition, the following priors are assigned: where IW is the inverse Wishart distribution. For normally distributed outcomes, the within-subject error variance is assigned an inverse gamma prior.

Estimation
The posterior distribution is not analytically tractable and must be approximated. Markov chain Monte Carlo (MCMC) is a common approach (Ishwaran and James, 2001;MacEachern, 1994;Jain and Neal, 2004), although other alternatives, including partial predictive recursion (Newton and Zhang, 1999), variational Bayes approximations (Blei and Jordan, 2006), sequential importance sampling (MacEachern et al., 1999), and weighted Chinese restaurant sampling (Ishwaran and Takahara, 2002) have been proposed. A blocked Gibbs sampler for non-parametric random effects distributions has been described in detail by Ishwaran and James (Ishwaran and James, 2001). The method approximates the posterior using a truncation of the stick breaking model of (2.2). Since the probability weights assigned to each cluster decrease as the number of clusters increases, the infinite sum can be replaced by a sum across the first N terms by letting V N = 1, and a good approximation can be achieved for fairly modest N of 20 to 50 (Dunson, 2010). For non-Gaussian exponential family outcomes, we utilize a hybrid Gibbs Metropolis-Hastings sampler based on the truncated stick breaking approach (Ishwaran and James, 2001). For Gaussian outcomes, the posterior can be estimated using Gibbs sampling alone. We use data augmentation to simulate the censored dropout times for subjects that complete the study or are administratively censored (δ i = 1) (Wei and Tanner, 1990). To initialize the algorithm, we set u i = u o i for the first iteration. Details of our sampler are described in Appendix A of the Supplementary Material (Moore et al., 2019). An R package to implement these models is available at https:// github.com/kreidles/informativeDropout.

Marginal Effects, Averaged Over Clusters
At each iteration of the MCMC algorithm, subjects are assigned to clusters or latent classes, with class specific means for their random coefficients. However, we are typically interested in the expected value of the coefficients, averaged across clusters. The stick breaking construction of the Dirichlet process makes the calculation of these marginal effects straight forward, since the probability of belonging to cluster k, p k , is calculated at each iteration. At iteration s of the MCMC, we calculate the marginal or expected value of the subject-specific coefficients as follows:

Dropout Time Specific Effects
In order to understand the relationship between dropout time and the subject-specific coefficients, we are also interested in the expected values of the random coefficients at specific dropout times, E(b i |u i = u). These can also be estimated at each iteration as k|u is the probability of belonging to cluster k given a particular dropout time and E(b i |θ (k)(s) , ω (s) , u) is the expected value of the random coefficients in cluster k for a subject with dropout time u at iteration s. p k|u can be estimated at iteration s as follows: where f is the normal probability density function with mean μ (k)(s) u and variance σ 2(s) u . Using properties of the multivariate normal distribution, E(b i |k, u) at iteration s can be calculated: Dunson (2010) has discussed implementation issues when using Dirichlet process priors to non-parametrically model the random effects distribution in hierarchical models. The posterior distribution of the number of clusters is influenced by both α and the Dirichlet process baseline distribution. The data inform strongly about α, so robust results can be obtained by choosing a hyperprior for α; a common choice is Gamma(1,1). However, the posterior distribution is sensitive to the choice of baseline distribution and a poor choice can lead to a model with a single cluster or very few clusters. For example, for the N (μ 0 , Σ 0 ) baseline distribution, choosing the diagonal variance components of Σ 0 to be extremely large in order to be "non-informative" may result in all of the subjects being allocated to the same cluster, since the conditional posterior probability of allocating subjects to unoccupied clusters is very low. This would result in the parametric frailty model. Interestingly, allocating all subjects to separate clusters also results in a parametric model since the random effects distribution is then the N (μ 0 , Σ 0 ) baseline distribution. To properly characterize a non-normal distribution of subject-specific effects, subjects should be assigned to a few distinct clusters (Dunson, 2010). This makes it important to choose the baseline distribution with careful thought about realistic values given the problem.

Specification of Hyperparameters and Sensitivity to Priors
One solution that we have employed in this paper is to use a hyperpriors for the baseline distribution parameters, although the hyperparameters, m b , S b , ν b , and T b , must still be chosen. For HIV cohort studies, there is a wealth of information in the literature that can help inform these choices. For example, in our application, we considered the range of typical CD4 + T cell counts observed in untreated HIV + subjects and the mean and range of dropout times in our dataset to help inform our choices for m b and S b , the hyperprior parameters for the mean of the normal baseline distribution. Since we planned to include baseline log(CD4 + T cell count) as a covariate in our models, setting the first element of m b , which relates to the subject specific intercepts, to 0 was a reasonable choice. Since declines in log(CD4 + T cell count) are typically small (less than 1 log per year) and our primary interest was in comparing changes in CD4 + T cell count over time between groups, we set the second element of m b to 0 to be non-informative. For the third element of m b , we used the log(mean dropout time) in our dataset. S b quantifies our prior confidence in our estimates of m b , the mean of the normal hyperprior on the mean of the baseline distribution. We set S b to a 3 x 3 identity matrix, as this covered the range of subject specific intercepts, slopes and dropout times we believed were plausible.
Specifying ν b and T b , the hyperparameters for the inverse Wishart hyperprior on the covariance of the normal baseline distribution, can be more challenging. We chose to use a 3 x 3 identity matrix for T b and then set ν b so that the expected variances of the inverse Wishart distribution would cover the plausible range of intercepts, slopes, and dropout times in our dataset. For example, the expected value of the inverse Wishart distribution is T b /(ν b -4), if a subject specific intercept and slope are included in the cluster specific coefficients. Setting ν b to 5, resulted in prior expected variances of 1 for the baseline distribution, which covered the range of values expected or seen in our dataset well. For example, log(dropout time) ranged from approximately -0.6 to 1.6 log(years), so a single standard deviation above and below the expected mean of the baseline distribution would cover all the of the observed values in our dataset. Since the log of dropout times, intercepts and changes in log(CD4) + over time were on a similar scale, this strategy worked well. If this were not the case, a more complex T b , with unequal diagonal elements, may have been more appropriate. Shi et al. (2019) have proposed methods for developing low information priors for Dirichlet process mixture models, which involve scaling the data before analysis. This offers another potential solution when there is less subject-specific information available to choose hyperparameter values. We also suggest fitting additional models varying the hyperparameter values for the hyperpriors for α and the mean and variance of the baseline distribution to better understand sensitivity of inferences to these choices.
Other hyperparameters that must be specified include R 0 , the prior variance of covariate effects, and ν 0 and T 0 , the prior degrees of freedom and scale matrix for the inverse Wishart prior on Σ, the within cluster the covariance matrix for the subject specific intercept, slope and log of dropout time. For R 0 we typically use a covariance matrix with zeros for the off-diagonal elements and large values on the diagonal, in order to be uninformative. In our simulation and applications, we set the diagonal elements to 100, but in general "large" will depend on the scale of the data. Again, we set ν 0 and T 0 to 5 and a 3 x 3 identity matrix, respectively, as this would cover the range of means for subject specific intercepts, slopes and dropout times we believed were plausible in our dataset.

Identifiability of Model Parameters
The identifiability of parameters is an issue both in statistical models for missing data and in Bayesian non-parametric modelling in general. Likelihood invariance under label switching is a well-known problem in Bayesian mixture modeling, which complicates inference on the parameters of the individual components of mixture models (Stephens, 2000;Celeux et al., 2000;Makov et al., 1985). However, in our proposed approach, we do not attempt to assign meaning to or make inference on individual components; rather, we simply use the mixture model approach to flexibly model the marginal distribution of dropout times and subject-specific effects. As noted in Linero (2017), in the presence of missing data, identification of model parameters and causal effects requires assumptions about the missing data which cannot be checked, making sensitivity analysis important. In our model, we assume that after accounting for dropout time, data are missing at random and that subject specific slopes remain unchanged after dropout (i.e. subjects continue on the same linear trajectory after dropout). In practice, we perform sensitivity analyses to determine if results are robust to this assumption by re-fitting the model assuming that slopes change by a multiplicative factor of Δ after dropout. We consider a range of Δ and examine results to understand the extent to which Δ changes estimates of the expected value of the outcome over time. We perform a sensitivity analysis in the analysis of the WIHS data in Section 4.2.

Simulation
We assess the performance of our model to estimate the marginal slope (expected change in the outcome over time) as well as dropout time specific slopes. Data were generated under four different scenarios, including two normally distributed outcomes and two binary outcomes. In these four scenarios the subject-specific slopes were related to dropout time by two different dropout mechanisms: (i) a continuous and smooth function of dropout time, resulting in a left skewed distribution of subject-specific slopes and (ii) a discontinuous step function, resulting in a bimodal distribution. In addition, simulations for a linear relationship and no dropout effect are presented in Appendix B of the Supplementary Material to show the method can also fit CLMs and GLMMs.

Methods of Evaluation
DP-Drop models, frailty models (Schluchter et al., 2001), and CLM's (Wu and Bailey, 1989) to account for drop out, as well as standard generalized linear mixed models (GLMM's) were fit to each dataset. The DP-Drop was implemented as described in Section 2.1, including subject-specific intercepts and time effects for b i . Similarly, the frailty model was implemented as shown in Section 1.2, also including subject-specific intercepts and time effects for b i . The conditional linear model was implemented as follows: where 1 i is an n i x 1 vector of 1's and b 0i and b 1i are subject-specific random effects centered at 0. Marginal effects for the CLM were calculated by averaging over the empirical distribution of dropout times. The GLMM was similar to the CLM, with the exception that the β 2 u i term was excluded from the linear predictor. The performance of the methods was compared graphically and in terms of bias, variance, and mean square error for the marginal slope, as well as for the sixteen dropout time specific slopes.

Implementation
Non-informative priors, with large variances were used for all models. The priors for Σ and Σ 0 were IW(5, I 3x3 ), where I 3x3 is a 3 x 3 identity matrix, and the prior for the mean of the baseline distribution was N 0, 0, log(0.5) , I 3x3 . The hyperparameters for α were both set to 1, and an inverse gamma prior with both hyperparameters set to 0.001 was used for the within-subject variance for the normal distribution simulations. A truncation level of K = 20 was used and subjects were randomly assigned to one of the clusters to initialize the MCMC chain. The MCMC chains were run for 20,000 iterations, with the first 5,000 iterations used as burn in. Trace plots were used to check for convergence. Similar MCMC algorithms were run for the frailty model, the CLM, and the standard generalized linear mixed model. The DP-Drop, CLM, frailty model and GLMM methods were implemented in R using a custom MCMC algorithm utilizing the MASS (Venables and Ripley, 2002), mvtnorm (Genz et al., 2014), MCMCpack (Martin et al., 2011), and gtools (Warnes et al., 2014) packages.

Results
Model performance was quantified in terms of bias, variance, mean squared error (MSE) and 95% credible interval coverage for the marginal slope (Table 1). The median number of non-empty clusters for the DP-Drop model was 10.5, 12.0, 6.5, and 8.1 for the Normal (i), Normal (ii), Binary (i), and Binary (ii) scenarios, respectively. The GLMM had the lowest variance, likely because the method makes unmet assumptions that simplify the model and also has the fewest parameters. The DP-Drop method had the lowest bias and MSE, as well as the highest coverage probability, of all the methods, except in the Binary (ii) scenario, where the CLM had better performance. However, in this scenario, the CLM poorly modeled the dropout time specific slopes, underestimating slopes at early dropout times and over-estimating at later dropout times. This combination of over and under estimation, rather than appropriate model fit, resulted in the minimized bias values.
While the DP-Drop had the best performance in terms of credible interval coverage, we note that the DP-Drop 95% credible intervals do not always maintain 95% frequentist coverage probabilities. While in certain scenarios, Bayesian credible intervals can achieve nominal frequentist coverage probabilities, in general, there is no guarantee that a 95% Bayesian credible interval will maintain 95% frequentist coverage probability, as coverage probability depends on the prior distribution and sample size, as well as other factors (Wasserman, 2011;Gray et al., 2015). In these simulations with nonignorable dropout, coverage is lower than expected since the estimates of change over time for subjects that dropout early, and therefore have fewer observations, are necessarily shrunk towards subjects with more information. For example, for the Normal (i) scenario, subjects that dropout after their first visit have the smallest change over time, but only have one observation, which is not enough information to estimate a slope. Therefore, it is not surprising that the DP-Drop method estimates the marginal slope to be higher than the true simulation value and that the coverage is less than 95% due to this shrinkage. Simulations. In all cases the DP-Drop method fits the simulated dropout-varying slope more closely than the frailty model.
Graphs of the predicted DP-Drop frailty model, and CLM slopes at each dropout time are presented in Figure 3. The DP-Drop method was able to more accurately describe the relationship between dropout time and the subject-specific slope compared to the frailty model and the CLM, which always fits a linear relationship. While the frailty model and DP-Drop seemed to provide a similar fit for the Normal (i) scenario, where the dropout-varying slope could be relatively well represented by the logarithmic curve of the frailty model, the DP-Drop method provided a much better fit for form (ii) and in the binary simulations. The densities of the subject-specific slopes are presented in Figure 4. The DP-Drop method was able to capture the skewness of the form (i) distributions and the bimodal nature of the form (ii) distributions. The CLM, frailty and GLMM models fit normal distributions to these data, resulting in shrinkage of the estimates of subjects with early dropout times towards the mean for form (i) and large variances in form (ii). For the DP-Drop form (ii), in particular for the binary case, we see that while the density is bimodal, the modes of the estimated density are shifted towards one another compared to the histogram of the simulated data, and are slightly shrunk towards the overall mean. This is not surprising, as there is uncertainty in cluster assignment, particularly for subjects that dropout early and have few observations.
In Appendix B, we present additional simulation study results, including a scenario when dropout is ignorable and there is no relationship between dropout time and the subject specific slope. If there is no relationship between the subject specific slopes and dropout time, the DP-Drop is more complex than needed (as are other models that account for dropout) and the variance and MSE are increased compared to GLMM's.

Application
We use our methodology to analyze data from the WIHS. Recall that we are primarily interested in comparing the expected change in ln(CD4+) over time between hard drug users and other subjects in the cohort, while controlling for baseline ln(CD4 + ) and its interaction with time. At all study visits, subjects in the WIHS were questioned regarding illicit drug use since their previous visit. Subjects that reported the use of any hard drug, such as cocaine, amphetamines, heroin or other injection drugs, at more than half of visits were classified as consistent hard drug users and comprised the hard drug use group for this analysis. We define dropout time as the time of the last observation plus 1 day.

Comparison to Alternatives and Assessing Model Fit
In addition to the DP-Drop, we consider two parametric models to account for nonignorable dropout, the CLM and the parametric frailty model. We also fit a standard linear mixed model (LMM) that did not account for dropout for comparison. Baseline ln(CD4 + ) and its interaction with time were included as common covariates in all models. All models also included random subject-specific intercepts and time effects. The CLM and the LMM included separate intercepts and time effects for hard drug users and others. The CLM also included a time by dropout time by hard drug use interaction to allow a separate effect of dropout time for hard drug users and other subjects. For the frailty model the means of the subject-specific coefficients and dropout times were allowed to differ for drug users and others. For the DP-Drop, distributions of the subject-specific coefficients and dropout times were allowed to differ for drug users and others. For full details of these models, see Appendix C of the Supplementary Material.
For the DP-Drop, when possible, we used uninformative priors with large variances. An inverse gamma prior with both hyperparameters set to 0.001 was used for the withinsubject variance. The prior for the covariates was N(0, 100). The hyperparameters for α were both set to 1. Priors for Σ and Σ 0 were both IW(5, I 3x3 ). The prior for the mean of the baseline distribution was N 0, 0, log(1.9) , I 3x3 . These values were chosen using the approaches outlined in Section 2.5. A truncation level of N = 60 was used. The MCMC chain was run for 200,000 iterations with a burn in of 50,000. We fit additional models varying the priors for α and the mean and variance of the baseline distribution. As our dataset is fairly large, we did not see significant changes in the results. Similar MCMC algorithms were run for the frailty model, the conditional linear model, and the standard generalized linear mixed model.   To compare model fit between the methods, we performed a 10-fold cross validation. We first created 10 independent datasets consisting of all observations on 61 subjects (13 subjects reporting hard drug use, 48 others). In each fold of the cross-validation, models were trained on 90% of the data, and predicted values were generated for the remaining 10% of the data in the test set. We calculated a squared error loss function as follows: we calculated the mean squared error of the predicted values for each subject in the test set and summed these values across subjects. Values of the loss function were then averaged over the 10 folds. Lower values of the loss function indicate better model performance.
The DP-Drop model had significantly lower loss than any of the alternatives (Table 2), likely because the parametric assumptions of the CLM and frailty models were not met. For example, we see in Panel a of Figure 5 that the CLM assumes a linear relationship between dropout time and the slopes, the frailty model assumes a log-linear relationship, and the DP-Drop model is more flexible. However, this increased flexibility comes at the price of wider credible intervals. In Figure 5b, we see that the histograms of the subject specific slopes estimated using the DP-Drop method reasonably approximates the distributions seen in Figure 1b.

Results and Sensitivity Analysis
For subjects that reported hard drug use, CD4 + counts declined by an average of 24.4% per year compared to 15.2% per year for other subjects ( Table 2). The DP-Drop method had a posterior mean of 7 clusters (95% credible interval: 4-11) for subjects reporting hard drug use and 13 clusters (95% credible interval: 10-17) for others. Using a linear mixed model that does not account for dropout, these declines are estimated to be 21.0% and 14.4% respectively. Accounting for dropout, we estimate larger declines in CD4 + count over time, particularly for hard drug users, who were more likely to drop out. In addition, we find larger differences in declines between hard drug users and other subjects (Table 2). Failing to account for dropout could result in underestimating the effect of hard drug use ( Figure 6). In Table 3, we see that subjects that reported hard drug use had lower posterior mean slopes across the range of the dropout times, and significantly steeper declines for subjects that dropped out at years 2 and 3. This indicates that the significant marginal difference in CD4 + decline (averaged over dropout time distribution) between subjects reporting hard drug use and others is not only driven by differences in dropout time distribution, but also by more rapid declines in CD4+ at specific dropout times. Our estimates of CD4 + count over time rely on the untestable assumption that subjects continue to have the same rate of CD4 + decline after they dropout of the study. We evaluate the sensitivity of our results to this assumption by estimating CD4 + count over time assuming that the rate of decline changes by a factor of Δ after dropout. Δ > 1 would indicate more rapid declines after dropout, and 0 < Δ < 1 would indicate an attenuation of the rate of decline after dropout. Δ of 0 represents the assumption that subjects experience no further declines in CD4 + count after dropout, which is unrealistic in a cohort of untreated subjects. For Δ of 0, 0.5, and 0.75, differences in CD4 + count between hard drug users and other subjects are estimated to be smaller, but are still statistically significant at years 2, 3, 4, and 5, with drug users having significantly lower CD4 + than other subjects (Figure 7, Table 4). This sensitivity analysis suggests that our results are relatively robust to violations of the assumption that subjects' CD4 + counts continue to decline at the same rate after dropout.  One potential limitation of this analysis is that subjects in our study may have dropped out due to death. Subjects who die do not have potential outcomes after their dropout. Estimates of CD4 + presented in marginal plots, such as Figure 6, are averaged over all subjects, including those who may have died. Therefore in the presence of death, marginal estimates and plots should not be interpreted as the expected CD4 + across the surviving patients in the cohort. If dropout due to death is known, it may be more relevant to report the proportion of subjects surviving to the end of the study in each group and the expected longitudinal trajectories for those that survive, or to compare estimates of dropout time specific slopes between groups, as in Figure 5, which do not depend on extrapolating beyond the observed dropout times.

Discussion
We propose a flexible Dirichlet process mixture method to account for dropout in a Bayesian framework. The DP-Drop method allows for dropout occurring at any continuous point in time and avoids making parametric assumptions about the distribution of dropout times or the functional form of dropout-varying slope, by modeling both the  distribution of the log of dropout time and the subject specific effects as a Dirichlet process mixture of normals. The model flexibly clusters subjects with similar random effects and dropout times in order to model skewed, multi-modal and other non-normal distributions of subject-specific effects and to more accurately characterize longitudinal outcomes when non-ignorable dropout is present. Results of our simulation studies show that the DP-Drop reduces bias and mean squared error for the marginal slope compared to standard GLMMs, frailty models and conditional linear models when non-ignorable dropout is present. Natural extensions to this model might include adding continuous covariates to the class specific model for the random effects or dropout time or additions to account for differing dropout reasons among subjects (Moore et al., 2017;Pauler et al., 2003). In longitudinal studies, subjects may dropout for a variety of reasons, such as loss to follow up, declines or improvements in health, or death. Since subjects with similar dropout reasons are also likely to have similar subject-specific coefficients, we expect that the DP-Drop will cluster subjects with similar dropout reasons together and that the heterogeneity of effects across different dropout reasons would be taken into account. However, as the model does not explicitly include reason, we cannot compare differences in declines between subjects dropping out for different reasons.
Applying the DP-Drop method to the analysis of untreated CD4 + count in the WIHS, we find important differences in estimates of CD4 + decline for both hard drug users and others compared to a standard LMM and the CLM and frailty models that account for dropout. In addition, since differential dropout is present, we also find a larger difference in CD4 + declines between hard drug users and others. Since we must assume that subjects continue on the same trajectory after dropout in order to identify model parameters, we performed a sensitivity analysis and found that the results of the WIHS analysis are relatively robust to violations of this assumption. Our preliminary investigation of the WIHS data suggested that additional parametric assumptions of common methods to account for non-ignorable dropout, such as log-normally distributed dropout times and/or normally distributed subject specific effects, may be violated. As inferences can be sensitive to these assumptions, having flexible approaches with minimal assumptions, such as the DP-Drop, is important when considering how missing data may influence the results of longitudinal data analyses.