Personalized Dynamic Treatment Regimes in Continuous Time: A Bayesian Approach for Optimizing Clinical Decisions with Timing

Accurate models of clinical actions and their impacts on disease progression are critical for estimating personalized optimal dynamic treatment regimes (DTRs) in medical/health research, especially in managing chronic conditions. Traditional statistical methods for DTRs usually focus on estimating the optimal treatment or dosage at each given medical intervention, but overlook the important question of “when this intervention should happen.” We fill this gap 1 ar X iv :2 00 7. 04 15 5v 3 [ st at .M E ] 1 8 Fe b 20 21 by developing a two-step Bayesian approach to optimize clinical decisions with timing. In the first step, we build a generative model for a sequence of medical interventions—which are discrete events in continuous time—with a marked temporal point process (MTPP) where the mark is the assigned treatment or dosage. Then this clinical action model is embedded into a Bayesian joint framework where the other components model clinical observations including longitudinal medical measurements and time-to-event data conditional on treatment histories. In the second step, we propose a policy gradient method to learn the personalized optimal clinical decision that maximizes the patient survival by interacting the MTPP with the model on clinical observations while accounting for uncertainties in clinical observations learned from the posterior inference of the Bayesian joint model in the first step. A signature application of the proposed approach is to schedule follow-up visitations and assign a dosage at each visitation for patients after kidney transplantation. We evaluate our approach with comparison to alternative methods on both simulated and real-world datasets. In our experiments, the personalized decisions made by the proposed method are clinically useful: they are interpretable and successfully help improve patient survival.


Introduction
In biomedical applications involving long-term personalized care of patients with chronic health conditions (e.g., diabetes, human immunodeficiency virus infections, and chronic kidney diseases), treatments often include a sequence of decision making and must be adaptive to the uniquely evolving disease progression of each patient. Such scenarios are called dynamic treatment regimes (DTRs). Patients with chronic diseases are usually required to follow up with their physicians from time to time and their clinical data are recorded longitudinally. Based on these clinical observations, physicians make clinical decisions such as scheduling follow-up visitations and prescribing the right dosages to optimize patient outcomes given a patient's individual characteristics and treatment history at each clinic visitation. Estimating treatment effects and optimizing sequential treatment assignments from observational data have been extensively studied in both statistics and machine learning, such as the G-computation formula (Robins, 1986), inverse probability of treatment weighting (Orellana et al., 2010), doubly robust methods (Zhao et al., 2015), and reinforcement learning (Zhao et al., 2011;Clifton and Laber, 2020). This paper develops a two-step Bayesian approach to optimize clinical decisions with timing. In the first step, we develop a Bayesian joint model consisting of a generative probabilistic submodel for clinical decisions with timing and a submodel for clinical observations (e.g., longitudinal clinical measurements and time-to-event data): these two submodels share certain structures and parameters in order to capture the mutual influence between the clinical observations and decisions. The posterior inference of the proposed Bayesian joint model can learn the parameters in the clinical decision submodel as the estimates of how physicians treatment patients in the observed data and uncertainties in clinical observations. In the second step, we propose an optimization method that allows the decision model, by interacting with the other parts of the joint model, to learn to make the personalized optimal clinical decision at the right time while accounting for uncertainties in clinical observations. Such a joint model and the proposed optimization method will be useful in many biomedical applications. We elaborate on one signature application in Section 1.1, explain why existing methods won't work well on it in Section 1.2, and then give an overview of our method and its technical novelty in Section 1.3.

A signature application
A signature medical application of the proposed method would be the kidney transplantation, the most common type of organ transplantation and the primary therapy for patients with end-stage kidney diseases (Arshad et al., 2019). Compared to dialysis, kidney transplantation improves patients' long-term survival and quality of life but with a lower healthcare cost (Jarl et al., 2018). Despite significant advances, a number of complications after surgery still represent important causes of morbidity and mortality for kidney transplant recipients, such as infection, stroke, and graft failure (Lamb and Lodhi, 2011;Bicalho et al., 2019). To prevent graft rejection, patients are usually hospitalized for a few days initially to monitor signs of complications, then required to have frequent checkups at an outpatient center after being released. At each visitation, they are administered immunosuppressive drugs, such as tacrolimus, to keep their immune systems from attacking and rejecting the new kidney (Kasiske et al., 2010). One crucial medical decision is to schedule patients' post-transplantation follow-up visitations. While follow-up visitation frequency varies from 0-12 months (Israni et al., 2014), patients with stable kidney function usually have less frequent follow-ups compared to non-stable patients. Another medical decision is to determine the right dosage of tacrolimus at each follow-up visitation since a dosage that is either too high or too low may cause serious adverse events. Higher tacrolimus levels have been reported to associate with adverse effects such as neurotoxicity, nephrotoxicity, and cancers (Naesens et al., 2009); while lower tacrolimus levels are associated with an increased likelihood of graft rejection (Staatz et al., 2001). Therefore, optimizing personalized follow-up schedules and prescribing the right dosage of tacrolimus tailored to each patient at each visitation (i.e., precision medicine) are critical and can have a significant impact on patients' survival.
Large-scale kidney transplantation databases, such as French computerized and validated data in transplantation (DIVAT), provide us both opportunities and challenges to determine personalized optimal follow-up schedules and tacrolimus dosages. DIVAT is a database storing medical records for kidney transplantation in several French hospitals (e.g., Nantes, Paris Necker). Data are collected from the date of transplantation until the graft failure, defined as either returning to dialysis or death with a functioning graft. At each scheduled follow-up visitation, patients' creatinine levels, an important biomarker for measuring kidney function, are collected longitudinally to determine the next follow-up time and assign dosages by physicians. For example, Figure 1 presents one randomly selected patient's longitudinal creatinine levels and tacrolimus dosages versus his/her follow-up visitations from DIVAT. In the first several visitations after kidney transplantation, this patient's creatinine levels were high, indicating the kidney was not functioning well; therefore, the physician scheduled a high frequency of follow ups and prescribed high dosages of tacrolimus. As time went by, this patient's kidney function became stable indicated by slowly decreasing creatinine levels; then the prescribed tacrolimus dosages also slowly decreased accompanied with a decreasing frequency of visitations. For patients with kidney transplantation, a major clinical outcome of interest is the graft survival time, defined as the time between the transplantation and the first graft failure. Follow-up schedules and tacrolimus dosages should be made for the sake of maximizing patients' graft survival time.

Why not use existing methods?
Although many statistical and machine learning DTR methods have been developed to optimize sequential clinical decisions (Murphy et al., 2003;Chakraborty, 2013;Laber et al., 2014;Xu et al., 2016;Luckett et al., 2019), they don't model, and thus can't optimize, the timing of clinical decisions. Most DTR methods regard treatment schedules as known a priori and only learn to assign optimal treatments at pre-defined schedules. For example, Xu et al. (2016) developed a Bayesian nonparametric approach building upon a dependent Dirichlet process and a Gaussian process to determine the optimal treatment regimen containing a front-line chemotherapy and a salvage treatment for acute myelogenous leukemia patients. However, the timing of the salvage treatment was pre-defined as the time when patients became resistant to the front-line chemotherapy or achieved complete remission first then relapsed. Shardell and Ferrucci (2018)  proposed a joint mixed effects model using G-computation in which the longitudinal outcome and treatment assignment models share common random effects to estimate causal effects with pre-defined treatment timing. Clifton and Laber (2020) reviewed the use of Q-learning, a general class of reinforcement learning methods, in estimating optimal treatment regimens taking the timing of treatments as given. Zhao et al. (2011) attempted to optimize the timing to initiate second-line therapy in the context of clinical trials with two-stage treatments using Q-learning, but only considered two options (i.e., immediately or delayed after induction therapy). Other work on optimizing intervention timing with a fixed number of treatment options include initiation of antiretroviral therapy in HIV (Robins et al., 2008), just-in-time adaptive interventions in mobile health (Nahum-Shani et al., 2018;Carpenter et al., 2020), and advantage doubly robust policy learning that optimizes when to treat (Nie et al., 2021). Guan et al. (2019) developed a Bayesian nonparametric method that learns to recommend a regular recall time for patients with periodontal diseases. However, their method only picks the recall time out of a few pre-defined choices (e.g., 3 months, 6 months, and 9 months) and thus is not applicable to complicated scenarios like the one introduced in Section 1.1: at each visitation after kidney transplantation, the next visitation time has to be carefully scheduled given the current clinical measurement in order to maximize the patient's health outcome. For instance, when patients' kidney function is relatively stable, they should be instructed to wait longer until the next visitation compared to those who are less stable.

Why use our method?
To the best of our knowledge, the proposed approach is the first general methodology for estimating personalized optimal clinical decisions with timing. The method is cuttingedge because (1) we build a generative probabilistic model that properly handles clinical decisions with timing; (2) we embed this decision process into a Bayesian joint model that also models clinical observations; (3) we propose a Bayesian optimization procedure to optimize personalized treatment schedules alongside other clinical decisions while accounting for uncertainties in clinical observations based on the posterior inference of the proposed Bayesian joint model.
Our decision model is a marked temporal point process (MTPP) (Aalen et al., 2008), which is a natural tool to model discrete events in continuous time. It has been widely applied and become increasingly popular in various domains, including social science (Butts and Marcum, 2017), medical analytics (Liu et al., 2018), finance (Hawkes, 2018), and stochastic optimal control (Tabibian et al., 2019). In our example application of Section 1.1, each follow-up visitation is an event: the visitation time is assumed to be stochastically scheduled according to the probability distribution characterized by the proposed MTPP; and the assigned tacrolimus dosage, when the visitation happens, is treated as the corresponding "mark." The proposed MTPP for clinical decisions is then embedded into a Bayesian joint model where it shares certain structures and parameters with the other components modeling clinical observations, including longitudinal creatinine measurements and patient survival in the example application of Section 1.1. Such design allows our model to capture the complicated mutual influence between clinical observations (e.g., creatinine levels) and decisions (e.g., treatment schedules and tacrolimus dosages). We fit the proposed Bayesian joint model to the data and obtain the posterior inference, which not only learns the dynamic patterns of clinical observations with uncertainty quantification, but also estimates the parameters in modeling clinical decisions as the estimates of how physicians treat patients in practice. The posterior means of these parameters can be used as the initial values in the optimization procedure so that clinical decisions can be refined to optimize patients' survival.
Next, we let the decision model interact with the observation model in an optimization procedure. This technique is known as "reinforcement learning" (Sutton and Barto, 2018): the decision model (also called the "policy") is reinforced, by the feedback from the observation model (also called the "environment"), to give personalized optimal treatment schedules and dosages that would improve the expected health outcome for each patient. The Bayesian nature of our approach allows the learning to account for parameter uncertainties in the observation model. Figure 2 illustrates the proposed two-step Bayesian approach. The left box displays the observational data including the clinical measurements and dosages at a sequence of visitations, and the right-headed arrow next to it stands for the Bayesian joint model (i.e., the first step), pointing to the right box, which represents the learned inference from the Bayesian joint model, including inference on longitudinal measurements, the hazard function of survival, dosages, and the intensity function of visitation timing. The vertical arrows inside the right box indicates the policy optimization method (i.e., the second step), in which the decision model learns to achieve higher reward by interacting with the learned measurement model. The R package doct (short for "Decisions Optimized in Continuous Time") implementing the proposed model and algorithm is available at https://github.com/ YanxunXu/doct. The rest of the paper is organized as follows. In Section 2, we outline the overall framework of the proposed two-step Bayesian approach. In Section 3, we elaborate on the first step of developing a Bayesian joint model consisting of the decision submodel (for visitation schedules and dosages) and the observation submodel (for clinical longitudinal measurements and patient survival). In Section 4, we elaborate on the second step of the optimization procedure for the decision model. We evaluate the proposed approach through simulation studies in Section 5 and applying it to the DIVAT kidney transplantation dataset in Section 6. Lastly, we conclude the paper with a discussion in Section 7.

Problem formulation
Our goal is to optimize personalized clinical decision including scheduling a patient' follow-up visitations and prescribing dosages to maximize the patient' health outcome, e.g., the graft survival time in the kidney transplantation application. In this section, we first introduce the notations, and then outline the framework of the proposed two-step Bayesian approach.
For each patient i, let T i and C i denote the graft survival and administration censoring times, respectively, i = 1, . . . , I. We observe only T i = min(T i , C i ) and the censoring indicator δ i = 1(T i ≤ C i ). The graft survival would typically be affected by the patient's baseline risk factors, denoted by x i , including the patient's age when receiving the transplantation and the donor type. At each visitation time t i,j , the clinical measurement y i,j of interest would be taken, j = 0, . . . , J i , and t i,Ji ≤ T i . Note here t i,0 = 0 denotes the transplantation date of patient i, and y i,0 denotes the initial creatinine level. Then the physician would prescribe the dosage d i,j , and schedule the next visitation time t i,j+1 . In the kidney transplantation application, y i,j is the logarithm of the creatinine level (μmol/l), and d i,j is the logarithm of the tacrolimus level (ng/ml). We represent the i-th patient's sequence of visitations and assigned dosages by Assume that the physician assigns the dosage d i,j and schedule the next visitation time t i,j+1 based on the patient's baseline covariates x i and the longitudinal creatinine measurements y i,j = {y i,j : j ≤ j}, we write the joint model of the observed clinical observations and decisions as (2.1) We leave the details of the chosen parametrization for this joint model to Section 3. Throughput this paper, we use p(·) to denote the probability density. In order to identify causal effects, we make the standard assumptions of sequential ignorability and consistency throughout this paper (Murphy et al., 2003). The positivity assumption is satisfied in our model due to the use of the marked temporal point process (MTPP) (Aalen et al., 2008) for modelling stochastic clinical decisions in continuous time, which will be introduced in Section 3.1.
Assume that the joint model (2.1) is parameterized with θ and φ, where θ denotes the set of parameters related to clinical decisions that control patients' follow-up schedules and dosages at follow-up visitations, and φ denotes the remaining parameters. We aim to learn the parameters θ of the clinical decision model such that, for any (future/hypothetical) patient i, a certain reward R i can be maximized by prescribing the right amount of dosages and scheduling the visitations at the right times. Formally, we aim to maximize the expected reward for any patient i with baseline covariates x i : Here the expectation is taken over all possible stochastic realizations of the clinical measurements y i , the survival time T i , and the clinical decisions e i,Ti from their joint distribution p(y i , T i , e i,Ti | θ, φ, x i ); D denotes the observed data; and p(φ | D) is the posterior distribution of φ obtained from the posterior inference of the Bayesian joint model (2.1). Note that the reward is stochastic, because the clinical measurements and decisions (i.e., dosages and visitation times), and their influence on the reward are all stochastic. In our kidney transplantation application, we can consider the patient's median survival time as the reward. More details on the reward will be introduced in Section 4.
To find the optimal clinical decision parameter θ i that maximizes the expected reward G i (θ) for patient i after integrating out the uncertainty in the longitudinal process and the survival distribution, we propose a two-step Bayesian approach. In the first step that will be introduced in Section 3, we will fit the joint model (2.1) to the observed data and assign priors for parameters θ and φ, then obtain posterior inference through Markov chain Monte Carlo (MCMC) simulations. In the second step that will be introduced in Section 4, we will propose a policy gradient method using stochastic gradient descent (SGD) (Ruder, 2016) to optimize personalized clinical decision. The uncertainties in the clinical observations will be incorporated by integrating over the posterior distribution of p(φ | D) when calculating the reward (2.2). The estimated posterior mean of θ from the first step can be used as an initial value in the optimization procedure for efficient learning.

First step: a Bayesian joint model
In this section, we describe the proposed Bayesian joint model for both clinical decisions and observations. In Section 3.1, we introduce the clinical decision model for follow-up visitation schedules and dosages; in Section 3.2, we introduce the clinical observation model for longitudinal measurements and time-to-event data, which are linked to the decision model through parameter sharing. To facilitate our presentation and readers' understanding, we will use the kidney transplantation example and the DIVAT data to illustrate the model. However, the proposed method is applicable to general medical settings since the patterns that the method can capture are not tied to this particular application.

Modeling clinical decisions
Modeling event data with marker information is important to learn the latent mechanisms that govern the observed stochastic event patterns over time in many domains, such as social science (Butts and Marcum, 2017) and medical analytics (Liu et al., 2018). Marked temporal point processes (Aalen et al., 2008) are a general framework for modeling such event data. Formally, a marked temporal point process is a random process, the realization of which consists of a sequence of events localized in time, i.e., H = {(t 0 , d 0 ), (t 1 , d 1 ), . . . , (t J , d J )} with the occurrence time of event j being t j ∈ R + and d j is the associated mark. In our application, t j represents the time when a patient visits an outpatient center and d j represents the tacrolimus dosage assigned by the physician. The first event is defined as the day of transplantation at t 0 = 0 with an initial dose d 0 .
Denote the event history up to time t to be H t = {(t j , d j ) ∈ H | t j < t}. Under MTPP, the instantaneous rate of the event is characterized by a conditional inten-sity function λ(t), namely λ(t) = lim dt→0 P r{event happens in [t,t+dt)|Ht} dt . Common forms of the conditional intensity function λ(t) include Poisson process (Zhu and Li, 2018), Gamma process (Shibue and Komaki, 2020), Hawkes process (Hawkes, 1971). However, these common models cannot capture complicated patterns in many medical applications. For instance, as shown in Figure 3(a) that plots the empirical intensity of the amount of time between visitations for different ranges of creatinine levels in the DI-VAT data, the elapsed time between follow-up visitations depends on the creatinine level. Also, the empirical intensities of visitations are observed to quickly rise to a peak and then fall down accompanied by moderate oscillations. Such complication is beyond the capacity of the Poisson process that assumes a constant intensity and the Gamma process whose intensity function is monotonic. The Hawkes process assumes that the past events always elevate the intensities of future events and this "self-exciting" effect is additive-it is also apparently not the dynamics that the visitations in the DIVAT data actually follow. We propose a flexible conditional intensity function that also incorporates human intuition: it takes longitudinal clinical measurements into account and captures patients' heterogeneity. Recall that y i,j denotes the logarithm of the creatinine level for patient i at the j-th follow-up visitation occurring at time t i,j (days). Our conditional intensity function makes use of a Gamma density function as follows: The parameter α i,j is patient-specific so that our intensity function λ i is personalized. We set κ = exp(ν 2 ) + 1 > 1 so that the intensity rises to a "global peak" and then decreases: it would eventually approach to the "baseline level" exp(μ) unless the next visitation happens and sets up a new intensity curve. For easy interpretation, we parameterize γ as γ = exp(ν 2 − ν 1 ) such that the "peak time" (i.e., when the peak of the intensity function occurs) can be easily computed as κ−1 γ = exp(ν 1 ). Moreover, since the intensity level often depends on the clinical measurement (e.g., as in Figure 3(a), a higher creatinine level implies a higher intensity), we condition the parameter α i,j , which controls the peak intensity for patient i between time t i,j and t i,j+1 , on the clinical measurement taken at the j-th visitation: This design reflects the human intuition that the time of "next visit" is usually determined based on the clinical measurement of "this visit." Note that our design allows incorporating other covariates (i.e., measurements) by simply augmenting them to the vector (1, y i,j ). Figure 3(b) shows how the visitation intensity under our model is affected by the most recent creatinine level y i,j (and thus the magnitude parameter α i,j ) given a specific set of parameter values.
Next, we model the dosage d i,j at the j-th visitation of patient i as the "mark" of the visitation event. Generally speaking, the physician would assign a dosage based on the patient's current clinical measurement y i,j and potential risk factors x i . We assume the following dosage model reflecting this knowledge: . This Gaussian error assumption is suitable for our dosage data, which has continuous values, but can be easily modified for other dosage/treatment types (e.g., generalized linear regression models for discrete treatment choices). In the kidney transplantation application, we assume that the clinical decision at time t i,j (i.e., assigning a dosage d i,j and scheduling the next visitation time t i,j+1 ) is independent of the patient's history conditional on the measured creatinine at the current visit y i,j . This is reasonable in our application since the duration between two consecutive visitations for patients after kidney transplantation can be months or even longer. However, this assumption can be relaxed as in other popular DTR methods such as Q-learning (Clifton and Laber, 2020) if desired. For instance, the patient's past creatinine levels can be included in the vector (1, y i,j , x i ) of (3.2) to model the effect of past creatinine levels on the dosage. Thus, the probability density of the i-th patient's sequence of visitations and assigned dosages

Linking clinical observations with clinical decisions
In this section, we introduce the proposed Bayesian joint model that links the submodel of the longitudinal measurements and time-to-event data to the MTPP of the clinical decisions by carefully designing parameter sharing in order to capture the mutual influence between clinical observations and decisions. Shortly in Section 4, we will leverage this joint model to optimize clinical decisions with the goal of maximizing patients' survival.
Our clinical observation model is composed of two submodels-a linear mixed effects model for longitudinal clinical measurements (e.g., creatinine levels) and a time-to-event model for patient survival (e.g., graft survival time after kidney transplantation). The two submodels are then connected by sharing random effects (Rizopoulos et al., 2014).
be the underlying true but unobserved longitudinal process at time t ≥ 0. We assume The covariate vectors z i (t) and r i (t) are associated with fixed and random effects respectively: where d i (t) at time t is the dosage assigned by the physician at the most recent visitation, i.e., d i (t) = d i,j for t ∈ (t i,j , t i,j+1 ]. The temporal dependence of z and r on the dosage d captures the drug effect on the longitudinal measurements of interest: in the kidney transplantation application, it is supposed to capture the suppressive effect of tacrolimus on the creatinine level. Denote d i = (d i,0 , . . . , d i,Ji ), the probability of the observed sequence of creatinine measurements y i is Next, we construct the time-to-event submodel depending on the underlying true longitudinal trajectory y * i (t) and the MTPP that models clinical decisions. We consider a Weibull proportional hazards model as follows: 6) where ω is the shape parameter. If desired, more complex survival models can be explored, such as Cox proportional hazard models (Lin and Wei, 1989) and Bayesian nonparametric survival regression models (Xu et al., 2019). The dependence on y * i (t) reflects the domain knowledge that the survival event is usually associated with the underlying health condition reflected by longitudinal measurements. The dosage effect term in (3.6) measures the overall drug effect on the patient: β s2 d i (t) is the "instantaneous" effect while β s3 Tox i (t) is the "accumulated" effect: where the parameter η tox controls the rate of the exponential weighting for the past dosages. In practice, the instantaneous effect is usually beneficial (e.g., tacrolimus reduces the likelihood of graft rejection or death) while the accumulated effect is often toxic (and that is why we name it Tox): e.g., a prolonged high dosage of tacrolimus might have adverse effects on kidneys, central nervous system, and gastrointestinal tract, thereby worsening a patient's survival (Randhawa et al., 1997). We also link the survival submodel with the visitation model by defining α i (t) = α i,j for t ∈ (t i,j , t i,j+1 ] since a high visitation intensity (i.e., larger α i,j ) typically implies a higher risk, e.g., graft failure and thus shorter expected survival time.
Recall that T i and C i denote the survival and censoring times for patient i, respectively. We assume T i = min(T i , C i ) and δ i = 1(T i ≤ C i ). Denote f i (t) and S i (t) to be the corresponding density and survival functions of the hazard function (3.6): We can write the survival likelihood for patient i as In summary, we propose a joint model consisting of an MTPP for clinical decisions including follow-up visitation schedules and dosages, a linear mixed effects model for longitudinal clinical measurements, and a time-to-event model for the patient survival; they are inter-connected by sharing structures and parameters. The joint probability of the clinical observations and decisions can then factor as . We complete the model by imposing the following priors: for conjugacy. We assume a flat prior for Σ b . When conjugacy is unattainable for the visitation and survival parameters, we assume β s1 , β s2 , β s3 , β s4 , h 0 ∼ Normal(β s0 , σ 2 s0 ), η tox ∼ Gamma(π s1 , π s2 ), ω ∼ Gamma(π s3 , π s4 ), μ, ν 1 , ν 2 ∼ Normal(β v0 , σ 2 v0 ), β α ∼ Normal(β α0 , Σ βα ), and ξ ∼ Gamma(π v1 , π v2 ). We carry out posterior inference using the Markov chain Monte Carlo (MCMC) sampler. The details are included in the Supplementary Material Section A (Hua et al., 2021).

Second step: optimize personalized clinical decision
In this section, we propose a policy gradient method using stochastic gradient descent (SGD) (Ruder, 2016) to optimize personalized clinical decision including scheduling a patient' follow-up visitations and prescribing dosages to maximize the patient' health outcome, e.g., the graft survival time in the kidney transplantation application.
Let θ = (ν 1 , ν 2 , μ, β d , σ 2 d ) denote the set of "policy" parameters related to clinical decisions, i.e., the parameters that only appear in the conditional intensity function (3.1) and the mark distribution (3.2), which control patients' follow-up schedules and dosages at follow-up visitations. Let φ = (β s , b i , β l , σ 2 l ) denote the set of "observation" parameters, i.e., all other parameters in the joint model (3.8). Recall that in Section 2, we define the expected reward for any future/hypothetical patient i with baseline covariates x i to be: The expectation is taken over all possible stochastic realizations of (y i , T i , e i,Ti ). In the kidney transplantation application, we define a personalized reward function R i as the log-scaled median survival time to optimize patients' survival: R i = log( T i ), where S i ( T i ) = 0.5. This reward provides computational and variance-reduction advantages over a naive choice of the survival time itself. If desired, other reward functions can be considered. For example, if a physician or patient would like to take into consideration the healthcare cost per visit, we could penalize the number of visitations in the reward function, e.g., R i = log( T i ) + η 0 C i , where η 0 is a tuning parameter and C i is the number of visitations. Our goal is to find, for any patient i with baseline covariates x i , the optimal clinical decision, represented by θ i , to maximize the expected reward G i (θ) after integrating out the uncertainty in the longitudinal process and the survival distribution: where p(e i,Ti | θ) is the probability density of the MTPP.
To find the optimal clinical decision parameter θ i for patient i, we use stochastic gradient descent (SGD) (Robbins and Monro, 1951), i.e., θ i,m+1 = θ i,m + s i,m ∇ θ G i (θ) | θ=θi,m , which requires computing the gradient of the expected reward: ∇ θ G i (θ). As the expectation is taken over realizations of the joint distribution p(y i , T i , e i,Ti | θ, φ, x i ), it is intractable to directly compute ∇ θ G i (θ). Fortunately, we can indirectly compute this gradient by taking the expectation of the reward-weighted gradient of log-policy. Precisely, Ti | θ, φ, x i ), the gradient of the expected reward G i (θ) with respect to θ is: Ti | y i , x i , φ, θ) is the probability of the patient's sequence of visitations and assigned dosages in (3.3).

Proposition 1. For any patient i with baseline covariates x i , given a joint distribution
We leave the detailed proof of Proposition 1 to Supplementary Section B.
According to Proposition 1, in order to compute ∇ θ G i (θ), we first need to be able to sample y i , T i , e i,Ti from p(y i , T i , e i,Ti | θ, φ) and calculate R i from the generated samples. We sample the j-th follow-up visitation time t i,j and the survival time T i using an inverse transform sampling method: first computing the cumulative distribution function (CDF) of the distribution, sampling a random number U from Uniform(0, 1), and then inverting the CDF function at U to yield the visitation/survival time (Giesecke et al., 2011). If the j-th visitation time occurs before the survival time, i.e., t i,j < T i , we sample y i,j and d i,j from their respective distributions and continue to sample the (j+1)-th visitation time and the survival time. We iteratively sample follow-up visitation times, survival times, longitudinal measurements, and dosages until the sampled survival event occurs before the next visitation time. After obtaining samples of y i , T i , e i,Ti , we can easily compute R i . We describe the sampling process for a general R i in Algorithm 1. The algorithm details of sampling y i , T i , e i,Ti , R i for the reward being the log median survival time are provided in Supplementary Section C.
Next we compute the gradient of the log-likelihood of the MTPP, ∇ θ log p(e i,Ti | y i , x i , φ, θ), using the parametrization defined in (3.3). The details are described in Supplementary Section D. Lastly, we integrate out φ in computing ∇ θ G i (θ) using the Monte Carlo method since it is analytically intractable. Suppose that we have K MCMC draws from the posterior distribution of φ and we denote the k-th draw as φ (k) , then ∇ θ G i (θ) can be approximated as follows: K . (4.2) To compute each term of the summation in the numerator of (4.2), we first sample T i , y i , and e i,Ti from p(y i , T i , e i,Ti | θ, φ (k) , x i ) using Algorithm 1 to compute R i for each φ (k) , then multiply the gradient of the log-probabilities of visitation times and dosages under the MTPP policy. We use an adaptive step size algorithm and choose the step size to be: The entire SGD algorithm for finding the optimal parameter θ i is described in Algorithm 2, where G i (θ i,m ) denotes the expected reward in iteration m. We select the optimal policy θ i to be the one yielding the highest expected reward across all iterations. Note that, in the step 7 of Algorithm 2, we subtract the average reward from each individual reward: this "baseline subtraction" trick significantly reduce the variance while still yielding an unbiased estimate of the gradient (Williams, 1992;Greensmith et al., 2004).  (t i,1 ), . . . , y i (t i,Ji )) denote the simulated follow-up schedules, dosages, and longitudinal data over J i visitations until the survival time, T i for any patient n with covariates x i .

Simulation study
To demonstrate the advantage of the proposed Bayesian joint model, we compared it to an alternative model that breaks the connection between longitudinal and survival processes. Furthermore, to illustrate the benefit of optimizing the personalized clinical decision, we compared the expected reward under the estimated optimal clinical decision to alternative strategies of scheduling follow-up visitations on a regular basis, e.g., every three months (Israni et al., 2014).

Simulation setup
We simulated a dataset mimicking the DIVAT dataset composed of longitudinal creatinine measurements, follow-up schedules, tacrolimus dosages, and survival events for I = 500 patients. We considered three baseline covariates in x i : donor age (AgeD), delayed graft function (DGF), and body mass index (BMI). DGF is a binary variable with 1 indicating that the patient used dialysis within the first week of the trans-Algorithm 2 Stochastic Gradient Descent for optimizing θ for any patient i.
The simulated dataset had a total of 14,395 follow-up visitations for 500 patients with a 10.8% censoring rate. The median survival time was 1,684 days with the shortest being 24 days and the longest being 10,016 days. Supplementary Figure S1 plots the simulated longitudinal creatinine levels and follow-up schedules with dosages for four randomly selected patients.

Results: model fitting
We applied the proposed Bayesian joint model to the simulated dataset. The hyperparameters were set to be β d0 = β l0 = β α0 = 0, Σ β d = Σ β l = Σ βα = 100 2 I, π d1 = π d2 = π l1 = π l2 = π s3 = π s4 = 0.01, π s1 = π s2 = 0.01, β s0 = β v0 = 0, σ 2 s0 = σ 2 v0 = 100 2 , π v1 = 400, π v2 = 200. We ran 20,000 MCMC iterations with an initial burn-in of 5,000 iterations and a thinning factor of 50. The convergence was assessed using R package coda, including traceplots of the post-burn-in MCMC samples for some randomly selected parameters (Supplementary Figure S2), showing no issues of non-convergence. We first report on the performance of the proposed joint model in terms of parameter estimation. Supplementary Figure S3 plots the 95% estimated credible intervals (CIs) for selected parameters, showing that all 95% CIs are centered around the simulated true values. As another metric of performance, we computed the mean squared error (MSE) taken as the averaged squared errors between the post-burnin MCMC posterior samples and the simulated true values. Supplementary Table S1 summarizes the MSE and the standard deviation of squared errors, indicating that the proposed joint model can accurately estimate parameters.
As the proposed model represents the first effort in the literature to jointly model clinical decisions, longitudinal markers, and the survival event, there is no existing method we can compare with. To demonstrate the advantage of jointly modeling longitudinal creatinine levels and the survival event, we compared the proposed model with an alternative "separate longitudinal and survival (SLS)" model that breaks the connection between the longitudinal and survival submodels by replacing the process y * i (t) with the observational data y i (t) in the hazard model (3.6). We first compared the two models by checking their model adequacy using the Watanabe-Akaike information criterion (WAIC) (Watanabe and Opper, 2010): the joint model has a WAIC value of 226,982 while the SLS model has a WAIC of 226,992, indicating that the proposed joint model fits data slightly better. Furthermore, we compared the two models in terms of parameter estimation.

Results: personalized optimal clinical decision estimation
We applied the proposed policy gradient method in Section 4 to the simulated dataset to estimate the personalized optimal clinical decision that maximizes one patient's graft median survival time, i.e., R i = log( T i ), where T i is the median survival time of any patient n. The starting parameter values θ 0 in Algorithm 2 were set to be the estimated posterior means of these parameters from posterior inference, which can be considered as the estimates of how physicians treated patients in the simulated data. Therefore, the goal of the optimization procedure is to improve physicians' current treatment strategy in terms of prolonging patients' survival.
We implemented Algorithm 2 with M = 1000 steps to estimate the personalized optimal parameter θ i for two randomly selected patients, denoted as S1 and S2. Patient S1 had a DGF of 0, donor age of 54.2 years, and BMI of 24, while patient S2 had a DGF of 1, donor age of 37.4 years, and BMI of 24.8. Figure 4(a, b) plots the expected mean reward versus SGD iterations. Note that the SGD procedure in the context of reinforcement learning usually has high variability, which is alleviated by our variance reduction method in Algorithm 2 as shown in Figure 4. For patient S1, the expected mean reward increases from an initial value of 7.65 to its maximum in the SGD, 7.69, which corresponds to a predictive median survival time of 2,209 days, a 4.6% increase from its initial value 2,111. For patient S2, the expected mean reward goes from an initial value of 7.69 to a maximum at 7.76. This corresponds to the predictive median survival time increasing from 2,203 days to 2,383 days, an 8.2% improvement.
To further interpret the estimated optimal "policy" parameters for patients S1 and S2, we compared the initial parameter values of the SGD-posterior means, with the optimized values by the SGD in Table 2. Recall that the dosage model is d i,j = (1, y i,j , x i )β d + ζ i,j . Denote β d = (β d1 , β d2 , . . . , β dL ) T , where L is the dimension of β d . Since x i denotes the baseline covariate and does not change over time, we define the personalized dosage intercept to beβ d = (1, x i )(β d1 , β d3 , . . . , β dL ) T so that optimizing β d is equivalent to optimizing (β d2 ,β d ). As shown in Table 2, the optimized dosage parametersβ d and β d2 for patient S1 were lower than the estimated posterior means, indicating that patient S1 would benefit from a lower dosage for the same creatinine level compared to the observed dosages. In contrast, the optimalβ d and β d2 were higher than the posterior means for patient S2, indicating the preference for higher dosages. The optimal dosage errors, σ 2 d , for both patients were significantly lower than the initial value, indicating that a lower variance in the dosing procedure would benefit patient survival. The optimal baseline visitation intensity μ and the peak time parameter ν 1 were both roughly the same as their posterior means, indicating that the simulated follow-up schedules were close to optimal. However, the visitation intensity shape parameter ν 2 increased from 1.464 to 1.778 and 2.008 for patients S1 and S2 respectively and thus implies a higher intensity around the peak time ν 1 : intuitively, the optimized policy learns to be more certain about the "optimal peak time." In addition, to illustrate the advantage of optimizing both follow-up schedules and dosages, we compared our results to alternative strategies based on regular visits. As studied in Israni et al. (2014), during the first year post-transplant, patients were most frequently seen every 1 month or 3 months, depending on their physicians. After the Figure 4: Panels (a, b) plot the expected mean reward versus SGD iterations for two randomly selected patients S1 and S2. Panels (c, d) plot the density of the predictive median survival times under our method and the three alternative strategies for patients S1 and S2.
first year, stable patients were most frequently referred back between 4-6 months but the follow-up frequency was reported to vary from 0-12 months. We considered three alternative follow-up strategies: recommend patients to follow up every 1 month, 3 months, and 6 months. The dosages at follow-up visitations were still optimized in the same way as the proposed joint model with the policy gradient method. Figure 4( c, d) show the density plots of 100 realizations of the predictive median survival times under our method and the three alternative strategies for patients S1 and S2. Comparing the predictive median survival times under the three regular visitation strategies,  we can see that more frequent visitations yield longer median survival times. The optimized visitation schedule under the proposed method outperforms the three alternative strategies although it yields a similar overall visitation frequency with the strategy of "regular visits every 3 months" (not shown), highlighting the importance of optimizing visitation schedules based on longitudinal clinical measurements to prolong patients' survival.

Application: DIVAT data analysis
We extracted data from Nantes University Hospital Centers in the DIVAT cohort (www.divat.fr), yielding a total of I = 947 patients who received a first or second renal graft transplanted from a living or heart-beating deceased donor between 2000 and 2014. All patients in the dataset received an initial maintenance therapy with tacrolimus and did not experience graft failure or death during hospitalization. Immediately after transplantation, several covariates as risk factors for graft failure were collected: donor age (AgeD), recipient age (AgeR), delayed graft function (DGF) defined as the indicator of the use of dialysis within the first week of transplant (1=used dialysis, 0=didn't use dialysis), diabetes history (Diab) with 1 indicating the patient has a history of diabetes and 0 otherwise, type of donor (Type), and body mass index (BMI). There were two types of donors: donation after brain death but with heart beating (Type=1) and donation by a living donor (Type=0). Supplementary Table S2 summarizes patients' characteristics at baseline immediately after transplantation. For each patient, longitudinal data were collected from the date of transplantation until the graft failure or being censored. At each follow-up visitation, the creatinine level and tacrolimus dosage were recorded. The next follow-up visitation time was determined by the physician.

Experimental results: model fitting
We first applied the proposed Bayesian joint model to the DIVAT data with The hyperparameters were set to the same as in the simulation study. We ran a total of 20,000 MCMC iterations with an initial burn-in of 5,000 iterations, and a thinning factor of 50. The convergence was assessed using R package coda and the trace plots for randomly selected parameters were shown in Supplementary Figure S4, showing no issues of non-convergence.
We plot the estimated posterior means with 95% CIs for some selected parameters in the dosage, longitudinal, and survival submodels in Figure 5. Figure 5(a) plots posterior means of the linear coefficient β d with respect to the creatinine level and baseline covariates in the dosage model. DGF was negatively associated with the dosage, indicating that patients who used dialysis within the first week of transplant were likely to be assigned lower dosage levels. In contrast, BMI was positively associated with the dosage since bodyweight-based dosing of tacrolimus is the standard care for patients after transplantation (Andrews et al., 2017). Diabetes history was positively associated with the dosage. While the effect of diabetes on tacrolimus was not well characterized in the literature, Mendonza et al. (2007) showed that the time to maximum concentration of tacrolimus in the pharmacokinetics study was significantly longer in diabetics versus nondiabetics. Furthermore, donor type also increased the dosage level, indicating that patients who received kidney from a non-living donor were more likely to be assigned higher dosages compared to that from a living donor. Figure 5(b) plots the estimated posterior means with 95% CIs for the fixed-effects regression coefficients with respect to the most recent tacrolimus dosage and baseline covariates in the longitudinal model (3.4). The dosage, donor age, DGF, BMI, and donor type were positively associated with the creatinine level, which agreed with findings in the literature (Katari et al., 1997;Gerchman et al., 2009;Foucher et al., 2016). In contrast, the recipient age was negatively associated with the creatinine level, suggesting that younger patients tend to have lower creatinine levels (Maraghi et al., 2016). Diabetes history also decreased the creatinine level. Hjelmeseth et al. (2010) showed that a low creatinine was associated with type 2 diabetes in a cross-sectional study. The estimated posterior means and 95% CIs for selected survival submodel parameters are plotted in Figure 5(c). The posterior mean of the parameter corresponding to the tacrolimus dosage was positive while that corresponding to the toxicity was negative, suggesting that a higher tacrolimus drug reduces the hazard but the accumulated toxicity increases the hazard. These results were consistent with findings in Randhawa et al. (1997) and Böttiger et al. (1999), who reported nephrotoxicity caused by long-term high dosages of tacrolimus.

Experimental results: personalized optimal clinical decision estimation
Next, we applied the proposed policy gradient method to estimate the personalized optimal clinical decision in terms of maximizing a patient's median survival time. We initialized the parameters in Algorithm 2 by setting θ 0 to be their posterior means. Algorithm 2 was implemented with M = 1000 steps to estimate θ i for two randomly selected patients, denoted as R1 and R2. Patient R1 at transplantation was 60 years old with a BMI of 17, no history of diabetes, no DGF, and received donation from a 61-year-old non-living donor. Patient R2 at transplantation was 28 years old with a BMI of 25.5, no history of diabetes, no DGF, and received a kidney from a living 29year-old donor. Patient R1 had an observed survival time of 1,527 days, while patient R2 had a censored survival time of 4,487 days. Supplementary Figure S5 plots the predictive median survival times across SGD iterations for the two patients. Patient R1's predictive median survival time increased from 1,793 to 1,895 days at the maximum, a 5.7% improvement; while patient R2's predictive median survival time increased from 5,191 to 5,628, an 8.4% gain.
To further interpret the estimated optimal parameters in clinical decisions, we compared their initial values with the optimized values in Table 3. In practice, these optimized parameters can be interpreted to guide clinical decisions. Patient R1's optimal dosage parameters,β d and β d2 , were higher than their posterior means, suggesting that assigning a higher dosage level compared to what the physician actually did for the same creatinine level would improve his/her survival outcome. Specifically, the recommended dosage for patient R1 based on the optimizedβ d and β d2 is the baseline of 2.788 plus an additional 0.076 units of tacrolimus for each unit of creatinine recorded at the most recent followup visitation. On the other hand, patient R2's optimal dosage parameters were both lower than the initial values, so lower dosage levels are recommended. The optimal dosage errors, σ 2 d , for both patients were significantly lower than the initial value, meaning that the optimized policy is more certain about its dosing decisions so the variance is lower than the observed data. Therefore, a decrease in the optimized dosage error term can be implemented in practice by closely adhering to the mean dosage amount suggested by the optimized dosage parameters. The optimal baseline visitation intensities μ for both patients were lower than the initial value, indicating that they should be instructed to visit less often without the knowledge of their creatinine measurements. Their optimized visitation intensity peak times were lower than the posterior mean, indicating that they should be scheduled more frequent follow-ups when their creatinine levels are high. Furthermore, the visitation intensity shapes were significantly higher than the initial value so the optimized policy is more certain about the optimal peak time for visitation schedules.  Table 3: DIVAT data: optimal parameters estimated by the policy-optimizing method.

Ablation study: optimizing time or dosage or both
Moreover, to demonstrate the benefit of optimizing the follow-up visitation schedules and dosages together, we compared the predictive median survival times under the non-optimized initial policy (Non-Opt.) with three versions of optimized policies: 1) only visitation schedules are optimized (Opt. Visits); 2) only dosages are optimized (Opt. Dosage); and 3) both visitation schedules and dosages are optimized (Opt. Both). Specifically, Non-Opt. used the parameters estimated from the proposed Bayesian joint model, mimicking what physicians did as collected in the DIVAT dataset; Opt. Visits used the optimized parameters from the SGD in the visitation model (3.1) and the non-optimized parameters in the dosage model (3.2); Opt. Dosage used the optimized parameters from the SGD in the dosage model and the non-optimized parameters in the visitation model; Opt. Both is the fully optimized model obtained in Section 6.2. Figure 6 plots boxplots for 100 realizations of the predictive median survival times under each of the four policies. Both patients benefit more from optimizing dosages relative to optimizing visitation schedule. The visitation schedule optimization accounts for more improvement in prolonging the survival for patient R1 compared to R2 because, as shown in Table 3, there was a larger difference between the optimal parameter values (μ and ν 1 ) in the visitation model and their initial values for patient R1. The optimized visitation schedule for both patients, as we have discussed in Section 6.2, suggested slightly fewer visits overall, but more frequent visits when their creatinine levels are high. Comparing the four policies, we can see that optimizing both visitation schedules and dosages is clearly beneficial to these patients, thus empirically strengthening the motivation of our work.

Conclusion
In this work, we developed a two-step Bayesian approach to optimize clinical decisions with timing. Firstly, we proposed a Bayesian joint model for clinical observations (e.g., longitudinal measurements and survival time) and clinical decisions (e.g., follow-up visitation schedules and dosage assignments). The model components are connected by sharing certain structures and parameters in order to capture the mutual influence between the clinical observations and decisions. Moreover, we proposed a policy gradient method that optimized the personalized clinical decision for better survival, while parameter uncertainties in the clinical observation model are considered. Through simulation studies, we demonstrated that the optimized clinical decision obtained from the proposed approach yields longer predictive median survival times compared to scheduling follow-up visitations on a regular basis that is commonly used in caring for patients with chronic conditions nowadays. The analysis of the DIVAT data yields meaningful and interpretable results, showing that the proposed method has the potential to assist physicians' decisions on personalized treatment. In addition, we have built an R package doct so that users can apply the proposed method to datasets in a similar setup that involves longitudinal decision making and an objective reward to optimize.
There are several potential extensions. Firstly, we consider one longitudinal measurement in the longitudinal process of the joint model. There could be other time-varying measurements affecting the clinical decision and survival. In our kidney transplantation application, besides creatinine levels, there are other longitudinal measurements recorded such as proteinuria, which represents having protein in the urine and can be an early sign of kidney disease. The proposed method can be extended to incorporate other longitudinal measurements by replacing the model in (3.4) with a multivariate mixed effects model (Chi and Ibrahim, 2006). Secondly, the proposed Bayesian joint framework that models both clinical decisions and observations relies on certain parametric assumptions that are suitable for our kidney transplantation application. However, these assumptions (e.g., the proportional hazard assumption in the survival submodel) may not hold for other medical applications. It will be straightforward to extend the proposed Bayesian model to more flexible models such as replacing the current prior distributions with a Bayesian nonparametric prior (e.g., the Dirichlet process, Ferguson (1973)), or considering other nonparametric models (e.g., neural networks). Thirdly, the proposed MTPP assumes stationarity in clinical decisions since it yields the same distribution over visitation timing and dosages given the same history. Although the stationarity is reasonable in the kidney transplantation application, our approach can be easily extended to allow non-stationary clinical decisions, by explicitly incorporating a timevarying term (e.g., exp(ct) where c ∈ R is a coefficient parameter) in (3.1). In addition, the policy gradient method is used to optimize personalized clinical decisions in this paper. Other optimization methods such as natural evolution strategies (Wierstra et al., 2014) will be explored in the second step of the proposed approach. Lastly, patients with chronic conditions may take multiple medicines, e.g., mycophenolate mofetil (an immunosuppressive drug) and steroids along with tacrolimus in our kidney transplantation application. Modeling the effects of multiple types of drugs (and their interactions with clinical observations) and learning their optimal dosage-assigning policies in the proposed optimization method will be an interesting and challenging research topic.

Supplementary Material
Supplement for "Personalized Dynamic Treatment Regimes in Continuous Time: A Bayesian Approach for Optimizing Clinical Decisions with Timing" (DOI: 10.1214/21-BA1276SUPP; .pdf).