Bayesian Cox Regression for Large-scale Inference with Applications to Electronic Health Records

The Cox model is an indispensable tool for time-to-event analysis, particularly in biomedical research. However, medicine is undergoing a profound transformation, generating data at an unprecedented scale, which opens new frontiers to study and understand diseases. With the wealth of data collected, new challenges for statistical inference arise, as datasets are often high dimensional, exhibit an increasing number of measurements at irregularly spaced time points, and are simply too large to fit in memory. Many current implementations for time-to-event analysis are ill-suited for these problems as inference is computationally demanding and requires access to the full data at once. Here we propose a Bayesian version for the counting process representation of Cox's partial likelihood for efficient inference on large-scale datasets with millions of data points and thousands of time-dependent covariates. Through the combination of stochastic variational inference and a reweighting of the log-likelihood, we obtain an approximation for the posterior distribution that factorizes over subsamples of the data, enabling the analysis in big data settings. Crucially, the method produces viable uncertainty estimates for large-scale and high-dimensional datasets. We show the utility of our method through a simulation study and an application to myocardial infarction in the UK Biobank.


Introduction.
A fundamental aspect of epidemiology is to identify and study the factors driving disease.One early example is the Framingham Heart Study, which shaped our understanding of cardiovascular outcomes and established risk factors like blood pressure and cholesterol that have become routine in standard health assessments (Dawber, Meadors and Moore, 1951).
The possibilities and availability for this type of analysis have dramatically increased with the profound transformation of the biomedical sector in the last decades through digitization and the unprecedented scale of data generation.Electronic health records (EHR), containing the medical history of individuals systematically collected for millions of people and sometimes whole populations, provide new opportunities on how we can approach diseases.
Some recent studies utilizing large-scale EHR with around 1-20 million individuals include: Clift et al. (2020) and Williamson et al. (2020) studying the risk factors associated with severe outcomes for Covid19, Hippisley -Cox, Coupland and Brindle (2017) predicting risk of cardiovascular disease or Hippisley- Cox and Coupland (2021) estimating the risk of prostate cancer in asymptomatic men.
The statistical method underlying these studies is the Cox partial likelihood, introduced by Cox (1972Cox ( , 1975)), as the main objective of interest is the time passed until the occurrence of an event.
While estimation with a simple Cox model is possible in large-scale settings, as the aforementioned examples show, it becomes infeasible for more complex analyses often encountered in EHR e.g.high-dimensional data, time-varying covariates, and multiple events.In many cases, one is interested in the evolving status of patients, particularly when multiple measurements over time are available, highlighting the need for a version of the Cox model that scales to the vast amounts of data while simultaneously being able to handle many of the statistical challenges arising in EHR data analysis.
1.1.Related Work.A comprehensive treatment of Bayesian methods for time-to-event analysis can be found in Ibrahim, Chen and Sinha (2001), with applications in Alvares et al. (2021).Most implementations for the Cox model explicitly estimate the baseline hazard through a non-parametric prior, with choices including the Gamma process (Kalbfleisch, 1978;Sinha, 1993), Beta Process, (Hjort, 1990;Laud, Damien and Smith, 1998), or correlated prior processes (Qiou, Ravishanker and Dey, 1999), but also spline functions (Sharef et al., 2010).While specifying the baseline hazard has advantages, like absolute risk prediction, the traditional Cox model circumvents the explicit estimation.Hjort (1990) and Sinha, Ibrahim and Chen (2003) provide justifications for a Bayesian version of the Cox model without the need to estimate the baseline hazard, closely resembling the traditional Cox model.Recent Bayesian methods for variable selection in high-dimensional data include Shin, Bhattacharya and Johnson (2018) and Nikooienejad, Wang and Johnson (2020), however, they have mostly focused on fixed covariates.Generally, none of the Bayesian methods have focused on large-scale applications.
From a frequentist point of view, in the context of high-dimensional data, the Lasso or 1 regularization proposed by Tibshirani (1996Tibshirani ( , 1997) ) has been a popular choice.An overview of time-to-event analysis with high-dimensional data more generally can be found in Witten and Tibshirani (2010).Computationally efficient implementations for the Cox model under penalization have been developed by Friedman, Hastie and Tibshirani (2010), Simon et al. (2011), Mittal et al. (2014) and Yang and Zou (2013).Similar to the Bayesian versions, these methods have mainly focused on fixed covariates, while the R software package R-glmnet by Friedman, Hastie and Tibshirani (2010) and Simon et al. (2011) has recently added support for a Cox model based on the counting process representation in combination with the Lasso, enabling the inclusion of time-varying covariates.However, the proposed methods require the full data for optimization, limiting the application to moderately sized datasets.
More recently, approaches that factorize the likelihood over subsamples of the data have been proposed, facilitating time-to-event analysis at large scales.Li et al. (2020) proposes a solution based on the batch screening iterative lasso algorithm as an extension to the traditional Lasso.Kvamme, Borgan and Scheel (2019) suggest a modified version of the Cox model based on a hypothetical weighting of the likelihood to justify training on subsamples via stochastic gradient descent (SGD).Tarkhan and Simon (2020) provide a more general treatment of the combination of Cox's partial likelihood with SGD.
While these methods can handle large-scale datasets, they cannot incorporate time-varying covariates measured at different time points, nor do they provide standard errors or other measures of uncertainty.The last method does propose a plug-in estimator or the bootstrap for standard errors, however, it is not clear how these will scale with large-scale and highdimensional datasets.
A divide-and-conquer approach to the Cox model has been proposed in Wang et al. (2021) that circumvents some of the aforementioned limitations.However, the algorithm requires n >> p and is generally limited to a moderate number of covariates.Given the extent of information available in EHR, the need for a method that can not only scale with the number of individuals but also explore and utilize high-dimensional covariate sets is paramount.
1.2.Main Contributions.In this study, we propose a Bayesian Cox model for big data settings based on the flexible counting representation introduced in Andersen and Gill (1982).The advantages of the counting process representation are described in Therneau and Grambsch (2000) and enable to fit models with time-varying covariates, complex censoring and truncation patterns, multiple time-scales, multiple events, and marginal or conditional models for correlated data.By combining our model with a reweighting of the log-likelihood and stochastic variational inference (SVI), we can decouple the sample size from our estimation procedure, enabling inference via subsampling of the full data and hence analyzing data that cannot fit into memory at once e.g.EHR.By using a sparsity inducing prior, we can also do inference in high-dimensional data.Our approach enables the inclusion of time-varying covariates measured at irregularly spaced time points, as well as viable uncertainty estimates for large-scale and high-dimensional datasets.The main contributions of this paper are: • Large-scale implementation of a Bayesian version of Cox's partial likelihood based on a counting process representation.• Reweighting of Cox's partial log-likelihood for subsampled data to enable stochastic variational inference • Viable uncertainty estimates for large-scale and high-dimensional datasets.
• Simulation algorithm for discrete event times with time-varying covariates resembling EHR.• Application of the proposed method to myocardial infarction in the UK Biobank.This comprises a Cox regression for ≈400,000 individuals with ≈1.5 million observations and ≈1,000 time-varying covariates.
The rest of the paper is structured as follows: In Section 2 we describe the method in detail.A comprehensive simulation study resembling EHR data with a particular focus on time-varying covariates is provided in Section 3. A modified algorithm to efficiently simulate large-scale and high-dimensional event time data is described in Section 3.1.The results of the simulation are given in Section 3.3 and 3.4, where we compare our method with other Cox model implementations.Then, the method is evaluated on real-world applications in Section 4. Section 4.1 shows the model performance on a standard dataset provided in the R-Survival package.Section 4.2 is a small case study on the risk of myocardial infarction in the UK-Biobank.Section 5 discusses some limitations and future directions while Section 6 concludes the paper.

Methods.
A typical analysis with EHR comprises the emergence of a disease in the population, and one wants to understand the effect of other comorbidities and potential biomarkers on it.The main interest is the time passed until the onset of a particular event, in this case a disease, denoted here as the event times T i for an individual i.
Generally, one wants to understand how covariates may increase or decrease the risk of the event, consequently accelerating or decelerating the event times.The covariates are allowed to vary over time e.g. the evolving set of diseases an individual acquires or updated laboratory measurements, denoted as a column vector X i• (t) = (X i1 (t), . . ., X ip (t)) of the p-covariate processes.
Typically, we have incomplete data, like censoring, truncation, or filtering.Here, we assume for simplicity independent right censoring, e.g.physical time limit of observations in EHR.The proposed method, however, can readily be used in conjunction with more complex missing data patterns.Formally, we have an individual censoring time u i such that we only observe y i = min{T i , u i }.The observed data for n individuals is then D = {(y i , δ i , X i• (t)) : i = 1, 2, . . ., n} with δ i = 1 [Ti=yi] as an indicator for censoring.
Some important concepts from time-to-event analysis, assuming continuous time, are the survival function S(t) = P(T > t), i.e. the probability that the event has not happened by t, and the hazard function α(t) = lim ∆t→0 (∆t) −1 P(T ≤ t + ∆t | T > t) = f (t)S(t) −1 , i.e. the instantaneous risk of observing an event given the event has not happened yet, where f (t) is the density function.The cumulative hazard is given as A(t) = t 0 α(u)du, and is linked to the survival function through S(t) = exp(−A(t)).
2.1.Cox's partial likelihood based on counting processes.For the observational model we define a multivariate counting process for the n individuals in the EHR as N(t) = (N 1 (t), N 2 (t), . . . ,N n (t)) , with individual counting process N i (t), counting the occurrences of events for individual i with jump sizes ∆N = 1, assuming that no two processes jump simultaneously at t (no ties).In principle N i (t) can count multiple events e.g.recurrent cancer diagnoses, however, we treat it here as if only a single event occurs, without loss of generality.The intensity process, the rate at which we expect events to happen, for N i (t) is linked to the hazard function under the Cox model by where I i (t) is a predictable process with values in {0, 1}, describing whether individual i is at risk of observing the event at t.The baseline hazard α γ 0 (t), defined by the possibly infinite-dimensional parameter γ, determines the rate at which events occur without any covariate effects.The p-regression parameter vector θ describes the effect of the time-varying covariates, as they act multiplicatively on the baseline hazard.
The likelihood L for independent right-censored event times is the combined effect of the density function for individuals observing events and the survival function for censored individuals, which can naïvly be written by means of the hazard and survival function as . The likelihood for γ and θ under the counting process representation of the Cox model follows as (1) where T is the set of event times and A γ 0 (t) is the cumulative baseline hazard.As we are mainly interested in θ, it is possible to derive a profile likelihood in the sense of ( 2) , which is equivalent to the partial likelihood specified by Cox (1972Cox ( , 1975)).For details on how to derive the profile/partial likelihood see Andersen and Gill (1982).The log-likelihood follows as (3) For a more detailed derivation of Cox's partial likelihood under a counting process representation we refer to Andersen et al. (1993).
2.2.Reweighting of Cox's partial likelihood.As can be seen in Equation (2) the partial likelihood cannot be factorized over the data as the denominator contains a sum over the full data.Hence, to use subsamples D * D for inference in large-scale datasets we need to reweight the likelihood, or more precisely the log-likelihood from Equation (3) appropriately.We reweight the log-likelihood as it will be the main objective for inference.To achieve a good approximation we propose the following adjustments to the partial log-likelihood (3): where n * n and w 1 = t 0 1dN(u)( t 0 1dN * (u)) −1 i.e. the ratio of events , w 2 = n(n * ) −1 i.e. the ratio of observations, with N * representing the multivariate counting process corresponding to the subsampled individuals n * .The motivation for the reweighting is given in Section 1 of the supplementary materials.Particularly, the last approximation i.e. the denominator of the partial likelihood may introduce a bias.However, we show through simulations in the supplementary materials Section 1 that the possible bias is very small for a variety of simulation and optimization settings.

Bayesian Cox Model.
To use the partial likelihood defined in Equation ( 2) for probabilistic inference further justification is needed as it is not necessarily a likelihood in the classical sense.Kalbfleisch and Prentice (1973) have shown that the partial likelihood can be interpreted as proportional to the marginal distribution of the rankings of the event times, however, the derivation does not include time-varying covariates.Kalbfleisch (1978) provides a Bayesian justification based on an underlying diffuse Gamma process for A γ 0 , showing that the partial likelihood is the limiting marginal posterior distribution.This approach is further extended by Sinha, Ibrahim and Chen (2003), encompassing a large class of models, including (external) time-varying covariates.We show through simulations in Section 3 that this provides valid estimates even for discretized internal time-varying covariate processes.Hjort (1990) provides additional justifications for the partial likelihood based on Beta and Dirichlet processes.
As prior for p(θ) we use either the Normal distribution for a moderate number of covariates or in the case of many covariates and high-dimensional data, the Student t ν distribution with ν degrees of freedom to induce sparsity 2.4.Variational Inference.While it is possible to design an efficient accept/reject algorithm to do MCMC, we recast the inference problem as optimization.Precisely, we define a family G of variational distributions and try to find the member q(θ) of the family that closely approximates the posterior distribution from Equation (5).Closeness is measured as the Kullback-Leibler (KL-) divergence Generally, it is not possible to optimize the KL-divergence directly, instead we maximize the evidence lower bound (ELBO) where log L(D|θ) is our reweighted version from Equation (4), p(θ) the prior density and q(θ) the variational distribution.Our final objective for inference consequently becomes An explicit formulation of the objective function and an outline of the variational inference procedure can be seen in Section 2 of the supplementary materials.Andersen and Gill (1982) have shown that the estimator of θ under the Cox model is asymptotically Normal, hence, we propose a Multivariate Normal family for G when the number of covariates is moderate, which should give a reasonable approximation, especially for large-scale settings.
For a large number of covariates and high-dimensional data we use a lower rank approximation of the covariance matrix in the form Σ = W T W + I, where I denotes identity matrix, and W ∈ R R×P , with R as the rank and P the number of covariates.The low-rank approximation for high-dimensional data is experimental.However, we show through simulations in Section 3.4 that this provides a good approximation.Various methods have been proposed to handle difficult variational objectives by replacing the expectation in the ELBO through Monte Carlo (MC) samples and optimization via automatic differentiation.For details on some algorithms see Ranganath, Gerrish and Blei (2014) and Kucukelbir et al. (2017).A detailed survey of MC-based gradient estimation can be found in Mohamed et al. (2020).We use automatic differentiation variational inference (ADVI) following Kucukelbir et al. (2017) for the optimization of the ELBO, with a random subsample D * at each optimization step.
3. Simulation.The overall setting for the simulations is the occurrence of a disease in the general population and the association of time-varying covariates with it, emulating a typical scenario encountered when analyzing EHR.Event times are generated based on the Cox proportional hazards model and are measured as age in days, ranging from birth to 80 years.Covariates vary randomly over time at discrete time steps and are drawn from either a Normal or Bernoulli distribution, resembling laboratory measurements or disease status.The baseline hazard reflects typical age-sex adjusted distributions of disease with an initial increase and a decrease in older ages.Individuals are followed from birth until either censoring or an event occurs.Censoring is independent of the data generating process.
3.1.Simulation of event times via the Binomial model.Our proposed algorithm for simulation of event-times follows along the lines of Sylvestre and Abrahamowicz (2008) and is based on the discretized partial likelihood defined by Cox (1972) as Based on this, one can define a Binomial model with the probability of an event at time t as ( 6) For a given simulation, we draw a sample of the underlying baseline hazard.Per individual, we simulate the p-covariate processes and then move successively through time t = 1, 2, . . ., c, where c is the individual censoring time c ∼ Uniform(1000, 30000).For a given t we evaluate the individual probability of an event as in Equation ( 6) conditional on the state of the baseline hazard and the covariate processes at t until either an event occurs or the censoring time is reached.The covariate processes up to the final time are then retrieved and kept for the subsequent analysis.A schematic representation of the simulation can be seen in Figure 1.FIGURE 1.A schematic overview for discrete failure time simulation.The upper panel shows the rough shape of the baseline hazard α 0 .The middle panel represents for an individual the corresponding linear effect of the time-varying covariate processes X i• (t) and the parameters θ.The lower panel depicts the combined effect of the baseline hazard and the covariate effects.The dotted vertical line represents jump times of the baseline hazard and the dashed vertical line the jump times for the covariate processes.The probability of an event P(t = T ) is the combination of both processes with jumps at either of the two and is evaluated for each dt until either censoring or an event happens as indicated by the solid vertical line in the lower panel.
For the simulations presented in this study, the covariate processes and the baseline hazard resemble a sample path from a renewal-reward process {Z(t, ω), t = 0, 1, . . .} with ω as a particular realization, given the waiting times W 1 , W 2 , . . .between the jumps of the process Ji≤t with I Ji≤t as the indicator function and R i as the rewards.Here, the time between increments W i are rounded to the nearest integer.
For the covariate processes ) and R i ∼ Normal(0, 1) for continuous and R i ∼ Bernoulli(0.2) for binary variables.
The baseline hazard α 0 otherwise.
The sampled baseline hazard can then be scaled up or down to reflect the appropriate rate of censorship.The resulting shape for the baseline hazard is similar to the depiction in the schematic.An example output for two individuals in the resulting long-format can be seen in Table 1.3.2.Log-likelihood approximation through reweighting.First, we wanted to evaluate the subsampled approximation of the full log-likelihood through the reweighting described in Equation ( 4), and particularly, how it compares to a more naïve approximation that would simply scale log L by the relative number of observations between the full data and the batch size.The data has been generated as: Simulation: 2000 individuals, 15883 observations, 0.90 censorship ind , 5000 covariates (2500 binary, 2500 continuous), θ l ∼ Normal(0, 0.5 2 ), 1024 Batch size obs , θl ∼ Student t(ν = 1, s = 0.001), Rank = 50.The results can be seen in Figure 2. The reweighting forms an almost unbiased, symmetric distribution around the full log-likelihood throughout the training process and is substantially better then a naïve approximation based on the batch size itself.Interesting to note here is the initial drop in the -log-likelihood followed by an increase -showing the effect of penalization.A similar depiction for a standard case can be seen in Section 1 of the supplementary materials.Additionally, we evaluate the performance of our reweighting for different scenarios based on different levels of observations, batch size, censorship, number of covariates, and hazard.Overall the approximation of the log-likelihood is very good with almost no error on average.

TABLE 2
Simulation results for the standard case setting.The upper panel shows the results for the first simulation and the lower panel represents the second simulation.The first 3 estimates correspond to the binary covariates, the last 3 to the quantitative covariates.We use a Multivariate Normal family for variational inference and θl ∼ Normal(0, 1) as prior.The results for the simulation are shown in Table 2.We provide results for the Cox model fitted with the implementation R-Survival in Therneau (2021) on the full data and our implementation ProbCox on subsamples.The results for different batch sizes can be found in the supplementary materials Section 3. The average parameter estimates θ are consistent across the different settings and stable for different batch sizes.The variability of the estimates across the simulations measured as the standard deviation σ θ as well as the root mean squared error RMSE are comparable to R-Survival.The coverage is close to the nominal rate with the average highest posterior density interval length HP D 95% similar in size to the average confidence interval length CI 95% from R-Survival, increasing with smaller batches, reflecting the additional approximation error through subsampling.Overall, we see comparable performance between our implementation trained on subsamples and R-Survival fitted on the full data.
3.4.High-dimensional case.With the number of variables measured on individuals continuously increasing the need for models that can handle large numbers of covariates and high-dimensional data (covariates observations individuals) is paramount.Particularly, when analyzing EHR the subsampled data for each evaluation can effectively be very high-dimensional, even though the EHR itself is not.Hence, a method that can scale to EHR will also need to be able to handle high-dimensional data.Therefore, we evaluated our method on a very high-dimensional dataset (covariates observations) and a moderate high-dimensional dataset (covariates individuals) with strong correlation between covariates.The simulations have been conducted as follows: Simulation 1: 200 runs, 1000 individuals, 6896 − 7613 observations, 0.65 − 0.72 censorship ind , 0.0 − 0.02 ties, θ 1:20 ∼ Normal(0, 0.75 2 ), θ 21:10000 = 0 10000 covariates (5000 binary, 5000 continuous).Parameters are equally split between binary and continuous covariates.
Correlations have been randomly induced following Lewandowski, Kurowicka and Joe (2009) and the underlying correlation matrix as well as the distribution of pairwise correlations can be seen in Figure 3.
The results for the simulations can be seen in Table 3.We provide a comparison of our proposed method to the Cox model with the Adaptive Lasso described in Zou (2006) and Zhang and Lu (2007).The model is fit via the R-glmnet package.The estimation procedure is two-fold.First, we run an initial Cox regression with Lasso penalty where the λ parameter (strength of penalization) is chosen through 5-fold cross-validation on the concordance index.The coefficients corresponding to either λ 1se or λ min are extracted.Coefficients estimated to be 0 are replaced with a value of 1e-06 and subsequently used as the weights for the Adaptive Lasso (similar settings as in the first run).Typically, the weights would be estimated with a Ridge penalty, however, this did not provide good results, and therefore, we used the Lasso for the weight estimation as well.We show results for the Adaptive Lasso with λ w=λ1se min and λ w=λ1se 1se for simulation 1 and simulation 2, respectively, as they showed the best performance for parameter identification.For our method we use the low rank approximation to the Multivariate Normal family for variational inference and θl ∼ Student t(ν = 1, s = 0.001) as prior.A parameter is naïvly classified as identified if the HP D 95% does not contain zero.It should be noted that this is not a proper variable selection procedure and we use this just as a rough guide to compare identification of parameters between the methods.
For the Adaptive Lasso a parameter is identified if the estimate is not equal to zero.The rate of identification for the parameters over all simulations is given in the last column of Table 3 for each method, respectively.
Summaries are computed over the set of estimates that have been identified in the respective simulation runs.
Generally, the estimates of our method work well in both simulation settings, being on average close to the true parameter values and exhibiting good coverage overall.The results are robust across the rank and batch size specifications.In comparison to the Adaptive Lasso, TABLE 3 Simulation results for the high-dimensional case -non-zero parameters only.The first panel corresponds to simulation 1 and the second panel to simulation 2. For simulation 1, the first ten rows correspond to the non-zero binary covariates and the last ten rows to the non-zero continuous covariates.our estimates are slightly closer to the true parameters.For identification, both methods perform equally well with both approaches identifying on average 13 covariates correctly and 1-2 miss-identified and 11 covariates correctly and 1 miss-identified for simulation 1 and simulation 2, respectively.A graphical comparison of our method and the Adaptive Lasso can be seen in Figure 3.Most of our estimates lie on the diagonal enclosed by the average HP D 95% bounds, indicating again a good identification of the true parameters on average with good quantification of the uncertainty.The estimates for adaptive Lasso are also close to the diagonal.
An overview and additional results for the Adaptive Lasso, Lasso, and our method with different batch sizes and ranks can be found in the supplementary materials Section 4.
We provide an additional high-dimensional simulation comparing our method again to the Adaptive Lasso, and additionally to the smoothly clipped absolute deviation (SCAD) penalty, the minimax concave penalty (MCP) implemented in R-ncvreg by Breheny and Huang (2011), as well as another Bayesian approach with inverse moment priors implemented in R-BVSNLP by Nikooienejad, Wang and Johnson (2020).However, implementations for these do only allow for simple Cox models and consequently no time-varying covariates.The simulation and results can be found in the supplementary materials Section 5.
Overall, our method performs best in parameter estimation and is comparable in identification.
We conducted another high-dimensional simulation which can be found in Section 6 of the supplementary materials, to further evaluate and assess the performance of our method under different hyperparameter settings for the Student t distribution as well as a comparison to the Bayesian Lasso as described in Park and Casella (2008) and Hans (2009).
Also, we run a large-scale simulation to evaluate the potential bias when the ratio between batch size to observations is very low (≈ 0.0001).The simulation and results can be found in Section 7 of the supplementary materials.Parameter estimates are close to the true values on average with good coverage.
Pairwise Corr.Corr.Matrix FIGURE 3. Simulation results for the high-dimensional case.The first panel corresponds to simulation 1 and the second panel to simulation 2. The light grey dots are parameter estimates from individual simulation runs, while the diamonds and stars represent the average estimate over all simulations for binary and continuous variables, respectively.The black bars are the average HP D 95% bounds.The dotted line corresponds to the diagonal and indicates correct identification of the true parameters 3.5.Resource comparison.In this section, we provide a small compute-time and memory utilization comparison between our method Python-ProbCox, R-Survival, R-glmnet, and Python-lifelines.We run two different versions of our method, one that feeds the subsamples from memory, and one version that loads each subsample from the hard drive, reflecting the more typical use-case of our method.Both version are run with the low rank approximation (R = 25) to the Multivariate Normal family for variational inference, θl ∼ Student t(ν = 1, s = 0.001) as prior, and a batch size of 256 observations.The main characteristics for the simulation can be taken from Figure 4. Covariates are equally split between binary and continuous covariates of which ten are non-zero with θ 1:10 ∼ Normal(0, 0.5 2 ).We only run a single run per simulation per method.Therefore, the estimates only serve as a rough guide.The timings and memory utilization for a 4-core machine can be seen in Figure 4. Overall, we can see a good resource utilization of our method.There is an initial overhead which is offset once the dataset becomes large-scale and high-dimensional.We can also see a clear trade-off between memory usage and compute time.However, the additional compute time levied through loading of the subsamples from the hard-drive is neglectable and allows for almost constant memory usage.
Additionally, we performed a separate comparison of our method to the implementation in Wang et al. (2021).Both methods perform well in regard to parameter identification and estimation and are close in speed for large-scale cases.The Analysis can be found in Section 8 of the supplementary materials.

Application.
4.1.Small-scale examples.To further evaluate our proposal and compare it with the standard implementations, we run a couple of small applications on real-world data provided in R-Survival.The focus is on comparing the methods rather than on the data analysis.In total, we compare 5 different datasets, the results for 4 of them can be found in the supplementary material Section 9. Our proposed method compares very well with the parameter and standard error estimates being similar.
The PBCseq data contains 312 individuals with 1,945 observations and 725 events where we treat repeated events as independent for simplicity.For missing covariates, we carry the information from the previous visit forward.Additionally, we construct an artificial highdimensional dataset based on the PBCseq data by appending 2,000 Normal(0, 1) distributed covariates.The continuous covariates from the PBCseq have been Z-transformed accordingly.We compare our method to R-Survival for the original dataset and the Adaptive Lasso implemented through R-glmnet for the high-dimensional dataset.We use a Multivariate Normal family for variational inference with θ l ∼ Normal(0, 1) as prior for the original data and the low rank (R = 50) approximation to the Multivariate Normal family with θ l ∼ Student t(ν = 1, s = 0.001) in the high-dimensional case.The batch size for both cases is 256 observations.The results can be see in Table 4.For the original data, our method is again comparable to the implementation in R-Survival with the same covariates identified as having non-zero effects and similar estimates for parameters and uncertainty.For the high dimensional case, we identify similar covariates as the Adaptive Lasso with them being the significant covariates from the original analysis.The estimates are close to the parameter estimates from the original evaluations for both approaches.Overall our method identified 7 non-zero covariates (based on the HP D 95% ) and the Adaptive Lasso 6 non-zero covariates.4.2.Myocardial infarction in the UK Biobank.In this section, we showcase a typical example of analysis where we see the greatest utility of our proposed method.With the number of biobanks and population-based health registers continuously increasing we suspect these types of analyses to become more widespread and relevant.Being able to conduct multivariate association studies at such a large scale provides new possibilities to identify novel relationships, but also to investigate rare disease outcomes.However, the focus of the presented small case study is on the method and the type of analysis that can be enabled rather than on the actual medical associations.Therefore, the presented result should not be overinterpreted from a medical viewpoint.
Cardiovascular disease (CVD) is one of the major causes of death worldwide with an immense impact on public health.CVD is the general term for conditions affecting the heart and blood vessels including ischaemic heart disease and stroke.Understanding the risk factors driving the progression of CVD could help practitioners to intervene early or allow individuals to adjust lifestyle choices.To identify potential risk factors, we make use of the UK Biobank (UKB) (Sudlow et al., 2015), a public large-scale biomedical database established in 2006 with a vast array of information on 502,628 participants recruited between 2006 and 2010.All participants were between 40-69 years of age at their recruitment date.
In total, we have 393,464 participants with 1,594,586 interval observations and 6,880 MI events.Participants are left-truncated by the date of their entry and possibly right-censored, the latest at 2020-03-01.Additionally, we have 1,108 time-varying binary indicators for each ICD10 code representing if an individual ever had a corresponding disease at the corresponding point in time.Furthermore, we have 12 baseline risk factors (sex, alcohol, smoking, LDL, HDL, Triglyceride, BMI(3 indicators), Blood Pressure(3 indicators)) that are generally be considered to influence an individual's risk of CVD.Overall, there are 1,120 time-varying covariates and sex in our model with covariates values taken from either the most recent recruitment visit, follow up, or entry in the hospital admission records and are assumed to be constant in-between measurements.
We transform the timeline from real-time to age in days as age is a dominant and powerful predictor.In such a cases it can be advisable to use age as the preferable time axis.See Andersen et al. (2021) for general guidelines.
The outcome of the study is a myocardial infarction defined as an ICD10 code in I21-I24.Other CVD events like ischemic heart disease (I20-I25), cerebrovascular disease (I60-I69), cardiac arrest (I46), heart failure (I50), and transient cerebral ischaemic attack (G45) have been used additionally as either exclusion criterion before study entry or as a possible competing event, which we treat here as a censoring instance for simplicity.Of interest is the MI event one year ahead -the last year of observation prior to the event is removed -to avoid identifying factors that might be part of the diagnostic process itself.
A more detailed description of the study can be found in Section 10 of the Supplementary materials.
For analysis, we use a similar setup as in the high-dimensional simulation in Section 3.4.The prior distribution is θ l ∼ Student t(ν = 1, s = 0.001) , where we use a low rank(50) Multivariate Normal as our approximation family for variational inference.The batch size for inference is 8,192 individuals.
Overall the results are in line with many of the reported associations in the literature.The most relevant factors according to our estimation can be seen in Figure 5. Yusuf et al. (2020) report a hazard ratio of 1.74 (1.61 -1.88) for diabetes without subcategorization on CVD, which agrees with our estimates for unspecified diabetes 1.65 (1.46 -1.86) and insulin-dependent diabetes 1.6 (1.18 -2.16).Millett, Peters and Woodward (2018) estimate sex-specific factors for smoking on MI as 3.46 (3.02 -3.98) and 2.23 (2.03 -2.44) for female and male, respectively, slightly higher then our estimate of 2.08 (1.9 -2.27).Mortensen and Nordestgaard (2020) estimate a hazard ratio of 1.34 (1.27 -1.47) for LDL on MI, again close to our estimates of 1.41 (1.36 -1.46).Chronic kidney disease has been identified by Hippisley-Cox, Coupland and Brindle (2017) as a risk factor for CVD with a hazard ratio of 2.09 (1.87 -2.34) and 1.94 (1.72 -2.19), for male and female, respectively.These estimates overlap with our cystic kidney disease and hypertensive renal disease estimates of 1.9 (1.01-3.6)and 1.79 (1.06 -3.04), respectively.The same study also estimates a hazard ratio for rheumatoid arthritis of 1.24 (1.19 -1.28), similar to our estimate of 1.30 (1.03 -1.65).The association between rosacea and MI is less well established, however, Egeberg et al. (2016) estimates a relative risk ratio of 0.75% (0.57 -1.00) in line with our estimate of 0.61 (0.37 -1.00).The concordance index for the model on the train, valid, and test split are 0.72, 0.72, and 0.69, respectively.This type of analysis could be further extended to include more of the available information in the UKB and refine associations, but more importantly, with the number of participants in these cohorts we can now reveal and study more nuanced outcomes and systematically evaluate multivariate effects.
5. Limitations and future directions.The presented analysis mainly focused on the scalability of the Cox model in high-dimensional and large-scale datasets with time-varying covariates.Further research will be needed to extend the proposed method to some of the challenges that may arise when analyzing EHR.We provide a brief outline of some possible future directions of this research.
A limitation of the proposed analysis is the assumption of proportional hazards and the lack of means to assess.Developing efficient algorithms to scale residuals analysis to these vast amounts of data to evaluate model performance and identify possible deficiencies will be crucial.
Further extending the flexibility of the method by non-linear and time-dependent covariate effects is another factor to consider.However, challenges may arise through the different dimensions on which the effects can act when considering time-varying covariates.The effect may vary by timepoint of measurement (early vs. late pregnancy) and by the progression in time (decaying or increasing effects with time).
While EHR provide a rich data source, the assumption that the measurements are constant in-between time points or are measured without error may introduce a bias.Modeling the longitudinal data jointly with event times would be an alternative approach, however, there are additional inferential and computational challenges.
Competing events naturally arise when analyzing EHR and pose challenges for the interpretation of results.In this study we focused on the cause-specific hazard, however, adopting different approaches like the Fine-Gray model, i.e. subdistribution hazard, may be preferable in some cases while challenging for internal time-varying covariates.
Using SVI rather than MCMC provides substantial compute time advantages at the cost of lacking guarantees for the approximation.The adoption of SVI for more complex models e.g.hierarchical models or frailty models imposes additional challenges as defining a good approximating family will require further research and evaluation.The performance of our approach in the context of high-dimensional data is promising.Nevertheless, using the lower rank approximation to the Normal distribution as the variational family will require further theoretical justifications.Extending SVI to very high-dimensional datasets e.g genetic association studies or using flexible variational families e.g.normalizing flows are promising research directions.
An open question, remains the optimal batch size to choose for a given application.Generally, we have shown that the method is reasonably robust for a variety of different batch sizes without incurring too much of a bias.
6. Concluding remarks.We have proposed a Bayesian version of Cox partial likelihood that can be used on high-dimensional and large-scale datasets with time-varying covariates.For the simulated datasets and optimization parameters, our method provides (almost) unbiased estimates with good uncertainty quantification while subsampling data.This approach enables researchers to extend the Cox model, one of the most widely applied methods in biomedical research, to new data sources like biobanks and EHR, the combination of which can provide new insights into our understanding of diseases.Our implementation enables the joint analysis of tens of thousands of time-varying covariates for millions of individuals, providing a framework to utilize population-scale EHR.Based on a counting process representation, our method can easily be extended to a wider variety of scenarios like recurrent events or more elaborate missing data patterns.As shown in the case of myocardial infarction, we can replicate many of the results previously reported in the literature and characterize their joint effects.Extending the analysis by the additional data typically available in EHR or focusing on less well-studied disease could be an interesting avenue to follow.
Funding.AWJ and MG are supported by grant NNF17OC0027594 from the Novo Nordisk Foundation.The data for the UK Biobank was accessed by application 45761.No potential competing interest was reported by the authors.

Supplemental Code and Data
Python-package probcox.Contains the code to fit the probabilistic cox regression method described in the article.Further contains all datasets and scripts (except the UKB data -additional fake data simulation + analysis are provided instead) used as examples in the article.Notebooks are provided to readily replicate all results, tables and figures from the paper (Jung and Gerstung, 2022).(zip file or https://github.com/alexwjung/ProbCox)

Supplemental Simulation and Application Results
Contains additional tables and results described in the article (Jung and Gerstung, 2022).(pdf file)

FIGURE 2 .
FIGURE 2. Log-likelihood approximation along training.On the left-hand side, we can see the full log-likelihood evaluated at each training step in black, the corresponding reweighted log-likelihood evaluations in dark grey and a naïve approximation in light grey.The right-hand side shows the distributions of the approximations around the full-likelihood evaluated at the last training step for repeated samples.

FIGURE 4 .
FIGURE 4. Runtime/Memory comparison between Python-ProbCox, Python-ProbCox (Hard drive), Pythonlifelines, R-Survival, and R-glmnet.ProbCox (Hard drive) loads the relevant batch sample on demand from the hard drive instead of loading the full data into memory.

FIGURE 5 .
FIGURE 5. Forest plot for the hazard exp[θ].The black dot represents the median parameter estimates with the corresponding HP D 95% around it.The upper panel shows the estimates for the baseline risk factors.We include all baseline risk estimates.The lower panel shows the parameter estimates for diseases with sufficient evidence of a non-zero effect based on the HP D 95% interval.

TABLE 1
Example output for a simulation draw of 2 individuals and 11 observations with 3 binary and 3 quantitative covariates in long format.
Root mean squared error, HPD -Highest posterior density, p | θ| -probability of identification -R-glmnet estimates for the standard error are not available.

TABLE 4
Results for the PBCseq application.The first two columns correspond to the original and the last two columns to the high-dimensional analysis.