Joint Bayesian Analysis of Multiple Response-Types Using the Hierarchical Generalized Transformation Model

Consider the situation where an analyst has a Bayesian statistical model that performs well for continuous data. However, suppose the observed dataset consists of multiple response-types (e.g., continuous, count-valued, Bernoulli trials, etc.), which are distributed from more than one class of distributions. We refer to these types of data as “multiple response-type” datasets. The goal of this article is to introduce a reasonable easy-to-implement all-purpose method that “converts” a Bayesian statistical model for continuous responses (call this the preferred model) into a Bayesian model for multiple response-type datasets. To do this, we consider a transformation of the multiple response-type data, such that the transformed data can be reasonably modeled using the preferred model. What is unique with our strategy is that we treat the transformations as unknown and use a Bayesian approach to model this uncertainty. The implementation of our Bayesian approach to unknown transformations is straightforward, and involves two steps. The first step produces posterior replicates of the transformed multiple responsetype data from a latent conjugate multivariate (LCM) model. The second step involves generating values from the posterior distribution implied by the preferred model. We demonstrate the flexibility of our model through an application to Bayesian additive regression trees (BART) and a spatio-temporal mixed effects (SME) model. We provide a thorough joint multiple response-type spatio-temporal analysis of coronavirus disease 2019 (COVID-19) cases, the adjusted closing price of the Dow Jones Industrial (DJI), and Google Trends data.


Introduction
Suppose you have a Bayesian statistical model for continuous responses that you believe works extremely well in several settings. Refer to this statistical model as the "preferred model." Also suppose you have observed a dataset consisting of multiple responsetypes (e.g., continuous, count-valued, Bernoulli trials etc.). These response-types may be "mismatched" with the response-types of the preferred model. For example, the dataset may consist of count-valued observations, but this preferred model may be derived only for Gaussian data. The primary goal of this article is to introduce a reasonable easy-to-implement all-purpose method that "converts" a Bayesian statistical model for continuous responses into a Bayesian model appropriate for the analysis of multiple response-type datasets.
There are several methods for jointly modeling data consisting of multiple responsetypes, however these approaches require one to either abandon the preferred model, or it requires you modify it in a manner that creates computational difficulties. For example, Markov models Yang et al. (2014), copulas (Liu et al., 2009;Xue and Zou, 2012;Dobra and Lenkoski, 2011;Liu et al., 2012), multi-task learning models (Argyriou et al., 2007;Kim and Xing, 2009;Yang et al., 2009), regression trees, and random forests (Hastie et al., 2009;Fellinghauer et al., 2013) have been adapted to this multiple response-type setting. However, these methods do not immediately incorporate a data scientist's preferred model. An important goal of this article is to allow our model to be flexible enough that it can be adapted to other data scientist's preferred model. While our proposed model allows for this flexibility, it can be interpreted as a simple combination of two existing methods: generalized linear mixed effects models (GLMM; e.g., see McCulloch et al., 2008, for a standard reference) and LCMs .
The GLMM is a standard approach to model non-Gaussian data. For example, Poisson data is modeled hierarchically, where the log mean parameter can be analyzed using a data scientist's preferred model. GLMMs can lack conjugacy, which creates noticeable difficulty when implementing a GLMM on a modern high-dimensional dataset. There are several examples of where this approach has been used to analyze multiple responsetype datasets (e.g., see Christensen and Amemiya, 2002;Schliep and Hoeting, 2013;Wu et al., 2015;Clark et al., 2017;Todd et al., 2018, among several others). A more recent alternative is the LCM. Basic theoretical results and empirical analyses in Bradley et al. (2018), Hu and Bradley (2018), Yang et al. (2019), Bradley et al. (2019b), and Bradley et al. (2020+) suggest that one can outperform a standard GLMM (specifically Latent Gaussian Process (LGP) models) in terms of prediction error. However, both the GLMM and LCM requires the preferred model to be a mixed effects model, and the LCM requires one to modify the distribution of random effects to follow the appropriate distribution based on conjugacy.
A classical approach is to transform the data, so that the transformed data can be reasonably modeled using the distribution assumed by the preferred model. In the non-Bayesian settings this literature is extremely well-developed and includes the Box-Cox transformations (Box and Cox, 1964), the alternating conditional expectations (ACE; Breiman and Friedman, 1985) algorithm, graphical techniques (McCulloch, 1993), and the Yeo-Johnson power transformation (Yeo and Johnson, 2000), among other techniques. More recently developments in rank based algorithms (Servin and Stephens, 2007;McCaw et al., 2019;Beasley et al., 2009) and quantile-matching (McCullagh and Tresoldi, 2020) have also been proposed in the non-Bayesian setting. It is important to note that Bayesian models for transformations have been proposed as well, but focus on the case where continuous non-normal data are observed and the preferred model assumes normality. In particular, these Bayesian models put a prior on the free parameter within the Box-Cox transformation or the Yeo-Johnson power transformation (Kim et al., 2013;Charitidou et al., 2015;Bean et al., 2016;Charitidou et al., 2018). No such Bayesian model has been developed to analyze multiple response-type data using any preferred model for a continuous response.
There are three distributions that define our hierarchical generalized transformation (HGT) model: (a) the distribution of the data given a transformation, (b) the prior distribution of the transformation, and (c) both the distribution of the transformation given the latent process and the distribution of the process of interest (i.e., multiplicative terms in the aforementioned preferred model). Here, cross-correlations are explicitly modeled using the preferred model. In this article, we model the data given a transformation (a) using members from the exponential family. Specifically, given a transformation, continuous data follows the normal distribution, categorical data follows the binomial distribution, and count-data follow the Poisson distribution. These distributions are conjugate with the normal, the logit-beta (Gao and Bradley, 2019;Bradley et al., 2019b) and the log-gamma distributions Hu and Bradley, 2018;Bradley et al., 2020+;Yang et al., 2019), which are special cases of the Diaconis-Ylvishaker (DY) distribution (e.g., see Diaconis and Ylvisaker, 1979;Chen and Ibrahim, 2003, for key references). Consequently, the prior distribution of the transformation (b) is modeled with a DY distribution, which defines an LCM model for the transformations. To combine (a), (b), and (c) we also have to introduce a multiplicative term to the preferred model to ensure propriety and to avoid contradictions when computing the marginal distribution of the transformations. While we include this multiplicative term in the expression of the preferred model, implementation of the preferred model is un-altered when learning the values of the latent process and associated parameters. The implementation of our approach can be done using composite sampling. In particular, the first step is to sample from the posterior distribution of the transformation. Then the second step is to sample from the conditional distribution of the latent process of interest given the transformation.
The first step of the composite sampler is computationally straightforward because the DY distribution is conjugate (and easy to sample from) with the exponential family. Additionally, the first step of this algorithm is important for the purpose of analyzing multiple response-types. Specifically, at the end of the first step we obtain a replicate from the posterior distribution of the transformation (which is continuous valued). Thus, the first step of the composite sampling algorithm "transforms" the multiple responsetype data into a continuous-valued quantity appropriate for the preferred model.
Implementation of the preferred model is unchanged (from the standard use of the preferred model) in the second step of our composite sampling algorithm. This is particularly noteworthy, as many of the Bayesian statistical models derived for Gaussian data are not immediately computationally efficient in the non-Gaussian data setting (e.g., see Bradley et al., 2019a;Kang and Cressie, 2011;Katzfuss and Cressie, 2012, for examples in the spatial setting). This is because GLMMs in the non-Gaussian setting have full-conditional distributions that are not Gaussian, and can not be sampled from immediately. Bayesian methods that do not have easy to sample from full-conditional distributions require difficult to tune Metropolis-Hastings algorithms (e.g., see Bradley et al., 2020+, for an example), inefficient rejection samplers (e.g., see Damien et al., 1999), or significant reparameterization to make approximate Bayesian methods (that are only appropriate for small parameter spaces) practical (Rue et al., 2009;Neal, 2011). The second step of our composite sampling algorithm allows one to circumvent this issue entirely, and simply use the computational strategies that were developed for the preferred model.
The two steps of our composite sampler can be seen as sequential smoothing. By "smoothing" we mean a function of the data that attempts to discover important features in the data (e.g., see Simonoff, 2012, for a standard reference). Multiple layers of smoothing may lead to estimates that are "oversmooth," in the sense that many features of the data are not captured. To avoid oversmoothing we specify the model so that the posterior distribution of the transformation is "saturated." Recall a saturated model is one in which there exists at least as many parameters as there are data points, and fitting this model allows you to exactly recover the original dataset. Hence, saturated models are often an extreme example of overfitting. Thus, in the first step of our composite sampler we choose to overfit the multiple response-type data, and in the second step we smooth overfitted values (again this is done to avoid oversmoothing).
In the classical log-linear model literature, saturated models are useful for selecting more parsimonious models (e.g., see Agresti, 2007, for a standard reference). Specifically, the most parsimonious reduced model that is not significantly different (in terms of the deviance or chi-square statistic) from the saturated model is used for statistical inference. Consequently, specifying the data model to be saturated allows us to assess the goodness-of-fit of the preferred model in a fully Bayesian manner that is similar to what is done in classical residual analysis.
We use our method to analyze spatio-temporal COVID-19 incidences and social distancing related variables. This COVID-19 dataset is observed daily over large sparse regions (i.e., countries and provinces). Social distancing can be described as an effort to maintain a physical distance between individuals and has become a necessary public health measure to combat COVID-19 (CDC, 2020). Social distancing is known to weaken incidences and deaths due to COVID-19, however, there are detrimental economic and psychological effects. This motivates us to analyze incidences (and deaths) of COVID-19 along with a measure of the health of the US economy (i.e., the adjusted closing price of the Dow Jones Industrial), and a measure of the public interest in COVID-19 through Google Trends data. The HGT model is developed to be easily adapted to a data scientist's preferred method for continuous data, which aids future analyses of this important dataset. It has recently been shown that forecasts regarding COVID-19 require sophisticated models. Following the results of Donnat and Holmes (2020), we include spatio-temporal random effects through the use of basis function expansions (e.g., see Cressie and Wikle, 2011, for a standard reference). Additionally, to improve the performance of forecasting we adopt the training, validation, and testing data framework that has become standard among the machine learning literature (e.g., see Hastie et al., 2009, for a standard reference). In particular, we incorporate a Bayesian version of ResNet (He et al., 2016), where validation data is used to make linear adjustments to improve forecasts. While we are partially motivated by COVID-19 and the detrimental impacts of social distances, the HGT developed in this manuscript is of independent interest, since this is a new way in Bayesian statistics to model non-Gaussian processes using models for continuous data. Furthermore, our methodology also allows one to analyze a single non-Gaussian response-type in a straightforward manner.
The remainder of this article is organized as follows. In Section 2, we describe how standard modeling procedures are not appropriate for this multiple response-type dataset. Then, we introduce the HGT model to analyze multiple response-type data with unknown transformations in Section 3. Additionally, we provide an example model specification. Then in Section 4, we provide details on using training, validation, and testing data for statistical inference. This allows us to perform linear adjustments to our predictions in a similar manner to He et al. (2016), which aids in forecasting. A summary of all the Bayesian models used in our analysis is also provided. In Section 5, we give simulation studies to illustrate that our approach has been developed in a manner that one can incorporate their preferred statistical model. In particular, we apply our approach to BART models and a spatio-temporal mixed effects (SME) model. Section 6 contains our joint analysis of COVID-19 mortality, incidences and recoveries, along with Google Trends data, and DJI data. Section 7 contains a discussion and derivations are provided in the Supplementary Materials (Bradley, 2020).

Motivation
Denote the multiple response-type data with Z ij , where i indexes replicates and j indexes response-type such that i = 1, . . . , I j and j = 1, 2, 3. We consider the setting where for each i, Z i1 is continuous-valued, Z i2 is integer-valued ranging from 0, . . . , b i , and Z i3 is count valued. There are many "off-the-shelf" approaches that one might consider to analyze multiple response-type data. For example, one might define the following linear model, where ξ i1 is normally distributed with mean zero and variance σ 2 ξ , β Zk ∈ R, β 1 is an unknown p-dimensional vector, and x i1 is a p-dimensional covariate vector. However, this conditionally specified model enforces a strong assumption of linearity between the different response-types. Furthermore, the variability (and dependence) of Z i2 and Z i3 is not explicitly modeled. Additionally, it assumes that the responses are paired, which may not be the case; that is, it is assumed that we observe the triplet (Z i1 , Z i2 , Z i3 ).
To incorporate the variability across response-types (i.e., across j) and allow for non-linear relationships, one might also consider the following hierarchical model: where we assume conditional independence of Z ij given Y ij , Y ij is an unobserved latent process, Normal(Y i1 , v) is a shorthand for the normal distribution with mean Y ij ∈ R and variance v > 0, Binomial(b i , p) is a shorthand for the binomial distribution with b i > 1 number of trials and probability of success p ∈ (0, 1), Poisson(μ ij ) is a shorthand for the Poisson distribution with mean μ ij . The covariance between observations is determined by the model for Y ij : Thus, cross-dependence and predictions are modeled through the statistical model assumed for the process Y ij , and a standard choice in this context is the GLMM: where x ij is a known p-dimensional vector of covariates and S ij is a pre-specified r- ξ > 0. Then, the cross-response spatio-temporal covariance implied by this model is cov Y ij , Y k |σ 2 η = σ 2 η S ij S k , which propagates through and enforces dependence in the multiple response-type data through (2.2). The relationship between the different response-types can be found by estimating the unknown function S ij η (e.g., using posterior means and credible intervals).
Computationally, the GLMM is difficult to implement in a Bayesian context. For example, a Gibbs sampler requires one to simulate from the following full-conditional distributions (Gelfand, 2000), and in this setting these distributions do not have a known form that is straightforward to simulate from. There are several approximate Bayesian computational tools available, however, for moderate sizes of p and r these approaches are not feasible. In particular, Hamiltonian Monte Carlo (HMC; Neal et al., 2011) and the integrated nested Laplace approximation (INLA; Rue et al., 2009) are only appropriate for small parameter spaces (e.g, Martino and Riebler, 2019, suggests no more than 15 parameters when implementing INLA). Additionally, INLA only allows for marginal inference (Kristensen et al., 2015). The computational issues of the generalized linear mixed model in (2.1) and (2.4) may become even more cumbersome when considering a different model for Y ij . This is especially pertinent for our multiple response-type dataset in Section 6, since the US government has put out a call to action (Office of Science and Technology Policy, 2020) for data scientists to analyze COVID-19 datasets, and it would be preferable to have an approach that is flexible enough for others to specify their own model for Y ij without major changes to implementation.
There are also parameters that are fixed and require specification in the GLMM. For example, the dimension of η (i.e., r) is chosen by the data analyst. In practice, measures of out-of-sample variability can be used to specify such parameters. This is because estimates are often biased towards the training data, since estimators are defined to be "close" to the training data (Hastie et al., 2009). Consequently, the use of validation data (held separate from the training data) to estimate r can aid in accounting for out-of-sample error. In particular, define a validation dataset to be z val = (Z ij : i = I j + 1, . . . , I val j , j = 1, 2, 3) , which is observed over the indices i ∈ {I j + 1, . . . , I val j } and j = 1, 2, 3, where I j < I val j < I. Then a posterior predictive step (e.g., see Gelman et al., 2013) can be used to predict at z val . For example, suppose we obtain an estimate of E(z val |z trn , r) using MCMC, where n = 3 j=1 I j and the n-dimensional data vector z trn = (Z ij : i = 1, . . . , I j , j = 1, 2, 3) . Then one could use E(z val |z trn , r) for inference, where for illustration and the arg min is computed over r. Additionally, to assess the overall predictive performance of the selected r a testing dataset could be used. For example, let the testing data Z ij be defined over the indices i = I val j + 1, . . . , I for I val j < I. Then the metric I i=Ij +1 (Z ij − E(Z ij |z trn , r)) 2 can be used to assess the out-of-sample error of predictions based on the selected r.
Computationally, the posterior predictive steps to estimate E(z val |z trn , r) can be done efficiently once replicates of Y ij from the posterior distribution have already been generated. That is, denote the ij . Then a posterior replicate of Z ij for i > I j can be generated by simulating from the distribution for Z ij |Y [b] ij . Consequently, we will use the HGT to fit training data (developed in Section 3), but continue to use the GLMM to fit validation and testing data (see Sections 4.2-4.4).
Instead of using validation data to estimate existing parameters, in our implementation, we introduce new parameters to model the validation data to adjust for out-ofsample error. That is, we introduce an additional linear model for {Y ij : i > I j }, and then use the validation data to estimate the parameters in the linear model. This is similar to an approach in machine learning called ResNet (He et al., 2016), and will aid in adjusting for biases that occur when forecasting COVID-19 incidences and deaths. More details and justification are given in Section 4.2.

The Hierarchical Generalized Transformation Model
As discussed in the Introduction, there are three main components that define the HGT: (a) the distribution of the data given a transformation, (b) the prior distribution of the transformation, and (c) both the distribution of the transformation given the latent process and the distribution of the process of interest (i.e., terms in the aforementioned preferred model). Before making explicit specifications of Items (a), (b), and (c), in Section 3.1, we provide a general discussion on allowing transformations to be an unknown process to estimate. Then, in Section 3.2, we describe general Bayesian implementation using any proper generic specifications of (a), (b), and (c). Then in Section 3.3, we provide an explicit specification of Items (a) and (b) that are used in this manuscript.
Finally, in Section 3.4, we provide an example specification of densities that define the preferred model in (c). In Supplementary Appendix A, we provide a table of terminology to aid in keeping track of both terminology and notation.

Unknown Transformations of Multiple Response-Types
One classical strategy to model non-Gaussian data is to impose a transformation such that, where h j (·) is a transformation of the datum Z ij , the h j (Z ij )'s are conditionally independent given {Y ij } and θ, f is a short-hand used for a probability density function (pdf) or represents a multiplicative term in the aforementioned preferred model. In what remains, inference on {Y ij } and θ is the primary goal. To aid in our exposition we drop the functional notation for h j (·) and where ij ind ∼ Normal(0, σ 2 ) and σ 2 > 0, and the mixed effects model on Y ij in (2.4) is assumed.
Transformations convert a multiple response-type dataset (e.g., {Z ij }) to a single response-type dataset (e.g., {h ij }), since h ij follows a single distribution with a continuous support. Consequently, transformations have become a standard tool in analyzing multiple response-types. Recall, transformations such as these have a long history including the box-cox transformations (Box and Cox, 1964), graphical techniques (Mc-Culloch, 1993), the alternating conditional expectations (ACE; Breiman and Friedman, 1985) algorithm, and the Yeo-Johnson power transformation (Yeo and Johnson, 2000, among others).
In this paper, we introduce a Bayesian solution to the problem of an unknown transformation through the use of the following pdfs and pmfs: (a) The distribution of the data given a transformation is denoted with f (Z ij |h ij ).
In Section 3.2, we describe general Bayesian implementation when Items (a), (b), and (c) have been specified.

General Bayesian Implementation
In this section, we describe Bayesian implementation of the HGT. The preferred model is represented in terms of a hierarchical model: where the product of each density in (3.3) defines the conditional density f (h, y, θ|γ) and γ is conditionally independent of (y , θ ) given h. Following the terminology used in Cressie and Wikle (2011), we call f (h|y, θ)m(h|γ) the "transformed data model," f (y|θ) the "process model," and f (θ) the "prior" for θ.
To guarantee that our choice of the transformation prior and transformed data model are consistent with each other we set To illustrate the need for m(h|γ) consider the incorrect specification of m(h|γ) = 1. Then the preferred model in (3.3) would imply a different marginal distribution for h than the transformation prior f (h|γ), since where recall that f (γ) is the "transformation hyperprior." However, setting m(h|γ) = f (h|γ)/ f (h|y, θ) f (y|θ)f (θ)dydθ guarantees that the marginal distribution of h stays same when computing using either the preferred model or the transformation prior. That is, Notice that the presence of m(h|γ) in the preferred model does not change the preferred model proportionally as a function of y and θ. Consequently, the presence of m(h|γ) will not have an effect on updating y and θ in an MCMC.
Bayes rule can be used to produce the following conditional distribution (e.g., see Gelman et al., 2013, for a standard reference), ( 3.4) Similarly, one can use Bayes rule to produce the posterior distribution of the transformed data. That is, Equations (3.4) and (3.5) can then be used to produce a posterior distribution for y and θ.
proper. Suppose z trn is conditionally independent of γ given h, γ is conditionally independent of (y , θ ) given h, and z trn and (y , θ ) are conditionally independent given h. Then: The derivation of the posterior distribution of the latent process and parameters stated in (3.6) can be found in Supplementary Appendix B.
The posterior distribution of the latent process and parameters can easily be simulated from using a composite sampling scheme, provided that it is easy to simulate from f (y, θ|h). Algorithm 1 gives the step-by-step implementation of how to simulate from the posterior distribution in (3.6). Here, we see that the implementation of the HGT model is similar to the bootstrap implementation, where we have replaced a resampling step with sampling from f (h|z trn ) and the full-conditional distributions associated with γ. This similarity emphasizes the flexibility of allowing for unknown transformations in a Bayesian context, since the bootstrap algorithm is an established flexible approach in the literature (e.g., see Efron, 1992, for an early reference). Of course, the bootstrap algorithm produces replicates from a different distribution than that of Algorithm 1. Specifically, the bootstrap method results in an approximate sample from the sampling distribution of a statistic. Whereas, the composite sampling approach in Algorithm 1 can be seen as a means to sample from the posterior distribution in (3.6). This is also different from the Bayesian bootstrap (Rubin, 1981), which does not restrict the samples to be from a posterior distribution of the form in (3.6).

Data Models, Transformation Priors, and Transformation Hyperpriors
Consider the following specifications of the data models: which is different from the GLMM in (2.1). Specifically, instead of conditioning on the latent process of interest Y ij , we condition on the transformation h ij . Thus, the data models in (3.7) imply that the Z ij 's are conditionally independent given {h ij } and v (v will be given a prior and integrated out).
With the data model f (z trn |h) defined, we are left to specify a transformation prior and transformation hyperprior. We define the transformation prior to be the conjugate distributions associated with the data models in (3.7). It follows from Diaconis and Ylvisaker (1979) that the conjugate distribution for h ij is given by, and κ k > 0; for q = 2, 3, and k = 1, 3. Let ψ 1 (Z) = Z 2 , ψ 2 (Z) = log(1 + e Z ), and ψ 3 (Z) = exp(Z). Also, we use the shorthand DY(α j , κ j ; ψ j ) to represent the pdf of the Diaconis and Ylvisaker (1979) prior in (3.8). Finally, let γ = (α 1 , α 2 , α 3 , κ 1 , κ 2 , κ 3 ) be the transformation hyperparameter. The DY distribution is a special case of the recently introduced conjugate multivariate distribution , where the matrix-valued covariance parameter is set equal to the identity matrix.
The data models and the DY priors in (3.7) and (3.8) can be used to produce a full-conditional distribution for the elements of transformations h: The derivations of the full conditional distributions are fairly straightforward, and are given in Supplementary Appendix B. One can simulate directly from the posterior distribution of the transformations in (3.9). Posterior replicates of h ij from (3.9) can be computed using the following transformation : where " d =" stands for equal in distribution, w 1 |Z i1 , α 1 , κ 1 , v is distributed normally with mean zero and variance 2κ 1 + 1 v −1 , w 2 |Z i2 , α 2 , κ 2 is distributed according to a beta distribution with shape parameters (α 2 +Z i2 ) and (κ 2 −α 2 +b i −Z i2 ), and w 3 |Z i3 , α 3 , κ 3 is distributed according to a gamma distribution with shape parameter (α 3 + Z i3 ) and rate parameter (κ 3 + 1).
Step 2 of Algorithm 1 involves simulating according to (3.10), which is straightforward.
The specification of a transformation hyperprior for γ is crucial to guarantee that Thus, we assume α 1 = κ 1 = 0, α 2 and α 3 are distributed according to a gamma distribution, κ 2 |α 2 is distributed according to a shifted (by α 2 ) gamma distribution, κ 3 follows a gamma distribution, and v is distributed according to an inverse gamma distribution (e.g., see Gelman, 2006, among others). These transformation hyperpriors are explicitly stated, and the full-conditional distributions for γ are derived in Supplementary Appendix C.1. In this context α 2 and α 3 play the role of a continuity correction for zero-valued responses. The use of continuity corrections for zero-valued responses has a long history. For example, see Cox (1970)'s use in contingency tables and Yates (1934) who suggest adding a 1/2 to zero-valued responses so that log-odds ratios are defined. The value 1/2 forces an average approximation error to zero (Cox, 1970). Different (smaller) values have been proposed, including more data driven techniques (Mosteller and Tukey, 1977;Fienberg, 1969;Sweeting et al., 2004, among others). A major difference between these continuity corrections and the proposed HGT is that, we add a random amount for the correction, rather than a pre-determined fixed amount.
The general Bayesian implementation described in Section 3.2 is flexible enough to allow for a transformation prior that implies cross-dependence among the elements of h, but we do not consider this case in this article. The main reason for this choice is that transformations are used in place of the original multiple response-type dataset when implementing the preferred model (Step 4 of Algorithm 1). That is, the transformed values are used as a proxy for (or in place of) the multiple response-type data in the preferred model. Consequently, we would like to specify h to "overfit" the data so that h can reasonably be thought of as a proxy for the data.
Our choice of the DY prior in (3.8) leads to posterior replicates that overfit the data. In particular, it is straightforward to verify that See Supplementary Appendix B for the derivation of (3.11). Thus, the posterior mean of h (on the original scale of the multiple response-type data) is exactly the observed data {Z ij } as the hyperparameters go to zero. This suggests that estimates from f (h|z trn ) overfits the data, however, it is not necessarily true that f (y, θ|z trn ) overfits the data.

Example of Bayesian Implementation
Consider the following mixed effects model for the transformed data (e.g., see Cressie and Johannesson, 2008, among others): where φ(·|μ, v) is the pdf of a normal distribution with mean μ and variance v, x ij is a p-dimensional vector of known covariates, I r is a r × r identity matrix, 0 r is an r-dimensional vector of zeros, α v = α η = α ξ = 1, β v = β η = β ξ = 1, σ 2 β = 100, and ξ = (ξ 11 , . . . , ξ I33 ) . The hyperparameters are chosen so that the prior is relatively "flat" and we find that our results are robust to these specifications. In Algorithm 1, we set Y ij = x ij β + S ij η + ξ ij and θ = (β , σ 2 , σ 2 ξ , σ 2 η ) . The choice of basis functions and specification of r are important. In Supplementary Appendix C.2, we give these details.
The full conditional distributions for y and θ are well-known (e.g.,see Cressie and Wikle, 2011, for a standard reference) and are listed in Supplementary Appendix C.2. Thus, Step 2 of Algorithm 1 involves simulating according to (3.10), which produces MCMC replicates {h [b] } and {γ [b] }. Then Step 4 of Algorithm 1 involves the following steps: , and note that one can easily allow for block-wise updates. Notice that the full conditional distribution has the conditioning event h [b] , but does not condition on z trn and γ [b] due to the conditional independence assumptions described in Section 3.2. Let η [0] and {ξ [0] ij } be a pre-defined initialization.
These standard full-conditional distributions and other details are given in Supplementary Appendix C.1 and C.2.

Statistical Inference
Estimation and prediction over the training set can be done by computing summary statistics using the quantities generated in Step 4 of Algorithm 1. However, to forecast values (e.g., future cases or deaths due to COVID-19) we make use of validation and testing datasets.

Goodness-of-Fit Using Training Data
Assessment of the goodness of fit can be done similar to residual analyses of transformed data in traditional regression analyses. We compute the residuals δ = (δ ij : i = 1, . . . , I j , j = 1, 2, 3) , δ ij = h ij − Y ij , and compute a credible region associated with δ (e.g., see Gelman et al., 2013, for a standard reference). For example, for each i and j, find the values q L,ij and q U,ij , where where 1 − α defines the "level" of the credible interval, is prespecified, and is different from the hyperparameters of the DY distribution. A default choice is α = 0.05. In practice, it is rather straightforward to approximate q L,ij and q U,ij . Let h ij be the b-th posterior replicate of h ij and Y ij so that δ ij is the b-th posterior replicate of δ ij . Then q L,ij and q U,ij can be approximated with the α/2 and 1 − α/2 percentiles of the set {δ [b] ij : b = 1, . . . , B}, respectively. If the value of zero lies within this interval (e.g., q L,ij < 0 < q U,ij ) for many values of i and j, then this suggests that the model for y provides a reasonable fit to this dataset. Equation (3.11) shows that the posterior mean of the transformation overfits the data, which we motivated as a way to avoid oversmoothing estimates of y and θ in Algorithm 1. However, the fact that the posterior distribution of the transformations overfit is also important from the point-of-view of diagnostics. In particular, in the goodness-of-fit literature, overfitted values are often used as a proxy for the data. For example, in log-linear models the most parsimonious reduced model that is not significantly different (in terms of the deviance or chi-square statistic) from the saturated model (an overfitted model) is used for statistical inference (e.g., see Agresti, 2007, for a standard reference). This is exciting because it provides a new way to conduct classical residual analysis in a Bayesian multiple response-type data context. In particular, in Section 5 we give an example of plotting the (posterior median) residuals versus a useful covariate not included in the analysis to assess whether or not it should be included in a model.
Since h ij is unknown it is also of interest to determine the goodness-of-fit of our model for h ij . That is, we use δ ij to assess the goodness of fit of Y ij , but one should consider the goodness of fit of h ij as well. One approach is to "back-transform." However, since our model for the transformation h ij is unknown, the back-transformation is also unknown. The Bayesian perspective is flexible enough to do inference on the back transformation. In particular, we define the unknown back-transformation of h ij to be the replicate from the data model f (Z ij |h ij ), since replicates from f (Z ij |h ij ) correspond to the transformation h ij . Thus, posterior predictive data (denoted with Z * ij ) can be used to estimate the unknown back-transformed value, and subsequently, be used to assess the performance of the transformation. In particular, if Z ij is consistently (according to the posterior distribution) close to Z * ij , we obtain a transformation model that overfits, which is a goal for our model for h ij . That is, the traditional posterior predictive p-value (e.g., Meng et al., 1994;Gelman et al., 1996) can be used to assess the goodness-of-fit of the back-transformation, and hence, the transformation itself. In practice, one might compute the proportion of times, over replicates of , is larger than the observed chi-square statistic (or sum of squared Pearson residuals).
Outside of the HGT setting, a preferable posterior predictive p-value would be 0.5 (e.g.,see Gelman, 2013, for a more complete discussion and other considerations when interpreting the posterior predictive p-value), since values close to one and zero indicate overfitting and oversmoothing, respectively. However, since we purposefully overfit the HGT when estimating {h ij }, we prefer the traditional posterior predictive p-value to be "close" to one, which indicates a good fit for the model for the transformation.

Estimating Hyperparamers Using a Validation Dataset
In machine learning, one often adjusts the model for being biased towards the training data by holding aside a dataset to estimate hyperparameters. This hold-out dataset is referred to as a validation dataset (Hastie et al., 2009), where recall, the validation dataset z val = (Z ij : i = I j + 1, . . . , I val j , j = 1, 2, 3) is observed over the indices i ∈ {I j + 1, . . . , I val j } and j = 1, 2, 3. Additionally, let Y * ij be the posterior predictive replicate of Y ij . We can not replace Y * ij with Y ij in our analysis of the validation data, otherwise, the validation data would be included with the training data when updating Y ij . Then, we assume where κ is a generic d-dimensional vector of real-valued parameters and f (κ) is the prior distribution of this parameter. We parameterize the unknown function k j with κ.
In this article, we define k j to be a line so that Y * ij can be adjusted linearly, 3: Sample κ [b] from it's full-conditional distribution. We use the slice sampler (Neal et al., 2003) since the full-conditional distribution does not have a closed form. 4: Set b = b + 1. 5: Repeat Steps 2−5 until b = B for a prespecified value of B. and κ = (κ 10 , κ 20 , κ 30 , κ 11 , κ 21 , κ 31 ) . When we consider κ unknown, we choose the improper flat prior f (κ) = 1. We also consider setting κ = (0, 0, 0, 1, 1, 1) so that k j is simply the identity function. Algorithm 2 describes implementation of our model for validation data.
There are two main reasons why we consider introducing k j . The first, as discussed at the end of Section 2, is that often times estimates are biased towards the training data, since estimators are defined to be "close" to the training data. Consequently, the use of validation data to estimate κ can aid in accounting for out-of-sample error. A second reason for the use of k j and κ is to model a certain feature in our COVID-19/social distancing dataset, where the validation and testing data are the second to last and last day of available data, respectively. Specifically, there is potential for dramatic day-today changes in our application, which may suggest that the training data is centered at a completely different value than the validation and testing. In this setting, κ j0 is needed to re-center the last two days. Additionally, one might redefine the link function to include k j directly when implementing the HGT for training data; however, this specification implies that the training data will inform κ and as a result the out-ofsample error is not controlled. This post-processing step has been done before in the machine learning literature, where successive linear models are applied to residuals and are referred to as a "ResNet" (He et al., 2016). What differs in our implementation is that (1) we enforce only a single linear model k j (·, κ) (i.e., we do not do multiple composites of a linear model), and (2) we are Bayesian in our implementation.

Forecasting
We produce next day forecasts for the variables in our study, which we treat as testing observations indexed over i = I val j + 1, . . . , I for j = 1, 2, 3. We let κ * and Y * * ij be distributed according to f (κ|z trn , z val ) and f (Y ij |z trn ), respectively. Again, we can not let Y * * ij equal Y ij , since otherwise, the testing data would be included when updating Y ij based on the training data. Then, we assume that ij from (4.4). 5: Set b = b + 1. 6: Repeat Steps 2−5 until b = B for a prespecified value of B. 7: Compute the sample mean and variance (across the index b) of Z [b] ij .
Predictions of the data process and estimation of cross-covariances can be found in a similar manner as in the GLMM example in (2.3) and (2.2). That is, the posterior mean and covariance of Z ij and Z k is, E(Z ij |z trn ) and cov(Z ij , Z k |z trn ), where recall, if we assume the process model in Section 3.4 cov(Z ij , Z k |z trn ) = S ij cov(η|z trn )S k , which is not necessarily zero. Implementation is summarized in Algorithm 3. When k j is the identity function, the predictions and covariances are simply In the context of observing daily data, once the testing data is observed (i.e., once tomorrow's data is realized), we can assess the performance of our forecasts (e.g., through the root mean squared error, etc.).
The model in (4.6) is the aforementioned HGT model. This is a well defined proper model (see Supplementary Appendix B for these details), provided that f (h ij |θ), f (y|θ), and f (θ) are proper.
Recall that one motivation for the model in (4.6) is that one can incorporate their preferred model for continuous data directly into our framework, since Algorithm 1 does not require one to change the implementation of their preferred model. This flexibility arises in the data scientist's specification of f (h ij |θ), f (y|θ), and f (θ). In Section 3.4 we specify f (h ij |θ), f (y|θ), and f (θ) using a mixed effects model, and in Section 5 we also consider using BART to illustrate this flexibility. Although we only consider Bayesian specifications of the preferred model, Step 4 can easily be substituted with replicates/estimates of y and θ (computed using h [b] ) from empirical Bayesian models, approximate Bayesian models, or frequentist models.
The LCM is explicitly used in the HGT model in (4.6) through the term m(h|y), where recall , f DY denotes the DY prior, and the prior for γ is defined in Supplementary Appendix C.1. Recall that Algorithm 1 is a collapsed Gibbs sampler, where we update transformation h and γ using the marginal distribution of the HGT model in (4.6) that is found by integrating out the process y and parameters θ. Specifically, when integrating out y and θ in (4.6), we obtain . . , I j , j = 1, 2, 3; Transformation Hyperprior : γ ∼ f (γ), which leads to the computationally simple updates of the transformations h and parameters γ developed in Section 3.3 to be used in Step 2 of Algorithm 1.
The joint distribution of the validation data, processes, and parameters is written as the product of the following conditional distributions: where recall that the goal of this model is to estimate κ from its posterior f (κ|z val , z trn ), which is a parameter that allows one to avoid overfitting the training data. The distribution f (Y * ij |z trn ) is the posterior distribution implied by the HGT model in (4.6). Model (4.7) can be implemented through Algorithm 2. When f (y|θ) is specified according to a linear model (i.e., Equation (2.4)) then Equations (4.7a) through (4.7c) can be thought of as a GLMM (McCulloch et al., 2008). GLMMs also arise in our model for testing data. The joint distribution of the testing data, processes, and parameters is written as the product of the following conditional distributions: where the goal is to predict the validation data Z ij at i = I val j + 1, . . . , I and j = 1, 2, 3. The distribution f (Y * * ij |z trn ) is the posterior distribution implied by the HGT model in (4.6) and f (κ * |z val , z trn ) is the posterior distribution from (4.7). The model for the testing data in (4.8) can be implemented through Algorithm 3.

Simulations
The goals of this simulation study are to provide a standard demonstration that the HGT model produces reasonable predictions in a computationally efficient manner. Another goal is to illustrate the flexibility of the HGT model to specify a data scientist's preferred model for continuous data. To do this we apply the HGT (4.6) to the spatio-temporal mixed effects model in Section 3.4 and BART (details in Supplementary Appendix C.3). Friedman (1991) introduced a simulation design, which has become a useful benchmark study (e.g., see Chipman et al., 2010, among others). Let w(x 1,ij , . . . , x 10,ij ) = 10sin(πx 1,ij x 2,ij ) + 20(x 3,i − 0.5) 2 + 10x 4,ij + 5x 5,i ; i = 1, . . . , I, j = 1, 2, 3, (5.1) which includes two non-linear terms, two linear terms, and a non-linear interaction. We consider the following specifications of the distributional assumptions associated with the data:

Simulation Setup
ind ∼ Binomial 300, exp (w(x 1,k2 , . . . , x 10,k2 )) 1 + exp (w (x 1,k2 , . . . , x 10,k2 )) , 1, 3 , . . . , x 10, 3 ))} , Figure 1: A violin plot of the RMSE (y-axis) by HGT method (x-axis) over 20 independent replicates of the data. The data are simulated as described in Section 5.1. Each HGT method is implemented using Algorithm 1, except the method "Saturated." for i = 1, . . . , I 1 , k = I 1 + 1, . . . , I 1 + I 2 , and = I 1 + I 2 + 1, . . . , I 1 + I 2 + I 3 . Methods are compared using the root mean squared error (RMSE), ⎛ where Y ij is estimated using Monte-Carlo integration using 2,000 iterations with a burn-in of 1,000. For each Bayesian method, we let Y ij be the pointwise posterior mean of g −1 j (h). We fit the preferred model using covariates x 1,ij , x 3,ij , x 4,ij , . . . , x 10,ij , and hence, we consider the case were an important covariate is not observed (i.e., {x 2,ij }) and several unneeded covariates are included (i.e., {x 6,ij , . . . , x 10,ij } are not present in (5.1)). The omissions of {x 2,ij } when implementing our method is a slight departure from the original setup in Friedman (1991). However, we feel that it is more realistic to assume that not all covariates are observed in practice, and will be a helpful choice for illustration. We specify x k,ij ∼ Uniform(0, 1), where Uniform(0, 1) is a shorthand for the uniform distribution over the interval [0, 1] and k = 1, . . . , 10. The preferred models are spatio-temporal mixed effects and BART (and an extension), whose implementation are described in Supplementary Appendix C.2 and Supplementary Appendix C.3, respectively. Additionally, the choice of basis functions is described in Supplementary Appendix C.4. In the implementation of each preferred method, we allow each response-type to have different regression coefficients.

Simulations: Joint Analysis of Multiple Response-Types
In this section, we evaluate the predictive performance of our Bayesian model with unknown transformations in the multiple response-type data setting. In particular, we set the preferred model equal to BART (Chipman et al., 2010) and a Bayesian version of the spatio-temporal mixed effects model (Cressie and Johannesson, 2008) using basis functions introduced by (Hughes and Haran, 2013). The posterior mean of h ij (referred to as the saturated model) is included as a default poor estimator, since it is known to overfit the data.
The data are simulated according to (5.2), with I = 1000, I 1 = 350, I 2 = 350, and I 3 = 200. We do not include a validation dataset so that k j ≡ g j . We repeat this simulation study 20 times, and we provide violin plots of the RMSE over the 20 replicates by method in Figure 1. In Figure 2 we also plot the true function versus the estimated function for a single replicate dataset. Figures 1 and 2 suggest that the HGTbased spatio-temporal mixed effects model (and HGT-BART) performs well in terms of predictive performance. For the replicate in Figure 2 the HGT-based spatio-temporal mixed effects (and HGT-BART) model had 97% (94%) of the point-wise credible intervals of the elements of δ containing zero. The patterns observed in Figure 1 mimic the goodness-of-fit diagnostics, which is notable because the goodness-of-fit diagnostics are data driven (and hence can be used in practice) while Figure 1 is based on the unknown truth. The posterior predictive p-value is ≈ 1, which suggests that our model for transformation overfits as desired (see Section 4.1). These results suggest that the Bayesian transformations can be used to obtain predictions in the non-Gaussian setting using two standard models, and also has a useful built-in goodness-of-fit diagnostic. Now, suppose we have observed the values of {x 2,ij }, and recall these covariates are not included in the analysis. In Figure 3, we plot the posterior median of the residuals versus the covariate {x 2,ij } across the indexes i and j for a single replicate of the dataset. The plot clearly indicates a sinusoidal or possibly quadratic pattern, which suggests that this behavior is not captured in our model for y. We know this to be Figure 3: We simulate a single replicate of {Y ij } according to Section 5.1. Then a spatiotemporal mixed effects model is implemented using the specifications in Section 3.4. This plot displays the posterior median of {δ ij } (see Section 4.1) versus x 2,ij , which is not included in our implementation of the spatio-temporal mixed effects model. A systematic pattern in this plot suggests that including x 2,ij would improve our analysis of y.
true because {x 2,ij } is not included in our implementation, but the data was generated using {x 2,ij }. This is an illustration of how our approach provides a Bayesian analog to a graphical technique from classical regression analysis (i.e., systematic patterns in residuals from a multiple regression versus a covariate suggest that the covariate should be included in the analysis).

Simulations: Robustness to Departures from Model Assumptions
In this simulation study we compare the predictive performance of our HGT approach to predictions from the preferred model itself. A straightforward way to do this is to restrict ourselves to the continuous data-only setting, in which both modeling paradigms can be implemented more readily. The data are simulated according to (5.2), with I 1 = 800, I = 1000, and I 2 = I 3 = 0. We do not include a validation dataset.
We repeat this simulation study 20 times, and we provide violin plots of the RMSE over the 20 replicates by method in Figure 4. In this section, we include an additional predictor: soft BART (SBART; Linero and Yang, 2018, see Supplementary Appendix C.3 for more details). We again see that the HGT versions of BART and the spatiotemporal mixed effects model outperform the saturated model, with the spatio-temporal mixed effects model clearly outperforming BART. Additionally, the HGT version of BART and spatio-temporal mixed effects model perform only slightly better than or the same as their non-transformed counterparts. Here we see that SBART performs worse than the saturated model in terms of RMSE. The HGT version of SBART does Figure 4: A violin plot of the RMSE (y-axis) by method (x-axis) over 20 independent replicates of the data. The data are simulated as described in Section 5.1. Each method is implemented using Algorithm 1, except the method "Saturated." The observed dataset are used as the predicted values for the method "Saturated." The results for the HGT method are highlighted in black, the results for the original method are highlighted white, and the saturated model is highlighted green (these models are also indicated on the x-axis).
not perform noticeably different than SBART in terms of RMSE. This suggests that in the continuous only setting, if the preferred model performs well (or poorly) one should expect the Bayesian transformation approach to perform well (or poorly). Recall that we can use the goodness-of-fit approach in Section 4.1 to assess when a method performs poorly in practice. For example, for a single replicate dataset, we found that the percent of credible intervals of the elements of δ that contain zero (by method) are as follows: 99.8% (spatio-temporal mixed effects model), 77.4% (BART), and 58.1% (SBART). This produces the same rankings of the method in terms of RMSE. Additionally, the posterior predictive p-value is ≈ 1, which suggests that our model for transformation overfits as desired (see Section 4.1).

Simulations: Computational Considerations
In the continuous-only data setting Algorithm 1 requires more computation, and hence more computation time. This is because Step 2 in Algorithm 1 is not needed to implement the preferred model without transformation, but is required to implement the HGT. Thus, in the continuous-only data setting the main motivation for the HGT is that it allows for the diagnostics in Section 4.1, and the un-transformed preferred model does not. However, for Poisson-only, binomial-only, and multiple response-type data Algorithm 1 can lead to clear computational speed-ups compared to existing GLMMs. To demonstrate this, we compare the HGT and the GLMM implementation of the SME Figure 5: In the left panel we plot a violin plot of the RMSE (y-axis) by method (x-axis) over 20 independent replicates of the data, the middle plot replaces the RMSE with Central Processing Units (CPU) in seconds, and in right plot we replace the RMSE with the ESS. The data are simulated as described in Section 5.1. The HGT is implemented using Algorithm 1, and the GLMM is implemented using a Gibbs sampler with Metropolis Hastings updates. The results for the HGT method are highlighted in black, and the results for the GLMM are highlighted white (these models are also indicated on the x-axis).
model for a Poisson-only dataset (i.e., I 1 = 0, I 2 = 0, and I 3 = 350). The GLMM is implemented using the same covariates/basis functions as the HGT, and the R package MCMCglmm, based on Metropolis-Hastings updates, is used (Hadfield, 2010). Algorithm 1 is implemented using 2,000 replicates and a burn-in of 1,000. Trace plots were used to informally investigate convergence with no lack of convergence detected. The Bayesian GLMM is implemented with 13,000 replicates, a burn-in of 3,000, and thinning rate of 10. The effective sample size (ESS) of Y ij is computed to better compare the computational performance of HGT and the GLMM (e.g., see Kass et al., 2016, among others). The ESS is the total MCMC replicates times a ratio of the within chain variance and between chain variance. We average component-wise ESS instead of implementing a multivariate version of ESS (Vats et al., 2019), since our goal is to simply compare the HGT and the GLMM.
In Figure 5, we see that the GLMM took more time (roughly 275 seconds for the GLMM, and roughly 100 seconds for the HGT) to produce fewer average effective samples (roughly 800 for the GLMM and roughly 1,300 for the HGT) than the HGT. Furthermore, the predictive performance of HGT is better than the GLMM, as measured by RMSE, in this Poisson-only data setting. This is somewhat different from what is seen in the continuous-only data setting in Section 5.3, where the difference in RMSE between the HGT and the preferred model was negligible.
There are of course several other methods/computational approaches besides the GLMM with Metropolis-Hastings updates that are more computationally efficient (e.g., LCM or the use of INLA, etc.). Thus, a fair conclusion would be the following: if the continuous-only preferred model (e.g., the preferred model is proportional to SME) is considerably more computationally advantageous than the non-Gaussian version of the preferred model (e.g., the GLMM), then the HGT is a more computationally practical Figure 6: We plot the number of reported COVID-19 infections (top left), reported COVID-19 deaths (top middle), the reported recoveries from COVID-19 (top right), the DJI adjusted closing price (bottom left), and the Google Trends interest score for searches of "coronavirus" (bottom right). Note that the DJI price data is not available on Saturday and Sundays. The black circles are the observed data, and blue lines connecting these points are added as a reference. The top row represents only a summary of available data, since we also observe these counts over 184 countries and 82 provinces. choice in the non-Gaussian response-type setting (as illustrated in the above example). This is because HGT can use the same implementation of the continuous-only preferred model in Step 4, and the added computations in Step 2 are "small" as compared to, say, Metropolis-Hastings updates in a GLMM.

Joint Analysis of Reported COVID-19 Occurrences, the Adjusted Closing Price of the Dow Jones Industrial, and Google Trends Data
We now present our joint analysis of occurrences of COVID-19, the adjusted closing price of the DJI, and the Google Trends interest score in searches of "coronavirus" (see time series displays of this data in Figure 6). The MCMC is implemented according to Algorithms 1 through 3 with 10,000 replicates and a burn-in of 1,000. Convergence was assessed visually through the use of trace plots and through Gelman-Rubin diag-nostics  with no indications of a lack of convergence. All of our analyses were implemented on Windows 10 with the following specifications: Intel(R) CORE(TM) i5-8250U CPU with 1.60Gh.

The Dataset
COVID-19 was first detected in a live animal market in Wuhan City within the Hubei Province of China (Guarner, 2020). This virus spreads easily from person to person, and there are cases of this virus where an individual is unsure of how they became infected (i.e., community transmission, Dowd et al., 2020;Guarner, 2020). As of this writing, no vaccine has been approved for broad distribution in nearly all countries (Corum et al., 2020;Wangping et al., 2020). As such, many governmental organizations, including the Centers for Disease Control and Prevention (CDC), have advised placing distance between yourself and other individuals (i.e., social distancing, CDC, 2020). Social distancing is an important public health measure that reduces close contact with people that may be infected by maintaining physical distance between all individuals (Wilder-Smith and Freedman, 2020; Zhang et al., 2020;CDC, 2020). However, social distancing comes as a cost, and can be detrimental to economies and cause psychological distress (Long, 2020;Park et al., 2020).
The data on reported deaths and cases of COVID-19 were obtained from the Johns Hopkins University Center for Systems Science and Engineering (JHU CCSE) Coronavirus repository (publicly available at https://github.com/CSSEGISandData/ COVID-19), a subset of which, is made available in the R package coronavirus (Krispin, 2020). Cases, recoveries and mortality counts are available over regions (i.e., country or province) and discrete time (daily). In this article, we model these counts using a Poisson distribution, and our main interest lies in estimating the mean number of reported deaths and cases of COVID-19, and estimating its dependence with interest in COVID-19 and DJI data.
The number of Google searches of "coronavirus" is indicative of the high interest on COVID-19 and can act as a loose proxy for the public interest in COVID-19. This search information is made available through Google Trends data (Google, 2020). Google Trends provide daily time series of an "interest" measure of searches on Google. This interest measure is defined on a scale from zero to one hundred with 100 indicating high interest and zero indicating low interest. In this article, we model the Google Trends interest score for the search "coronavirus" as binomial with sample size 100, since this response is a non-negative, integer-valued response that is bounded above by 100.
The DJI follows 30 publicly owned blue chip (i.e., nationally recognized and financially secure) companies that trade on the New York Stock Exchange (NYSE) and the National Association of Securities Dealers Automated Quotations (NASDAQ). It is a benchmark for blue-chip stocks and is often treated as a measure of the economic health of the US. This data was obtained through Yahoo Finance (Yahoo, 2020). We model the adjusted daily closing price with a Gaussian distribution, since it is continuous valued. Our main interest in DJI is in determining and summarizing the relationship between the adjusted closing price with both interest in COVID-19 and reported cases and deaths due to COVID-19.
Let Z i1 represent the negative adjusted closing price of DJI per $10, 000, Z i2 be the integer-valued interest score for COVID-19 searches as computed by Google Trends (with b i ≡ 100), and i indexes the days ranging from January 22, 2020 to April 8, 2020. We analyze Z i1 on the negative scale so that we see an increasing trend over time among all three response-types. The data Z i3 represents the i-th replicate of the number of COVID-19 cases, where for each i there is an associated region (e.g., China) −90, 90], day t i (between January 22, 2020 to April 8, 2020), and an indicator d i of whether or not the count consists of reported deaths. Let d i = 1 if Z i3 represents reported deaths and d i = 0 otherwise. Likewise, let u i represent an indicator of whether or not the count consists of reported recoveries. Also let t i = 1, . . . , T = 78 represent each day between January 22, 2020 to April 8, 2020. In Figure 6, we plot the number of reported COVID-19 infections, reported COVID-19 deaths, the DJI adjusted closing price, and the Google Trends interest score for searches of "coronavirus." This plot is based on data reported as of this writing, and this data is continually being monitored and updated.
Our specifications of the basis functions are defined in Supplementary Appendix C.4, and covariates include an indicator for the region and response-types, d i , and u i . The data from January 22, 2020 to April 6, 2020 are the training data (n = 10, 600), the data on April 7, 2020 is held-out as a validation dataset (373 observations), and the data on April 8 is held-out as a testing dataset (374 observations).

Goodness of Fit
In Figure 7 we plot the posterior mean death, confirmed cases, recovered cases, adjusted closing price, and Google Trends interest score. Here, we see that the predicted values are reasonably close to their observed values with the observed data close contained within a pointwise 95% credible interval. These results suggest that the in-sample error is small, and that the predicted values reflect the general patterns of the data. Goodness of fit can be formally investigated according to the approaches in Section 4.1. Roughly 99.4% percent of the credible intervals, as defined in (4.1), contain zero. This provides additional evidence the model provides a reasonable fit to the data. The posterior predictive p-value is ≈ 1, which suggests that our model for transformation overfits (see Section 4.1). In the bottom right panel of Figure 7 we plot the posterior median residual (i.e., δ) versus the time the observation was recorded. Here we see roughly no pattern over time, which suggests that our specification of the basis functions were reasonable.

Estimation and Prediction
We did not include the data on April 8-th, 2020, which was the most current value available at the time of the analysis. We use the HGT model to predict the number of deaths, number of confirmed recoveries, and number of confirmed cases according to Algorithm 3. In Figure 8 we provide the posterior means associated with these values versus the testing data. The use of Algorithm 2 aided in producing more accurate forecasts as the posterior means of κ 02 and κ 12 were roughly equal to 3.5 and 1.5, respectively. In general, the posterior means from Algorithm 3 trends the testing data, except for smaller testing values, where there is a tendency to overestimate the log count. However, the percentage (over the testing data) of pointwise credible intervals that contain the testing data is 98.4%, which suggest that the uncertainty of these estimates are captured in the model. This property of the model is also seen in the plot of the posterior variance versus the posterior mean, also displayed in Figure 8. Here, smaller predicted values tend to be over-dispersed, and larger predicted values appear to be equi-dispersed. Thus, we appear to have accurate predictions of the areas with the largest confirmed cases, recoveries, and deaths. Being able to accurately estimate large values of (log) occurrences is particularly important. That is, if we know where there are large occurrences of confirmed cases, then additional testing of individuals in these regions allows one to isolate all those who test positive in this region, which ultimately reduces the spread of COVID-19 from this region to others (Ai et al., 2020). Consequently, models such as ours can be useful at stopping the spread of COVID-19. However, finer-scale regional data would be necessary for this model to be helpful in narrowing in on potential "hot-spots" in practice.
In Figure 9, we plot the posterior mean of the random effects that is shared across response-type along with pointwise 95% credible intervals (see Section 4.1). The time period t i = 1 to t i = 33 (corresponding to January 22, 2020 and February 23, 2020) was particularly crucial, since this time range saw the strongest direct effects between COVID-19 cases, the negative adjusted closing price, and Google Trends interest-score in the Google search "coronavirus." Furthermore, the fact that zero does not tend to fall within the credible intervals suggests that our incorporation of dependence across response-types, spatial regions, and days was reasonable. Time point t i = 33 (corresponding to February 23, 2020) marks a time in which the adjusted closing price initially started to decrease (see Figures 1 and 6), and the Google Trends interest score increases. After t i = 33 the random effect appears to be negative-valued, which suggests an indirect relationship among these responses.

Discussion
We introduce the HGT model, which is derived from a straightforward combination of the LCM and the GLMM. This combination is motivated as a means to aid other researchers to analyze multiple response-type datasets such as the one considered in this article. In particular, our approach provides several contributions to Bayesian statistics. First, we have developed a general all-purpose Bayesian model to analyze multiple responses (e.g., continuous, Binomial counts, and Poisson counts). Our approach allows one to directly incorporate their preferred Bayesian model to analyze multiple response- type data without completely abandoning their approach to the implementation of their preferred model. Second, we developed a general Bayesian analog to the classical comparison between a saturated model and a reduced model. This results in the use of classical residual analysis for assessing goodness-of-fit in Bayesian models for multiple response-type data. Third, we develop an approach to forecasting by introducing new parameters in a model for validation data. Code and tutorials on how to adapt the HGT to your preferred model can be found at https://github.com/JonathanBradley28/CM.
We illustrate the HGT on a dataset of COVID-19 and social distancing related variables. COVID-19 is a global epochal health disaster, and social distancing has become a necessary public health measure to protect the health of individuals. In particular, we investigate the relationship between COVID-19 cases, the US economy (specifically the negative adjusted closing price of DJI), and interest on Google (specifically Google Trends interest score for the search "coronavirus"). The data and model suggests that the relationship among these three values had the strongest positive relationship during a majority of February 2020, which suggests that this was an important time period. Additionally, there are clear cross-dependencies among response-types, regions, and days. It is important to comment that correlation does not imply causation, and to make explicit causal conclusions one needs to adopt methods among the causal inference literature (Rubin, 2005). Finally, the HGT model produces reasonable forecasts of the log frequency of cases, deaths, and recoveries from COVID-19. This suggests that with finerscale regional data, this model could potentially be useful for targeting future hot-spots of COVID-19.
In our simulations, an illustration was given of non-linear functional analysis of multiple response-types using BART as the preferred model. Additionally, an illustration was given of a joint spatial analysis of multiple response-types using a spatio-temporal mixed effects model as the preferred model. These results suggest that the prediction error of our approach is small (in terms of RMSE), and we can develop multiple responsetype data versions of two different preferred models seamlessly. Computational benefits in the non-Gaussian setting was also illustrated. Additionally, data driven goodness-offit diagnostics were able to lead to the same conclusion as the RMSE criterion (based on the latent process) that is unobserved in practice.