Bayesian Tensor Response Regression with an Application to Brain Activation Studies

. This article proposes a novel Bayesian implementation of regression with multi-dimensional array (tensor) response on scalar covariates. The recent emergence of complex datasets in various disciplines presents a pressing need to devise regression models with a tensor valued response. This article considers one such application of detecting neuronal activation in fMRI experiments in presence of tensor valued brain images and scalar predictors. The overarching goal in this application is to identify spatial regions (voxels) of a brain activated by an external stimulus. In such and related applications, we propose to regress responses from all cells (or voxels in brain activation studies) together as a tensor response on scalar predictors, accounting for the structural information inherent in the tensor response. To estimate model parameters with proper cell speciﬁc shrinkage, we propose a novel multiway stick breaking shrinkage prior distribution on tensor structured regression coeﬃcients, enabling identiﬁcation of cells which are related to the predictors. The major novelty of this article lies in the theoretical study of the contraction properties for the proposed shrinkage prior in the tensor response regression when the number of cells grows faster than the sample size. Speciﬁcally, estimates of tensor regression coeﬃcients are shown to be asymptotically concen-trated around the true sparse tensor in L 2 -sense under mild assumptions. Various simulation studies and analysis of a brain activation data empirically verify desirable performance of the proposed model in terms of estimation and inference on cell-level parameters.


Introduction
Of late, neuroscience or related applications routinely encounter regression scenarios involving a multidimensional array or tensor structured response and scalar predictors. An important motivating example occurs in single-subject Functional MRI (fMRI) studies to detect localized regions where neuronal activation takes place in presence of external stimuli (e.g., during a task). During the course of an fMRI experiment, the three dimensional brain image space is divided into a large number of rectangular cells, also referred to as "voxels". A series of brain images of the blood-oxygen-level-dependent measure (BOLD measure) are acquired over multiple time points across all voxels while a subject performs multiple tasks, yielding three dimensional tensor responses over time points. One could also examine a slice of an fMRI scan, resulting in a two-dimensional tensor for each time point. The BOLD measure tensor response at each time point is presumed to be associated with the task related predictors and it is of scientific interest to delineate the nature and regions of activation using a regression framework involving the tensor response and task related predictors. Similarly, in electroencephalography (EEG) studies voltage values are measured from numerous electrodes placed on the scalp over time. The resulting data is a two-dimensional matrix where the readings are both spatially and temporally correlated. These matrix responses are often regressed on a set of scalar predictors (e.g., if a subject is alcoholic or not) to identify their variation with the predictors. All these applications involve a response tensor Y t ∈ R p1×···×p D and a vector of predictors x t ∈ R m at time t, respectively, with an objective to understand the cells in Y t influenced by the changes in x t . Although the tensor response regression framework is motivated by the aforementioned neuroimaging studies, the proposed methodology equally applies to a variety of scientific applications, including chemometrics (Bro, 2006), psychometrics (Kiers and Mechelen, 2001) and relational data (Gerard and Hoff, 2015), among others, where tensor valued responses are collected routinely.
Rather than analyzing voxels within a tensor response together, many neuroscientists use what is referred to as the General Linear Model (GLM), which is not to be confused with the generalized linear model. The GLM fits a regression model at each cell in the tensor response independently of the others and calculates the test statistic corresponding to each cell to identify if the response is significantly associated with a predictor in that cell, accounting for multiple testing corrections (Penny et al., 2011;Friston et al., 1995;Genovese et al., 2002). The GLM is conceptually simple and computationally efficient, though many multiple testing corrections do not take spatial relationships into account. Corrections that do take the spatial relationships between voxels into account require tuning and make strong assumptions about spatial structures (Poline et al., 1997). Additionally, neuroimaging data are usually pre-processed using a kernel convolution based spatial smoothing approach. Using the GLM on presmoothed data may result in inaccurate estimation and testing of the covariate effects (Chumbley and Friston, 2009;Li et al., 2011). More principled approaches vectorize the tensor response to construct a multivariate vector response regression. Some notable structures employed to estimate parameters in the multivariate vector response regression include sparse regressions with various penalties incorporating correlated response variables (Similä and Tikka, 2007;Peng et al., 2010), reduced-rank regressions (Yuan et al., 2007;Chen et al., 2013) and sparse reduced-rank regressions (Chen and Huang, 2012). While these methods view tensor response as a high dimensional vector without any spatial association among its cells, our goal is to retain some spatial information in the multidimensional tensor into the proposed model.
To this end, sophisticated approaches include adaptive multiscale smoothing methods and spatially varying coefficient (SVC) models. The former estimates parameters by building iteratively increasing neighbors around each cell and combining observations within the neighbors with weights (Li et al., 2011). The SVC models add spatial components in the cell by cell regression that account for the spatial correlations between cells (Zhang et al., 2015(Zhang et al., , 2014Descombes et al., 1998;Zhu et al., 2014). These approaches introduce distinct parameters for different cell specific regressions and propose to model them jointly. For a tensor response of dimensions p 1 × · · · × p D , where p 1 , . . . , p D are moderately large, such strategies lead to the joint modeling of at least D i=1 p i parameters, which may turn out to be computationally challenging. There is a parallel literature to model spatial dependence among regression coefficients induced by Markov random fields (MRF) (Smith and Fahrmeir, 2007;Zhang et al., 2015). MRF models are generally efficient in computation, though our proposed approach seems to yield somewhat better inference in comparison with a specific Gaussian MRF model (see Section 3). Intuitively, introducing tensor response in the regression without reshaping it seems to preserve the neighborhood information of the cells in the response. A more sophisticated MRF model may improve the inference, with perhaps added computational cost.
Recently, Li and Zhang (2017) propose a novel approach of regressing the tensor variate response on scalar predictors, where the envelope technique by Cook et al. (2010) is employed to yield point estimates of the parameters. Subsequently, Sun and Li (2017) provide convergence rates of the frequentist penalized regression approaches with a tensor response and vector predictors. This approach proposes low rank decomposition of the tensor coefficient and introduces multiple constraints on the parameter space. While such constraints can be more seamlessly accommodated by frequentist optimization algorithms, they offer a steep challenge for Bayesian implementation. Additionally, frequentist optimization frameworks are dependent on tuning parameters (e.g., the envelope dimensions in Li and Zhang (2017)), with choices for these parameters being sensitive to the tensor dimensions and the signal-to-noise ratio (degree of sparsity).
In the same vein as Li and Zhang (2017), we propose a regression scenario with tensor response Y t and predictors x t , referred to as the tensor response regression (TRR). The coefficient corresponding to each predictor in the vector x t is a tensor, and is assumed to possess a "low rank" canonical decomposition/parallel factorization decomposition (CANDECOMP/PARAFAC, or CP), which is defined in Section 2.1. The model is also designed to be generalizable to any value of D for possible application in other research areas. For the Bayesian implementation, we employ a novel multiway stick breaking shrinkage prior distribution to shrink the cells of the tensor coefficient corresponding to unimportant voxels close to zero while maintaining accurate estimation and uncertainty of cell coefficients related to important voxels. Our framework is, to the best of our knowledge, the first Bayesian framework for regressing a tensor response on scalar predictors. Additionally, TRR retains the tensor structure of the response to implicitly preserve correlations between cells and yet substantially reducing the number of parameters using the CP decomposition to accrue computational benefits. The TRR framework with the multiway stick breaking prior gives rise to model-based shrinkage towards a "low rank" solution for the tensor coefficient, with a carefully constructed shrinkage prior that naturally induces sparsity within and across ranks for the tensor coefficient and results in identification of important cells in the tensor related to a predictor. In addition, there is a strong need for uncertainty quantification for parametric estimates, especially when the tensor dimension far exceeds the sample size, or the signal to noise ratio is low, motivating the Bayesian TRR (BTRR) approach.
There is recent literature on regressing a scalar response on a tensor covariate Zhou et al., 2013;Zhou and Li, 2014) that focuses on identifying voxels in the tensor which are related to the response. In contrast, we flip the role and regress a tensor response on scalar predictors. Our approach differs from the existing frequentist and Bayesian tensor modeling approaches (Gerard and Hoff, 2015;Dunson and Xing, 2009) as we offer a supervised tensor regression framework that accommodates scalar predictors.
One important contribution of this article remains proving posterior consistency for the proposed BTRR model with the multiway stick breaking shrinkage prior. Theory of posterior contraction for high dimensional regression models has gained traction lately, though the literature is less developed in shrinkage priors compared to point-mass priors. For example, Castillo et al. (2012) and Belitser and Nurushev (2015) have established posterior concentration and variable selection properties for certain point-mass priors in the many normal-means model. The latter article also establishes coverage of Bayesian credible sets. Results on posterior concentration and variable selection in high dimensional linear models are also established by Castillo et al. (2015) and Martin et al. (2017) for certain point-mass priors. In contrast, Armagan et al. (2013b) show posterior consistency in the linear regression model with shrinkage priors for low-dimensional settings where the number of covariates does not exceed the number of observations. Using direct calculations, Van Der Pas et al. (2014) show that the posterior based on the horseshoe prior concentrates at the optimal rate for the many normal-mean problem. Song and Liang (2017) and Wei and Ghosal (2017) consider a general class of continuous shrinkage priors and obtain posterior contraction rates in ordinary high dimensional linear regression models and logistic regression models, respectively, depending on the concentration and tail properties of the density of the continuous shrinkage prior. In contrast, the study of posterior contraction properties for tensor regression models in the Bayesian paradigm has been given far too less attention. A recent article by Guhaniyogi (2017) is of interest in this regard. Developing theory for tensor response regression models is faced with two major challenges. While high dimensional regression models directly impose a well investigated shrinkage prior on the predictor coefficients, BTRR imposes shrinkage priors on margins of the CP decomposition of tensor coefficients. As a result, the prior distribution on voxel level elements of the tensor coefficient is difficult to deal with. Additionally, in typical applications, the dimensions of tensor coefficients are much larger than the sample size. Both of these present obstacles which we overcome in this work. We also emphasize that the posterior contraction of tensor regression in Guhaniyogi (2017) is shown for the Kullback-Leibler neighborhood. In contrast, Bayesian tensor response regression develops a much stronger result with L 2 -neighborhood around the true tensor coefficient.
The remainder of the article flows as following. Section 2 introduces the model and describes prior distributions on the parameters. Sections 3 and 4 show performance of the proposed model through simulation studies and brain activation data analysis, respectively. Section 5 concludes the paper. Posterior consistency results and their proofs along with details of posterior computation including the full MCMC implementation of the model can be found in the Supplementary Material (Guhaniyogi and Spencer, 2021).
A D-way tensor Γ ∈ ⊗ D j=1 pj assumes a rank-R parallel factorization (PARAFAC) decomposition (Kiers, 2000;Kolda and Bader, 2009) if Γ can be expressed as is a p j dimensional column vector as before, for 1 ≤ j ≤ D and 1 ≤ r ≤ R. The terminology refers to these vectors as 'margins' of a particular rank. The PARAFAC decomposition is generally preferred in most modeling applications involving tensors, both in terms of interpretability (i.e., invariance to the order of summation) and from a computational tractability point of view (Kolda and Bader, 2009). Part of the utility of this decomposition in many sparse tensor scenarios is that it allows for contiguous areas with the response to have little or no association with the covariate, which is reasonable in the context of the application in neuroscience (Li and Zhang, 2017). This is also visualized as the higher dimensional analogue to the singular value decomposition of matrices.

Model Framework
Assuming that both response Y t and predictors x t are centered around their respective means, the proposed tensor response regression model of Y t on x t is given by ., m is the tensor coefficient corresponding to the predictor x k,t . To account for the temporal correlation of the response tensor at each cell v, the error is assumed to follow a component-wise first-order autoregressive structure, This parametrization allows varying temporal correlation in different voxels. However, κ v parameters are weakly identifiable and thus introducing a large number of voxel specific κ v parameters is perhaps not worth the extra burden both due to computation and due to estimation of these parameters. In fact, in many fMRI data application for brain imaging studies, one often transforms the voxel specific response vector and predictor matrices using the wavelet transform, and perform inference on the model parameters based on the transformed data. Wavelet transforms have the advantage of 'whitening' the data, i.e., reducing the temporally correlated errors to i.i.d. errors (Bullmore et al., 2004;Fadili and Bullmore, 2002;Meyer, 2003;Zhang et al., 2015). In the light of this, we refrain from modeling E t using complex temporal correlation structure and simplify the temporal correlation of E t by setting ). This ensures both computational simplicity and stationarity in the AR(1) structure.
Voxel-by-voxel regression of Y t,v on x t requires introducing m regression parameters per voxel, hence a total of m D j=1 p j parameters, resulting in an ultra-high dimensional modeling pursuit, and fails to incorporate tensor structural information into the estimation procedure. This necessitates imposing a sufficiently expressive structure on Γ k which simultaneously achieves a large dimensionality reduction. We propose a flexible rank-R PARAFAC decomposition of each Γ k , i.e., A few remarks on (2.2) are in order. First, since we deal with modeling the linear predictor part of the model, our framework can be extended to a GLM set up. Second, the formulation also assumes easy extensions to settings with a more complicated spatiotemporal correlation structure in E t . Additionally, PARAFAC decomposition reveals that the cell level parameters are nonlinear functions of the tensor margins γ (r) k,j . Careful choice of prior distributions on the tensor margins implicitly imposes correlations among voxels and facilitates identifying significantly nonzero cells in Γ k .
Imposing this additional rank-R PARAFAC structure on Γ k remarkably reduces the total number of parameters in the model from m D j=1 p j to Rm D j=1 p j . A critical question remains whether such a dimension reduced structure can identify geometric sub-regions in the tensor response which are related to the predictors. Additionally, we also intend to accurately estimate coefficients corresponding to these sub-regions of the tensor coefficient. The next section proposes a careful elicitation of the prior distribution on the tensor parameters to achieve our goal.

Multiway Stick Breaking Shrinkage Prior on Tensor Coefficients
Although the spike-slab prior for selective predictor inclusion (George and McCulloch, 1993;Clyde et al., 1996) possesses attractive theoretical properties, intractability of exploring an exponentially large space of predictor inclusion along with the belief that many regression coefficients may be small rather than exactly zero has led to considerable growth in the appeal for continuous shrinkage priors. An impressive variety of Bayesian shrinkage priors for ordinary high dimensional regression with a scalar/vector response on high dimensional vector predictors has been proposed in recent times, see for example Hans (2009);Carvalho et al. (2010); Armagan et al. (2013a) and references therein. Shrinkage priors are based on the principle of artfully shrinking predictor coefficients of unimportant predictors to zero, while maintaining proper estimation and uncertainty of the important predictor coefficients. Polson and Scott (2010) further show that most of the existing shrinkage priors can be expressed as the scale mixture of normal distributions with a global parameter common to all predictors and predictor specific local parameters. The global parameter imposes shrinkage globally while local parameters carefully balance shrinkage for large and small coefficients.
The literature on the vector shrinkage priors provides an excellent starting point for studying multiway shrinkage priors on the tensor coefficient Γ k , though the latter presents a lot more challenges. Assuming that Γ k admits a rank-R PARAFAC decomposition, proposing a prior on Γ k is equivalent to specifying priors over tensor margins γ (r) j,k . Given that every cell coefficient in Γ k is a nonlinear function of the tensor margins, care should be taken while imposing prior shrinkage on them. To this end, Guhaniyogi et al. (2017) have characterized multiple restrictions on putting prior distributions on tensor valued parameters and have proposed the multiway Dirichlet generalized double Pareto (M-DGDP) shrinkage prior satisfying all the restrictions. However, in the context of BTRR, a straightforward application of M-DGDP prior on Γ k leads to inaccurate estimation due to less desirable tail behavior of the distribution of Γ v,k parameters. The aberrant tail behavior results from the exchangeability of the rank-specific variance parameters across ranks in the M-DGDP prior, which results in convergence issues for the voxel parameters within Γ k , when using Markov Chain Monte Carlo methods to draw from their posterior distributions. In contrast, we will propose a prior construction that prevents the rank-specific variance components from effectively switching values back and forth, leading to fast convergence of the voxel-specific parameters.
We propose a multiway stick breaking shrinkage prior on Γ k to ensure desirable tail behavior for the tensor coefficient. More specifically, set τ r,k = φ r,k τ k , as the scaling specific to rank r = 1, . . . , R. To achieve effective shrinkage across ranks, we adopt a stick breaking construction for the rank-specific scale parameters φ r,k s, φ r,k = ξ r,k scale parameter is modeled as τ k ∼ Inverse Gamma(a τ , b τ ). Additionally, the local scale parameters W jr,k = diag(w jr,k,1 , . . . , w jr,k,pj ) are employed to achieve margin level shrinkage in the following way The construction tacitly exploits the finite stick breaking construction for the local parameters φ r,k 's. As α k → 0, most of φ r,k 's will be close to being sparse. Therefore, careful learning of α k leads to a sparse and parsimonious representation of the tensor. The parameter α k is assigned a discrete uniform prior on a grid and learnt using a greedy Gibbs algorithm. Additionally, flexibility in estimating tensor margins {γ (r) j,k : 1 ≤ j ≤ D, 1 ≤ r ≤ R} is accommodated by modeling heterogeneity within margins via element-specific scaling of W jr,k . A common rate parameter λ jr,k encourages sharing of information between the margin elements. In fact, it is easy to see that γ (r) j,k,i |φ r,k , τ k follows the well known generalized double Pareto (GDP) (Armagan et al., 2013a) shrinkage prior distribution. Exploiting more efficient computational techniques, TRR with the multiway stick breaking shrinkage prior accurately estimates the posterior distribution of Γ k for a relatively large number of cells compared to the ordinary spike and slab prior on cell coefficients.
An important aspect of these models is the selection of the model rank R. Since rank-R PARAFAC decomposition is a low-rank decomposition of the tensor, we can think about selecting R as selecting the truncation level of a large dimensional model. As long as R is "large enough", the results should be robust to our choice (and further increasing R will lead to negligible changes in the results). Natural analogies are the truncation of a Dirichlet process mixture model, which results in an (approximate) finite mixture model (e.g., see Ishwaran and James (2001)), and the truncation of the stickbreaking construction of the Indian Buffet process (Teh et al., 2007). In practice, it is recommended that different rank-models be compared with some criterion in order to decide which rank model will be used to perform inference. Accordingly, in the real data application, we have fitted the model with a range of R values. The specific choice of R that corresponds to the lowest Deviance Information Criterion (DIC) (Gelman et al., 2014) is used.

Simulated Data Results
This section showcases parametric inference from Bayesian tensor response regression (BTRR) with various simulation studies. Since the major motivation of model development is drawn from the fMRI based brain activation study, the simulation study is performed on simulated datasets which closely mimic the real world fMRI data. Scalar predictors are simulated with the block experimental design. A single stimulus block is convolved with the canonical double-gamma haemodynamic response function. A more thorough discussion on how the covariate values are generated can be found below.
The block design consists of a single discrete epoch of activity and rest, with "activity" representing a period of stimulus presentation, and "rest" referring to a state of rest or baseline. The stimulus is assumed to take place at time t = 0 for a duration of one time step, with a stimulus value of 1. This is done to assure that results from simulated datasets with different choices of T (length of the time series) can be compared, as even data sets with small values of T would have a covariate that exhibits a peak in the stimulus function.
For a specific value of T , the covariate is generated using the canonicalHRF function in the neuRosim package in R (Welvaert et al., 2011) in which the delay of response relative to onset is T ×0.12, the delay of undershoot is T ×0.5, the dispersion of response is set to 2, the dispersion of undershoot is set to 1, and the scale of undershoot is set to 0.5. This setup is used so that simulations can be performed under different values of T without affecting the number of stimulus blocks in the simulated data and without changing the relative pattern of the simulated covariate values. An example of the values taken by the covariate for T = 100 can be seen in Figure 1.

Competitors
The model is fitted in each simulation scenario along with the classical General Linear Model (GLM), in which independent linear regressions are fitted with serially correlated errors using an AR(1) specification for each cell. Comparison with the maximum likelihood estimator of each cell of Γ thus obtained with the estimate of Γ from BTRR, will highlight the potential advantages of joint Bayesian modeling with tensor coefficients. In fitting the General Linear Model with AR(1) error structure for each cell, the cochrane.orcutt function in the orcutt package in R (Stefano et al., 2018) is used. It performs the iterative process necessary to estimate the values of Γ and κ. The estimates from the GLM are corrected by fixing the false positive rate to be 0.05 using the multiple testing correction proposed by Benjamini and Hochberg (1995). All coefficient estimates that are not deemed to be significantly different from zero are set equal to zero for any point estimates of the tensor response.
We also include another competitor that models spatial dependence among tensor cells through a Gaussian Markov Random Field (GMRF) (Zhang et al., 2015;Gössl et al., 2001;Quirós et al., 2010). More specifically, a GMRF prior is assigned on the vectorized elements of Γ as following, where n v is the number of neighbors of element v and v ∼ denotes that elements v and are neighbors (Zhang et al., 2015). In fitting the GMRF model, a λ and b λ are set to 1 and 0.001 respectively, to match the noninformative prior in the BTRR model.
For the Bayesian models, the log-likelihood is examined in order to verify that the Markov chain converges. The models witness rapid convergence, so that only 1,100 draws are taken from the joint posterior distribution in each model fitting, out of which the first 100 draws are discarded as burn-ins. Average effective sample sizes shown in Figure 2 for the 1, 000 post burn-in samples calculated using the coda package in R confirm sufficiently uncorrelated post burn-in samples.
Point Estimation of Γ Due to the continuous nature of the multiway stick breaking prior on Γ, posterior mean estimate of all cells in Γ from BTRR turn out to be nonzero. In general, shrinkage priors in high dimensional regression are not designed to perform variable selection, and hence a post processing step on posterior samples of parameters is required to identify important and unimportant variables. Following Li and Pati (2017), we employ a two-stage variable selection algorithm that identifies zero and nonzero cells of Γ from its post burn-in MCMC samples. For cells of Γ identified as zero by the algorithm, the final point estimates are taken to be zero. We keep the posterior mean as point estimates for cells of Γ identified as nonzero. A comparison of the two stage estimates of Γ for different values of R and η when μ = 30 and T = 20 can be seen in Figure 3. We especially show figures in this case since it represents higher tensor dimensions and smaller sample size. As the signal-to-noise ratio tends to be very low for fMRI data (Welvaert and Rosseel, 2013), the model is attractive in this application if it performs better in such settings. The simulation settings under higher signal-tonoise are less realistic and are kept only to show relative performance of all models under varying signal-to-noise ratio. Also, since the main goal being the identification of activated cells, it is not primarily the estimated cell coefficients, but the contrast between coefficients corresponding to "active" and "inactive" voxels that is more informative. However, to show the effect of the shrinkage priors in the proposed Bayesian tensor response regression and the Gaussian Markov Random Field models, the scales are shown in Figure 3. The scales are kept separate due to the differing results in terms of both the contrast-to-noise ratio (η) and the effects of the different model treatments on the shrinkage of the activation estimates.
The true and the estimated activation maps in Figure 3 demonstrate desirable performance of BTRR in capturing the true activation pattern under moderate contrast to noise ratio η. When contrast to noise ratio drops below 1, identifying signal from noise remains a challenging task which causes less accurate identification of activated regions. Figure 3 shows the improvement in the proposed model of shrinking insignificant coefficient estimates toward zero. This is done using a strong shrinkage prior, which also draws down the values for the significant coefficients. Such an observation is not entirely uncommon even with shrinkage priors in ordinary ultra-high dimensional regressions, see Bhattacharya et al. (2016). Moreover, as mentioned before, in the interest of detecting regions of nonzero coefficients, the contrast between the zero and nonzero estimates is more informative. In this regard, we argue in the next few paragraphs that the proposed model outperforms the GMRF both in terms of the overall estimation of cell coefficients and in identifying truly activated cells. The coefficient estimates from our approach can be somewhat improved by more involved choices of prior distribution on α k . However, the little gain in the estimation of cell coefficients is perhaps not worth extra computational burden, since the inference on identifying activated cells does not change much with the more involved choices. Figure 4 shows the standardized mean square error for estimates of Γ, which is found using the formula where var Γ (0) = ||Γ (0) − mean(Γ (0) )|| 2 /# voxels in Γ (0) . In each scenario, BTRR model with ranks (R) 1, 2, and 3 is tested, and further testing suggests that additional ranks do not improve the performance. For the Bayesian models, that is, the proposed Bayesian tensor response regression (BTRR) and the Gaussian Markov random field (GMRF) models, the sequential 2-means method of post-hoc variable selection method (Li and Pati, 2017) is used to produce the final point estimate. The General Linear Model (GLM) uses the Benjamini-Hochsberg multiple testing correction (Benjamini and Hochberg, 1995) to produce the final point estimate. In a real data application, the final rank used to fit a BTRR model can be selected using the Deviance Information Criterion (DIC) (Gelman et al., 2014). The model fitted with ranks 1, 2, and 3 is compared to the General Linear Model and the GMRF model described above. It is evident that the proposed model performs significantly better than all its competitors in terms of point estimation under low signal-to-noise ratio (see Figure 4), which perhaps justifies its usefulness for real fMRI data. In fact, the standardized MSE values in Figure 4 show that the proposed method also performs better than its competitors under high signal-to-noise scenarios, except for cases where magnitude of the true simulated tensor has small number of cells (first two rows of Figure 4) and T is large. In fact, the only two specific scenarios under high signal-to-noise ratio where GLM does significantly better than BTRR can be found in the last two columns of the first row of Figure 4. Both these settings correspond to a large value of T and smaller tensor dimensions, which are favorable to asymptotic settings, and thus the carefully constructed shrinkage mechanism is no longer advantageous to the naive GLM. In fact, we have included these scenarios to provide an idea about the settings where the model does not necessarily have edge over its competitors.
Digging a bit deeper, Figure 5 presents the False Positive Rates (FPR) for identifying nonzero cells in Γ from competing methods. Although the FPR values are generally low for all competing methods under all scenarios, BTRR tends to offer lower FPRs than the GMRF under scenarios with low-signal to noise ratios. The regularization imposed by the shrinkage prior helps in shrinking unimportant cell coefficients, which appears to yield advantages over GMRF, especially in presence of highly noisy data. Parametric Uncertainty of Γ To assess uncertainty quantification of Γ from BTRR, we focus on coverage and length of 95% credible intervals (CI) of the cells of Γ, shown in Figures 6 and 7, respectively. Given that the coverage of the 95% credible intervals in almost all the scenarios is close to nominal, attention turns to the length of the 95% credible intervals. Two visible patterns emerge from the figures. First, the 95% credible intervals shrink as T increases, since the posterior variance lowers with increased sample size. Secondly, the credible intervals are wider for higher contrast to noise ratio, which can be attributed to the fact that estimating a few high signals with a lot of zero coefficients involves more uncertainty. Finally, there is a drop in coverage for the 95% posterior credible intervals as the contrast-to-noise ratio increases, especially for small tensor dimensions. This is due to the fact that the shrinkage priors, in their attempt to estimate cell coefficients with zero effects, lead to a little underestimation of the nonzero cell coefficients. When nonzero cell coefficients are higher in magnitude, this shrinkage effect becomes more prominent. The drop in coverage can be attributed to this phenomenon. It is also important to note that a rank-1 decomposition of the tensor coefficient is more restrictive structurally than rank-2 or rank-3 coefficients, and hence the drop in coverage is more severe in rank-1 decomposition.

Simulated Data with Model Mis-specification
To assess performance of the BTRR model along with the GMRF competitor under model mis-specification, a simulated data set is created in which the nonzero-valued components within Γ are not sparse or spatially-contiguous. This would present a problem for the model, as the spatial correlation between elements in the tensor coefficient would essentially be meaningless, and the shrinkage imposed by the model would be an incorrect treatment of this data. In this example, the values within Γ are simulated independently from a Bernoulli(0.9) distribution, and the response in each cell is simulated independently from a linear regression model with errors auto-correlated in an AR(1) structure. The hyper-parameters used in the MCMC are the same used in other simulated data analyses. Figure 8 shows the final estimates of Γ from the GMRF model and the BTRR models with ranks 1, 2 and 3. The standardized Mean Squared Errors (sMSE) for the BTRR models with ranks 1, 2, and 3 and the GMRF model are 1.30, 0.93, 0.92, and 1.02, respectively. This illustrates the point that while the BTRR models do not perform well estimating the dense, random coefficient tensor, the estimates are able to approximate results close to those of the mean model, which has an sMSE of 1.  Posterior Estimates of Autoregression Parameter κ Accurately estimating the posterior distribution of the autoregression parameter κ becomes essential since it captures the temporal correlation among tensor responses. We present posterior densities of κ for BTRR with ranks 1,2 and 3 for a representative simulation scenario with T = 100, μ = 30 and η = 1 in Figure 9. The posterior mean of κ is found to be estimated close to the truth with a narrow 95% credible interval. The conclusion remains true in all other simulation scenarios. The simulation study conclusively establishes the strength of BTRR as a principled Bayesian approach that accurately detects brain activation with proper characterization of uncertainties. It is particularly appealing to observe BTRR decisively outperforming the GLM estimates in smaller contrast to noise ratios reminiscent of real fMRI data. An application of the model to a dataset on brain activation study is explored in the following section.

Application to Balloon Analog Risk Taking Data
Neuroscientists at the University of California Los Angeles have conducted an experiment intended to infer about the regions of the brain that are involved in the process of evaluating risk (Schonberg et al., 2012). Sixteen young adults (average age of 23.56 years) are subjects in an experiment with the following design. Each subject enters an fMRI machine with a computer display and a controller with two buttons. On the screen, the image of a balloon is shown, along with a payout amount, starting with a value of $0.25. The buttons on the controller allow the subject to either inflate the balloon or take the payout. If the subject inflates the balloon, and the balloon does not explode, the payout amount increases by $0.25. If the subject inflates the balloon, and the balloon explodes, no payout is received, the payout value is reset to $0.25, and a new balloon is displayed. Balloons are assigned a number of pumps at which the balloon will explode from a discrete uniform distribution with a lower bound of 1, and an upper bound of 8, 12, or 16, depending on whether the balloon is red, green, or blue, respectively. A grey "control" balloon, offering no payout and an upper bound of 12 pumps before exploding, is also part of the trial to record a riskless scenario. Each subject participates in three runs. Each run consists of either 10 minutes, or 48 balloons exploding, whichever comes first. Note that in our real data application, we only work with data from a single run for one randomly chosen subject, since the proposed BTRR method is designed for single-subject scenarios.
Preprocessing was done using FSL (Smith et al., 2004) following what was done by Schonberg et al. (2012) as closely as possible. The fMRI have a repetition time (TR) of 2 seconds. In order to allow for T1 equilibrium effects, the first two images in the series were dropped. The echo planar imaging (EPI) scan was motion corrected, then high-pass filtered using a Gaussian least-squares linear fit with sigma = 50.0 seconds. While it is common for researchers to prewhiten their data in order to remove temporal autocorrelation, we followed the preprocessing steps followed by Schonberg et al. (2012), which did not include prewhitening. Thus, we make an assumption that the temporal autocorrelation does not vary significantly within the regions of interest in the study data. Brain extraction was done using the BET function in FSL. The anatomic (T1-weighted) scan was registered using an affine transformation to standard Montreal Neurological Institute (MNI) space, and the EPI scans were then registered to each subject's corresponding anatomic scan. Finally, the data were spatially smoothed using a Gaussian kernel with a 5 mm full-width half-maximum (FWHM). To fit BTRR in a computationally efficient manner on a full 3D fMRI image response, the EPI scans were downsampled to have voxels with volume 8 mm 3 . This analysis with low-resolution 3D fMRI images is used to find areas of increased activity within the entire brain in order to choose slices within the brain that can be analyzed at a higher resolution with voxels of volume 2 mm 3 . For both cases, to facilitate parallelization of computational tasks and flexible choice of rank R for different brain structures, data were separated into one of 9 regions of interest based on the MNI structural atlas provided within FSL (Collins et al., 1995;Mazziotta et al., 2001). The MNI structural atlas is a hand-segmented atlas developed by Mazziotta et al. (2001) and Collins et al. (1995) and distributed within the FSL library of neuroimaging tools (Jenkinson et al., 2012).
As regions of interest are not hypercubic in nature, the smallest cuboid containing a region of interest is taken as the response tensor. Parts of the cuboid that are not a part of the region of interest are all assigned the value 0, and masking was performed on the coefficients of all of the models to avoid finding activations outside the brain. There is an established literature on brain imaging that conceptualizes brain images or brain image cross-sections as 3-or 2-dimensional tensors respectively, after similar adjustments (please see Zhou et al., 2013 andZhang, 2017). Table 1 records the Regions of Interest (ROI) with the number of voxels in both the whole-brain lowresolution analysis and one of the single-slice high-resolution analyses.
The analysis of the whole-brain high-resolution scan for an individual is the ideal goal of these methods. However, due to the size of the data and availability of computer memory, the regions of interest were separated in order to draw from the posterior distribution via Markov Chain Monte Carlo simulations. As an added benefit, the parallelization keeps the computation time reasonable. A follow-up project has recently been published that extends the BTRR model to a tensor response mixed effect model. The proposed model incorporates region-specific random effects and infers jointly on voxellevel activation and region-level connectivity from multi-subject fMRI data, please see Spencer et al. (2020) for details.
To measure the level of risk to a subject at a given time, we modify the procedure used in Schonberg et al. (2012) slightly. First, we measure the centered number of pumps that an individual gives a "treatment" balloon before they "cash-out" or the balloon explodes. It is assumed that the higher the number of pumps becomes, the higher is the risk perceived by the subject. This value is then convolved with the doublegamma haemodynamic response function, which takes into account the physiological lag between stimulus and response, and smooths the stepwise function for the centered number of pumps. An illustration of this calculation can be seen in Figure 10. Finally, the centered, convolved number of pumps on the control balloon is subtracted from the treatment series to provide a basis for comparison.
It is important to mention that the analysis performed by Schonberg et al. (2012) has some significant differences to the analysis shown here. To begin, their study is done on 16 subjects and across three runs for each subject, with analyses performed separately for each subject and run. The parameters were then averaged across runs, and an analysis of variance (ANOVA) was performed in order to assess variance in activation within and between subjects. As our proposed model is intended for a single subject, the first run from a randomly-selected subject was used. The predictors for the treatment and control balloons were kept separate in Schonberg et al. (2012), and the activation found from the control balloons was subtracted from the activation found with the treatment balloons. We sought to effectively combine these steps by subtracting the predictors, which produces similar results, as the treatment and control balloons do not overlap temporally within the experiment. Z-score thresholding was used in Schonberg et al. (2012) in combination with a correction by Poline et al. (1997) that combines peak intensity testing corrections with active cluster size corrections. However, these corrections rely on the implicit assumption that the activation parameters follow a lattice approximation to a Gaussian field, which does not assume sparsity and spatial similarity in the same way as the proposed model. In addition, the parameters used to perform the testing corrections are not shared in Schonberg et al. (2012), and thus the results cannot be replicated. Schonberg et al. (2012) also used additional regressors in modeling, including a regressor for the average number of pumps in each balloon trial for both cash-out and explosion outcomes. These were not used in this analysis to focus on an application of the proposed model in a more accessible setting without including additional variables with different variable treatments, including slice timing and other nuisance variables. As a result, this analysis does not account for average BOLD responses under different trial conditions, such as different outcomes (cash-out or explosion) and stimuli (treatment or control balloons). However, inference from BTRR generally coincides with the conclusion presented by Schonberg et al. (2012) and the General Linear Model concerning the bilateral insula activation patterns, even though only one subject is analyzed. We will present them in the upcoming paragraphs.
As discussed earlier, in order to perform inference on high-resolution neuroimaging data, low-resolution inference is performed first on the entire three-dimensional volume of the regions of interest. Once general areas of activation are found in the three-dimensional regions of interest, slices of the subject's scan can be chosen in order to perform high-resolution inference. In order to illustrate this process, we first make 11,000 draws from the posterior distribution, discarding the first 1,000 draws as a burn-in using the low-resolution data. Point estimates for one of the axial slices of the low-resolution data after applying the sequential 2-means variable selection method (Li and Pati, 2017) and merging the regions of interest into a single image can be seen in Figure 11. This slice of the three-dimensional image was chosen for display in the figure because it shows information on an axial slice through the middle of the brain. Table 1 shows the tensor dimensions in each region, the number of ROI voxels within each response tensor, and which rank was chosen for each region using the Deviance Information Criterion (Gelman et al., 2014) for the low-resolution analysis. For reference, a figure showing the locations of each of these regions of interest within the Montreal Neurological Institute Atlas has been included in the Supplementary Material. We then run the analysis following the same procedure on the high-resolution data on three slices that fall within the area shown in the downsampled slice in Figure 11. This high resolution 2D slices correspond to z = (44, 45, 46). The tensor dimensions in each region, the number of ROI voxels within each response tensor, and which rank was chosen for each region using the Deviance Information Criterion (Gelman et al., 2014) for the high-resolution analysis for these slices are shown in Table 1. The point estimates for the high-resolution analysis for the single slice data with tensor ranks 1, 2 and 3 can be seen in Figure 12. The analysis shows no significant difference in terms of activation in these three slices and hence we only provide the final estimate of the coefficients in the high-resolution analysis allowing for different ranks to be used for different regions of interest based on the DIC for slice z = 45. This can be seen in Figure 13. For ease of visual comparison, all of the plots within Figures 11, 12, and 13 were color-thresholded at the approximate 90th percentile value for nonzero coefficients across the four models being compared. The active coefficients identified in the single-slice analyses were between 0 and 0.37 for the Bayesian models and between −0.4 and 0.73 for the GLM. The active coefficients identified in the whole-brain analyses were between 0 and 0.59 for the Bayesian models and between −0.23 and 0 for the GLM.
The same general linear model maximum likelihood estimate described in Section 3 is included for comparison. In the whole brain analysis, the Bayesian estimates for the tensor coefficient are coincident with each other, with higher rank models producing Figure 11: A single slice of the tensor estimate for the whole brain overlaid on top of the reference image. Figure 12: The estimates for the tensor coefficient in high-resolution single-slice analyses for slices z = (44, 45, 46) overlaid on reference images of the brain. larger coefficient estimates. The BTRR model estimates suggest that there is activation in the subject's frontal and temporal lobes. The general linear model has estimates that are somewhat larger in amplitude in the high-resolution slice analysis than the Bayesian sparse tensor response regression estimates, though usually over smaller spatial regions. This is likely due to the shrinkage prior favoring smaller tensor coefficient values in the Figure 13: The final estimate of the effect of increased perceived risk on BOLD measure response in different regions of the brain after selecting R for each region using the DIC. BTRR models. However, the final point estimates from the general linear model and the BTRR models show generally coincident voxel-level activations, though the areas of activation are larger in the BTRR inference. In addition, the activation patterns across the adjacent slices within the high-resolution are largely coincident with each other and the low-resolution while-brain analysis.

Conclusion
This article proposes a Bayesian framework to regress a tensor valued response on scalar covariates. Adopting the rank-R PARAFAC decomposition for the tensor coefficient, the proposed model is able to reduce the number of free parameters. We employ a novel multiway stick breaking shrinkage prior distribution on the tensor coefficient to be able to identify significantly nonzero cell coefficients. New results on posterior consistency have been developed to show convergence in L 2 sense of the fitted tensor coefficient to the true tensor coefficient as data size increases. The different values for R selected by the deviance information criterion (DIC), along with the dimensions associated with the response tensors in each region.
As an illustrative example, the present article focuses on the analysis of a singlesubject fMRI data to detect voxels of the brain which exhibit neuronal activity in response to stimuli. The whole brain is analyzed first using low-resolution data from the three-dimensional brain volume, and then three slices of high-resolution data from the brain volume are analyzed to more specifically infer areas of activation. Analysis of simulated time series and real fMRI data demonstrates excellent performance of BTRR in identifying the regions of activation with required uncertainties. Additionally, BTRR is able to achieve remarkable parsimony, even as a Bayesian model. This facilitates its usage in presence of images with a fine resolution. The use of MCMC in the analysis does pose a challenge computationally, which necessitates the separation of the regions of interest in order to make the analysis parallelizable.
The core idea of the article is to recognize the importance of retaining the tensor structure of the image response during the entire statistical analysis for studies including brain activation. An immediate extension to the proposed model to investigate both voxel level activation and ROI level connectivity from multi-subject fMRI data has recently been published (Spencer et al., 2020). It also needs to be emphasized that the proposed tensor regression framework is developed along the principles of high dimensional variable selection/shrinkage prior literature where a new multi-way stick breaking shrinkage prior is developed to shrink unimportant cell coefficients close to zero. Introducing tensor variables in the analysis intuitively make use of the neighborhood information in the cells, though an explicit spatial modeling of elements in the tensor cells has not been considered here. Developing shrinkage priors for high dimensional parameters after accounting for their spatial correlations has been recognized to be an extremely challenging problem, and hence is much less addressed. We recognize the steep challenge involved in extending our approach to account for spatial correlations of cell parameters and plan to tackle the problem in a future work.