Additive multivariate Gaussian processes for joint species distribution modeling with heterogeneous data

Species distribution models (SDM) are a key tool in ecology, conservation and management of natural resources. Two key components of the state-of-the-art SDMs are the description for species distribution response along environmental covariates and the spatial random effect. Joint species distribution models (JSDMs) additionally include interspecific correlations which have been shown to improve their descriptive and predictive performance compared to single species models. Current JSDMs are restricted to hierarchical generalized linear modeling framework. These parametric models have trouble in explaining changes in abundance due, e.g., highly non-linear physical tolerance limits which is particularly important when predicting species distribution in new areas or under scenarios of environmental change. On the other hand, semi-parametric response functions have been shown to improve the predictive performance of SDMs in these tasks in single species models. Here, we propose JSDMs where the responses to environmental covariates are modeled with additive multivariate Gaussian processes coded as linear models of coregionalization. These allow inference for wide range of functional forms and interspecific correlations between the responses. We propose also an efficient approach for inference with Laplace approximation and parameterization of the interspecific covariance matrices on the euclidean space. We demonstrate the benefits of our model with two small scale examples and one real world case study. We use cross-validation to compare the proposed model to analogous semi-parametric single species models and parametric single and joint species models in interpolation and extrapolation tasks. The proposed model outperforms the alternative models in all cases. We also show that the proposed model can be seen as an extension of the current state-of-the-art JSDMs to semi-parametric models.


Introduction
Species distribution models (SDMs) are key tools in the ecologists' toolbox. They have been widely used, among other applications, to study species habitat preferences , to improve identification and management of conservation areas and natural resources (Kallasvuo et al., 2017) and to evaluate species responses to environmental filtering under climate change scenarios (Clark et al., 2014;Kotta et al., 2019). The main goals of statistical inference in these contexts are to use species observations and information on the associated environment to infer the relationship between these two attributes and to predict over regions of unsampled locations to build thematic species distribution maps Elith and Leathwich, 2009).
Species distribution modeling is, thus, directly related to inferring species' responses to environmental factors . This is traditionally done using generalized linear or additive models as well as an array of machine learning approaches such as regression trees or maximum entropy modeling (Elith and Leathwich, 2009). These approaches model each species separately and cannot account for species interactions nor shared responses to the environment. However, species interaction with other species is potentially as important factor as its response to environment. Moreover, in many practical situations, data from species can be patchy or scarce in which case sharing information between species can significantly improve models' predictive performance (Ovaskainen and Soininen, 2011;Clark et al., 2014;Hui et al., 2013;Thorson et al., 2015;Hartmann et al., 2017). For these reasons, joint species distribution models (JSDM) have gained increasing attention in recent years (Warton et al., 2015). Dunstan et al. (2013) and Hui et al. (2013) introduced species archetype modeling where species' responses to the environment are clustered into few archetype models. Pollock et al. (2014) proposed to use the multivariate probit regression model (Chib and Greenberg, 1998) to describe geographical co-occurrence patterns between frogs and eucalyptus trees in Australia and Clark et al. (2014) built a JSDM to infer richness and loss of species under climate change scenarios. Thorson et al. (2015) introduced a spatial latent factor model to predict spatial distribution of breeding birds and rock fish communities and recently, Ovaskainen et al. (2017) introduced the hierarchical modeling of species communities (HMSC) framework which includes detailed description of interspecific correlations in covariate responses and spatial random effects.
Current JSDMs rely on the classical framework of hierarchical generalized linear models (GLMs). Even though this approach allows flexible modeling by describing the randomness of response variables with different probabilistic models (Nelder and Wedderburn, 1972), it is still limited by its parametric assumptions. Hence, it may fail to accurately describe a species' response to environmental conditions (Vanhatalo et al., 2012;Kotta et al., 2019). Here, we propose a semi-parametric JSDM model represented with multivariate Gaussian processes (GPs). GPs are flexible semi-parametric regression models where the regression function is estimated without restrictive parametric assumptions about its form (O'Hagan, 1978;. Our model integrates the main strengths of semi-parametric single species models (Vanhatalo et al., 2012;Golding and Purse, 2016) and generalized linear model based JSDMs. First, we model the species responses to environmental covariates with additive multivariate GPs. This allows us to capture wide range of nonlinear responses and share information about these responses between species. Second, due to the hierarchical model structure, the model can simultaneously accommodate several kinds of outcome variables (observations/measurements) by assigning different types of probabilistic models for them. This is important, since it allows us to exploit different types of measurements which commonly arise in real-case scenarios of multiple species surveys. Our presentation focuses in the mostly used probabilistic models in practice, the Bernoulli (Binomial) and Poisson with over-dispersion (Negative-Binomial). We also propose an efficient computational approach build around Laplace method (Golding and Purse, 2016;Vanhatalo et al., 2010). Lastly, we present a structured cross-validation scheme to validate and compare models' performance in different kinds of prediction tasks. This paper is organized as follows. In Section 2 we introduce a motivating case study. In section 3, we introduce the additive multivariate GPs and in Section 4 we discuss its predictive properties and introduce the computational methods for inference. In Section 5, we illustrate the basic properties of the model through two simple examples and introduce the specific case study model. In Section 6 we present the case study results and we end by discussion and conclusions in Section 7.
2 Motivating case study: coastal species distributions in the Gulf of Bothnia As a motivating example, we study spring distribution of four fish species and three types of algae or macro-vegetation on the coastal region of the Gulf of Bothnia in the northern Baltic Sea. The Gulf of Bothnia is a brackish water basin between Sweden and Finland covering an area of approximately 600 × 120 km. Its coastal areas play a central role in the ecosystem and many Baltic fish stocks are dependent on the coastal regions for their reproduction (Veneranta et al., 2013;Kallasvuo et al., 2017). Coastal zones are also the most sensitive regions of the Baltic sea to both natural variation and anthropogenic pressures (Reusch et al., 2018). Hence, these areas are of central importance for conservation and there is need for detailed knowledge on the distribution of coastal species and predictions concerning their response to environmental changes.

Case study species and data
The studied fish species are the juvenile or adult three-spined stickleback (Gasterosteus aculeatus) and nine-spined stickleback (Pungitius pungitius) as well as larvae of whitefish (Coregonus lavaretus) and vendace (Coregonus albula). Both whitefish and vendace are commercially important species and the sticklebacks are one key fish fauna in the Gulf of Bothnia ecosystem . We treat the studied vegetation and algae in functional group level comprising of diatomous algae, filamentous and macrovegetation. These are the dominating groups of benthic vegetation in shores where larvae of Coregonids (whitefish and vendace) occur at early developmental stages. In the scale Figure 1: The study region, Gulf of Bothnia, and locations for species data. The region is divided into five subregions (I-V) to be used in cross-validation. The environmental conditions (described by the environmental covariates) are heterogeneous between these regions (Veneranta et al., 2013) which allows to test models' extrapolation performance.
of Gulf of Bothnia, their occurrence also reflects the salinity, nutrient balance and wind exposure of studied area (Veneranta et al., 2013).
The case study species reflect the large scale environmental gradients and the changes in the Baltic Sea environment in last decades. Sticklebacks in the Baltic Sea use the shallow coastal zone for reproduction  and high abundances of sticklebacks in the Baltic Sea have been positively correlated with high occurrence of filamentous algae (Eriksson et al., 2009(Eriksson et al., , 2012. Coregonids in coastal areas prefer more oligotrophic waters (Veneranta et al., 2013). Whitefish and vendace reproduce in stony reefs and islets of Baltic Sea in late autumn and the larvae hatch at ice break-up in spring. In sheltered areas the overwintering reeds (Phragmites australis) dominate the macro-vegetation in spring. Diatomous algae consist of several epiphytic species that have an early spring bloom at ice break up and settle rapidly over bottom when light and water temperature increase (Busse and Snoeijs, 2002). Filamentous algae in this study consist mostly of Pilayella sp. It is a fast growing annual species that dominates the exposed shores at Baltic Sea in early spring (Rönnberg and Bonsdorff, 2004).
The whole study area was divided into sampling subareas (Figure 1). Within each sub-area, data were collected at several sampling sites distributed so that they covered the range of most important environmental covariates in that sub-area. Sampling was  Figure 1). Fish samples comprise the number of caught fish together with information on the sampling effort at each site. The effort was measured as the volume of sampled water (m 3 ). A more detailed description of the data collection is provided by Veneranta et al. (2013) For plant data, the bottom was photographed at 5-13 locations at depth of 30 cm within a distance of 2 m and parallel to shore line. The occurrence of a species was recorded at 16 uniformly distributed points in all these photographs. The case study plants can grow over each others so that presence of one plant at a spatial location does not exclude the presence of other plants. The whitefish and vendace data were previously analyzed by Veneranta et al. (2013) and Vanhatalo et al. (2012) but the data on other species are unpublished.

Environmental covariates
Gulf of Bothnia hosts rich variety of environmental conditions. Coastal areas are affected by inflows from land as well as shallow and complex topography. In the scale of Gulf of Bothnia, there is a gradient in river influence, salinity, temperature and length of ice cover period from north to south. We used seven real-valued and one categorical abiotic environmental covariates that were available throughout the study area as raster maps. Each species observation was combined with covariates from the raster cell where the sample location fell in. These are summarized in table 1 and described and motivated in detail by Veneranta et al. (2013). Due its fresh water origin vendace is a priori sensitive to even small changes in salinity levels experienced in the Gulf of Bothnia whereas the other species respond to salinity only in the Baltic Sea scale. Hence, we used all covariates only for vendace but excluded the winter salinity from other species.

Hierarchical multivariate species distribution model
We start our model building following the generic hierarchical structure similar to the one presented by, e.g., Wikle (2003), Cressie and Wikle (2011) and Banerjee et al. (2015) [data|process, parameters] : where the first layer of hierarchy is the probabilistic model which defines the conditional distribution for multivariate data Y (x, s), at spatial location s with associated covariates x, given the model parameters η and the multivariate latent process f (x, s). The second layer defines the prior distribution for the latent process given the process parameters θ, and the third layer defines the prior distribution for all unknown parameters.
Let j ∈ J = {1, · · · , J} be the species index set and J the total number of species in the study. Denote by Y (x, s) T = [Y 1 · · · Y J ] the J-variate random vector with components Y j = Y j (x j , s) related to the j th species at spatial location s T = [s 1 s 2 ] (coordinates) under the influence of environmental covariates x T j = [x j,1 · · · x j,cj ] where c j is the number of environmental covariates for the species j. We denote by x the full vector of covariates and X the covariate space. The species specific covariates x j are sub-vectors of x such that x j ∈ X j where X j ⊂ X and X j is c j -dimensional. Here, we assume that given a multivariate latent process where f j = f j (x j , s) is fixed but an unknown realization of the j'th process with covariates x j at the spatial location s. η j is the vector of parameters of the probabilistic model π Yj for the species j and η = [η 1 · · · η J ] T . In general any probabilistic model for data could be used and the observation models should be chosen according to the assumed sampling process. Here, we consider in detail the Binomial and the Negative-Binomial (over-dispersed Poisson) models used in our case study (Section 2). These models can be seen as observation models resulting from inhomogeneous point process model for species abundance Warton and Shepherd, 2010).
In practice, each sampling site consists of a small area where the observations are made. In our case study, the area covered by a sampling site is so small compared to the whole study region that the latent function is practically constant in each sampling site. The model for the count of species presences at uniformly distributed points in a sampling site (plants in our case study) is then Binomial given by where z B (s) is the total number of observation points in the sampling site at location s and p(f (x, s)) is the success probability, which will be modeled through the logit here.
The natural model for the number of individuals in a closed area or volume in a sampling site (fish in our case study) is Poisson which we extend to over-dispersed Poisson using the Negative-Binomial given by (see Liu and Vanhatalo, 2018, Section 7.1 for details) , r) and r is the over-dispersion parameter. The latent process f (·) corresponds to the logarithm of a species (relative) density and z N (s) is the sampled volume of water in the sampling site at location s.
Increasing r decreases the variance and when r → ∞, the model approaches the Poisson distribution.

Univariate additive latent Gaussian process
First, we assume that marginally for any species j, the process model is given by where β j,0 is the offset weight with distribution β j,0 |σ 2 j,0 ∼ N (0, σ 2 j,0 ) and h j : X j → R is a predictor function of environmental covariates. The predictor function is assumed to be additive over the covariates, h j (x j ) = cj r=1 h j,r (x j,r ), and each additive term is given an independent zero mean GP prior, so that where k hj,r (x j,r , x j,r ; θ j,r ) = Cov(h j,r (x j,r ), h j,r (x j,r )| θ j,r ) is a covariance function with parameter θ j,r and θ j = [θ T j,1 · · · θ T j,cj ] T . For example, with k hj,r (x j,r , x j,r ; θ j,r ) = x j,r x j,r σ 2 j,r and θ j,r = σ 2 j,r , the predictor function corresponds to linear model h j,r (x j,r ) = x j,r β j,r where β j,r ∼ N (0, σ 2 j,r ) . With other choices of covariance functions we can model non-linear responses in which case the model can be seen as an alternative to the traditional generalized additive models (GAMs, Hastie and Tibishirani, 1986). The general form of an additive GP prior would include also joint effects of covariates (Duvenaud et al., 2011) but this is not considered here.
The term ε j (s) is a spatial GP, where k j (s, s ; j , σ 2 j ) is a spatial covariance function with variance σ 2 j and range parameter j . When we consider independent models for each species, that is the traditional single species SDMs (SSDMs), the processes h j,r and j are mutually independent among all species. The model outlined this far is similar to the GP based species distribution models used by, e.g., Vanhatalo et al. (2012), Kallasvuo et al. (2017) and Golding and Purse (2016). Next, we introduce models which consider dependency.

Additive multivariate Gaussian process priors
In order to account for possible species interdependence, we first include spatial species interaction into the model through the linear model of coregionalization (LMC) (Mardia and Goodall, 1993;Gelfand et al., 2004). Write ε(s) T = [ε 1 (s) · · · ε J (s)] and assume that ε(s) has LMC covariance structure with species specific correlation functionsk j (·, · ; j ). The covariance structure of the LMC with interspecific spatial dependence is then, and u ,l (j, j ) the entry (j, j ) of U ,l = L ,l L T ,l where L ,l is the l th column of the Cholesky decomposition L of the coregionalization matrix Σ , that is matrix of interspecific correlations between spatial GP. Hence, the vector-valued latent process f (x, s) T = [f 1 (x 1 , s) · · · f J (x J , s)] unconditional on β 1,0 ,. . .,β J,0 follows a multivariate GP which we denote as and k hr (x r , x r ; θ) = diag k h1,r (x 1,r , x 1,r ; θ 1,r ), · · · , k h J,r (x J,r , x J,r ; θ J,r ) . If the predictor functions, h j,r , were linear, the prior (3.9) would correspond to traditional multivariate spatial model with independent linear effects and spatial LMC (Gelfand et al., 2004).
We extend (3.9) to an additive multivariate GP prior where the dependence between the species specific additive predictor functions is modeled with LMC. We demonstrate this with a model where the set of covariates is equal for all species, that is, dim(X j ) = c and x j = x for all j. Then the model is written as where Λ 2 = (Λ 1 , Σ h1 , . . . , Σ hc ) and Σ hr is the interspecific covariance matrix between the species specific predictor functions of r'th covariate,k hj,r (·, ·; θ j,r ) is a correlation function related to the predictor function h j,r , U hr,j = L hr,j L T hr,j and L hr,j is the j th column of the Cholesky decomposition L hr of Σ hr . When the set of covariates is not the same for all species, covariate specific predictive functions are correlated only between species that share those same covariates.
A JSDM with multivariate GP prior (3.10) can be interpreted as extension of GP based SSDMs (Vanhatalo et al., 2012;Golding and Purse, 2016) to joint species modeling similarly as done with the generalized linear model based JSDMs (Pollock et al., 2014;Ovaskainen et al., 2017). The enhanced inferential ability of the multivariate additive GP compared to univariate GP models lies in its capability to infer similarity (dissimilarity) in species specific responses to covariates and in the spatial random effect. The similarity/dissimilarity of responses of two species j and j along a covariate r is indicated by a positive/negative value for correlation and hence, examining the LMC covariance matrices, Σ hr , can provide new insight to species to species associations.

Marginally uniform priors for correlation parameters
Here, we define the prior for coregionalization covariance matrices but define the covariance functions and rest of the priors in Section 5. A standard choice for prior for correlation matrices is the inverse Wishart distribution. However, if there is no prior information on interspecific correlations, we follow Hartmann et al. (2017) and suggest to use marginally uniform priors for the LMC covariance matrices Σ and Σ hr . These can be achieved by the distribution of Barnard et al. (2000) and Tokuda et al. (2012). Let P be a correlation matrix with elements ρ j,j . A prior distribution that is marginally noninformative for the correlations, that is, the marginal distribution for every ρ j,j is uniform over (−1, 1), is achieved with the distribution when v = J − 1. Here, Γ J (·) is the multivariate gamma function and matrix P jj is obtained by removing the j th column and j th row of P. When v increases, the probability density (3.11) becomes increasingly concentrated around the origin.

Posterior inference and predictive properties
Given a set of species observations at locations s ij , i j = 1, . . . , n j , respectively for each species j = 1, . . . , J, the likelihood can be written as where y j,ij is the i j 'th observation of species j at the i j 'th spatial location s T ij associated with a set of covariates x ij ∈ X j . The observed vector y = [y T 1 · · · y T J ] T with y T j = [y j,1 · · · y j,nj ] collects all the species observations and , collects the corresponding latent variables. Hence, the likelihood factorizes over the latent variables and the inference can be done similarly as with univariate GP models. Markov chain Monte Carlo (MCMC) would provide exact answers in the limit of large sample size but deterministic approximations such as Laplace approximation or expectation propagation algorithm have also been shown to provide accurate approximate inference for univariate GP models with much lower computational time (Nickisch and Rasmussen, 2008;Rue and Marino, 2009;Vanhatalo et al., 2010). In order to study the properties of the model, we will examine the Laplace approximation for the posterior of latent variables conditional on the hyperparameters.

Posterior predictive inference conditional on hyperparameters Posterior of latent variables
The Laplace's method approximates the conditional posterior of the latent function values f * = f (s * , x * ) at the spatial location s * with covariates x * as 1 wheref is the (conditional) maximum a posterior (MAP) estimate of latent variables, and W is the Hessian matrix of the negative log-likelihood function evaluated atf . Here, C is the prior covariance matrix of f , C * ,f is the (prior) covariance matrix between elements of f * and f . C * is the prior covariance of f * . In case of multivariate additive GP (3.10) the prior covariance matrix is given by where J n is an n×n matrix full of ones u ,l (j, j ) and u hr,l (j, j ) are the entry (j, j ) of U ,l and U hr,l (see (3.8)), and {K l } j,j and {K h l,r } j,j are correlation matrices with elements The other correlation matrices are constructed similarly. If data on all species are available at all sampling locations, the covariance matrix reduces to Kronecker product similarly as in LMC models by Mardia and Goodall (1993) and Gelfand et al. (2004), so that C = Σ 0 ⊗ J n + j=1 U ,j ⊗K j where n j = n for all j = 1, . . . , J. Since, in general, the second and third term of C are full matrices, it can be seen from (4.2) and (4.3) that the posterior of f and the posterior predictive distribution of f * are affected by observations of all species at any spatial locations.

The marginal posterior of additive terms
Typically, we are also interested in species' responses along covariates encoded by the additive predictor functions. Their marginal posteriors, conditional on hyperparameters, are analogous to (4.2). The matrices C * and C * ,f are only constructed using the correlation functions corresponding to the latent function of interest. For example, to study the response of species j along covariate x r we evaluate the posterior predictive distribution for h j,r (x r ) using (4.2) so that we replace C * ,f and similarly for C * . However, in case of the traditional LMC (3.9) only the spatial random effects are correlated between species whereas the predictor functions are not. Hence, Σ hr is diagonal for all r so that u hr,l (j, j ) = σ 2 l if l = j = j and zero otherwise. The second terms of (4.4) is then block diagonal and only the j'th block column in (4.5) is non-zero for which reason there is no information exchange between species specific predictor functions. In case of univariate GP prior all the terms in (4.4) are block diagonal and (4.2) reduces to J independent Gaussian distributions with no information exchange among species at all. Hence, the predictive performance of the univariate GP and the two multivariate GP models are very different. This is illustrated in section 5.

The (marginal) posterior expectation and variance for new observations
When predicting species occurrence probability or abundance, we need to marginalize over the posterior of the latent variables. The posterior expectation and variance for a new outcome for species j at a location (x * , s * ) conditional on hyperparameters are When the probabilistic model for species j is assumed to be the Negative-Binomial or Binomial model with logistic link function (3.3) we can find either an exact result or an analytical approximation for these expectations and variances. These are given in the Supplementary material. These solutions speed up the predictive calculations considerably compared to numerically integrating f over R.

Conditional scenario predictions
In some applications, one might be interested in scenario predictions conditional on changes in species composition. For example, one might be interested in how removing from or introducing specific species into an area would affect other species. In our case study setting, we could be interested in, for example, effects of management actions that would clean filamentous algae from the shoreline. Such scenario predictions would be naturally tackled with predictive causal inference (Lindley, 2002;Pearl, 2009). In predictive causal inference the parameters of the model are inferred with the available data so far and predictions made considering alternative scenarios.
To illustrate this lets first introduce a short notation y J1, * = y J1 (x * , s * ) and f J1, * = f J1 (x * , s * ) where J 1 denotes the set of species to be predicted and J 2 the scenario species assumed to be "managed". For brevity, we omit the conditioning on hyperparameters and data. The conditional distribution of where Y J1 (x * , s * )| y J2 , f J1, * only depends on f J1, * . The Laplace approximation for this conditional predictive distribution is shown in Supplementary material.

Parameter inference
We used Laplace approximation (Vanhatalo et al., 2010) to approximate the conditional posterior of latent variables π(f | y, η, Λ) and the marginal likelihood of the hyperparameters π(y | η, Λ) = π(y | f , η)π(f |Λ)d f . We searched for the (approximate) maximum a posterior (MAP) estimate for hyperparameters given by where log q(y | η, Λ) is the Laplace approximation for the log marginal likelihood for parameters. This approach produces also good approximation for the posterior predictive distribution for latent variables (Vanhatalo et al., 2010;Vehtari et al., 2016), which are the main interest in this work. Hence, we fixed hyperparameters to their MAP estimate.
In order to avoid constrained optimization, all the parameters were transferred to unconstrained space. We used log transformation for covariance function parameters, and for the interspecific correlation matrices we used the transformation presented by  and , which is a bijective mapping between the space of correlation matrices and R ( J 2 ) . This is summarized in Supplementary material. We used scaled conjugate gradient optimization for locating the MAP and checked carefully that a (local) mode had really been found by verifying that gradients along all hyperparameters were zero. The required gradients of log q(y | η, Λ) were solved analytically as described by  and Vanhatalo et al. (2010). The Supplementary material summarizes the derivatives w.r.t. to the covariance parameters in Σ , Σ hr , r = 1, . . . , c.

Experiments
Here, we first illustrate the properties of the hierarchical multivariate GP models with two simple examples. These examples highlight particular properties of the proposed model. After this we introduce the model and analysis for our case study (Section 2). We implemented all the models using the GPstuff package . are dropped out. The spatial correlation function used in this demo is the Gaussian k(s, s ) = e −|| s − s || 2 /2l 2 , where || s − s || is the Euclidean distance and l is the lengthscale. The first row of subplots shows the posterior predictive probability of observing species 1 (E[Y 1 ], Y 1 (s) ∼ Binomial) and the second row shows the expected number of (a) and (e) show the predictions when the species observations are from the same region but not from exactly the same locations. In this case, the prediction of multivariate GP is similar to predictions by univariate GPs. Plots (b) and (f) show the predictions when data are available on both species from the lower left corner and additionally data on species 1 is available from upper right corner. There is positive correlation between species, which has been inferred from the data in the lower left region, so the prediction for species 2 in upper right corner is informed by data on species 1 in that region. The last two columns illustrate the conditional scenario predictions (4.7). In the third column, the training data is the same as in the first column and the expected values in plots (c) and (g) show joint scenario prediction in new region of same size and shape. In this scenario, species in the new region were observed so that species 1 is known to be in the top right, and species 2 in the bottom left. The fourth column shows the conditional scenario prediction for both of the species separately so that the grey marks show the locations where the other species would be introduced in these scenarios.

Demonstration with time series of species abundances
Next, we consider the classical predator-prey system containing annual population counts of hare and lynx in the northern Canada from 1845 to 1935 (Odum, 1953). The data is not spatially explicit since the observations are total counts over the region so we model it as time series. This illustrates the behavior of individual additive covariate effect in the multivariate GP model (3.10) with time being the covariate. Let's denote by Y 1,t and Y 2,t the population sizes of hare and lynx at time t, and assume that . We compared two alternative priors for the latent functions, one with independent GP priors and another with joint bivariate GP prior with the LMC structure. In order to compare models' predictive performances when some species have not been observed, we removed parts of the original data from the training set, the period of 1870-1900 for hare and 1850-1870 for lynx. Figure 3 displays the result of the data analysis. The model with bivariate GP prior clearly outperforms the independent model since it predicts better the test data points in periods where training data was removed. Moreover, its predictions have also smaller uncertainty than the predictions of independent GPs in those periods of time.

The case study on coastal species distribution
Case study models The plant data are modeled with the Binomial observation model (3.3) and the fish data are modeled with the Negative-Binomial observation model (3.4). In order to test the effect of different model components and the effect of semi-parametric response functions versus standard quadratic response functions we compare the following models: 1) additive univariate GP predictor functions and univ. spatial random effects (3.5), 2) additive univariate GP predictor functions and LMC spatial random effect (3.9), 3) additive multivariate GP predictor function and LMC spatial random effect (3.10), 4) additive univariate GP predictor functions only (model (3.5) without j (s)), 5) univariate spatial random effects only (model (3.5) without h j (x j )), 6) additive univariate quadratic predictors and univariate spatial random effects (model (3.5) with h j (x j ) = x j β j and x j includes the covariates and their squares), 7) additive univariate quadratic predictors and LMC spatial random effect (model (3.9) with h j (x j ) = x j β j and x j includes the covariates and their squares), 8) additive multivariate quadratic predictors and LMC spatial random effects (model (3.10) with h j (x j ) = x j β j and x j includes the covariates and their squares). 9) additive multivariate quadratic predictors only (model (3.10) without (s) with h j (x j ) = x j β j and x j includes the covariates and their squares).
In each model, the spatial random effects were given the Mátern covariance function The continuous covariate effects in the additive GP models (models 1-4) were given the Gaussian correlation functioñ k hj ,r (x r , x r ) = e −||xr−x r || 2 /2l 2 j,r . The additive linear models (models 6-8) were coded as additive GPs withk hj,r (x r , x r ) = x r x r σ 2 j,r . Even though this is not a proper correlation function, when used in LMC it implies a generalized linear model with interspecific dependencies between weights, β j , that are the key components in state-of-the-art parametric JSDMs. See Discussion for details. For the categorical covariate (Bottom class, table 1) we used a delta function so thatk hj,r (x r , x r ; θ) = σ 2 δj δ xr (x r ), where δ xr (x r ) = 1 if x r = x r and zero otherwise. This corresponds to having an own intercept for each category. We used the marginally uniform priors (section 4.1) for the between species correlations and weakly informative priors for the rest of the parameters. The variance parameters were given Half-Student-t(µ = 0, σ 2 = 4, ν = 4) priors and the length-scale parameters were given Half-Inverse-Student-t(µ = 0, σ 2 = 1, ν = 4). Hence, a priori more weight is given for smooth functions with small variability.

Model validation and comparison
We aim to provide models that give reliable posterior predictions. Hence, it is natural to compare models with the goodness of their posterior predictive distributions. This can be done efficiently with cross-validation (CV) using the log predictive density diagnostics (Vehtari and Ojanen, 2012). Let D denote the full data-set. Fix K disjoint sub-sets of D, say D 1 , . . . , D K , such that their union is D. The K-fold CV log point-wise marginal predictive density is then L K (D) = 1 n n i=1 log π(y i |D ∼k(i) ) where D ∼k(i) = {∪ K r=1 D r : r = k(i)} and k(i) is such that y i ∈ D k(i) . We compare the models with the leave-oneout CV (LOO-CV, K = n) and structured 5-fold CV. We used the MAP estimate for the hyperparameters. In 5-fold CV we found the MAP for each training set separately. The LOO-CV was conducted at the MAP found with full data and we used Laplace approximation to approximate the LOO-CV log predictive densities (Vehtari et al., 2016). Laplace approximation for LOO-CV is shown to work well in GP models and, since our data is rather large, leaving only one data point out of the training set has only negligible effect on the posterior of the hyperparameters (Vehtari et al., 2016).
The rationale for calculating both LOO-CV and 5-fold CV comparison is the following. Since multiple species were sampled at every sampling site and each of the sampling  sites has other sites spatially nearby it, the LOO-CV log predictive densities are affected significantly by the spatial random effects. The LOO-CV, thus, measures models' interpolation performance which can be good even if the models were not able to represent well responses along covariates (predictor functions) (Vanhatalo et al., 2012;Veneranta et al., 2013). For this reason, we structured the 5-fold CV by dividing the data into five spatially distinct groups corresponding to regions I-V in Figure 1. The sampling sites in different groups are spatially so far from each others that the spatial random effects do not affect the posterior predictive distributions for test groups. Hence, the 5-fold CV tests mostly the extrapolation performance of a model, which is governed by the goodness of the predictor functions, whereas the LOO-CV tests the interpolation performance, which is governed also by the spatial random effects.  4 in Supplementary material show the pairwise comparison of the CV point wise log predictive densities between the best GP/parametric model (models 3 and 8) and the other GP/parametric models with both the covariate effects and spatial random effect (models 1-2 and 6-7). The results show that model 3, which includes multivariate GP predictors and multivariate spatial random effects (3.10), is the best in both LOO-CV and 5-fold CV with a difference of 10 −2 or more to the other models. Since the mean log predictive density is an average of n = 850 point-wise predictions the difference of 10 −2 corresponds to a difference of 8.5 in log (point-wise) joint predictive densities. Analogously to Bayes factors, which compare log prior predictive distribution, this can be considered a significant difference between two models (Kass and Raftery, 1995).

Predictive performance of models
Hence, the multivariate GP (model 3) has significantly better overall predictive performance over all species than the other models. According to posterior predictive checks (Gelman et al., 2013) there are no serious discrepancies between its predictions and observed data. In extrapolation (5-fold CV) model 3 performs best also for all species separately. However, in interpolation (LOO-CV) it is not the best for all species separately (tables 1-2 in Supplementary material). Moreover, the semi-parametric GP models (models 1-4) work better than the corresponding parametric models (quadratic environmental response, models 6-9) in both interpolation and extrapolation in general. Dropping either spatial random effect or covariate effect out from the model decreases its performance clearly. All models work better in interpolation than in extrapolation and compared to univariate models including interspecific correlations either in spatial random effects or also in environmental predictors improves models' performance in both of these tasks. Moreover, including interspecific correlations between environmental responses, h j,r (x r ), improves the extrapolation performance relatively more than including interspecific correlations only for spatial random effects. Hence, multivariate spatial random effect improved more interpolation whereas multivariate predictors has larger effect for extrapolation as would intuitively be expected.

Effects of environmental covariates
The interspecific correlations (Figure 4) show that the responses to environment are similar among Coregonids and among sticklebacks whereas there are clear differences among these groups. These fish groups respond differently to sandy bottom and distance to deep. Moreover, there is strong positive spatial correlation among Coregonids and among sticklebacks but not between these groups. All species have negative spatial correlation with filamentous algae whereas there is either weak positive or negligible spatial correlation between fish species and macrovegetation and diatomous algae. show patterns that would be hard to capture with a quadratic function. For example, three and nine spine stickleback, macrovegetation and filamentous algae show logistic style response to either distance to deep or openness. Moreover, the response of vendace on ice break up week is first constant but has very steep increase after week 16. The responses to ice break up or chlorophyll-a show non-linear and non-quadratic responses also for white fish, macrovegetation and filamentous algae. The MAP estimates of other hyper-parameters are summarized in Supplementary material. Figure 6 shows the distribution maps as posterior median of intensity and expected count of individuals in replicate sampling produced by SSDM (model 1) and JSDM (model 3) for vendace. The distribution maps for other species together with maps on posterior predictive variance are shown in Supplementary material. In broad scale the posterior median of the intensity looks similar with SSDM and JSDM whereas SSDM predicts larger species counts than JSDM for all species throughout the study region and the uncertainty related to SSDM predictions is larger than that of JSDM predictions. Both SSDM and JSDM predict that vendace is distributed mostly in the northern Gulf of Bothnia. However, SSDM predicts somewhat higher median intensity than JSDM also in the southern Gulf of Bothnia. In relation to median intensity similar pattern that JSDM predicts more restricted distribution range than the single species model is seen also in the predictions concerning sticklebacks and whitefish. In case of diatomous and filamentous algae SSDM and JSDM predict the distribution pattern similarly whereas for macrovegetation JSDM predicts slightly larger distribution ranges. The posterior distributions of spatial lenght-scale parameters were concentrated near one kilometer or less (see table 5 in Supplementary material). Hence, the differences in distribution predictions cannot be explained by the spatial random effects over large areas but the spatial random effect explains local deviations from the predictions based on covariates.

Spatial predictions
7 Discussion and concluding remarks

Case study results
The GP based SDMs had better predictive performance than the parametric SDMs and JSDMs had better predictive performance than the corresponding SSDMs. The differences between SSDM and JSDM are most apparent in the distribution maps and predictor functions. The JSDM predicted in general smaller distribution ranges than SSDMs (macrovegetation being the only exception) and the posterior uncertainty in their predictions were smaller than in SSDMs.
When interpreting the results, it should be remembered that the sampling was tar-geted to larvae of sea-spawning Coregonids (whitefish and vendace) and planned to cover their plausible distribution range in terms of spatial (coastal areas of Gulf of Bothnia) and temporal extent (spring). Other species were sampled alongside Coregonids and the sampling area covers only limited portion of their full distribution range in the spring. Moreover, the abundance and distribution of all the studied species vary annually. Fish change their distribution areas seasonally and their larval stages last only few weeks. Vegetation and algal cover vary due changes in temperature and ice effects. For these reasons the results are most representative for larvae of whitefish and vendace. For other species the results describe their spring distribution in the shallow coastal regions only.
Our results correspond rather well to the earlier knowledge on the studied species. The responses to the environmental covariates and the interspecific correlations (Figures 5 and 4) indicate that whitefish and vendace larvae are, in general, distributed in different areas than sticklebacks. Most of the literature on Baltic Sea sticklebacks focus on three-spined sticklebacks whereas the nine-spined stickleback is not well studied. In early spring, stickleback abundances have been found to be highest in sheltered archipelago areas, where part of the population overwinter. High abundance of sticklebacks are typically thought to indicate structural complexity on the bottom; such as the presence of stones and boulders as well as reeds and other macrovegetation that function as shelter and provide food (Peltonen et al., 2004;Sieben et al., 2011). On contrary, highest densities of whitefish and vendace larvae are observed in open sandy shores near deep areas and in structurally simple shores, without macrovegetation, boulders and stones (Hudd et al., 1988;Leskelä et al., 1991;Veneranta et al., 2013). The distribution of vendace is highly influenced by ice break up week and salinity so that long ice cover period and low salinity increase their abundance. Ice winter is longer and salinity is lower in the northern than in the southern Gulf of Bothnia and thus it correlates positively to Coregonid presence (Vanhatalo et al., 2012). In general, optimal habitats for Coregonid larvae are mainly located in the northernmost Gulf of Bothnia.
Sticklebacks are known to prey mainly on mesozooplankton, but also on grazers (Peltonen et al., 2004;Sieben et al., 2011). Sticklebacks feed also on fish larvae if available (Byström et al., 2015), but there are no studies on potential predation risk to Coregonids. Based on our results, in the scale of coastal region of Gulf of Bothnia, sticklebacks are not thread to Coregonid larvae since their high density areas do not overlap. Moreover, there was no spatial correlation between sticklebacks and coregonids ( Figure 4) which could indicate interspecific interaction of any kind. The presence of sticklebacks has been connected to higher eutrophication status and stickleback reproduction and growth benefit from increasing temperature and eutrophication (Lefébure et al., 2011;Candolin et al., 2008;Meier et al., 2012) On contrary, these environmental characteristics are assumed to affect negatively the Coregonid reproduction (M. Cingi et al., 2010;Müller, 1992;Veneranta et al., 2013;Vanhatalo et al., 2012). In our results whitefish and vendace larvae have positive response to Chlorophyl-a, which is a strong indicator for increasing eutrophication. However, in the scale of Gulf of Bothnia Chlorophyl-a concentration is typically higher near river mouths where salinity is lower and nutrient inflow high. Hence the result more likely reflects the high river influence through low salinity than preference for eutrophicated water.
The vegetation and algae distribution in our results reflect the nutrient status during winter and early spring as well as effect of ice cover and ice scraping in wind exposed shallow areas. Both the SSDMs and JSDMs indicate that filamentous algae occur in all coastal areas in high densities, except in some sheltered inner coastal areas. Macrovegetation and diatomous algae had highest presence probabilities at areas with lower filament presence probability. This pattern agrees well with the general understanding of these species groups. Filamentous algae are typically distributed in wind and wave exposed shores, that epiphytic diatoms and reeds that require shelter cannot tolerate. Filamentous algae can, however, grow over macrovegetation and it is possible that macrovegetation distribution is underestimated if filamentous algae growth over macrovegetation has hided macrovegetation from the sampling pictures. The higher nutrient levels in sheltered areas have been found to affect positively reed belt growth, and high abundances of macrovegetation species have been found from archipelago areas, lagoons, bays and river inlets (Pitkänen et al., 2013;Altartouri et al., 2014). The JSDM model reflects these smaller scale occurrences better than SSDM. In our results, diatomous algae are more common in estuaries and northern parts of the study area (see Supplementary material). This is likely explained by longer ice winter towards northern Gulf of Bothnia and longer distance to deep water.

Multivariate additive Gaussian process
Next we discuss some of the similarities and differences between our model and the existing JSDMs and highlight the novel methodological contributions in this paper.

Interspecific correlations
From the predictive point of view, the interspecific correlations in predictive functions are attractive for many reasons. In many applications of SDMs the aim is to predict species distribution over regions that include locations spatially far from the data (Record et al., 2013) or to conduct scenario predictions related to, for example, climate change or land use (Guisan et al., 2013). In these applications, predictions are based on the responses to environmental covariates. The inclusion of interspecific dependence allows information flow between species which improves the estimates for predictive functions especially for species with only scarce data (Thorson et al., 2015;Ovaskainen et al., 2017;Hui et al., 2013;Clark et al., 2014).
We used the marginally uniform priors of (Barnard et al., 2000;Tokuda et al., 2012) for the interspecific correlations. This is justified by prior ignorance on correlations. Prior information on interspecific correlations could, however, be added into model with, e.g., informative inverse Wishart prior. With many species inferring the full covariance matrix is hard in which case we could use spatially dependent latent factors which induce for the spatial random effects a covariance structure Cov j (s), j (s ) = M q=1 λ qj λ qj k q (s, s ), where k q is the q'th spatial covariance function and λ qj the corresponding species specific factor loading. Here, M < J so that this covariance structure is a low rank representation of LMC (Lopez, 2000;Lopez et al., 2008). Latent factor representation was first introduced to SDMs by Thorson et al. (2015) and it is used also in HMSC Ovaskainen et al. (2017). Recently Taylor-Rodríguez et al. (2017) proposed a clustering scheme where the species are clustered to less than J factor loading vectors.
Interspecific correlation between response functions has received less interest. Typically the response functions are assumed to be independent among species. Exception are species archetype models  and the HMSC framework of Ovaskainen et al. (2017). In the latter, the response functions are defined as h j (x j ) = x j θ j· where θ = [θ T 1· , . . . , θ T J· ] T is a J × p matrix of regression weights with hierarchical prior. The species specific weights are given Gaussian prior θ j· ∼ N (µ j· , V j ) where µ j· is the expected response of a species that can be common for all species, common within groups of species or modelled through species traits µ jr = τ T j γ r where τ j is a vector of trait values of species j and γ r ∼ N (0, V γ ) are the effects of traits to response along covariate r. With V = σ 2 I J×J and common prior mean µ j· = µ ∼ N (0, σ 2 µ ) for all j ∈ 1, . . . , J, the induced covariance between species specific additive response functions is and with trait dependent prior mean Cov (h j,r (x r ), h j ,r (x r )) = (τ T j V γ τ j +δ j (j )σ 2 )x r x r . Hence, these alternatives have the same covariance structure as an additive multivariate GP (3.10) with k hj ,r = x r x r for all j = 1, . . . , J, and interspecific covariances Σ hr = σ 2 µ + δ j (j )σ 2 and Σ hr = τ T j V γ τ j + δ j (j )σ 2 . Hence, the hierarchical prior structure of HMSC could easily be extended to semiparametric models as well if we had trait information from our species. The benefit would be that these hierarchical model structures contain ecologically relevant prior information and, hence, using the induced interspecific covariance structure in the additive multivariate GP model would make it ecologically more realistic and potentially improve its predictive performance. The HMSC framework has many other features as well, such as phylogenetic relationships Ovaskainen et al. (2017), which could in principle be incorporated with our approach as well. The structure of interspecific correlations is likely to be especially important in scenario based predictive analyses, such as climate change predictions.
Since our model does not make any mechanistic assumptions on species interdependencies the results concerning interspecific correlations have to be interpreted with care. For example, a positive correlation between spatial random effects could arize due mutualism, predator prey association or similar response to unobserved environmental covariates. Hence, interpreting these correlations should be done in light of more general ecological knowledge on the species. This is, however, a common challenge with all current JSDMs (Thorson et al., 2015;Dunstan et al., 2013;Ovaskainen et al., 2017;Taylor-Rodríguez et al., 2017). Moreover, the property of our model, as well as most other JSDM, is that if the training data shows positive or negative spatial correlation between two species this positive correlation is assumed to hold everywhere in the study domain. In many cases this might be an unrealistic assumption for which reason one interesting future development would be to extend our models to allow species-to-species associations to depend on the environmental context (Fox and Dunson, 2015;Tikhonov et al., 2017). However, the colocalization patterns are explained by both environmental covariates and spatial random effects and regional differences in colocalization patterns can be explained by different environmental covariates in different regions.
The interspecific correlation in our approach are modelled on the latent variable level and, hence, are not directly comparable to interspecific correlations in data. Some authors have argued that this is a conceptual weakness of latent variable models since it makes interpretating their results hard and thus less transparent. An alternative would be to model the correlations in the observation error model (see, e.g., Ovaskainen et al., 2010;Pollock et al., 2014). Clark et al. (2016) proposed generalized joint attribute model (GJAM) to fix this interpretation challenge by aiming to model the correlations between species on the data scale. However, for example, in case of presence absence observations GJAM corresponds to multivariate probit model that is the marginal likelihood of our model in case of Bernoulli observation model. It is also unclear how to disentangle the process underlying the species occurrence and abundance from the observation process leading to the actual data in GJAM approach. Hence, despite the challenge of interpretating the correlations, we prefer the hierarchical latent variable modeling since it provides a coherent approach to separate these two processes.

Response functions
The response functions along environmental covariates provide basis to understand how environment affects species distribution and to predict the species distribution in new areas. Linear and other parametric models are efficient in finding overall trends in responses but have trouble in locating abrupt changes and change points due, for example, physical tolerance limits. Such limits, for example, along temperature and salinity are typical for large variety of taxa in aquatic domains (MacKenzie et al., 2007;Kotta et al., 2019). With SSDMs Vanhatalo et al. (2012), Shelton et al. (2014), Golding and Purse (2016) and Kallasvuo et al. (2017) have demonstrated the benefits of semiparametric models in such situations. The GP approach for modeling environmental responses is similar to generalized additive models (Guisan et al., 2002) but the latter have not been implemented as JSDMs. The flexibility of semiparametric models provides also challenges compared to the more restricted parametric models. The prior for covariance function parameters has to be set with care (Golding and Purse, 2016) and we may also need to pose monotonicity constraints to the response functions Kotta et al. (2019) in order to prevent overfitting. Inferring the responses reliably requires typically more data compared to parametric models. The multivariate additive GP helps in these challenges as well since the interspecific correlations increase the effectively amount of data to be used to infer each response function and hence decreases the uncertainty in them.
Additive multivariate GP framework opens also new questions related to the choice of covariance functions and interspecific correlations. A typical choice for general purpose covariance function is a radial basis function. However, predictions using these covariance functions revert to prior predictive distributions when predicting beyond covariate range covered by data. Hence, in extrapolation tasks stationary covariance functions may not be the optimal choice (Vanhatalo et al., 2012;Kallasvuo et al., 2017) and combining the multivariate GP models with functional constraints (e.g. Kotta et al., 2019) could improve their predictive performance further.

Spatial random effects
Our model comparison results show clearly that including spatial random effects into model improves their predictive performance in both inter-and extrapolation. This result is well in line with earlier works demonstrating that environmental conditions alone may not sufficiently explain species distribution and spatial random effects improve their predictive performance Vanhatalo et al., 2012;Clark et al., 2014;Thorson et al., 2015;Kallasvuo et al., 2017). This is reasonable, since the distribution of species is shaped by the interplay of environmental covariates, stochastic processes and species interaction (Ovaskainen et al., 2017). Hence, a justified model should account for random processes and as demonstrated also by our results also to species interactions in these processes. However, adding spatial random effects into model may lead to problems in model identifibiality which we discuss in the next section.

Posterior inference and predictive performance
The JSDM built with multivariate additive GP had clearly the best overall predictive performance. Moreover, the uncertainty in the predictions by JSDMs were smaller than in the SSDMs which, together with the best log predictive density performance, indicates that the predictions were also more accurate. These findings have important implications to practical use of SDMs for management and other purposes. For example Kallasvuo et al. (2017) used SSDMs to classify coastal areas of Finland to unsuitable, suitable and highly productive spawning regions for four commercially important fish species. They based their estimates on the expected numbers of larvae in the coastal region which, as shown by Figure 6 is highly sensitive to the uncertainty of the predictions. Moreover, by using SSDMs we can overestimate the total biomass of all species (Clark et al., 2014).
An inherent challenge with the type of models considered here is the model identifibiality. Spatial random effects are known to affect the fixed effects estimates (Hodges and Reich, 2010) and in some cases the spatial random effect can actually capture variability otherwise explainable with fixed effects (Paciorek, 2010;Bose et al., 2018). Moreover, with uniform (uninformative) priors the length-scale and variance hyperparameters of Matérn class of covariance functions are non-identifiable (Zhang, 2004) in general. In order to alleviate these identifibiality challenges we used weakly informative priors for the variance and length-scale parameters defined through the principles proposed by Gelman (2006), Simpson et al. (2017) and Fuglstad et al. (2018). Our priors for the covariance function parameters favor small variance and large length-scale parameter values. The former penalizes for large magnitudes and the latter penalizes for wiggly responses along covariates or space. In case of covariate responses this is ecologically justified since according to ecological niche theory species' responses to environmental covariates are typically either monotonic or have only one mode (Ovaskainen et al., 2016). Moreover, Fuglstad et al. (2018) show that these type of priors work well for spatial random effects with Matérn covariance functions in general. Favoring longer spatial length-scales is justified also from the model identifibiality point of view. Identifiability between fixed and spatial random effects is of less concern if spatial random effect varies with larger spatial range than the environmental covariates (Paciorek, 2010, Section 3).
Since the environmental covariates in Finnish coastal region vary with relatively small spatial range, our weakly informative prior effectively favors models that explain this variation with covariate responses instead of with spatial random effects. Our model comparison suggests that the models with both covariate effects and spatial random effects lead to most reliable inference on environmental responses and spatial random effects. A model's extrapolation performance is governed mainly by the environmental covariates so better extrapolation performance indicates more reliable response functions. Similarly more accurate interpolation performance indicates more reliable joint effect of environmental covariates and spatial random effect.
An evident challenge with multivariate additive GP is the computation. The core element in the multivariate additive GP is the covariance matrix induced by the GP components. With many species and sampling sites the covariance matrix can become so large that it makes the implementation infeasible. Here we used Laplace method to speed up the computation by decreasing the number of costly posterior density calculations compared to full MCMC. However, in order to scale up the methods for large data sets involving thousands of sampling sites, the model would need to be implemented even more efficiently. One option to reduce the computational time could be to exploit the property that linear LMC can be parameterized through species specific conditional distributions (Gelfand et al., 2004). We could also implement multivariate GPs with sparse GP approximations (Vanhatalo et al., 2010;Alvarez et al., 2010) and replace the full rank LMC model with latent factor model (Thorson et al., 2015;Ovaskainen et al., 2017). However, we leave these considerations for future.

Summary
Our JSDM based on multivariate additive GP combines the key ideas from semiparametric SSDMs and state-of-the-art parametric JSDMs. In our case study, it showed superior predictive performance compared to existing GP based SSDMs and parametric SSDMs and JSDMs in both interpolation and extrapolation. Hence, the multivariate additive GP can be seen as the first step towards integration of the semi-parametric SSDMs and hierarchical JSDMs. We propose also an efficient approach for inference by utilizing Laplace approximation and gradient based optimization for hyperparameters. The multivariate additive GP model is not restricted only to species distribution modeling but can be used in wide variety of other applications as well.

Supplementary Material
The supplementary material contains additional mathematical formulation of the methodology proposed in this paper and additional figures and tables for the case study analysis. Ovaskainen, O., Tikhonov, G., Norberg, A., Guillaume Blanchet, F., Duan, L., Dunson, D., Roslin, T., and Abrego, N. (2017). "How to make more out of community data?

Conditional predictions
Recall that, in the main paper, the distribution of Y J1 (x * , s * )| y J2 conditioned on hyperparameters (they were ommited from the notation for simplicity) is given by π(y J1, * | y J2 ) = π(y J1, * | y J2 , f J1, * )π(y J2 , f J1, * )df J1, * /π(y J2 ) (2.1) The second term in the numerator of the integrand of (2.1) is the same as, where last two terms in (2.3) are Gaussians. The first one is the conditional Gaussian density function w.r.t. the predictor function values f J2 , that is, ,J2 * ) and the second one is the Laplace approximation using the scenario species data related to the set J 2 , that is, where C J1,J2 * is the covariance between species in the set J 1 at the point (x * , s * ) and the species in the set J 2 in their respective covariates and spatial locations. ∇ log π(y J2 |f J2 ) correspond to the gradient of the log-likelihood function for species related to the set J 2 . C J1 * , * is the covariance matrix of f J1, * at (x * , s * ). C J2,J2 is the covariance matrix of f J2 and W J2 is the negative Hessian matrix of the negative logarithm of the likelihood functions related to species in the set J 2 . Now, to calculate the predictive mean vector and predictive covariance matrix we follow the steps used in deriving unconditional expectation and unconditional variances as in the main paper. However we exclude them for brevity.

Unconstrained parametrization of covariance matrices
Let the matrix Σ be a covariance matrix of dimension J. In terms of variances and correlations, we rewrite Σ = diag(σ 2 1 , . . . , σ 2 J ) are variances and these are transformed as σ 2 j = exp(δ j ). The correlation matrix P is written in terms of its Cholesky decomposition, that is P = LL T , where L is the lower triangular Cholesky decomposition. In our case, the upper triangular Cholesk decomposition is given by where each z i,i ∈ (−1, 1). Now, since each z i,i can freely vary in the interval (−1, 1) without violate the positive definiteness property of P (see ), we map each z i,i to the real line as The derivative of the log π(P) (recall the prior for the correlation matrix in the main paper) w.r.t. each δ i,i is given by and the partial derivatives of log π(P |v) w.r.t to ρ j,j are obtained as, where c = j − 1 and c = j if r < j and, j = 1 or r > j . If r < j and r < j then c = j − 1 and c = j − 1. {P −1 } j,j is the entry (j, j ) of the matrix P −1 and {P −1 rr } c,c is the entry (c, c ) of the inverse matrix P −1 rr , where P rr is a matrix obtained by removing the r'th row and column of P. The partial derivatives of each ρ j,j w.r.t to δ i,i are obtained from ∂P /∂δ i,i = (∂L/∂δ i,i )L T + L(∂L/∂δ i,i ) T .
Lastly, we find the derivatives of the logarithm of the absolute value of the Jacobian determinant w.r.t each δ i,i (see equation (11) in ).

Gaussian integrals
Lemma 1. Let's denote by Φ(·) the standard-Gaussian cumulative distribution function and N (·|µ, σ 2 ) the Gaussian density function with parameters (µ, σ 2 ) ∈ R × R + . Then the following holds where F N (·|c, C) is the N -dimensional Gaussian cumulative distribution function with parameters (c, C) ∈ R N × R, with R the space of positive-definite matrices (covariance matrices). Furthermore, m N = [m 1 · · · m N ] T ∈ R N , µ N = µ1 N ∈ R N , v r > 0 ∀r and V N is a covariance matrix given by, Proof. To show (4.1), start writing the left-hand side of the equation in full. Rewrite the integrand as the product of non-standard Gaussian density functions as well as the regions of integration, i.e., Rewrite again using the following transformation [x, y 1 , · · · , y N ] T = [w + µ, z 1 + w + m 1 , · · · , z N + w + m N ] T and note that |∂(x, y 1 , · · · , y N )/∂(w, z 1 , · · · , z N )| = 1. After changing variables, group the different terms in the exponentials together to have where c = σ(2π) (N +1)/2 N r=1 v r . Now, the expression inside the squared bracket is a quadratic form which is written with the following matrix form, The integrand in (4.6) has the full form of the multivariate Gaussian density. To identify this we need to find the closed-form covariance matrix from the precision matrix in (4.6) and show that the determinant of the covariance matrix is given by c 2 /(2π) N +1 . Write the precision matrix as block matrix such that A = N r=1 Use the partitioned matrix inversion lemma (Strang and Borre, 1997, equation 17.44) to get the blocks, where its main diagonal equals to [v 2 1 +σ 2 , · · · , v 2 N +σ 2 ] and all off-diagonal elements are given by σ 2 . Put everything together to have the covariance matrix  whose determinant equals to 1/[det(D) det(A − BD −1 C)] = σ 2 N r=1 v 2 r = c 2 /(2π) N +1 by the partitioned matrix determinant lemma (e.g., . Finally, in (4.6), interchange the order of integration with Fubini-Tonelli theorem (Folland, 2013) and integrate w.r.t. w to get and therefore the equality (4.1) holds.
The above closed-form integrals extend many equalities in the table of Owen (1980).  Table 1: Model comparison with leave-one-out (LOO) and 5-fold cross validation using the mean point wise log marginal predictive density statistics (and its standard error) for models 1-5 (see Section 5.3 in the main text). The bolded numbers show the overall performance over all species and the indented text shows the species specific predictions.  Table 2: Model comparison with leave-one-out (LOO) and 5-fold cross validation using the mean point wise log marginal predictive density statistics (and its standard error) for models 6-9 (see Section 5.3 in the main text). The bolded numbers show the overall performance over all species and the indented text shows the species specific predictions.  Table 3: Pairwise comparison of leave-one-out (LOO) and 5-fold cross validation point wise log predictive densities of multivariate parametric (quadratic environmental responses) model (model 8) and parametric models without any interspecific correlations (model 6) or with interspecific correlations only in spatial random effect (model 7). We report the mean (and its standard error) of the differences in point wise log predictive densities at test locations. The bolded numbers show the overall performance and normal text shows the species specific predictions.  Table 4: Pairwise comparison of leave-one-out (LOO) and 5-fold cross validation point wise log predictive densities of multivariate GP model (model 3) and GP models without any interspecific correlations (model 1) or with interspecific correlations only in spatial random effect (model 2). We report the mean (and its standard error) of the differences in point wise log predictive densities at test locations. The bolded numbers show the overall performance and normal text shows the species specific predictions.