Data Augmentation for Bayesian Deep Learning

Deep Learning (DL) methods have emerged as one of the most powerful tools for functional approximation and prediction. While the representation properties of DL have been well studied, uncertainty quantification remains challenging and largely unexplored. Data augmentation techniques are a natural approach to provide uncertainty quantification and to incorporate stochastic Monte Carlo search into stochastic gradient descent (SGD) methods. The purpose of our paper is to show that training DL architectures with data augmentation leads to efficiency gains. We use the theory of scale mixtures of normals to derive data augmentation strategies for deep learning. This allows variants of the expectation-maximization and MCMC algorithms to be brought to bear on these high dimensional nonlinear deep learning models. To demonstrate our methodology, we develop data augmentation algorithms for a variety of commonly used activation functions: logit, ReLU, leaky ReLU and SVM. Our methodology is compared to traditional stochastic gradient descent with back-propagation. Our optimization procedure leads to a version of iteratively re-weighted least squares and can be implemented at scale with accelerated linear algebra methods providing substantial improvement in speed. We illustrate our methodology on a number of standard datasets. Finally, we conclude with directions for future research.


Introduction
Deep neural networks (DNNs) have become a central tool for Artificial Intelligence (AI) applications such as, image processing (ImageNet, Krizhevsky et al. (2012)), object recognition (ResNet, He et al. (2016)) and game intelligence (AlphaGoZero, Silver et al. (2016)). The approximability Bauer and Kohler, 2019) and rate of convergence of deep learning, either in the frequentist fashion (Schmidt-Hieber, 2020) or from a Bayesian predictive point of view (Polson and Rockova, 2018;Wang and Rockova, 2020), have been well-explored and understood. Fan et al. (2021) provides a selective overview of deep learning. However, training deep learners is challenging due to the high dimensional search space and the non-convex objective function. Deep neural networks have also suffered from issues such as local traps, miscalibration and overfitting. Various efforts have been made to improve the generalization performance and many of their roots lie in Bayesian modeling. For example, Dropout (Wager et al., 2013) is commonly used and can be viewed as a deterministic ridge ℓ 2 regularization. Sparsity structure via spike-and-slab priors (Polson and Rockova, 2018) on weights helps DNNs adapt to smoothness and avoid overfitting. Rezende et al. (2014) propose stochastic back-propagation through the use of latent Gaussian variables.
In this paper, following the spirit of hierarchical Bayesian modeling, we develop data augmentation strategies for deep learning with a complete data likelihood function equivalent to weighted least squares regression. By using the theory of mean-variance mixtures of Gaussians, our latent variable representation brings all of the conditionally linear model theory to deep learning. For example, it allows for the straightforward specification of uncertainty at each layer of deep learning and for a wide range of regularization penalties. Our method applies to commonly used activation functions such as ReLU, leaky ReLU, logit (see also Gan et al. (2015)), and provides a general framework for training and inference in DNNs. It inherits the advantages and disadvantages of data augmentation schemes. For approximation methods like Expectation-Maximization (EM) and Minorize-Maximization (MM), they are stable as they increase the objective but can be slow in the neighborhood of the maximum point even with acceleration methods such as Nesterov acceleration available and the performance is highly dependent on the properties of the objective function. Stochastic exploratory methods like MCMC have the main advantage of addressing uncertainty quantification (UQ) and are stable in the sense they require no tuning. Hyper-parameter estimation is immediately available using traditional Bayesian methods. DA augments the objective function with extra hidden units which allow for efficient step size selection for the gradient descent search. In some of the applications, data augmentation methods can be formulated in terms of complete data sufficient statistics, a considerable advantage when dealing with large datasets where most of the computational expense comes from repeatedly iterating over the data. By combing the MCMC methods with the J-copies trick (Jacquier et al., 2007), we can move faster towards posterior mode and avoid local maxima. Traditional methods for training deep learning models such as stochastic gradient descent (SGD) have none of the above advantages. We also note that we exploit the advantages of SGD and accelerated linear algebra methods when we implement our weighted least squares regression step.
Data augmentation strategies are commonplace in statistical algorithms and accelerated convergence (Nesterov, 1983;Green, 1984) is available. Our goal is to show similar efficiency improvements for deep learning. Our work builds on Deng et al. (2019) who provide adaptive empirical Bayes methods. In particular, we show how to implement standard activation functions, including ReLU (Polson and Rockova, 2018), logistic (Zhou et al., 2012;Hernández-Lobato and Adams, 2015) and SVM (Mallick et al., 2005) activation functions and provide specific data augmentation strategies and algorithms. The core subroutine of the resulting algorithms solves a least squares problem. Scalable linear algebra libraries such as Compute Unified Device Architecture (CUDA) and accelerated linear algebra (XLA) are available for implementation. To illustrate our approach, empirically we experiment with two benchmark datasets using Pólya-Gamma data augmentation for logit activation functions. For the deep architecture embedded in our approach, we adopt deep ReLU networks. Deep networks are able to achieve the same level of approximation accuracy with exponentially fewer parameters for compositional functions . Poggio et al. (2017) further show how deep networks can avoid the curse of dimensionality. The ReLU function is favored due to its ability to avoid vanishing gradients and its expressibility and inherent sparsity. Approximation properties of deep ReLU networks have been developed in Montufar et al. (2014), Telgarsky (2017), and Liang and Srikant (2017). Yarotsky (2017) and Schmidt-Hieber (2020) show that deep ReLU networks can yield a rateoptimal approximation of smooth functions of an arbitrary order. Polson and Rockova (2018) provide posterior rates of convergence for sparse deep learning.
There is another active area of research that revives traditional statistical models with the computational power of DL (Bhadra et al., 2021). Examples include Gaussian Process models (Higdon et al., 2008;Gramacy and Lee, 2008), Generalized Linear Models (GLM) and Generalized Linear Mixed Models (GLMM) (Tran et al., 2020) and Partial Least Squares (PLS) . Our method benefits from the computation efficiency and flexibility of expression of the deep neural network. In addition, our work builds on the sampling optimization literature (Pincus, 1968(Pincus, , 1970 which now uses MCMC methods. Other examples include Ma et al. (2019) who study that sampling can be faster than optimization and Neelakantan et al. (2017) showing that gradient noise can improve learning for very deep networks. Gan et al. (2015) implements data augmentation inside learning deep sigmoid belief networks. Neal (2011) and Chen et al. (2014) provide Hamitonian Monte Carlo (HMC) algorithms for MCMC. Duan et al. (2018) proposes a family of calibrated data-augmentation algorithms to increase the effective sample size.
The rest of our paper is outlined as follows. Section 2 provides the general setting of deep neural networks and shows how DA can be integrated into deep learning using the duality between Bayesian simulation and optimization. Section 3 describes our data augmentation (DA) schemes and two approaches to implement them. Section 4 provides applications to Gaussian regression, support vector machines and logistic regression using Pólya-Gamma augmentation . Section 5 provides the experiments of DA on both regression and classification problems. Section 6 concludes with directions for future research.

Bayesian Deep Learning
In deep learning we wish to recover a multivariate predictive map f θ (·) denoted by where y = (y 1 , . . . , y n ) ′ , y i ∈ R denotes a univariate output and x = (x 1 , . . . , x n ) ′ , x i ∈ R p a high-dimensional set of inputs. Using training data of input-output pairs {y i , x i } n i=1 that generalizes well out-of-sample, the goal is to provide a predictive rule for a new input variable whereθ is estimated from training data typically using SGD. The interest in deep learners lies in their ability to perform better than the additive rule for those interpolation or prediction problems. Other statistical alternatives include Gaussian processes but they often have difficulty in handling higher dimensions.
Deep learners use compositions (Kolmogorov, 1957;Vitushkin, 1964) of ridge functions rather than additive functions that are commonplace in statistical applications. With L ∈ N we denote the number of hidden layers and with p l ∈ N the number of neurons at the l th layer. Setting p L+1 = p, p 0 = p 1 = 1, we denote with p = (p 0 , p 1 , . . . , p L+1 ) ∈ N L+2 the vector of neuron counts for the entire network. Imagine composing L layers, a deep predictor then takes the form where b l ∈ R p l is a shift vector, W l ∈ R p l−1 ×p l is a weight matrix that links neurons between (l −1) th and l th layers and . . , (W L , b L )} as the stacked parameters. We can rewrite the compositions in (2.1) with a set of latent variables Z = (Z 1 , Z 2 , . . . , Z L ) ′ as where Z l ∈ R n×p l is the matrix of hidden nodes in l-th layer. We only consider the case p = 1 and Z 1 ∈ R n in our work. We provide discussion on extensions to cases p > 1 for some of our applications in Section 4.

Bayesian Simulation and Regularization Duality
The problem of deep learning regularization (Polson and Sokolov, 2017) is to find a set of parameters θ which minimizes a combination of a negative log-likelihood ℓ(y, f θ (x)) and a penalty function φ(θ) defined bŷ where λ controls regularization and #θ denotes the number of parameters in θ.
When the function f θ (x) is a deep learner defined as (2.1), we can specify different amount of penalty λ l and form of regularization function φ l (·) for each layer. Then the objective function can be written aŝ (2.4) Commonly used regularization techniques for deep learners include L 2 (weight decay), spike-and-slab regularization (Polson and Rockova, 2018) and dropout (Wager et al., 2013), which can also be viewed as a variant of L 2 -regularization.
As such the optimization problem in (2.4) of training a deep learner f θ (·) involves a highly nonlinear objective function. Stochastic gradient descent (SGD) is a popular tool based on back-propagation (a.k.a. the chain rule), but it often suffers from local traps and overfitting due to the non-convex nature of the problem. We propose data augmentation techniques which can be seamlessly applied in this context and provide efficiency gains. This is achieved via the hierarchical duality between optimization with regularization and finding the maximum a posteriori (MAP) estimate (Polson and Scott, 2011), as described in the following lemma.
Lemma 2.1. The regularization problem is equivalent to finding the the Bayesian MAP estimator defined by which corresponds to the mode of a posterior distribution characterized as Here p(θ) can be interpreted as a prior probability distribution and the log-prior as the regularization penalty.

A Stochastic Top Layer
By exploiting the duality from Lemma 2.1, we wish to use a Bayesian framework to add stochastic layers -so as to fully account for the uncertainty in estimating the predictive rule f θ (·). Thus, we convert the sequence of composite functions in the deep learner specified in (2.2) to a stochastic version given by (2.5) Now the hidden variables Z = (Z 1 , . . . , Z L ) ′ can be viewed as data augmentation variables and hence will also allow the contribution of fast scalable algorithms for inference and prediction.
For the ease of computation, we only replace the top layer of the DNN with a stochastic layer. We denote network structure below the top layer with B = {(W 1 , b 1 ), . . . , (W L , b L )}, and the network structure can be rewritten as where the function f 0 (Z 1 W 0 + b 0 ) is the top layer structure and function f B (x) is the network architecture below the top layer. Considering the objective function in (2.4), we implement the solutions with a two-step iterative search. At iteration t, we have 1. DA-update for the top layer W 0 , b 0 as the MAP estimator of the distribution The main contribution of our work comes from two aspects: (1) we update top layer weights {W 0 , b 0 } conditional on B as in (2.6), which is also equivalent to conditioning on Z 1 , with data augmentation techniques as later shown in Section 3; (2) the latent variables Z 1 is sampled from a normal distribution rather than optimized by gradient descent methods. Z 1 serves as a bridge that connects a weighted L 2 -norm model f 0 and a deep learner f B . Commonly used activation functions {f l } L l=1 are linear affine functions, rectified linear units (ReLU), sigmoid, hyperbolic tangent (tanh), and etc. We illustrate our methods with a deep ReLU network, i.e., {f l } L l=1 are ReLU functions, due to its expressibility and inherent sparsity. In the next section, we introduce our data augmentation strategies and show how the stochastic layers can be achieved via data augmentation.

Data Augmentation for Deep Learning
Data augmentation introduces a vector of auxiliary variables, denoted by ω = (ω 1 , . . . , ω n ) ′ with ω i ∈ R, such that the posterior can be written as where the augmented auxiliary distribution, p(θ, ω | y) factorizes nicely into complete conditionals p(θ | ω, y) and p(ω | θ, y). A crucial ingredient is that p(θ | ω, y) is easily managed typically via conditional Gaussians.
Data augmentation tricks allow us to express the likelihood as an expectation of a weighted L 2 -norm. Specifically, we write where p(ω) is the prior on the auxiliary variables ω = (ω 1 , . . . , ω n ) ′ and the function Q y | f θ (x), ω is designed to be a quadratic form, given the data augmentation vari- Table 1 shows that standard activation functions such as ReLU, logit, lasso and check can be expressed in the form of (3.1). Commonly used activation functions for deep learning, with an appropriate stochastic assumptions for w (for notation of simplicity, we derive the standard form for the single observation case) can be expressed as Here GIG denotes the Generalized Inverse Gaussian distribution, PG represents the Pólya Gamma distribution , and E represents the exponential distribution.
Using the data augmentation strategies, the objectives are represented as mixtures of Gaussians. DA can perform such an optimization with only the use of a sequence of iteratively re-weighted L 2 -norms. This allows us to use XLA techniques to accelerate the training.
Remark 3.1. The log-posterior is optimized given the training data, Deep learning possesses the key property that ∇ θ log p(y|θ, x) is computationally inexpensive to evaluate using tensor methods for very complicated architectures and fast implementation on large datasets. One caveat is that the posterior is highly multi-modal and providing good hyperparameter tuning can be expensive. This is clearly a fruitful area of research for state-of-the-art stochastic MCMC algorithms to provide more efficient algorithms. For shallow architectures, the alternating direction method of multipliers (ADMM) is an efficient solution to the optimization problem.
Similarly we can represent the regularization penalty exp(−λφ(θ)) in data augmentation form. Hence, we can then replace the optimization problem in (2.4) witĥ using the duality in Lemma 2.1.
There are two approaches to Monte Carlo optimization which could handle our data augmentation (Geyer, 1996), missing data methods like Expectation-Maximization (EM) algorithms or stochastic search methods like Markov Chain Monte Carlo (MCMC). The first approach is based on a probabilistic approximation of the objective function (3.1) and is less concerned with exploring Θ. The second type is more exploratory which aims to optimize the objective function by visiting the entire range of Θ and is less tied to the properties of the function.
For EM algorithms, we consider constructing a surrogate optimization problem which has the same solution to (3.1) (Lange et al., 2000). Specifically, we define a new objective function as which is a concave function to be maximized. A natural choice of the surrogate function can be constructed using Jensen's inequality as Maximizing G θ | θ (t) with respect to θ drives H(θ) uphill. The ascent property of the EM algorithm relies on the nonnegativity of the Kullback-Leibler divergence of two conditional probability densities (Hunter and Lange, 2004;Lange, 2013a). The EM algorithm enjoys the numerical stability as it steadily increases the likelihood without wildly overshooting or undershooting. It simplifies the optimization problem by (1) avoiding large matrix inversion; (2) linearizing the objective function; (3) separating the variables of the optimization problem (Lange, 2013b). In Section 4.3 we show how Pólya-Gamma augmentation leads to an EM algorithm for logistic regression.
The exploratory alternative to solve (3.1) is stochastic search methods such as MCMC. The data augmentation strategies enable us to sample from the joint posterior where the prior is related to the regularization penalty, via p(θ) ∝ exp − L l=0 λ l φ l (W l , b l ) . Hence, we can provide an MCMC algorithm in the augmented space (θ, ω) and simulate from the joint posterior distribution, denoted by p(θ, ω | y), namely A sequence can be simulated using MCMC Gibbs conditionals, Then we recover stochastic draws θ (t) ∼ p(θ | y) from the marginal posterior. These draws can be used in prediction to account for predictive uncertainty, namely , ω) is conditionally quadratic, the update step for θ | ω, y can be achieved using SGD or a weighted L 2 -norm -the weights ω are adaptive and provide an automatic choice of the learning rate, thus avoiding backtracking which can be computationally expensive. And the performance of MCMC search is less tied to the statistical properties (i.e. convexity or concavity) of the objective function. We provide examples of how Gaussian regression and SVMs can be implements in Section 4.1 and Section 4.2.

MCMC with J-copies
The MCMC methods offer a full description of the objective function (3.1) over the entire space Θ. Inspired by the simulated annealing algorithm (Metropolis et al., 1953), we introduce a scaling factor J to allow for faster moves on the surface of (3.1) to maximize. It also helps avoiding the trapping attraction of local maxima. In addition, the corresponding posterior is connected to the Boltzmann distribution, whose density is prescribed by the energy potential f (θ) and temperature J as To simulate the posterior mode without evaluating the likelihood directly (Jacquier et al., 2007), we sample J independent copies of hidden variable Z 1 . Denoted the copies with Z 1 1 , . . . , Z J 1 , we sample them simultaneously and independently from the posterior distribution where µ z , σ z are determined by {x, y, θ}. And we stack the J copies as to amplify the information in y, which is especially useful in the finite sample problems. Figure 1 illustrates our network architecture. given data y, x can be written as Hence, the marginal joint posterior concentrates on the density proportional to p(x, y | θ) J p(θ) and provides us with a simulation solution to finding the MAP estimator (Pincus, 1968(Pincus, , 1970. Another alternative to simulate from the posterior mode is Hamiltonian Monte Carlo (Neal, 2011), which is a modification of Metropolis-Hastings (MH) sampler. Adding an additional momentum variable ν to the Boltzmann distribution in (3.3), and generating draws from joint distribution where M is a mass matrix. Chen et al. (2014) adopt this approach in a deep learning setting.

Connection to Diffusion Theory
An alternative to the MCMC algorithm can be derived from diffusion theory (Phillips and Smith, 1996). For example, we can approximate the random walk Metropolis-Hastings algorithm with the Langevin diffusion L t defined by the stochastic differential equation where B t is the standard Brownian motion. More specifically, let d := |θ|, we write the random walk like transition as where ǫ t ∼ N d (0, I d ) and σ 2 corresponds to the discretization size.
This can also be derived by taking a second-order approximation of log(f ), namely where H(θ (t) ) = −∇ 2 log f (θ (t) ) is the Hessian matrix. By taking exponential transformation on both sides, the random walk type approximation to f (θ (t+1) ) is .
where θ (t) = θ (t) +H −1 (θ (t) )∇ log f (θ (t) ). If we simplify this approximation by replacing H(θ (t) ) with σ −2 I p , the Taylor approximation leads to updating step as Roberts and Rosenthal (1998) give further discussion on the choice of σ that would yield an acceptance rate of 0.574 to achieve optimal convergence rate. Mandt et al. (2017) show that SGD can be interpreted as a multivariate Ornstein-Uhlenbeck process here η is the constant learning rate, A is the symmetric Hessian matrix at the optimum and C S is the covariance of the mini-batch (of size S) gradient noise, which is assumed to be approximately constant near the local optimum of the loss. They also provide results on discrete-time dynamics on other Stochastic Gradient MCMC algorithms, such as Stochastic Gradient Langevin dynamics (SGLD) by Welling and Teh (2011) and Stochastic Gradient Fisher Scoring by Ahn et al. (2012).
Combing their results and the Langevin dynamics of MCMC algorithms, we can write the approximation of our DA-DL updating scheme as Similar adaptive dynamics are also observed in other methods. Geman and Hwang (1986) show the convergence of the annealing process using Langevin equations. Slice sampling (Neal, 2003) adaptively chooses the step size based on the local properties of the density function. By constructing local quadratic approximations, it could adapt to the dependencies between variables. Murray et al. (2010) further propose elliptical slice sampling that operates on the ellipse of states.

Applications
To illustrate our methodology, we provide three examples: (1) a standard Gaussian regression model with squared loss; (2) a binary classification model under the support vector machine framework; (3) a logistic regression model paired with a Pólya mixing distribution. For the Gaussian regression and SVM models, we implement with J-copies stacking strategy to provide full posterior modes.
Before diving into the examples, we introduce the notations we use throughout this section. We continue to denote the output with y = (y 1 , . . . , y n ) ′ , y i ∈ R, the input with x = (x 1 , . . . , x n ) ′ , x i ∈ R p , the latent variable of the top layer with Z 1 = (z 1,1 , . . . , z 1,n ) ′ , z 1,i ∈ R and the stacked version as in (3.4). We introduce stochastic noises ǫ 0 = (ǫ 0,1 , . . . , ǫ 0,n ) ′ in the top layer and ǫ z = (ǫ z,1 , . . . , ǫ z,n ) ′ in the second layer, where ǫ 0,i iid ∼ N (0, τ 2 0 ) and ǫ z,i iid ∼ N (0, τ 2 z ). The scale parameters τ 0 and τ z are pre-specified and determine the level of randomness or uncertainty for the DA-update and SGD-update respectively. We use η to denote the learning rate used in the SGD updates and T is number of training epochs. We use · to denote ℓ 2 -norm such that y = n i=1 y 2 i and the matrix-type norm as y Σ = y T Σy. Our models differ from standard deep learning models and some newly proposed Bayesian approaches in the adoption of stochastic noises ǫ 0 and ǫ z . It distinguishes our model from other deterministic neural networks. By letting ǫ z follow a spiky distribution that puts most of its mass around zero, we can control the estimation approximating to posterior mode instead of posterior mean. The randomness allows us to adopt a stacked system and make the best use of data especially when the dataset is small.

Gaussian Regression
We consider the regression model as . The posterior updates are given bŷ whereȳ = 1 n n i=1 y i and C z is a normalizing constant. The latent variable Z 1 is drawn from following normal distribution Z 1 ∼ N (µ Z , σ 2 Z ) with the mean and variance specified as The J copies of Z 1 are simulated and stacked as 1 = (Z 1 1 , . . . , Z J 1 ) ′ . The updating scheme for this Gaussian regression is summarized in Algorithm 1.
The model can also be generalized to multivariate y. Let y i be a q-dimension vector, we denote each dimension as y ik , k = 1, . . . , q, and the model is written as , where W 0 = (W 01 , . . . , W 0q ) ′ is now a q-dimensional vector with W 0k computed similarly to (4.1), b 0 = (b 01 , . . . , b 0q ) ′ is also q-dimensional with b 0k calculated as (4.2). The posterior update for Z 1 becomes Algorithm 1 Data Augmentation with J-copies for Gaussian Regression (DA-GR) jointly from deep learner f B and sampling layer f 0 which is a multivariate normal distribution with the mean and variance as

Support Vector Machines (SVMs)
Support vector machines require data augmentation for rectified linear unit (ReLU) activation functions. Polson and Scott (2011) and Mallick et al. (2005) write the support vector machine model as where p(λ) follows a flat uniform prior. The augmentation variable λ = (λ 1 , . . . , λ n ) ′ can be regarded as slacks admitting fuzzy boundaries between classes.
By incorporating the augmentation variable λ, the ReLU deep learning model can be written as . From a probabilistic perspective, the likelihood function for the output y is given by Derived from this augmented likelihood function, the conditional updates are where Λ = diag(λ 1 , . . . , λ n ) is the diagonal matrix of the augmentation variables.
In order to generate the latent variables, we use conditional Gibbs sampling as with the means and variances given by where IG denotes the Inverse Gaussian distribution and 1 = (1, . . . , 1) ′ is a n-dimensional unit vector.
The J-copies strategy can also be adopted here. Z j 1 and λ j needs to be sampled independently for j = 1, . . . , J. Algorithm 2 summarizes the updating scheme with J-copies for SVMs.
jointly from the deep learner f B and the sampling layer W 0

Logistic Regression
The aim of this example is to show how EM algorithm can be implemented via a weighted L 2 -norm in deep learning. Adopting the logistic regression model from , we focus on the penalization of W 0 , with parameter optimization given bŷ The outcomes y i are coded as ±1, and τ is assumed fixed.
For likelihood function ℓ and regularization penalty φ, we assume (4.8) where µ W , κ W are pre-specified terms controlling the prior of the penalty term and λ is endowed with a Pólya distribution prior P (λ). Let ω −1 i have a Pólya distribution with α = 1, κ = 1/2, the following three updates will generate a sequence of estimates that converges to a stationary point of posterior . . . , ω n ) and Λ = diag(λ 1 , . . . , λ p ) are diagonal matrices. x * can be written as x * = diag(y)Z 1 , φ ′ (·) denotes the derivative of standard normal density function.
In the non-penalized case, with λ i = 0 for every i, the updates can be simplified as weighted least squares We focus on the non-penalized binary classification case and Algorithm 3 summarizes our approach. Further generalizations are available. For example, a ridge-regression penalty, along with the generalized double-pareto prior (Armagan et al., 2013) can be implemented by adding a sample-wise L 2 -regularizer. A multinomial generalization of this model can be found in .

Experiments
We illustrate the performance of our methods on both synthetic and real datasets, compared to the deep ReLU networks without the data augmentation layer. We refer to the latter as DL in our results. We denote the data augmented gaussian regression in Algorithm 1 as DA-GR, the SVM implementation in Algorithm 2 as DA-SVM and the logistic regression in Algorithm 3 as DA-logit. For appropriate comparison, we adopt the same network structures, such as the number of layers, the number of hidden nodes, and regularizations like dropout rates, for DL and our methods. The differences between our Algorithm 3 Data Augmentation for Logistic Regression (DA-logit) Retrieve the input and output of the top layer Z Calculate the sample-wise weights Update the entire deep learner f θ with {y, x} methods and DL are that (1) the top layer weights W 0 , b 0 of DL are updated via SGD optimization, while the weights W 0 , b 0 of our methods are updated via MCMC or EM; (2) for binary classification, DA-logit and DL adopt a sigmoid activation function in the top layer to produce a binary output, while DA-SVM uses a linear function in the top layer and the augmented sampling layer transforms the continuous value into a binary output. For all experiments, the datasets are partitioned into 70% training and 30% testing randomly. For the optimization we use a modification of the SGD algorithm, the Adaptive moment estimation (Adam, Kingma and Ba (2015)) algorithm. The Adam algorithm combines the estimate of the stochastic gradient with the earlier estimate of the gradient, and scales this using an estimate of the second moment of the unit-level gradient. We have also explored RMSprop (Tieleman and Hinton, 2012) optimizer and we observe similar decreases in regression or classification errors.
To illustrate how the choice of J could affect the speed of convergence, we include different implementations of DA-GR and DA-SVM with J = 2, 5, 10. We have explored different sampling noise variance τ 0 , τ Z , but the choices, in general, do not affect the results significantly.

Friedman Data
The benchmark (Friedman, 1991) setup uses a regression of the form . . , x ip ) and only the first 5 covariates are predictive of y i . We run the experiments with n = 100, 1 000 and p = 10, 50, 100, 1 000 to explore the performance in both low dimensional and high dimensional scenarios. We implement both one-layer (L = 1) and two-layer (L = 2) ReLU networks with 64 hidden units in each layer. For DA-GR model, we let τ 0 = 0.1, τ z = 1. The experiments are repeated 50 times with different random seeds. Figure 2 reports the three quartiles of the out-of-sample squared errors (MSEs). The top row is the performance of the one-layer networks and the bottom row is the performance of the two-layer networks. The two-layer networks perform better and converge faster. For DA-GR, when J = 5 or J = 10, it converges significantly faster and the prediction errors are also smaller. When J = 2, the performance of DA-GR is relatively similar to the deep learning model with only SGD updates. This is due to the fact that DA-GR with J-copies learns the posterior mode which is equivalent to the minimization point of the objective function, and it concentrates on the mode faster when J becomes larger.
The computation costs of DA are higher as shown in Figure 3. This is not entirely unexpected since we introduce sampling steps. When J increases, the computation costs also increase slightly. Given the improvement in convergence speed and prediction errors, our data augmentation strategies are still worthwhile even with some extra computation costs. In addition, for each epoch, we can draw the sample-wise posteriors in parallel and the gap between the computation time can be further mitigated.

Boston Housing Data
Another classical regression benchmark dataset is the Boston Housing dataset 1 , see, for example, Hernández-Lobato and Adams (2015). The data contains n = 506 observations with 13 features. To show the robustness of DA, we repeat the experiment 20 times with different training subsets. We adopt the ReLU networks with one hidden layer of 64 units and set the dropout rate to be 0.5. For the DA-GR model, we let τ 0 = 0.1, τ Z = 1. Figure 4 shows the prediction errors of all methods. DA-GR with J = 10 performs significantly better than the others, in terms of both prediction errors and convergence rates. Meanwhile, DA-GR with J = 2 behaves similarly to SGD at the beginning, but it converges significantly faster than SGD after a few epochs. This again, shows that with the J-copies strategy, our method helps the optimization converge at a faster speed, and injecting the noise helps the model generalize well out-of-sample.

Wine Quality Data Set
The Wine Quality Data Set 2 contains 4 898 observations with 11 features. The output wine rating is an integer variable ranging from 0 to 10 (the observed range in the data is from 3 to 9). The frequency of each rating is reported in Table 2.  The most frequent ratings are 5 and 6. Since we focus on binary classification problems, we provide two types of classifications, both of which have relatively balanced categories: (1) wine with a rating of 5 or 6 (Test 1); (2) wine with a rating of ≤ 5 or > 5 (Test 2). We use the same network architectures adopted in Friedman's example with τ 0 = τ z = 0.1. Figure 5 provides results for the two types of binary classifications. In both cases, DA-SVM performs better than DA-logit and DL. The advantage of large J is still significant and helps converge especially in the early phase. DA-logit outperforms DL in Test 1 when the network is shallow (L=1), while in other cases performs similarly to DL.

Airbnb Data Set
The Airbnb Kaggle competition 3 provides a more challenging application with 21 3451 observations in total, and classified by destination into 12 classes: 10 most popular countries, other and no destination found (NDF), where other corresponds to any other country which is not among the top 10 and NDF corresponds to situations that no  Table 3 reports the percentage of each class. We follow the preprocessing steps in Polson and Sokolov (2017). The list of variables contains information from the sessions records (number of sessions, summary statistics of action types, device types and session duration), and user tables such as gender, language, affiliate provider etc. All categorical variables are converted to binary dummies, which leads to 661 features in total. For the neural network architecture, we use a two-layer ReLU network with 64 hidden units on each layer and set the dropout rate to be 0.3. For the SVM model, we let τ 0 = τ z = 0.  Our goal is to test the binary classification models on this dataset. We consider two types of binary responses, both of which have relatively balanced amounts of observations in each category.  We compare the misclassification rates of DA-SVM in Algorithm 2 with J = 2, 5, 10, DA-logit in Algorithm 3 and the ReLU networks without the data augmentation layer, after training for 1 to 20 epochs. Figure 6 demonstrates the binary classifications for Spain versus UK and UK versus Italy. For both cases, the out-of-sample misclassification rates are not small and the fluctuations over epochs are big, suggesting that a better model structure may be needed. However, we still observe that DA-SVM with J = 5 or J = 10 has smaller classification errors over epochs and the out-of-sample errors decrease faster during earlier phase of training.

Summary of Experiment Results
From the above examples, we observe that DA-logit which is implemented under the EM principle does not show an obvious advantage over the vanilla neural network. It shows some improvements on the convergence speed when the network is shallow in the Wine Quality dataset case as in Figure 5. This could be partially due to the fact that we did not apply regularization on the DA layer for our logit implementation. More importantly, the performance of the EM algorithm is contingent on the statistical properties of the objective function. Although the surrogate function is constructed via only the top layer whose quadratic form ensures concavity, the property of the objective function as a whole becomes complicated when the deep network architecture is more complex. Since our method also inherits the negative side of EM and MM algorithms, convergence to the global maximum is not guaranteed in the absence of concavity. However, this observation could open the possibility of future research where we can combine the EM algorithms with shape-constrained neural networks (Gupta et al., 2020).
On the contrary, the MCMC methods with the J-copies strategy significantly improve the prediction errors and convergence speed of the neural networks for both regression and classification problems. And the advantages become more outstanding when J is larger. The phenomenon suggests that the stochastic exploratory methods are preferable when the statistical property of the objective function is unknown or too complex. And the J-copies scheme largely relieves the problem of being trapped into local modes.
One concern of using MCMC methods is the extra computation costs induced by the sampling steps. In our current version where p 1 = 1, the sample-wise sampling steps can be computed in parallel. If one wishes to introduce a higher dimension latent variable Z 1 such that p 1 > 1, the computation costs will increase as it may involve sampling from multivariate distributions. In that case, fast sampling implementation such as Bhattacharya et al. (2016) is recommended to speed up the process.

Discussion
Various regularization methods have been deployed in neural networks to prevent overfitting, such as early stopping, weight decay, dropout , gradient noise (Neelakantan et al., 2017). Bayesian strategies tackle the regularization problem by proposing probability structures on the weights. We show that data augmentation strategies are available for many standard activation functions (ReLU, SVM, logit) used in deep learning.
Using MCMC provides a natural stochastic search mechanism that avoids procedures such as back-tracking and provides full descriptions of the objective function over the entire range Θ. Training deep neural networks thus benefits from additional hidden stochastic augmentation units (a.k.a. data augmentation). Uncertainty can be injected into the network through the probabilistic distributions on only one or two layers, permitting more variability of the network. When more data are observed, the level of uncertainty decreases as more information is learned and the network becomes more deterministic. We also exploit the duality between maximum a posteriori estimation and optimization. We provide a J-copies stacking scheme to speed up the convergence to posterior mode and avoid trapping attraction of the local modes. Concerning efficiency, DA provides a natural framework to convert the objective function into weighted least squares and is straightforward to implement with the current deep learning training process.
Our three motivational examples illustrated the advantages of data augmentation. Our work has the potential to be generalized to many other data augmentation schemes and different regularization priors. Probabilistic structures on more units and layers are also possible to allow for more uncertainty.
Our DA-DL methods enjoy the benefits of both worlds. On one hand, with the data augmentation on top, it is robust to random weight initialization. Although we still need to specify the learning rates for the deep architecture, the top layer can learn adaptively and the entire network becomes less sensitive to the choice of learning rate. On the other hand, the fast SGD updates from the deep architecture largely alleviate the computation concerns compared to a fully Bayesian hierarchical model.
There are many directions to future research, including adding more sampling layers so the model could accommodate more randomness and flexibility, and using weighted Bayesian bootstrap (Newton et al., 2021) to approximate the unweighted posteriors by assigning random weight to each observation and penalty. Uncertainty quantification for prediction is also possible. Although we focus on the training aspect of deep learning, one can collect posterior draws θ (t) from the MCMC procedure when the training process converges. Using (3.2), we can construct predictive intervals and conduct inference.