sgmcmc: An R Package for Stochastic Gradient Markov Chain Monte Carlo

This paper introduces the R package sgmcmc; which can be used for Bayesian inference on problems with large datasets using stochastic gradient Markov chain Monte Carlo (SGMCMC). Traditional Markov chain Monte Carlo (MCMC) methods, such as Metropolis-Hastings, are known to run prohibitively slowly as the dataset size increases. SGMCMC solves this issue by only using a subset of data at each iteration. SGMCMC requires calculating gradients of the log likelihood and log priors, which can be time consuming and error prone to perform by hand. The sgmcmc package calculates these gradients itself using automatic differentiation, making the implementation of these methods much easier. To do this, the package uses the software library TensorFlow, which has a variety of statistical distributions and mathematical operations as standard, meaning a wide class of models can be built using this framework. SGMCMC has become widely adopted in the machine learning literature, but less so in the statistics community. We believe this may be partly due to lack of software; this package aims to bridge this gap.


Introduction
This article introduces sgmcmc, an R package (R Development Core Team, 2008) for scalable Bayesian inference on a wide variety of models using stochastic gradient Markov chain Monte Carlo (SGMCMC). A disadvantage of most traditional Markov chain Monte Carlo (MCMC) methods are that they require calculations over the full dataset at each iteration; meaning the methods are prohibitively slow for large datasets. SGMCMC methods provide a solution to this issue. The methods use only a subset of the full dataset, known as a minibatch, at each MCMC iteration. While the methods no longer target the true posterior, they instead produce accurate approximations to the posterior at a reduced computational cost.
The sgmcmc package implements a number of popular SGMCMC samplers including stochastic gradient Langevin dynamics (SGLD) (Welling and Teh, 2011), stochastic gradient Hamiltonian Monte Carlo (SGHMC)  and stochastic gradient Nosé-Hoover thermostats (SGNHT) (Ding et al., 2014). Recent work has shown how control variates can be used to reduce the computational cost of SGM-CMC algorithms (Baker et al., 2017;Nagapetyan et al., 2017). For each of the samplers implemented in the package, there is also a corresponding control variate sampler providing improved sampling efficiency.
Performing statistical inference on a model using SGMCMC requires calculating the gradient of the log likelihood and log priors. Calculating gradients by hand is often time consuming and error prone. One 2 Introduction to MCMC and available software Suppose we have a dataset of size N , with data , where x i ∈ X for some space X . We denote the probability density of x i as p(x i |θ), where θ ∈ Θ ⊆ R p are model parameters to be inferred. We assign a prior density p(θ) to the parameters and the resulting posterior is then p(θ|x) = p(θ) N i=1 p(x i |θ)/Z. Often the posterior can only be calculated up to a constant of proportionality Z. In practice Z is rarely analytically tractable; so MCMC provides a way to construct a Markov chain using only the unnormalized posterior density h(θ) := p(θ) N i=1 p(x i |θ). The Markov chain is designed so that its stationary distribution is the posterior p(θ|x). The result (once the chain has converged) is a sample {θ t } t=1 from the posterior, though this sample is not independent. A downside of these traditional MCMC algorithms is that they require the evaluation of the unnormalized density h(θ) at every iteration. This results in an O(N ) cost per iteration. Thus MCMC becomes prohibitively slow on large datasets.
The efficiency of the Metropolis-Hastings algorithm is dependent on the choice of proposal distribution, q.
There are a number of proposals for the Metropolis-Hastings algorithm which can have a very high acceptance rate. Some examples are the Metropolis-adjusted Langevin algorithm (MALA; see e.g., Roberts and Rosenthal (1998)) and Hamiltonian Monte Carlo (HMC; see Neal (2010)). The reason these proposals achieve such high acceptance rates is that they approximate a continuous diffusion process whose stationary distribution is p(θ|x). As an example, consider the MALA algorithm. The MALA algorithm uses a Euler where N (θ|µ, Σ) denotes a multivariate normal density evaluated at θ with mean µ and variance Σ; I is simply the identity matrix; is a tuning parameter referred to as the stepsize. Discretising the diffusion introduces an approximation error, which is corrected by the Metropolis-Hastings acceptance step (1). This means that as → 0, we tend back towards the exact, continuous diffusion and the acceptance rate is 1. However this would result in a Markov chain that never moves. Thus picking is a balance between a high acceptance rate and good mixing.
There are a number of general purpose samplers for MCMC that fulfil different purposes to sgmcmc. The most popular samplers are Stan, BUGS and JAGS (Carpenter et al., 2017;Plummer, 2003;Lunn et al., 2000). The samplers BUGS and JAGS implement automated Gibbs sampling. These samplers work with both continuous and discrete parameter spaces and can be highly efficient. However because the samplers rely on Gibbs sampling, conjugate priors need to be used; also the samplers are not efficient when there is high correlation between the parameters (Carpenter et al., 2017). The package Stan implements state of the art HMC techniques, which means non-conjugate priors can be used, and that the sampler is more robust when there is correlation between parameters. However the package cannot perform inference on discrete parameters, and requires that these are integrated out of the model.
The sgmcmc package aims to fill a gap when the dataset is large enough that other general purpose MCMC samplers such as Stan, BUGS and JAGS cannot be run or run prohibitively slowly. In the packages Stan, BUGS and JAGS, properly specified models will define a Markov chain whose stationary distribution is the posterior distribution. However a major problem with these methods are that as the number of observations gets large the algorithms run slowly. This has become a problem as dataset sizes have been increasing. The reason these methods are slow when running on large datasets are because they require a calculation over the full dataset at each iteration. The methods implemented in sgmcmc aim to account for this issue by only using a subset of the dataset at each iteration. The main downside being that the stationary distribution is no longer the true posterior, just a close approximation. However, as the dataset size increases, the main tuning constant in sgmcmc, known as the stepsize, can be set smaller and the approximation to the posterior improves. Compared to Stan, BUGS and JAGS; sgmcmc offers significant computational advantages for Bayesian modelling with large datasets, but like Stan, a downside of sgmcmc is it requires that discrete parameters are integrated out. The package also requires more tuning than other general purpose samplers since satisfactory results for tuning these methods are still under development. Figure 1 demonstrates in which scenarios practitioners may find sgmcmc useful. The standard Stan sampler and the sgldcv algorithm of the sgmcmc package are used to sample from the posterior of data drawn from a standard normal N (0, 1) with a N (0, 10) prior. The Kullback-Leibler (KL) divergence between the MCMC sample and the true posterior is calculated, and the plots show how this and the run time changes as the number of observations are increased. Since both Stan and TensorFlow models need to be compiled, we recompile the models each time they are run to keep the comparison fair; but it is worth mentioning that the Stan run time is much quicker for the small observation models if precompiled. We can see the run time of Stan increasing rapidly as the number of observations is increased, while the run time of the sgmcmc algorithm increases more slowly. This is a very simple model, used so that the KL divergence can be calculated exactly. As the model complexity increases the run time of Stan can quickly become unmanageable for large datasets. On the other hand, we can see that the KL divergence of the sgmcmc algorithm for this example is poor compared with Stan for a small number of observations. However, as the dataset size increases, the KL divergence for the sgmcmc algorithm becomes much more reasonable compared with Stan. Thus sgmcmc is best used when the dataset size is slowing down the run time of the other general purpose algorithms, and practitioners can safely trade-off a small amount of accuracy in order to gain significant speed-ups.

Stochastic gradient MCMC
Many popular MCMC proposal distributions, including HMC and MALA, described in (2), require the calculation of the gradient of the log posterior at each iteration, which is an O(N ) calculation. Stochastic gradient MCMC methods get around this by replacing the true gradient with the following unbiased estimate calculated on some subset of the all observations S t ⊂ {1, . . . , N }, known as a minibatch, with |S t | = n.
Calculating the Metropolis-Hastings acceptance step (1) is another O(N ) calculation. To get around this, SGMCMC methods set the tuning constants such that the acceptance rate will be high, and remove the acceptance step from the algorithm altogether. By ignoring the acceptance step, and estimating the log posterior gradient, the per iteration cost of SGMCMC algorithms is O(n), where n is the minibatch size. However, the algorithm no longer targets the true posterior but an approximation. There has been a body of theory exploring how these methods perform in different settings. Similar to MALA, the algorithms rely on a stepsize parameter . Some of the algorithms have been shown to weakly converge as → 0.

Stochastic gradient Langevin dynamics
SGLD, first introduced by Welling and Teh (2011), is an SGMCMC approximation to the MALA algorithm. By substituting (3) into the MALA proposal equation (2), we arrive at the following update for θ where ζ t ∼ N (0, t ). Welling and Teh (2011) noticed that as t → 0 this update will sample from the true posterior. Although the algorithm mixes slower as gets closer to 0, so setting the stepsize is a trade-off between convergence speed and accuracy. This motivated Welling and Teh (2011) to suggest setting t to decrease to 0 as t increases. There is a body of theory that investigates the SGLD approximation to the true posterior (see e.g., Teh et al., 2016;Sato and Nakagawa, 2014;Vollmer et al., 2016). In particular, SGLD is found to converge weakly to the true posterior distribution asymptotically as t → 0. The mean squared error of the algorithm is found to decrease at best with rate O(T −1/3 ). In practice, the algorithm tends to mix poorly when is set to decrease to 0 , so in our implementation we use a fixed stepsize which has been shown to mix better empirically. Theoretical analysis for this case is provided in Vollmer et al. (2016). The tuning constant , referred to as the stepsize is a required argument in the package. It affects the performance of the algorithm considerably.

Stochastic gradient Hamiltonian Monte Carlo
The stochastic gradient Hamiltonian Monte Carlo algorithm (SGHMC) ) is similar to SGLD, but instead approximates the HMC algorithm (Neal, 2010). To ensure SGHMC is O(n), the same unbiased estimate to the log posterior gradient is used (3). SGHMC augments the parameter space with momentum variables ν and samples approximately from a joint distribution p(θ, ν|x), whose marginal distribution for θ is the posterior of interest. The algorithm performs the following updates at each iteration where ζ t ∼ N (0, 2[α −β t ] ); and α are tuning constants andβ t is proportional to an estimate of the Fisher information matrix. In our current implementation, we simply setβ t := 0, as in the experiments of the original paper by Chen et al. (2014). In future implementations, we aim to estimateβ t using a Fisher scoring estimate similar to Ahn et al. (2012). Often the dynamics are simulated repeatedly L times before the state is stored, at which point ν is resampled. The parameter L can be included in our implementation. The tuning constant is the stepsize and is a required argument in our implementation, as for SGLD its value affects performance considerably. The constant α tends to be fixed at a small value in practice. As a result, in our implementation it is an optional argument with default value 0.01. Ding et al. (2014) suggest that the quantityβ t in SGHMC is difficult to estimate in practice. They appeal to analogues between these proposals and statistical physics in order to suggest a set of updates which do not need this estimation to be made. Once again Ding et al. (2014) augment the space with momentum parameters ν. They replace the tuning constant α with a dynamic parameter α t known as the thermostat parameter. The algorithm performs the following updates at each iteration

Stochastic gradient Nosé-Hoover thermostat
Here ζ t ∼ N (0, 2a ); and a are tuning parameters to be chosen and p is the dimension of θ. The update for α in (7) requires a vector dot product, it is not obvious how to extend this when θ is higher order than a vector, such as a matrix or tensor. In our implementation, when θ is a matrix or tensor we use the standard inner product in those spaces (Abraham et al., 1988). The tuning constant is the stepsize and is a required argument in our implementation, as again its value affects performance considerably. The constant a, similarly to α in SGHMC, tends to be fixed at a small value in practice (Ding et al., 2014). As a result, in our implementation it is an optional argument with default value 0.01.

Stochastic gradient MCMC with control variates
A key feature of SGMCMC methods is replacing the log posterior gradient calculation with an unbiased estimate. The unbiased gradient estimate, which can be viewed as a noisy version of the true gradient, can have high variance when estimated using a small minibatch of the data. Increasing the minibatch size will reduce the variance of the gradient estimate, but increase the per iteration computational cost of the SGMCMC algorithm. Recently control variates (Ripley, 2009) have been used to reduce the variance in the gradient estimate of SGMCMC (Dubey et al., 2016;Nagapetyan et al., 2017;Baker et al., 2017). Using these improved gradient estimates have been shown to lead to improvements in the mean squared error (MSE) of the algorithm (Dubey et al., 2016), as well as its computational cost (Nagapetyan et al., 2017;Baker et al., 2017).
We implement the formulation of Baker et al. (2017), who replace the gradient estimate ∇ θ log p(θ|x) with whereθ is an estimate of the posterior mode. This method requires the burn-in phase of MCMC to be replaced by an optimisation step which findsθ := argmax θ log p(θ|x). There is then an O(N ) preprocessing step to calculate ∇ θ log p(θ|x), after which the chain can be started fromθ resulting in a negligible mixing time. Baker et al. (2017) and Nagapetyan et al. (2017) have shown that there are considerable improvements to the computational cost of SGLD when (8) is used in place of (3). In particular they showed that standard SGLD requires setting the minibatch size n to be O(N ) for arbitrarily good performance; while using control variates requires an O(N ) preprocessing step, but after that a batch size of O(1) can be used to reach the desired performance. Baker et al. (2017) also showed empirically that this particular formulation can lead to a reduction in the burn-in time compared with standard SGLD and the formulation of (Dubey et al., 2016), which tended to get stuck in local stationary points. This is because in complex scenarios the optimisation step is often faster than the burn-in time of SGMCMC. The package sgmcmc includes control variate versions of all the SGMCMC methods implemented: SGLD, SGHMC and SGNHT.

Brief TensorFlow introduction
TensorFlow (TensorFlow Development Team, 2015) is a software library released by Google. The tool was initially designed to easily build deep learning models; but the efficient design and excellent implementation of automatic differentiation (Griewank and Walther, 2008) has made it useful more generally. This package is built on TensorFlow, and while we have tried to make the package as easy as possible to use, some knowledge of TensorFlow is unavoidable; especially when declaring the log likelihood and log prior functions, or for high dimensional chains which will not fit into memory. With this in mind, we provide a brief introduction to TensorFlow in this section. This should provide enough knowledge for the rest of the article. A more detailed introduction to TensorFlow for R can be found at Allaire et al. (2016). Any procedure written in TensorFlow follows three main steps. The first step is to declare all the variables, constants and equations required to run the algorithm. In the background, these declarations enable TensorFlow to build a graph of the possible operations, and how they are related to other variables, constants and operations. Once everything has been declared, the TensorFlow session is begun and all the variables and operations are initialised. Then the previously declared operations can be run as required; typically these will be sequential and will be run in a for loop.

Declaring TensorFlow tensors
Everything in TensorFlow, including operations, are represented as a tensor; which is basically a multidimensional array. There are a number of ways of creating tensors: • tf$constant(value) -create a constant tensor with the same shape and values as value. The input value is generally an R array, vector or scalar. The most common use for this in the context of the package is for declaring constant parameters when declaring log likelihood and log prior functions.
• tf$Variable(value) -create a tensor with the same shape and values as value. Unlike tf$constant, this type of tensor can be changed by a declared operation. The input value is generally an R array, vector or scalar.
• tf$placeholder(datatype, shape) -create an empty tensor of type datatype and dimensions shape which can be fed all or part of a dataset, this is useful when declaring operations which rely on data which can change. When you have storage constraints (see Section 5.2) you can use a placeholder to declare test functions that rely on a test dataset. The datatype should be a TensorFlow data type, such as tf$float32; the shape should be an R vector or scalar, such as c(100,2).
• operation -an operation declares an operation on previously declared tensors. These use TensorFlow defined functions, such as those in its math library. This is essentially what you are declaring when coding the logLik and logPrior functions. The params input consists of a list of tf$Variables, representing the model parameters to be inferred. The dataset input consists of a list of tf$placeholder tensors, representing the minibatch of data fed at each iteration. Your job is to declare functions that return an operation on these tensors that define the log likelihood and log prior.

TensorFlow operations
TensorFlow operations take other tensors as input and manipulate them to reach the desired output. Once the TensorFlow session has begun, these operations can be run as needed, and will use the current values for its input tensors. For example, we could declare a normal density tensor which manipulates a tf$Variable tensor representing the parameters and a tf$placeholder tensor representing the current data point. The tensor could then use the TensorFlow tf$contrib$distributions$MultivariateNormalDiag object to return a tensor object of the current value for a normal density given the current parameter value and the data point that's fed to the placeholder. We can break this example down into three steps. First we declare the tensors that we require: Here we have declared a tf$Variable tensor to hold the location parameter; a tf$placeholder tensor which will be fed the current data point; the scale parameter is fixed so we declare this as a tf$constant tensor. Next we declare the operation which takes the inputs we just declared and returns the normal density value. The first line which is assigned to distn creates a MultivariateNormalDiag object, which is linked to the loc and scaleDiag tensors. Then dens evaluates the density of this distribution. The dens variable is now linked to the tensors dataPoint and scaleDiag, so if it is evaluated it will use the current values of those tensors to calculate the density estimate. Next we begin the TensorFlow session: The two lines we just ran starts the TensorFlow session and initialises all the tensors we just declared. The TensorFlow graph has now been built and no new tensors can now be added. This means that all operations need to be declared before they can be evaluated. Now the session is started we can run the operation dens we declared given current values for dataPoint and loc as follows: x = rnorm(2) out = sess$run(dens, feed_dict = dict( dataPoint = x ) ) print(paste0("Density value for x is ", out)) Since dataPoint is a placeholder, we need to feed it values each time. In the block of code above we feed dataPoint a random value simulated from a standard normal. The sess$run expression then evaluates the current normal density value given loc and dataPoint.
As mentioned earlier, this is essentially what is happening when you are writing the logLik and logPrior functions. These functions will be fed a list of tf$Variable objects to the params input, and a list of tf$placeholder objects to the dataset input. The output of the function will then be declared as a  TensorFlow operation. This allows the gradient to be calculated automatically, while maintaining the efficiencies of TensorFlow.
TensorFlow implements a lot of useful functions to make building these operations easier. For example a number of distributions are implemented at tf$contrib$distributions, 3 (e.g., tf$contrib$distributions$Normal and tf$contrib$distributions$Gamma). TensorFlow also has a comprehensive math library which provides a variety of useful tensor operations. 4 For examples of writing TensorFlow operations see the worked examples in Section 5 or the sgmcmc vignettes.

Package structure and implementation
The package has 6 main functions. The first three: sgld, sghmc and sgnht will implement SGLD, SGHMC and SGNHT, respectively. The other three: sgldcv, sghmccv and sgnhtcv implement the control variate versions of SGLD, SGHMC and SGNHT, respectively. All of these are built on the TensorFlow library for R, which enables gradients to be automatically calculated and efficient computations to be performed in high dimensions. The usage for these functions is very similar, with the only differences in input being tuning parameters. These main functions are outlined in Table 1 The functions sgld, sghmc and sgnht have the same required inputs: logLik, dataset, params and stepsize. These determine respectively: the log likelihood function for the model; the data for the model; the parameters of the model; and the stepsize tuning constants for the SGMCMC algorithm. The input logLik is a function taking dataset and params as input, while the rest are defined as lists. Using lists in this way provides a lot of flexibility: allowing multiple parameters to be defined; use multiple datasets; and use different stepsizes for each parameter, which is vital if the scalings are different. It also allows users to easily reference parameters and datasets in the logLik function by simply referring to the relevant names in the list.
The functions also have a couple of optional parameters that are particularly important, logPrior and minibatchSize. These respectively define the log prior for the model; and the minibatch size, as it was defined in Section 3. By default, the logPrior is set to an uninformative uniform prior, which is fine to use for quick checks but will need to be set properly for more complex models. The logPrior is a function similar to logLik, but only takes params as input. The minibatchSize is a numeric, and can either be a proportion of the dataset size, if it is set between 0 and 1, or an integer. The default value is 0.01, meaning that 1% of the full dataset is used at each iteration.
The control variate functions have the same inputs as the non-control variate functions, with one more required input. The optStepsize input is a numeric that specifies the stepsize for the initial optimisation step to find theθ maximising the posterior as defined in Section 3.4. For a full outline of the key inputs, see Table 2.
Often large datasets and high dimensional problems go hand in hand. In these high dimensional settings storing the full MCMC chain in memory can become an issue. For this situation we provide functionality to run the chain one iteration at a time in a user defined loop. This enables the user to deal with the output at each iteration how they see fit. For example, they may wish to calculate a test function on the output to reduce the dimensionality of the chain; or they might calculate the required Monte Carlo estimates on the fly. We aim to extend this functionality to enable the user to define their own Gibbs updates alongside the SGMCMC procedure. This functionality is more involved, and requires more knowledge of TensorFlow, so we leave the details to the example in Section 5.2.
As mentioned earlier, TensorFlow already has a variety of distributions implemented. These are located in two places: tf$distributions contains some common univariate distributions, while tf$contrib$distributions contains common multivariate distributions and more univariate distributions. Specifying the logLik and logPrior functions will require you to regularly specify distributions. Therefore, in the interest of cleanliness, we make the minor change of copying the distributions from tf$contrib$distributions to tf$distributions. The module tf$contrib$distributions will still work as expected. The copying across occurs as soon as the package is loaded into the workspace. This means any distribution distn in the TensorFlow package can be accessed by referencing tf$distributions$distn.
For the rest of this section we go into more detail about the usage of the functions using a worked example. The package is used to infer the bias and coefficient parameters in a logistic regression model. Section 5.1 demonstrates standard usage by performing inference on the model using the sgld and sgldcv functions. Section 5.2 demonstrates usage in problems where the full MCMC chain cannot be fit into memory. The same logistic regression model is used throughout.

Example usage
In this example we use the functions sgld and sgldcv to infer the bias (or intercept) and coefficients of a logistic regression model. The sgldcv case is also available as a vignette. Suppose we have data x 1 , . . . , x N of dimension d taking values in R d ; and response variables y 1 , . . . , y N taking values in {0, 1}. Then a logistic regression model with coefficients β and bias β 0 will have likelihood We will use the covertype dataset (Blackard and Dean, 1999) which can be downloaded and loaded using the sgmcmc function getDataset, which downloads example datasets for the package. The covertype dataset uses mapping data to predict the type of forest cover. Our particular dataset is taken from the LIBSVM library (Chang and Lin, 2011), which converts the data to a binary problem, rather than multiclass. The dataset consists of a matrix of dimension 581012 × 55. The first column contains the labels y, taking values in {0, 1}. The remaining columns are the explanatory variables X, which have been scaled to take values in [0, 1]. library("sgmcmc") covertype = getDataset("covertype") Now we'll split the dataset into predictors and response and get the dataset into the required R list format. X = covertype[,2:ncol(covertype)] y = covertype[,1] dataset = list( "X" = X, "y" = y ) In the last line we defined the dataset as a list object which will be input to the relevant sgmcmc function. The list names can be arbitrary, but must be consistent with the variables declared in the logLik function (see below).
When accessing the data, it is assumed that observations are split along the first axis. In other words, dataset$X is a 2-dimensional matrix, and we assume that observation x i is accessed at dataset$X[i,]. Similarly, suppose Z was a 3-dimensional array, we would assume that observation i would be accessed at Z [i,,]. Parameters are declared in a similar way, except the list entries are the desired parameter starting points. There are two parameters for this model, the bias β 0 and the coefficients β, which can be arbitrarily initialised to 0. d = ncol(dataset$X) params = list( "bias" = 0, "beta" = matrix( rep( 0, d ), nrow = d ) ) The log likelihood is specified as a function of the dataset and params, which are lists with the same names we declared earlier. The only difference is that the objects inside the lists will have automatically been converted to TensorFlow objects. The dataset list will contain TensorFlow placeholders. The params list will contain TensorFlow variables. The logLik function should be a function that takes these lists as input and returns the log likelihood value given the current parameters and data. This is done using TensorFlow operations, as this allows the gradient to be automatically calculated.
Everything is now ready to run a standard SGLD algorithm, with minibatchSize set to 500. To keep things reproducible we'll set the seed to 13. output = sgld( logLik, dataset, params, stepsize, logPrior = logPrior, minibatchSize = 500, nIters = 10000, seed = 13 ) The output of the function is also a list with the same names as the params list. Suppose a given parameter has shape (d 1 , . . . , d l ), then the output will be an array of shape (nIters, d 1 , . . . , d l ). So in our case, output$beta [i,,] is the i th MCMC sample from the parameter β; and dim(output$beta) is c(10000, 54, 1).
In order to run a control variate algorithm such as sgldcv we need one additional argument, which is the stepsize for the initial optimisation step. The optimisation uses the TensorFlow GradientDescentOptimizer. The stepsize should be quite similar to the stepsize for SGLD, though is often slightly larger. First, so that we can measure the performance of the chain, we shall set a seed and randomly remove some observations from the full dataset to form a testset. We also remove a short burn-in period of 1000 from the final output.

Example usage: Storage constraints
Often large datasets and high dimensionality go hand in hand. Sometimes the dimensionality is so high that storage of the full MCMC chain in memory becomes an issue. There are a number of ways around this, including: calculating estimates of the desired posterior quantity on the fly; reducing the dimensionality of the chain using a test function; or just periodically saving the chain to the hard disk and starting from scratch. To deal with high storage costs, sgmcmc provides functionality to run SGMCMC algorithms step by step. This allows users to deal with the output as they see fit at each iteration.
In this section, we detail how to run SGMCMC chains step by step. To do this requires more knowledge of TensorFlow, including using TensorFlow sessions and building custom placeholders and tensors. For more details see the TensorFlow for R documentation (Allaire et al., 2016). The step by step procedure works similarly to a standard TensorFlow procedure: TensorFlow variables, tensors and placeholders are declared; then the TensorFlow session is started and all the required tensors are initialised; finally the SGMCMC algorithm is run step by step in a user defined loop, and the user evaluates tensors as required.
To demonstrate this concept we keep things simple and use the logistic regression example introduced in the previous section. While this example can fit into memory, it allows us to demonstrate the concepts without getting bogged down in a complicated model. For a more realistic example, where the output does not fit into memory, see the Bayesian neural network model in Section 6.3.
We start by assuming the dataset, params, logLik, logPrior and stepsize objects have been created in exactly the same way as in the previous example (Section 5.1). Now suppose we want to make inference using stochastic gradient Langevin dynamics (SGLD), but we want to run it step by step. The first step is to initialise an sgld object using the function sgldSetup. For every function in Table 1 there is a corresponding *Setup function, such as sghmccvSetup or sgnhtSetup. This function will create all the TensorFlow objects required, as well as declare the dynamics of the SGMCMC algorithm. For our example we can run the following sgld = sgldSetup(logLik, dataset, params, stepsize, logPrior = logPrior, minibatchSize = 500, seed = 13) This sgld object is a type of sgmcmc object, it is an R S3 object, which is essentially a list with a number of entries. The most important of these entries for building SGMCMC algorithms is called params, which holds a list, with the same names as in the params that were fed to sgldSetup, but this list contains tf$Variable objects. This is how the tensors are accessed which hold the current parameter values in the chain. For more details on the attributes of these objects, see the documentation for sgldSetup, sgldcvSetup, etc. Now that we have created the sgld object, we want to initialise the TensorFlow variables and the sgmcmc algorithm chosen. For a standard algorithm, this will initialise the TensorFlow graph and all the tensors that were created. For an algorithm with control variates (e.g., sgldcv), this will also find theθ estimates of the parameters and calculate the full log posterior gradient at that point; as detailed in Section 3.4. The function used to do this is initSess,

sess = initSess(sgld)
The sess returned by initSess is the current TensorFlow session, which is needed to run the SGMCMC algorithm of choice, and to access any of the tensors needed, such as sgld$params. Now we have everything to run an SGLD algorithm step by step as follows for (i in 1:10^4) { sgmcmcStep(sgld, sess) currentState = getParams(sgld, sess) } Here the function sgmcmcStep will update sgld$params using a single update of SGLD, or whichever SGMCMC algorithm is given. The function getParams will return a list of the current parameters as R objects rather than as tensors.
This simple example of running SGLD step by step only stores the most recent value in the chain, which is useless for a Monte Carlo method. Also, for large scale examples, it is often useful to reduce the dimension of the chain by calculating some test function g(·) of θ at each iteration rather than the parameters themselves. This example will demonstrate how to store a test function at each iteration, as well as calculating the estimated posterior mean on the fly. We assume that a new R session has been started and the sgld object has just been created using sgldSetup with the same properties as in the example above. We assume that no TensorFlow session has been created (i.e., initSess has not been run yet).
Before the TensorFlow session has been declared, the user is able to create their own custom tensors. This is useful, as test functions can be declared beforehand using the sgld$params variables, which allows the test functions to be quickly calculated by just evaluating the tensors in the current session. The test function used here is once again the log loss of a test set, as defined in (10).
Suppose we input sgld$params and the testset T to the logLik function. Then the log loss is actually − 1 |T | times this value. This means we can easily create a tensor that calculates the log loss by creating a list of placeholders that hold the test set, then using the logLik function with the testset list and sgld$params as input. We can do this as follows testPlaceholder = list() testPlaceholder[["X"]] = tf$placeholder(tf$float32, dim(testset[["X"]])) testPlaceholder[["y"]] = tf$placeholder(tf$float32, dim(testset[["y"]])) testSize = as.double(nrow(testset[["X"]])) logLoss = -logLik(sgld$params, testPlaceholder) / testSize This placeholder is then fed the full testset each time the log loss is calculated. Now we will declare the TensorFlow session, and run the chain step by step. At each iteration we will calculate the current Monte Carlo estimate of the parameters. The log loss will be stored every 100 iterations. We omit a plot of the log loss as it is similar to Figure 2. This first chunk of code corresponds to the burn-in, at this point we do not store parameter estimates, we just print the log loss every 100 iterations to check convergence of the chain.

Simulations
In this section we demonstrate the algorithms and performance by simulating from a variety of models using all the implemented methods and commenting on the performance of each. These simulations are reproducible and available in the supplementary material and on Github. 6 For more usage tutorials similar to Sections 5.1 and 5.2, please see the vignettes on the package website. 7
The results are plotted in Figure 3. The black contours represent the best guess at the true posterior, which was found using the standard HMC procedure in Stan. The coloured contours that overlay the black contours are the approximations of each of the SGMCMC methods implemented by sgmcmc. This allows us to compare the SGMCMC estimates with the 'truth' by eye.
In the simulation, we obtain two chains, one approximating θ 1 and the other approximating θ 2 . In order to examine how well the methods explore both modes, we take just θ 1 and compare this to the HMC run for θ 1 . The results are quite variable, and it demonstrates a point nicely: there seems to be a trade-off between predictive accuracy and exploration. Many methods have demonstrated good performance using predictive accuracy; where a test set is removed from the full dataset to assess how well the fitted model performs on the test set. This is a useful technique for complex models, which are high dimensional and have a large number of data points, as they cannot be plotted, and an MCMC run to act as the 'truth' cannot be fitted.
However, this example shows that it does not give the full picture. A lot of the methods which show improved predictive performance (e.g., control variate methods and especially sgnht) over sgld appear here to perform worse at exploring the full space. In this example, sgld performs the best at exploring both components, though it over-estimates posterior uncertainty. The algorithm sghmc also explores both components but somewhat unevenly. We find that sgnht, while being shown to have better predictive performance in the original paper (Ding et al., 2014), does not do nearly as well as the other algorithms at exploring the space and appears to collapse to the posterior mode. The control variate methods, shown in the following sections, and in Baker et al. (2017), appear to have better predictive performance than sgld, but do not explore both components either. For example, sgldcv explores the space the best but over-estimates uncertainty of the first component, since it relies on SGLD updates which also overestimates uncertainty. In contrast, sgnhtcv collapses to a posterior mode since it relies on the SGNHT updates which also collapse.

Bayesian logistic regression
In this section, we apply all the methods to the logistic regression example in Section 5.1. We compare the performance of the methods by calculating the log loss of a test set every 10 iterations, again as detailed in Section 5.1. The standard methods (sgld, sghmc, sgnht) were run for 10 4 iterations with an additional 10 4 iterations of burn-in; except for sghmc which has 5× the computational cost, so is ran for 2,000 iterations with 2,000 iterations of burn-in. The control variate methods were run for 10 4 iterations with an additional 10 4 iterations for the initial optimisation step, and no burn-in; again except for sghmccv which was run for 2,000 iterations. This means that all the methods should be somewhat comparable in terms of computation time.
Results are plotted in Figure 4. All of the algorithms show decent performance. Methods which use control variates have significantly better predictive performance; and result in chains with lower variance. sghmc has lower variance than sgld and sgnht, though this could be related to the high computational cost. One might envisage setting a lower trajectory L would result in a chain with higher variance. sgldcv takes longer to burn-in than the other control variate methods. The algorithm sgld has the highest variance by far; this could be related to our discussion in Section 6.1 on exploration versus accuracy.

Bayesian neural network
In this simulation we demonstrate a very high dimensional chain. This gives a more realistic example of when we would want to run the chain step by step. The model is a two layer Bayesian neural network which is fit to the MNIST dataset (LeCun and Cortes, 2010). More step by step detail of how to fit this model can be found in the sgmcmc vignettes or on the package website 8 . The MNIST dataset consists of 28 × 28 pixel images of handwritten digits from zero to nine. The images are flattened to be a vector of length 784. The dataset is available as a standard dataset from the TensorFlow library, with a matrix of 55,000 training vectors and 10,000 test vectors, each with their corresponding labels. The dataset can be constructed in a similar way to the logistic regression example of Section 5.1, using the standard dataset in the package mnist.
The results are plotted in Figure 5. Again we see improvements in the predictive performance of the control variate methods. Among the standard methods, sghmc and sgnht have the best predictive performance; which is to be expected given the apparent trade-off between accuracy and exploration.

Discussion
We presented the R package sgmcmc, which enables Bayesian inference with large datasets using stochastic gradient Markov chain Monte Carlo. The package only requires the user to specify the log likelihood and log prior functions; and any differentiation required can be performed automatically. The package is based on TensorFlow, an efficient library for numerical computation that can take advantage of many different architectures, including GPUs. The sgmcmc package keeps much of this efficiency. The package provides functionality to deal with cases where the full MCMC chain is too large to fit into memory. As the chain can be run step by step at each iteration, there is flexibility for these cases.
We implemented the methods on a variety of statistical models, many on realistic datasets. One of these statistical models was a neural network, for which the full MCMC chain would not fit into memory. In this case we demonstrated building test functions and calculating the Monte Carlo estimates on the fly. We empirically demonstrated the predictive performance of the algorithms and the trade-off that appears to occur between predictive performance and exploration.
Many complex models for which SGMCMC methods have been found to perform well require Gibbs updates to be performed periodically (Patterson and Teh, 2013;Li et al., 2016). In the future we would like to build functionality for user defined Gibbs steps that can be updated step by step alongside the sgmcmc algorithms. SGHMC has been implemented by setting the valueβ t = 0, as in the experiments of the original paper Chen et al. (2014). In the future, we would like to implement a more sophisticated approach to set this value, such as using a similar estimate to Ahn et al. (2012).