Multivariate-from-Univariate MCMC Sampler: R Package MfUSampler

The R package MfUSampler provides Monte Carlo Markov Chain machinery for generating samples from multivariate probability distributions using univariate sampling algorithms such as Slice Sampler and Adaptive Rejection Sampler. The sampler function performs a full cycle of univariate sampling steps, one coordinate at a time. In each step, the latest sample values obtained for other coordinates are used to form the conditional distributions. The concept is an extension of Gibbs sampling where each step involves, not an independent sample from the conditional distribution, but a Markov transition for which the conditional distribution is invariant. The software relies on proportionality of conditional distributions to the joint distribution to implement a thin wrapper for producing conditionals. Examples illustrate basic usage as well as methods for improving performance. By encapsulating the multivariate-from-univariate logic, MfUSampler provides a reliable library for rapid prototyping of custom Bayesian models while allowing for incremental performance optimizations such as utilization of conjugacy, conditional independence, and porting function evaluations to compiled languages.


Introduction
Bayesian inference software such as Stan (Stan Development Team 2014), OpenBUGS (Thomas, O'Hara, Ligges, and Sturtz 2006), and JAGS (Plummer 2004) provide high-level, domainspecific languages (DSLs) to specify and sample from probabilistic directed acyclic graphs (DAGs). In some Bayesian projects, the convenience of using such DSLs comes at the price of reduced flexibility in model specification, and suboptimality of the underlying sampling algorithms used by the compilers for the particular distribution that must be sampled. Furthermore, for large projects the end-goal might be to implement all or part of the sampling algorithm in a high-performance -perhaps parallel -language. In such cases, researchers may choose to start their development work by 'rolling their own' joint probability distributions from the DAG specification, followed by application of their choice of a sampling algorithm to the joint distribution.
Many Monte Carlo Markov Chain (MCMC) algorithms have been proposed over the years for sampling from complex posterior distributions. Perhaps the most widely-known algorithm is Metropolis (Metropolis, Rosenbluth, Rosenbluth, Teller, and Teller 1953) and its generalization, Metropolis-Hastings (MH) (Hastings 1970). These multivariate algorithms are easy to implement, but they can be slow to converge without a carefully-selected proposal distribution. A particular flavor of MH is the Stochastic Newton Sampler (Qi and Minka 2002), where the proposal distribution is a multivariate Gaussian based on the second-order Taylor series expansion of the log-probability. This method has been implemented in the R package sns (Mahani, Hasan, Jiang, and Sharabiani 2015). It can be quite effective for twice-differentiable, log-concave distributions such as those encountered in generlized linear regression (GLM) problems. Another flavor of MH is the t-walk algorithm (Christen and Fox 2010) which uses a set of scale-invariant proposal distributions to co-evolve two points in the state space [better description]. Hamiltonian Monte Carlo (HMC) algorithms (Girolami and Calderhead 2011;Neal 2011) have also gained popularity due to emerging techniques for their automated tuning (Hoffman and Gelman 2014).
Univariate samplers tend to have few tuning parameters and thus are well suited for black-box MCMC software. Two important examples are adaptive rejection sampling (Gilks and Wild 1992) (or ARS) and slice sampling (Neal 2003). ARS requires log-density to be concave, and needs its first derivative, while slice sampler is generic and derivative-free. To apply these univariate samplers to multivariate distributions, they must be applied one-coordinate-at-a-time according to the Gibbs sampling algorithm (Geman and Geman 1984), where at the end of each univariate step the sampled value is used to update the conditional distribution for the next coordinate. MfUSampler encapsulates this logic into a library function, providing a fast and reliable path towards Bayesian model estimation for researchers working on novel DAG specifications. In addition to slice sampler and ARS, current version of MfUSampler (1.0.0) contains adaptive rejection Metropolis sampler (Gilks, Best, and Tan 1995) and univariate Metropolis with Gaussian proposal. Univariate samplers have their limits: when posterior distribution exhibits strong correlation structure, one-coordinate-at-a-time algorithms can become inefficient as they fail to capture important geometry of the space (Girolami and Calderhead 2011). This has been a key motivation for research on black-box multivariate samplers, such as adaptations of slice sampler (Thompson 2011) or the no-U-turn sampler (Hoffman and Gelman 2014).
The rest of this article is organized as follows. In Section 2 we provide a brief overview of the extended Gibbs sampling framework used in MfUSampler. In Section 3 we illustrate how to use the software with an example. Section 4 shows how MfUSampler discussed several performance optimization techniques that can be used in conjunction with MfUSampler. Finally, Section 5 provides a summary and concluding remarks.

Theory and Implementation of MfUSampler
In this section, we discuss the theoretical underpinnings of the MfUSampler package, including extended Gibbs sampling (Section 2.1), and proportionality of conditional and joint distributions (Section 2.2). Software components of MfUSampler, described in Section 2.3, are best understood in this theoretical background.

Extended Gibbs sampling
Gibbs sampling (Bishop 2006) involves iterating through state space coordinates, one at a time, and drawing samples from the distribution of each coordinate, conditioned on the latest sampled values for all remaining coordinates. Gibbs sampling reduces a multivariate sampling problem into a series of univariate problems, which can be more tractable.
In what we refer to as 'extended Gibbs sampling', rather than requiring an independent sample from each coordinate's conditional distribution, we expect a Markov transition for which the conditional distribution is an invariant distribution. Among the current univariate samplers implemented in MfUSampler, adaptive rejection sampler produces a standard Gibbs sampler while the remaining samplers falls in the 'extended Gibbs sampler' category. The following lemma forms the basis for proving the validity of extended Gibbs sampling as an MCMC sampler. (For a discussion of ergodicity of MCMC samplers, see Roberts and Rosenthal (1999); Jarner and Hansen (2000)).

Lemma 1. If a coordinate-wise Markov transition leaves the conditional distribution invariant, it will also leave the joint distribution invariant.
Proof is given in Appendix B. A full Gibbs cycle is simply a succession of coordinate-wise Markov transitions, and since each one leaves the target distribution invariant according to the above lemma, same is true of the resulting composite Markov transition density.

Proportionality of conditional and joint distributions
Using univariate samplers within Gibbs sampling framework requires access to conditional distributions, up to a multiplicative constant (in terms of coordinate being sampled). Referring to conditional distribution for the k'th coordinate as p(x k |x \k ), we examine the following application of Bayes' rule to observe that, since the normalizing factorp(x \k ) -is independent of x k , the joint and conditional distributions are porportional. Therefore, the joint distribution can be supplied to univariate sampling routines in lieu of conditional distributions during each step of Gibbs sampling. MfUSampler takes advantage of this property, as described next.

Implementation
The MfUSampler software consists of 5 components: 1) connectors, 2) univariate samplers, 3) Gibbs wrappers, 4) diagnostic utilities, and 5) full Bayesian prediction. We describe each component below. Figure 1 provides an overview of how these components fit in the overall process flow of MfUSampler. Below we describe each component in detail.

Connectors
The internal functions MfU.fEval, MfU.fgEval.f and MfU.fgEval.g return the conditional log-density and its gradients for each coordinate, using the underlying joint logdensity and its gradient vector (Section 2.2). These functions act as the bridge between the user-supplied, multivariate log-densities and the univariate samplers.  Univariate samplers These functions are responsible for producing a single MCMC jump for univariate distributions resulting from applying the connector functions to the user-supplied, multivariate distribution. As of version 1.0.0, MfUSampler supports the following 4 samplers: 1. Univariate slice sampler with stepout and shrinkage (Neal 2003). The code, wrapped in the internal function MfU.Slice, is taken -with small modifications -from Radford Neal's website 1 . Slice sampler is derivative-free and robust, i.e. its performance is rather insensitive to its tuning parameters. It is the default option in MfU.Sample and MfU.Sample.Run.
2. Adaptive rejection sampler (Gilks and Wild 1992), imported from R package ars (Perez-Rodriguez, Wild, and Gilks 2014). ARS requires log-density to be concave, and its gradient. Our experience shows that it is somewhat more sensitive to the choice of tuning parameters, compared to slice sampler (Section 3.3).
3. Adpative rejection Metropolis sampler (Gilks et al. 1995), imported from R package HI (Petris, Tardella, and Gilks 2013). This algorithm is an adaptation of ARS with an additional Metropolis acceptance test, aimed at accommodating distributions that are not log-concave. The algorithm can also be applied directly to a multivariate distribution. However, see (Gilks, Neal, Best, and Tan 1997).

Univariate Metropolis with Gaussian proposal, implemented by MfUSampler (function
MfU.Univariate.Metropolis). This simple sampler can be inefficient, unless the standard deviation of Gaussian proposal is chosen carefully. It has been primarily included as a reference for other, more practical choices.
For technical details on the sampling algorithms and their tuning parameters, see help file for MfU.Sample, as well as aforementioned publications or statistical textbooks (Robert and Casella 1999).
Gibbs wrappers The function MfU.Sample implements the extended Gibbs sampling concept (Section 2.1), using a for loop that applies the underlying univariate sampler to each coordinate of the multivariate distribution. The function MfU.Control allows the user to set the tuning parameters of the univariate sampler. MfU.Sample.Run is a light wrapper around MfU.Sample for drawing multiple samples. Diagnostic utilities Implementations of generic S3 methods summary and plot for MfU classoutput of MfU.Sample.Run -are light wrappers around corresponding methods for the mcmc class in the R package coda, with the addition of sample-based covariance matrix, effective sample size, time, and number of independent samples per sec.
Full Bayesian prediction The S3 method predict.MfU function allows for sample-based reconstruction of predictive posterior distribution for any user-supplied prediction function. The mechanics and advantages of full Bayesian prediction are discussed in the sns vignette . See Section 3.4 of this document for an example.

Using MfUSampler
In this section, we illustrate how MfUSampler can be used for building Bayesian models. We begin by introducing the data set used throughout the examples in this paper. This is followed by illustration of how univariate samplers can be readily applied to sample from the posterior distribution of our problem. Application of diagnostic and prediction utility functions are illustrated last.
Before proceeding, we load MfUSampler into an R session, and select the seed value to feed to the random number generator at the beginning of each code segment (for reproducibility), and the number of MCMC samples to collect in each run:

Diabetic retinopathy data set
This data set is a 2×8 contingency table, containing the number of occurances of diabetic retinopathy for patients with 8 different durations of diabetes. The tabular form of the data set can be found in Knuiman and Speed (1988).
The mid-point of diabetes duration bands are encoded in the vector z below, while the number of patients with/without retinopathy are encoded in m1 and m2 vectors: The prior suffix corresponds to numbers from a previous study, while current reflects the results of current study.

Slice sampling from posterior
The following function implements the log-posterior for our problem, assuming a multivariate Gaussian prior on the coefficient vector, beta, with mean beta0 and covariance matrix W.
The default values represent a non-informative -or flat -prior.
return (logprior(beta, beta0, W) + loglike(beta, X, m1, m2)) + } Incorporating prior information in this problem can be done in two ways: 1) extracting beta0 and W from prior data (using flat priors during estimation), and feeding these numbers as priors for estimating the model with current data, 2) simply adding prior and current numbers to arrive at the posterior contigency table. While the first approach can be more flexible, as argued in Knuiman and Speed (1988), here we opt for the second approach for brevity of presentation: R> m1.total <-m1.prior + m1.current R> m2.total <-m2.prior + m2.current We begin by drawing 1000 samples using the slice sampler, and printing a summary of samples: Sample mean is quite close to values reported in Knuiman and Speed (1988

Adaptive rejection sampling of posterior
Next, we illustrate how ARS can be used for this posterior distribution. We need to implement the gradient of log-density in order to use ARS. Furthermore, we must ensure that the distribution is log-concave, or equivalently that the Hessian of log-density is negative-definite. It is easy to verify that this distribution satisfies our requirement. For theoretical and software support in assessing log-concavity of distributions and verifying correct implementation of their derivatives, see the vignette for the R package sns . Note that we have provided custom values for the control parameter ars.x. For this problem, the ARS algorithm is sensitive to these initial values, and can fail to identify log-concavity of the distribution in some cases. Generally, we have found the slice sampler to require less tuning to achieve reasonable performance. Interested readers can see examples of using other samplers included in MfUSampler -namely ARMS and univariate Metropolis -by typing ?MfU.Sample in the R session.

Full Bayesian prediction
The predict function in the MfUSampler package can be used to do sample-based reconstruction of arbitrary functions of model parameters. This includes deterministic as well as stochastic functions. For example, assume we want to know the probability distribution of the probability of retinopathy for each value of z in our training set. The prediction function has the following simple form:

R> predfunc.mean <-function(beta, X) { + return (1/(1 + exp(-X %*% beta))) + }
We can now generate samples for this predicted quantity: R> pred.mean <-predict(beta.smp, predfunc.mean  We can also ask a different question: what is the distribution of percentage of population with retinopathy in each given band of diabetes duration. The prediction function is a slight modification of the previous one: R> predfunc.binary <-function(beta, X) { + return (1*(runif(nrow(X)) < 1/(1 + exp(-X %*% beta)))) + } R> pred.binary <-predict(beta.smp, predfunc.binary, X) R> predbinary.summ <-summary(pred.binary) R> print(predbinary.summ, n = 8) prediction sample statistics: (nominal sample size: 5) mean sd ess 2.5% 50% 97. We see that mean values from the two predictions are close, and in the limit of infinite samples they will converge towards the same values. However, the SD numbers are much larger for the binary prediction as it combines the uncertainty of estimating the coefficients, with the uncertainty of the process that generates the (binary) outcome. The value of full Bayesian prediction, particularly in business and decision-making settings, is that it combines these two sources of uncertainty to provide the user with a full reprsentation of uncertainty in estimating actual outcomes, and not just mean/expected values.

Performance improvement
Applying MfU.Sample.Run to the full joint PDF of a Bayesian model, implemented in R, is often a good starting point. For small data sets, this may well be sufficient. For example, in the diabetic retinopathy data set described in Section 3, we are able to draw 10,000 samples from the posterior distribution in less than xx seconds. However, for large data sets we must look for opportunities to improve the performance. In this section, we describe two general strategies for speeding up sampling of posterior distributions within the framework of MfUSampler: 1) utilizing the structure of the underlying model graph, and 2) high-peformance evaluation of posterior function. We describe these two strategies using extensions of the diabetic retinopathy data set. At the end of this section, we provide an overview of other performance optimization approaches, as well as pointers for further reading.

Diabetic retinopathy: Hierarchical Bayesian with continuous z
To illustrate the performance optimization strategies discussed in this section, we extend the diabetic retinopathy data set -by simulations -to define a HB problem with continuous z.

Utilizing graph structure
In directed acyclic graphs, the joint distribution can be factorized into the product of conditional distributions for all nodes, conditioned on parent nodes of each node (Bishop 2006). For undirected graphs, factorization can be done over maximal cliques of the graph. When sampling from conditional distribution of a variable -conditioned on all remaining variables -as is done during Gibbs sampling, not all such multiplicative factors involve the variable being sampled, and can be safely ignored during evaluation of the conditional distribution and its derivatives. In some cases, the resulting time savings can be quite significant, with a prime example being the hierarchical Bayesian models (Gelman and Hill 2006). In HB models, the conditional distribution of the low-level coefficient vector for each group during Gibbs sampling contains multiplicative contributions from other groups. This is reflected in the following log-posterior functions for the coefficients of all groups that contain one additive term per group:  (-sum((1-y[, n]) * xbeta + log (1 + exp(-xbeta)))) + }))) + } R> hb.logpost <-function(beta.flat, X, y, beta0, W) { + return (hb.logprior(beta.flat, beta0, W) + + hb.loglike(beta.flat, X, y)) + } A naive implementation of full PDF is thus pointlessly duplicating computations by ngrp times: R> nsmp <-10 R> set.seed(my.seed) R> beta.flat.ini hb.logpost + ,X = X.exp,y = y.mat.exp + ,beta0 = beta0.prior,W = W.prior,"t") R> cat("hb sampling time -naive method:", t.naive, "sec\n") hb sampling time -naive method: 53.545 sec Note that, in the above, we have made the simplifying assumption that we know the true values of the parameters of the multivariate Gaussian prior, i.e. beta0.prior and W.prior. In reality, of course, prior parameters must also be estimated from the data, and thus the Gibbs cycle includes not just the lower-level coefficients but also the prior parameters. However, all the strategies discussed in this section can be conceptually illustrated while focusing only on beta's. The first optimization strategy is to take advantage of the conditional independence property by evaluating only the relevant term during sampling of coefficients in each group: R> hb.loglike.grp <-function(beta, X, y) { + beta <-as.numeric(beta) + xbeta <-X %*% beta + return (-sum((1-y) * xbeta + log (1 + exp(-xbeta)))) + } R> hb.logprior.grp <-logprior R> hb.logpost.grp <-function(beta, X, y + , beta0 = rep(0,0, 3) In the above, we have continued to use the defualt R random number generator for code brevity. In practice, in order to generate uncorrelated random numbers across multiple execution threads, one should use parallel RNG streams such as those provided in the R package rstream (Leydold 2015).

High-performance PDF evaluation
For most MCMC algorithms, the majority of sampling time is spent on evaluating the logdensity (and its derivatives if needed). Efficient implementation of functions responsible for log-density evaluation is therefore a rewarding optimization strategy which can be combined with the strategies discussed in section 4.2. The Rcpp (Eddelbuettel and François 2011) framework offers a convenient way to port R functions to C++. Here we use the RcppArmadillo (Eddelbuettel and Sanderson 2014) package for its convenient matrix algebra operations to transform the log-likelihood component of the log-posterior (as it takes the majority of time for large data, compared to log-prior): R> library("RcppArmadillo") R> library("inline") R> code <-" While the result is a decent speedup given the relatively small effort put in, yet the impact is not as significant as the previous two strategies. It must be noted that matrix algebra operations in R are handled by BLAS and LAPACK libraries, written in C and Fortran. Therefore, the majpr benefit of porting the log-likelihood function to C++ in the above example is likely to be the consolidation of data and control transfer between the interpretation layer and the computational back-end. For large problems, even parallel hardware such as Graphic Processing Units (GPUs) can be utilized by writing log-density functions in languages such as CUDA (Nickolls, Buck, Garland, and Skadron 2008), while continuing to take advantage of MfUSampler for sampler control logic. Minimizing data movement between processor and co-processor is a key performance factor in such cases.
Figur 2 summarizes the impact of the three optimization stragies discussed in this section. We see that, for this particular problem and set of parameters, the cumulative impact of the 3 optimization strategies is a xx times speedup.
In addition to the above-mentioned strategies, there are several other options available for improving performance of MCMC sampling techniques for Bayesian models. Examples include differential update, single-instruction multiple-data (SIMD) parallelization of log-likelihood calculation, and batch random number generation. For a detailed discussion of these topics, see .

Summary
The R package MfUSampler enables MCMC sampling of multivariate distributions using univariate algorithms. It relies on an extension of Gibbs sampling from univariate independent sampling to univariate Markov transitions, and proportionality of conditional and joint distributions. By encapsulating these two concepts in a library, MfUSampler reduces the possibility of subtle mistakes by researchers while re-implementing the Gibbs sampler and thus allows them to focus on other, more innovative aspects of their Bayesian modeling. Brute-force application of MfUSampler allows researchers to get their project off the ground, maintain full control over model specification, and utilize robust univariate samplers. This can be followed  Figure 2: Time needed to draw 1000 samples for the HB logistic regression problem, based on the diabetic retinopathy data set introduced in Section 3.1, at various stages of optimization. Each step represents the cumulative effect of strategies, starting with the left-most bar corresponding to the naive implementation. Numbers above bars show speedup due to each optimization.
by an incremental optimization approach by taking advantage of DAG properties such as conditional independence and by porting log-density functions to high-peformance languages and hardware.
Note that standard Gibbs sampling is a special case of the above lemma where T (x ′ k , x k |x \k ) = p(x ′ k |x \k ). The reader can easily verify that this special transition density satifies the premise.