Stochastic Newton Sampler: The R Package sns

.


Introduction
In most real-world applications of Monte Carlo Markov Chain (MCMC) sampling, the probability density function (PDF) being sampled is multidimensional.Univariate samplers such as the slice sampler (Neal 2003) and adaptive rejection sampler (Gilks and Wild 1992) can be embedded in the Gibbs sampling framework (Geman and Geman 1984) to sample from multivariate PDFs (Mahani and Sharabiani 2015a).Univariate samplers generally have few tuning parameters, making them ideal candidates for black-box MCMC software such as JAGS (Plummer 2013) and OpenBUGS (Thomas, O'Hara, Ligges, and Sturtz 2006).However, they become less effective as PDF dimensionality rises and dimensions become more of models, referred to as 'exchangeable regression' models (Section 2.3).

SNS proposal density
SNS proposal density is a multivariate Gaussian fitted locally to the density being sampled, using the second-order Taylor-series expansion of the log-density: where f : R K → R is the log-density, and g and H are the gradient vector and Hessian matrix for f , respectively, with dimensions K and K × K. Assuming that f is globally concave, the above approximation is equivalent to fitting the following multivariate Gaussian (which we refer to as F (x)) to the PDF: By comparing Equations 3 and 4, we see that the precision matrix is the same as negative Hessian: Σ −1 = −H(x 0 ).The mean of the fitted Gaussian maximizes its log, and therefore: We can now formally define the (multivariate Gaussian) proposal density q(.| x) as: Note that Equation 5 is simply the full Newton step (Nocedal and Wright 2006a).We can therefore think of SNS as the stochastic counterpart of Newton-Raphson (NR) optimization.
In NR optimization, we select the mean of the fitted Gaussian as the next step, while in SNS we draw a sample from the fitted Gaussian and apply MH test to accept or reject it.Also, note that in the special case where the sampled PDF is Gaussian, f (x) is quadratic and therefore the proposal function is identical to the sampled PDF.In this case A(z , z) is always equal to 1, implying an acceptance rate of 100%.
Contrary to standard Metropolis variants with multivariate Gaussians centered on current point (symmetric Gaussian proposal), SNS is an aggressive, non-local MCMC algorithm as it seeks to construct a global, Gaussian approximation of the PDF.Therein lies both its strength and its weakness.When the Gaussian approximation is sufficiently close to the true PDF, this can lead to a nearly-uncorrelated chain of samples, with the extreme case of perfectly uncorrelated samples for a multivariate Gaussian distribution.In such cases, SNS can be much more efficient than univariate and multivariate alternatives, including Metropolis sampling with symmetric Gaussian proposal.On the other hand, if the Gaussian approximation is poor, there can be little overlap between the two, resulting in low acceptance rate and inefficient sampling.
Our empirical observations indicate that, in exchangeable regression models SNS performs best when the number of observations is large compared to the number of variables.Appendix C provides theoretical support for this observation in the special case of a univariate log-density.The sns package contains the 'state space partitioning' strategy for improving mixing for SNS in high-dimensional problems (Section 3.2).Performance of SNS vis-a-vis other sampling techniques under various data regimes is further illustrated in Section 5.

Log-density concavity
Constructing a valid SNS proposal function (Eq.6) requires the sampled log-density to be twice-differentiable and concave.Equivalently, the Hessian of log-density must exist and be negative definite.Many common probability distributions enjoy such property, perhaps with suitable transformations of parameters.We refer the reader to Table 2 of Gilks and Wild (1992) for a list of distributions and parameters that enjoy log-concavity.For Bayesian problems where log-density is a sum of contributions from log-prior and log-likelihood terms, it is sufficient for each term to be concave, since this property is additive.
An important application of the Bayesian framework is exchangeable regression models (Good 2002), where 1) log-likelihood is the sum of contributions from individual data points, and 2) one or more parameters of a probability distribution are assumed to be (possibly nonlinear functions of) the inner product of a vector covariates and a vector of coefficients, often referred to as 'linear predictors'.In such cases, log-density can be generically written in the following form: where < a, b > is the inner product of vectors a and b, x 1 n , . . ., x J n are the covariate vectors 1, . . ., J for the n'th observation, and β 1 , . . ., β J are the vectors of coefficients corresponding to J slots in the probability distributions f n (u 1 , . . ., u J ), n = 1, . . ., N .Typically, J equals 1 or 2, with a notable exception being the multinomial distribution, used in multi-class logistic regression (Hasan, Zhiyu, and Mahani 2014), where J equals the number of classes.
A convenient property of the functional form in Eq. 7 is that the definiteness property of its Hessian as a function of the high-dimensional state space of β: can be reduced to a much simpler problem, i.e. definiteness of the J-dimensional functions f n is Eq. 7, for n = 1, . . ., N .This notion is formally captured in the following theorem: Theorem 1.If all f n 's in Equation 7have negative-definite Hessians AND if at least one of J matrices X j ≡ (x j 1 , . . ., x j N ) t is full rank, then L(β 1 , . . ., β J ) also has a negative-definite Hessian.
Proof is provided in Appendix B. Reasoning about definiteness of a matrix in a one-or two-dimensional space is much easier than doing so in a high-dimensional space, with dimensionality being the sum of lengths, K j , of coefficients or j K j .In a typical problem, this can sum be as large as 50 or 100.Theorem 1, on the other hand, means that for single-slot base distributions f n (u), we must simply prove that f n (u) < 0, ∀n.For multi-slot distributions, we can use Silvester's criterion for negative-definiteness, requiring that leading principal minors alternate between positive and negative (Gilbert 1991).In particular, for two-slot distributions, we require the diagonal elements to be negative while the determinant of the Hessian must be positive.It must be noted that Theorem 1 offers a sufficient, but not necessary, set of conditions.It is therefore possible for a log-density of the form shown in Eq. 7 to be negative-definite without satisfying the set of conditions stated in the above theorem.The log-density for heteroskedastic linear regression discussed in Section 4.4 is potentially one such example.
In some cases, while the Hessian cannot be proven to be negative-definite, yet blocks within the full Hessian have this property.In such cases, SNS can be combined with other, more generic sampling algorithms such as slice sampler or HMC, all embedded in a Gibbs cycle.The implementation would be similar to that of state space partitioning approach in sns (Section 3.2), but replacing SNS with alternative samplers for Hessian subsets (or blocks) that do not satisfy negative-definiteness.An example of this strategy, as well as the application of Theorem 1 is provided in in Section 4.4.
In addition to the theoeretical support provided in Theorem 1, sns also contains a utility function for empirical assessment of negative-definiteness for Hessian blocks.This feature is described in Section 3.5, and illustrated in Section 4.4.

Calculation of gradient and Hessian
Deriving the expressions for the gradient vector and the Hessian matrix, and implementing them as computer programs, can be tedious.For exchangeable regression models, we can apply the chain rule of derivatives to log-density of Eq. 7 to express the high-dimensional gradient and Hessian in terms of their low-dimensional counterparts for f n 's.The first derivative of log-likelihood can be written as: where ). (10) For second derivatives we have: where we have defined H(β) in terms of J 2 matrix blocks: Applying the chain rule to the log-likelihood function of Equation 7, we derive expressions for its first and second derivatives as a function of the derivatives of the base functions f 1 , . . ., f N : with and Similarly, for the second derivative we have: where h jj is a diagonal matrix of size N with n'th diagonal element defined as: We refer to the matrix form of the Equations 13 and 16 as 'compact' forms, and the explicitsum forms as 'explicit' forms.The expander functions in the R package RegressionFactory (Mahani and Sharabiani 2015b) use the compact form to implement the high-dimensional gradient and Hessian, while proof of Theorem 1 utilizes the explicit-sum form of Equation 16(see Appendix B).
In addition to the expander framework discussed above, users can also utilize the numerical differentiation facility of sns to circumvent the need for deriving analytical expressions for gradient and Hessian.This is explained in Section 3.5, and illustrated using an example in Section 4.4.

Software implementation and features
This section begins with an overview of functions included in sns.This is followed by a deepdive into facilities for improving convergence and mixing of MCMC chains in SNS, diagnostic tools, and full Bayesian prediction.A discussion of capabilities for calculating and validatig log-density derivatives concludes this section.,

Overview
The workhorse of sns package is the sns function, responsible for implementation of MH algorithm using the multivariate Gaussian proposal density described in Section 2.2.sns implements the following steps: 1. Evaluate the log-density function and its gradient and Hessian at x old : f old , g old , H old .
2. Construct the multivariate Gaussian proposal function at q(.|x old ) using Equation 6 and x = x old .
4. Evaluate the log-density function and its gradient and Hessian at x prop : f prop , g prop , H prop .
5. Construct the multivariate Gaussian proposal function at q(.|x prop ) using Equation 6and x = x prop , and evaluate logq old = log(q(x old |x prop )).
7. If r ≥ 1 accept x prop : x new ← x prop .Else, draw a random deviate s from a uniform distribution over [0, 1).If s < r, then accept x prop : x new ← x prop , else reject x prop : Fitting the multivariate Gaussian in steps 2 and 5 is done via calls to the private function fitGaussian in the sns package.We use the functions dmvnorm and rmvnorm from package mvtnorm to calculate the log-density of, and draw samples from, multivariate Gaussian proposal functions.
There are three important arguments in sns: rnd, part, and numderiv.rnd controls whether the algorithm should run in stochastic or MCMC mode (which is the default choice), or in non-stochastic or Newton-Raphson (NR) mode.part controls the state space partitioning strategy.These two arguments and their roles are described in Section 3.2.numderiv determines whether numerical differentiation must be used for calculating the gradient and/or Hessian of log-density, and is discussed in Section 3.5.
sns.run is a wrapper around sns, and offers the following functionalities: 1. Convenience of generating multiple samples via repeated calls to sns.After the first call, the Gaussian fit object (attached as attribute gfit in the returned value from sns) is fed back to sns via argument gfit, in order to avoid unnecessary fitting of the proposal function at current value.
2. Collecting diagnostic information such as log-probability time series, acceptance rate, relative deviation from quadratic approximation (time series), and components of MH test.These diagnostic measures are discussed in Section 3.3, and their use is illustrated via examples in Section 4.
Note that, when SNS is part of a Gibbs cycle and used in conjunction with other samplers, sns.run cannot be used since the conditional distribution that must be sampled from changes in each iteration.
The generic methods summary.sns,plot.sns and predict.snsprovide diagnostic, visualization, and prediction capabilities.They are discussed in Sections 3.3 and 3.4, and illustrated via examples in Section 4.

Improving convergence and mixing
NR mode: Far from the distribution mode, the local multivariate Gaussian fit can be severely different from the PDF, leading to small overlap between the two, low acceptance rate and hence bad convergence.This can be overcome by spending the first few iterations in nonstochastic or NR mode, where instead of drawing from the proposal function we simply accept its mean as the next step.Rather than taking a full Newton step, we have implemented line search (Nocedal and Wright 2006b) to ensure convergence to the PDF maximum.To use sns in NR mode, users can set the argument rnd to FALSE.In NR mode, each iteration is guaranteed to increase the log-density.Using the NR mode during the initial burn-in phase is illustrated in Section 4. In sns.run, the argument nnr controls how many initial iterations will be performed in NR mode.
State space partitioning: Even when near the PDF maximum, the fitted Gaussian can be severely different from the PDF.This can happen if the PDF has a significant third derivative, a phenomenon that we have observed for high-dimensional problems, especially when the number of observations is small.To improve bad mixing in high dimensions, we use a strategy which we refer to as 'state space partitioning', where state space is partitioned into disjoint subsets and SNS is applied within each subset, wrapped in a Gibbs cycle.This functionality is available via the part argument, which is a list containing the set of state space dimensions belonging to each subset.Convenience functions sns.make.partand sns.check.partallow users to easily create partition lists and check their validity, respectively.

Diagnostics
sns includes a rich set of diagnostics which can be accessed via functions summary.snsand plot.sns.Some of these are generic measures applicable to all MCMC chains, some are specific to MH-based MCMC algorithms, and some are even further specialized for SNS as a particular flavor of MH.Where possible, we have used the library coda and adopted its naming conventions, but opted to create and maintain an independent set of functions due to their specialized and extended nature.
MCMC diagnostics: In summary.sns,we calculate the usual MCMC chain summaries including mean, standard deviation, quantiles, and effective sample size.We also calculate a sample-based p-value for each coordinate.In plot.snswe have log-density trace plot, state vector trace plots, effective sample size by coordinate, state vector histograms, and state vector autocorrelation plots.
MH diagnostics: In summary.sns,we calculate the acceptance rate of MH transition proposals.If mh.diag flag is set to TRUE, all 4 components of the MH test (log.p,log.p.prop, log.q and log.q.prop) for all iterations are returned as well.
SNS diagnostics: In summary.sns,we return reldev.mean(if sns.run was called with mh.diag set to TRUE), defined as the average relative deviation of log-density change (with respect to PDF maximum) from quadratic approximation (also constructed at PDF maximum).The location of PDF maximum is extracted from the Gaussian fit in the last iteration under NR mode.The higher this value, the more likely it is for the SNS to exhibit bad mixing.This is illustrated in Section 4. For reldev.mean to be valid, the user must ensure that the value of the argument nnr supplied to sns.run is sufficiently high to achieve convergence by the end of NR phase.Note that this is generally unfeasible if SNS is part of a Gibbs cycle, and used in conjunction with other samplers, as in this case the conditional distribution that SNS samples from constantly from one iteration to the next.

Full Bayesian prediction
The function predict.snsallows for full Bayesian prediction, using a sample-based representation of predictive posterior distribution.It accepts an arbitrary function of state vector as argument fpred, and applies the function across all samples of state vector, supplied in the first argument, which must be an output of sns.run.The core philosophy in full Bayesian prediction is to postpone summarization of samples until the last step.For example, rather than supplying the expected values of coefficients into a function, we supply the samples and take the expected value after applying the function.Following this proper approach is important because: 1. Mathematically, an arbitrary function is not commutable with the expected value operator.Therefore, applying expected value early produces incorrect results.
2. For a similar reason, confidence intervals cannot be propagated through arbitrary functions.Therefore, correct uncertainty measurement also requires a full Bayesian approach.
Note that fpred can be deterministic or stochastic.An example of the latter case is when we need to generate samples from the predictive posterior distribution, and fpred generates sample(s) from the conditional distribution p(ŷ|θ), where ŷ is a new observation (for which we are seeking prediction) and θ represents the model parameters (can be a vector), for which have used SNS to generate samples.This can be considered a special case of ancestral sampling (Bishop 2006), applied to the following:

Calculation and validation of log-density derivatives
SNS requires log-density to be twice-differentiable and concave.The sns package offers a numerical differentiation capability to circumvent the need to mathematically compute the log-density gradient and Hessian, and to help users determine whether the log-density (likely) satifies the twice-differentiability and concavity requirements.
The functions grad and hessian from numDeriv package (Gilbert and Varadhan 2012) are used for numerical differentiation.Arguments numderiv, numderiv.methodand numderiv.args in the sns and sns.run functions control the usage and parameters of these functions.Users have three choices with regards to log-density derivatives: 1. fghEval returns a list with elements f, g, and h for log-density, its gradient and its Hessian, respectively.In this case, users must leave numderiv at the default of 0 (no numerical differentiation).
2. fghEval returns a list with elements f (log-density) and g (gradient), and Hessian is found through numerical differentiation by setting numderiv to 1.
3. fghEval returns the log-density only (no list), and both gradient and Hessian are calculated numerically by setting numDeriv to 2.
A slightly more efficient approach is to do 'numerical augmentation' of the log-density once, outside sns, via the utility function sns.fghEval.numaug.Section 4.4 contains an example of using numerical derivatives with SNS.

Using sns
In this section, we illustrate the core functionalities of sns as well as some of its advanced features via 4 examples.Below is an overview of objectives in each example: Setup details such as platform, OS, and R and package versions can be found in Appendix A.
To maximize the reproducibility of results for each code segment, we repeatedly fix random seeds throughout the paper (using set.seed()function).

Example 1: Multivariate Gaussian
First, we launch an R session and load the sns package as well mvtnorm (used for evaluating the multivariate Gaussia log-density in logdensity.mvg).We also select a seed for rnadom number generator: R> library("sns") R> library("mvtnorm") R> my.seed <-0 Using sns to sample from a multivariate Gaussian is a contrived, but pedagogical, example.Since log-density for a multivariate Gaussian is quadratic, its second-order Taylor series expansion is not approximate but exact.In other words, the proposal function becomes location-independent, and equal to the sampled distribution.This means that 1) the MH test is always accepted, and 2) consecutive samples are completely independent, and hence the resulting chain is no longer Markovian.Of course, since we know how to sample from multivariate Gaussian proposal functions, we might as well directly sample from the original multivariate Gaussian distribution.(Hence, the pedagogical nature of this example.)To utilize sns, we must first implement the log-density and its gradient and Hessian: R> logdensity.mvg<-function(x, mu, isigsq) { + f <-dmvnorm(x = as.numeric(x),+ mean = mu, sigma = solve(isigsq), log = TRUE) + g <--isigsq %*% (x -mu) We now draw 500 samples from this log-desity, using pre-specified values for mu (mean vector) and isigsq (inverse of the covariance matrix, or precision matrix) in a 3-dimensional state space: R> set.seed(my.seed)R> K <-3 R> mu <-runif(K, min = -0.5, max = +0.5)R> isigsq <-matrix(runif(K*K, min = 0.1, max = 0.2), ncol = K) R> isigsq <-0.5*(isigsq + t(isigsq)) R> diag(isigsq) <-rep(0.5,K) R> x.init <-rep(0.0,K) R> x.smp <-sns.run(x.init,logdensity.mvg,niter = 500, + mh.diag = TRUE, mu = mu, isigsq = isigsq) Next, we use the summary.snsfunction to view some of the diagnostics: R> summary(x.smp)As expected, the acceptance rate is 100%, and there is no deviation from quadratic approximation, for SNS sampling of a multivariate Gaussian.
In real-world applications, the Gaussian proposal function is only an approximation for the sampled distribution (since log-density is not quadratic), creating the Markovian dependency and less-than-perfect acceptance rate.We study one such example next.

Example 2: Bayesian Poisson regression
Generalized Linear Models (GLMs) (Nelder and Baker 1972) are an important family of statistical models with applications such as risk analysis (Sobehart et al. 2000), public health (Sharabiani et al. 2011) and political science (Gelman and Hill 2007).GLMs can be extended to incorporate data sparseness and heterogeneity via the Hierarchical Bayesian framework (Rossi et al. 2005) or to account for repeated measurements and longitudinal data via Generalized Linear Mixed Model (McCulloch 2006).With properly-chosen link functions, many GLMs are known -or can be easily proven -to have globally-concave log-densities with negative-definite Hessian matrices (Gilks and Wild 1992).As such, GLMs are excellent candidates for SNS.Embedded in Bayesian frameworks, they continue to enjoy log-concavity assuming the same property holds for prior terms, according to the Bayes' rule and the invariance of concavity under function addition.
Next, we want to predict the response variable in Poisson regression model, given new values of the explanatory variables.We distinguish between two types of prediction: 1) predicting mean response, 2) generating samples from posterior predictive distribution.We illustrate how to do both using the predict.snsfunction.We begin by implementing the mean prediction function, which simply applies the inverse link function (exponential here) to the linear predictor.(For better comparison between the two prediction modes, we increase number of samples to 1000.) R> beta.smp<-sns.run(beta.init,loglike.poisson,+ niter = 1000, nnr = 20, mh.diag = TRUE, X = X, y = y) R> predmean.poisson<-function(beta, Xnew) exp(Xnew %*% beta) The following single line performs sample-based prediction of mean response (using X in lieu of Xnew for code brevity): R> ymean.new<-predict(beta.smp,predmean.poisson,+ nburnin = 100, Xnew = X) ymean.new is a matrix of N (1000) rows and niter -nburnin (900) columns.Each row corresponds to an observation (one row of Xnew), and each column corresponds to a prediction sample (one row of beta.smp after burn-in).
We can also generate samples from posterior predictive distribution as follows: R> predsmp.In the limit of infinite samples, the mean predictions from the two methods will be equal, and they are quite close based on 900 samples above.However, standard deviation of predictions is much larger for predsmp.poissoncompared to predmean.poisson,as the former combines the uncertainty of coefficient values (represented in the sd values for beta's) with the uncertainty of samples from Poisson distribution around the mean, i.e., the sd of Poisson distribution.Also note that, as expected, quantiles for predsmp.poissonare discrete since predictions are discrete, while the quantiles for predmean.poissonare continuous as predictions are continuous in this case.
As mentioned in Section 2.2, there are two important cases where the Gaussian proposal function used in SNS deviates significantly from the underlying PDF, resulting in poor mixing of the MCMC chain.First, when sample size is small, and second when state space dimensionlity is high.The first scenario is addressed in performance benchmarking experiments of Section 5. Below we focus on SNS for high-dimensional PDFs.

Example 3: High-dimensional Bayesian Poisson regression
As discussed in Section 2.2, the multivariate Gaussian proposal in SNS tends to lose accuracy as state space dimensionality rises, leading to poor mixing of the resulting MCMC chain.Continuing with the Poisson regression example, we increase K from 5 to 100, while holding N = 1000.To illustrate that the problem is not covergence but mixing, we explicitly use the glm estimate (mode of PDF) as the initial value for the MCMC chain: R> set.seed(my.seed)R> K <-100 R> X <-matrix(runif(N * K, -0.5, +0.5), ncol = K) R> beta <-runif(K, -0.5, +0.5) R> y <-rpois(N, exp(X %*% beta)) R> beta.init<-glm(y ~X -1, family = "poisson")$coefficients R> beta.smp<-sns.run(beta.init,loglike.poisson,+ niter = 100, nnr = 10, mh.diag = TRUE, X = X, y = y) R> summary(beta.smp)We see a significant drop in acceptance rate as well as effective sample sizes for the coefficients.Also note that mean relative deviation from quadratic approximation is much larger than the value for K = 5.To improve mixing, we use the 'state space partitioning' strategy of sns, available via the part argument of sns and sns.run.This leads to SNS sampling of subsets of state space wrapped in Gibbs cycles, with each subset being potentially much lower-dimensional than the original, full space.This strategy can significantly improve mixing.Below we use the function sns.make.part to partition the 100-dimensional state space into 10 subsets, each 10-dimensional: R> beta.smp.part<-sns.run(beta.init,loglike.poisson,+ niter = 100, nnr = 10, mh.diag = TRUE, + part = sns.make.part(K,10), X = X, y = y) R> summary(beta.smp.part)Notice the improved acceptance rate as well as effective sample sizes.A comparison of logprobability trace plots confirms better mixing after convergence to PDF mode (see Figure 2).

Example 4: Heteroskedastic linear regression
For our last example, we study MCMC sampling of log-likelihood for heteroskedastic linear regression, where both mean and variance of probability distribution for response are dependent on a set of covariates.We cover several advanced topics in this examples: 1) application of invariance theorem (Section 2.3), 2) rapid construction of analytical functions for gradient and Hessian, 3) validation of log-density, 4) numerical calculation of derivatives, and 5) combining SNS with other samplers in Gibbs framework.The model can be formally stated as: where N (µ, σ 2 ) is the normal distribution with mean µ and variance σ 2 , x i and z i are vectors of covariates (explaining the mean and variance, respectively) for the i'th observation, y i is the response variable for the i'th observation, and β and γ are vectors of coefficients for mean and variance.From Eq. 19, we can construct the following log-likelihood function:  and Hessian are finite for all 100 points evaluated, and 3) log-density Hessian (and its specified blocks) is negative definite for all points evaluated.Since the log-density for this problem falls under the 'exchangeable regression' category, the framework described in Section 2.3 can be applied for assessing concavity (Theorem 1) and constructing the derivatives (Eqn.13 and The log-density defined by loglike.linreg.hetcan be translated into the following expression for contribution of each data point towards log-density (ignoring constant terms): where µ and η are the linear predictors which assume values x t i β and z t i γ, respectively, for observation i.According to Theorem 1, a sufficient condition for concavity of the log-density is that the Hessian of f i 's are negative-definite, for i = 1, . . ., N , where N is the number of observations.Differentiating Eq. 20 readily leads to the following expression for the Hessian: While the diagonal elements of h i are both negative, indicating that β and γ blocks are negative-definite, but the full Hessian cannot be proven to have the same property: According to Silvester's criterion for negative-definiteness, the second leading principal minor must be positive ((−1) 2 = 1).From Eq. 20, however, we see that The above result does not contradict the output of sns.check.logdensity:On the one hand, Theorem 1 provides only a sufficient set of conditions.On the other hand, the emipirical checks performed by the software are limited to a finite set of points in the state space, and thus do not lead to any guarantees of negative definiteness everywhere.
While we can certainly try to sample from the full log-density using SNS, a safer option is to break down the full state space into subspaces corresponding to β and γ, apply SNS to each subspace, and wrap these two steps in a Gibbs cycle.This will work because we have theoretical guarantee that the subspace Hessians are both negative-definite, as indicated by the diagonal elements of h i 's.To construct the derivatives for these two state subspaces, we can use the RegressionFactory expander framework in two ways: First, we can supply the base function fbase2.gaussian.identity.logto regfac.expand.2parand extract the subspace components from its output.This is easy to implement but is computationally inefficient since gradient and Hessian for the full state space are calculated during sampling from each subspace.Secondly, we can implement base functions for each subspace from scratch.For brevity of presentation, we opt for the first approach here: R> loglike.linreg.het.beta<-function(beta, gamma, X, Z, y) { + K1 <-length(beta) + ret <-regfac.expand.2par(c(beta,gamma), X, Z, y + , fbase2 = fbase2.gaussian.identity.log)+ return (list(f = ret$f, g = ret$g[1:K1], h = ret$h[1:K1, 1:K1])) SNS with other samplers that do not require log-concavity.This is illustrated next, where we use slice sampler for γ subspace, using the MfUSampler package (Mahani and Sharabiani 2015a).

Performance benchmarking
For log-densities with negative-definite Hessian, SNS can be an efficient MCMC algorithm.As mentioned in Section 2.3, many members of the exponential family belong in this category.
In this section, we present some benchmarking results that help clarify the strengths and weaknesses of SNS vis-a-vis other sampling techniques, and thus provide empirical guidance for practitioners in choosing the right sampling algorithm for their problem.
In Table 1, we compare SNS against three alternative sampling algorithms: 1) univariate slice sampler with stepout and shrinkage (Neal 2003) or 'slicer', 2) adaptive rejection sampling (Gilks and Wild 1992) or 'ARS', which is another univariate sampler applicable to log-concave distributions, and 3) adaptive Metropolis (Roberts and Rosenthal 2009) (Thompson 2010).All tuning parameters of samplers are at their default values in their respective packages.
As discussed before, we expect SNS performance to deteriorate for small data sizes where deviation of sampled distribution from a Gaussian approximation is more significant.This point is illustrated in Figure 3 (left panels), where the performance of SNS, slicer, and AM are compared over a range of values for N (K = 10 and no correlation).We see that, while for N > 200, SNS has a clear advantage, slicer and AM tend to perform better for smaller data sets.Theoretical support for the univariate case is provided in Appendix C.
We had also pointed out that a weakness of univariate samplers such as slicer is when the sampled distribution exhibits strong correlations.This can be induced indirectly by increasing correlation in the matrix of covariates.As seen in Figure 3 (right panels), SNS performs rather steadily with increased correlation, while the performance of slicer and AM -particularly slicer -degrade as correlation increases.
Based on our benchmarking results, we recommend that SNS be considered as a primary MCMC sampling technique for log-concave distributions, moderate to large data, and particularly when significant correlation among covariates is suspected.

Discussion
In this paper we presented sns, an R package for Stochastic Newton Sampling of twicedifferentiable, log-concave PDFs, where a multivariate Gaussian resulting from second-order Taylor series expansion of log-density is used as proposal function in a Metropolis-Hastings framework.Using an initial non-stochastic mode, equivalent to Newton-Raphson optimization with line search, allows the chain to rapidly converge to high-density areas, while 'state space partitioning', Gibbs sampling of full state space wrapping SNS sampling of lower-dimensional blocks, allows SNS to overcome mixing problems while sampling from high-dimensional PDFs.
There are several opportunities for further research and development, which we discuss below.Beyond twice-differentiability and log-concavity: Current version of SNS requires the log-density to be twice-differentiable and concave.In many real-world application, the posterior PDF does not have a negative-definite Hessian, or it cannot be proven to have such a property.In such cases, SNS would not be applicable to the entire PDF (but perhaps to subspaces, as illustrated in Example 4).A potential solution is discussed in Geweke and Tanizaki (2001) in the context of state-space models, where non-concave cases are handled by utilizing exponential and uniform proposal distributions.Convergence and mixing properties of such extensions to general cases must be carefully studied.
An important situation where twice-differentiability is violated is in the presence of boundary conditions.The current SNS algorithm assumes unconstrained state space.One way to deal with constrained subspaces is, again, to mix and match SNS and other samplers within Gibbs framework.For example, the slice sampler algorithm implemented in MfUSampler is capable of dealing with boxed constraints.It can therefore be assigned to subspaces with constraints, and the unconstrained subspaces can be handled by SNS.Further research is needed in order to relax twice-differentiability and log-concavity requirements for SNS.
Implementation Efficiency: The majority of time during typical applications of MCMC sampling is spent on evaluating the log-density and its derivatives.In particular, for Stochastic Newton Sampler, computing the Hessian matrix is often the most time-consuming step, which is O(N × K 2 ) where N is the number of observations and K is the number of attributes in a statistical model with i.i.d observations.In sns function, this step is performed by user-supplied function fghEval.On the other hand, the sampling algorithm itself is computationally dominated by O(K 3 ) steps such as Cholesky factorization of the Hessian (inside fitGaussian or rmvnorm).As long as K is much smaller than N , therefore, the end-to-end process is dominated by fghEval.A notable exception will be applications with 'wide data' such as those encountered in DNA microarray analysis.Therefore, the best way to improve the efficiency of SNS is for the user to supply a high-performance fghEval function to sns, e.g., through compiled code, parallelization, etc.
Computational implications of subset size during state space partitioning require more attention as well.In particular, the current implementation in sns calls the same underlying, full-space fghEval function.This means within each subset, full Hessian is calculated, which is computationally wasteful.This naive approach neutralizes a potential computational advantage of state space partitioning, i.e., reducing the cost of Hessian evaluation which goes up quadratically with state space dimensionality.A more sophisticated approach would require the user to supply a function and derivative evaluation routine that produces the derivatives over state subspaces.Here, the tradeoff is between computational efficiency vs. a more complex API for function evaluation, whose implementation for each problem would be the responsibility of the user.
Also, more comprehensive analysis -theoretical and empirical -is needed to determine when/how to partition the state space.The correlation structure of the state space might need to be taken into account while assigning coordinates to subsets, as well as data dimensions such as number of observations.Our preliminary, uni-dimensional analysis shows that mixing improves with N 1/2 , where N is the number of independently-sampled observations in a regression problem.
and use h n to denote the J-by-J Hessian of f n : h n ≡ [h jj n ] j,j =1,...,J , we can write: Since all h n 's are assumed to be negative definite, all q t n h n q n terms must be non-positive.Therefore, p t Hp can be non-negative only if all its terms are zero, which is possible only if all q n 's are zero vectors.This, in turn, means we must have p j,t x j n = 0, ∀ n, j.In other words, we must have X j p j = ∅, ∀ j.This means that all X j 's have non-singleton nullspaces and therefore cannot be full-rank, which contradicts our assumption.Therefore, p T Hp must be negative.This proves that H is negative definite.

C. SNS mixing and number of observations
Efficient mixing in SNS depends on absence of large changes to the mean and std of the proposal function (relative to the std calculated at mode) as the chain moves within a few std's of the mode.To quantify this, we begin with the Taylor series expansion of log-density around its mode µ 0 (where f (µ 0 ) = 0), this time keeping the third-order term as well (and restricting our analysis to univariate case): τ 0 ≡ −f (µ 0 ) (31) Applying the Newton step to the above formula, we can arrive at mean µ(x) and precision τ (x) of the fitted Gaussian at x: We now form the following dimensionless ratios, which we need to be much smaller than 1 in order to have good mixing for MH-MGT:

Figure 2 :
Figure 2: Comparison of log-density trace plots for Bayesian Poisson regression with N = 1000 and K = 100, without (left) and with (right) state space partitioning using 10 subsets.

Figure 3 :
Figure3: Impact of data size (left panels) and covariate correlation (right panels) on performance of SNS, slicer, and AM for binomial, Poisson, and exponential regression log-densities.Covariate vectors, forming rows of the covariate matrix (X), are sampled from multivariate Gaussian dsitribution with .
|x − µ 0 | .τ 0 ∼ O(1).Some algebra shows that both these conditions are equivalent to:η 0 ≡ |κ 0 | .τ 1. Example 1: basic use of sns.run and summary.sns,ideal behavior of SNS for multi- Finally, we illustrate the use of numerical differentiation capabilities in sns, using the utility function sns.check.logdensity.We switch back to using SNS with γ subspace, but use loglike.linreg.het.gamma.fonly,and compute gradient and Hessian numerically, rather than analytically.Rather than requesting log-density augmentation with numerical derivatives to be performed once every iteration inside sns, we perform this once outside the loop, using sns.fghEval.numaug:

Table 1 :
or 'AM' using a multivariate Gaussian mixture proposal function.Performance is measured in 'independent samples per second', which is simply effective sample size divided by CPU time.We see that, for all three distributions studied -binomial, Poisson, and exponential -SNS achieves the best performance among all samplers, for this set of parameters: 1000 observations, 10 variables, and negligible correlation structure in the design matrix.Performance comparison of SNS against 3 other sampling techniques, using different probability distributions in a generalized linear model.In all cases, number of observations (N ) is 1000, number of variables (K) is 10, and 10,000 samples were collected.SNS = stochastic Newton sampler; Slicer = univariate slice sampler with stepout and shrinkage; ARS = adaptive rejection sampling; AM = adaptive Metropolis.R package MfUSampler was used for Slicer and ARS, while SamplerCompare was used for AM.Effective sample size (ESS) was calculated using the 'Initial Sequence Estimators' method