BASS : An R Package for Fitting and Performing Sensitivity Analysis of Bayesian Adaptive Spline Surfaces

We present the R package BASS as a tool for nonparametric regression. The primary focus of the package is ﬁtting fully Bayesian adaptive spline surface (BASS) models and performing global sensitivity analyses of these models. The BASS framework is similar to that of Bayesian multivariate adaptive regression splines (BMARS) from Denison, Mallick, and Smith (1998), but with many added features. The software is built to eﬃciently handle signiﬁcant amounts of data with many continuous or categorical predictors and with functional response. Under our Bayesian framework, most priors are automatic but these can be modiﬁed by the user to focus on parsimony and the avoidance of overﬁtting. If directed to do so, the software uses parallel tempering to improve the reversible jump Markov chain Monte Carlo (RJMCMC) methods used to perform inference. We discuss the implementation of these features and present the performance of BASS in a number of analyses of simulated and real data.


Introduction
The purpose of the R (R Core Team 2020) package BASS (Francom 2020) is to provide an easy-to-use implementation of Bayesian adaptive spline models for nonparametric regression.It provides a combination of flexibility, scalability, interpretability and probabilistic accuracy that can be difficult to find in other nonparametric regression software packages.The model form is flexible enough to capture local features that may be present in the data.It is scalable to moderately large datasets in both the number of predictors and the number of observations.It performs automatic variable selection.It can build nonparametric functional regression models and incorporate categorical predictors.The package can partition the variability of a resultant model using a nonlinear analysis of variance (ANOVA) decomposition, providing valuable interpretation to the predictors.The Bayesian approach allows for model estimates and predictions that can be evaluated probabilistically.The package is protected under the GNU General Public License, version 3 (GPL-3), and is available from the Comprehensive R Archive Network (CRAN) at https://CRAN.R-project.org/package=BASS.
The BASS framework builds on multivariate adaptive regression splines (MARS) from Friedman (1991b).Well-developed software implementations of the MARS model are available in the R packages earth (Milborrow 2019), polspline (Kooperberg 2019) and mda (Hastie and Tibshirani 2017).The Bayesian version of MARS (BMARS) was first developed in Denison et al. (1998).A MATLAB (The MathWorks Inc. 2019) implementation of BMARS is available from the software website accompanying Denison, Holmes, Mallick, and Smith (2002).
Our implementation is more similar to the BMARS implementation, though with some substantial changes to methodology as described in Francom, Sansó, Kupresanin, and Johannesson (2018) and Francom, Sansó, Bulaevskaya, Lucas, and Simpson (2019).The primary motivation for developing this software was building surrogate models (or emulators) for complex and computationally expensive simulators (or computer models).In particular, we wanted to build a fast and accurate surrogate model to use for uncertainty quantification in the presence of a large number of simulations and where each simulation had functional output.Attributing the variance in the response of the surrogate to different combinations of predictors, a practice known as global sensitivity analysis, is a valuable tool for determining which predictors and interactions are important.One of the major benefits of polynomial spline surrogate models is that sensitivity analysis can be done analytically.The BASS package has this functionality for scalar and functional response models with both continuous and categorical inputs.
There are a number of other R packages that use splines, such as crs (Racine and Nie 2018), gss (Gu 2014), mgcv (Wood 2017(Wood , 2019) ) and R2BayesX (Umlauf, Adler, Kneib, Lang, and Zeileis 2015;Belitz, Brezger, Kneib, Lang, and Umlauf 2017), the latter two of which include possible Bayesian inference methods.These packages allow (or require) the user to specify which variables are allowed to interact in what way, as well as which variables are allowed to have nonlinear main effects.The crs package is more similar to the packages that fit MARS models in that it can learn the structure of the model from the data.These packages report a single best model.BASS reports an ensemble of models (posterior draws from the model space) that can be used to make probabilistic predictions.In this way, it is more similar to Bayesian nonparametric regression packages like BART (McCulloch, Sparapani, Gramacy, Spanbauer, and Pratola 2019) and tgp (Gramacy and Taddy 2010).
We introduce the package as follows.In Section 2, we describe the modeling framework, including our methods for posterior sampling, modeling functional responses, and incorporating categorical inputs.In Section 3, we describe the sensitivity analysis methods.Then, in Section 4, we walk through six examples of how to use the package.These are done with knitr (Xie 2015) in order to be reproducible by the reader.Finally, in Section 5, we present a summary of the package capabilities.

Bayesian adaptive spline surfaces
The BASS model, like the MARS and BMARS models, uses data to learn a set of data dependent basis functions that are tensor products of polynomial splines.The number of basis functions as well as the knots and variables used in each basis are chosen adaptively.The BMARS approach uses reversible jump Markov chain Monte Carlo (RJMCMC; Green 1995) to sample the posterior.The BASS adaptation of BMARS includes the improvements of Nott, Kuk, and Duc (2005) for more efficient posterior sampling as well as parallel tempering for better posterior exploration.BASS also efficiently handles functional responses and allows for categorical variables.We discuss each of these aspects below.First, we introduce the BASS model and priors.
Let y i denote the dependent variable and x i denote a vector of p independent variables, with i = 1, . . ., n.Without loss of generality, let each independent variable be scaled to be between zero and one.We model y i as (1) where s km ∈ {−1, 1} is referred to as a sign, t km ∈ [0, 1] is a knot, v km selects a variable, K m is the degree of interaction and g km = [(s km + 1)/2 − s km t km ] α is a constant that makes the basis function have a maximum of one.The function [•] + is defined as max(0, •).The power α determines the degree of the polynomial splines.We allow for no repeats in v 1m , . . ., v Kmm , meaning that a variable can be used only once in each basis function.M is the number of basis functions, and a is the M + 1 vector of basis coefficients (including the intercept).The only difference between this setup and that of MARS and BMARS is the inclusion of the constant g km in each element of the tensor product.This normalizes the basis functions so that the basis coefficients a 1 , . . ., a M are on the same scale, making computations more stable.
In the course of fitting this model, we seek to estimate θ = {σ 2 , M, a, K, s, t, v} where K is the M -vector of interaction degrees, s is the vector of signs {{s km } Km k=1 } M m=1 , t the vector of knots and v the vector of variables used (with t and v defined similar to s).Under the Bayesian formulation, we specify a prior distribution for θ.
First, consider the priors for σ 2 and a.Let B be the n × (M + 1) matrix of basis functions (including the intercept).Then we use Zellner's g-prior (Liang, Paulo, Molina, Clyde, and Berger 2008) for a with with default settings a τ = 1/2 and b τ = 2/n (shape and rate), which is the Zellner-Siow Cauchy prior, and g 1 = g 2 = 0 resulting in the non-informative prior p(σ 2 ) ∝ 1/σ 2 .In practice, the default settings are sufficient for most cases, though it can be helpful to encode actual prior information into the prior for σ 2 .Now, consider the prior for the number of basis functions, M .We use a Poisson prior for M , truncated to be between 0 and M max .We give a Gamma hyperprior to the mean of the Poisson, λ.If c is the Poisson mass that has been truncated, i.e., c = 1 where 1(•) is the indicator function and the default settings of h 1 = h 2 = 10 (shape and rate) in most cases induce a small number of basis functions.In practice, these hyperparameters can be key in order to prevent overfitting.More specifically, we increase h 2 (by orders of magnitude in some cases) to bring the prior for λ close to zero in an effort to thin out the tails of the Poisson and have fewer basis functions.We use M max to give an upper bound to the computational cost, rather than to prevent overfitting.This strategy results in better fitting models since setting M max too small can result in poor RJMCMC mixing.Poor mixing of this sort is due to the fact that the primary way to search the model space with our RJMCMC algorithm is through adding a new basis function, deleting a current basis function, or modifying a current basis function.If we are at the maximum number of basis functions, we can only explore further by modifying the current set of basis functions or deleting basis functions.However, deleting basis functions can be difficult, because they may all be useful enough that deleting one causes a significant drop in the model likelihood.On the other hand, if we are allowed to add a few new basis functions, we may be able to traverse the model space to a different posterior mode, at which point we may be able to delete some of the old basis functions.
The priors for K, s, t and v are uniform over a constrained space as described in Francom et al. (2018).The constraint in this prior makes sure basis functions have more than b nonzero values.Note that a basis function, as shown in Equation 2, is likely to have many zeros in it depending on how close the knot is to the edge of the space.If a knot is too close to the edge of the space, there might only be a few non-zero values in the basis function.A basis function with only a few non-zero values corresponds to very local fitting and usually results in edge effects (i.e., extreme variance at the edges of the space).If we calculate the number of non-zero points in basis function m to be b m , this prior requires that b m > b.This is the BASS equivalent of specifying a minimum number of points in each partition in recursive partitioning.In addition to specifying b, we also specify K max , the maximum degree of interaction for each basis function.
Table 1 shows the parameters used in the function bass that we have discussed thus far, and their corresponding mathematical symbols.

Efficient posterior sampling
Posterior sampling is complicated by the fact that the model is transdimensional (since M is unknown).Our RJMCMC scheme allows us to add, delete, or change a basis function consistent with the approach of Nott et al. (2005).distribution that favors the variables and degrees of interaction already included in the model.For example, say there were four basis functions already in the model, each with degree of interaction two.Say the maximum degree of interaction was three.Then if we were proposing a new basis function we would sample the degree of interaction from {1, 2, 3} with weights proportional to {w 1 , w 1 + 4, w 1 }, thus favoring two-way interactions since we have seen more of them.If the nominal weight w 1 is large compared to the number of basis functions, this distribution looks more uniform.The value w 2 is the equivalent nominal weight for sampling variables to be included in a candidate basis function.Both w 1 and w 2 default to five.If there are a large number of unimportant variables in the data, a small value of w 2 (relative to M ) helps to make posterior sampling more efficient by not proposing basis functions that include the unimportant variables.
We extend the framework of Nott et al. (2005) to allow for more than two-way interactions.This ends up being non-trivial, since the RJMCMC acceptance ratio requires us to calculate the probability of sampling the proposed basis function.The difficulty comes when we try to calculate the probability of sampling the particular variables, as this requires calculating the probability of a weighted sample without replacement (weighted because we do not sample variables uniformly, without replacement because variables cannot be used more than once in the same basis function).This is equivalent to sampling from the multivariate Wallenius' noncentral hypergeometric distribution.In earlier versions of BASS, we determined the probability of such a sample using a function from the R package BiasedUrn (Fog 2015).Since the CRAN version of BiasedUrn allows for only 32 possible variables, we included a slightly altered version of the function in BASS to quickly evaluate the approximate density function of the multivariate Wallenius' noncentral hypergeometric distribution.In the current version of BASS, we have implemented a simplified multivariate Wallenius' noncentral hypergeometric distribution better fit for this specific purpose.
The computation behind posterior sampling becomes much more efficient when we recognize that each RJMCMC iteration only allows slight changes to our set of basis functions.Thus, quantities like B B, Ba and B y can easily be updated rather than recalculated, as shown in Francom et al. (2018).
We perform N MCMC RJMCMC iterations and discard the first N burn , after which every N thin iteration is kept.This results in (N MCMC − N burn )/N thin posterior samples.Table 2 shows the parameters used in function bass discussed in this section, as well as their mathematical symbols.

Parallel tempering
Posterior sampling with RJMCMC is prone to mixing problems (i.e., problems exploring all of the parameter space).In our case, this is because only slight changes to the basis functions can be made in each iteration.Thus, once we start sampling from one mode of the posterior, it can be hard to move to another mode if it requires changing many of the basis functions (Gramacy, Samworth, and King 2010).We are able to achieve better mixing by using parallel tempering.This requires the specification of a temperature ladder, 1 = t 1 < t 2 < • • • < t T < ∞.For each temperature in the temperature ladder, a RJMCMC chain samples the posterior raised to the inverse temperature (i.e., if π(θ|y) is the posterior of interest, we sample from π(θ|y) 1/t i ).The chains at neighboring temperatures are allowed to swap states according to a Metropolis-Hastings acceptance ratio (see Francom et al. 2018 and references therein).Only samples in the lowest temperature chain (t 1 ) are used for inference.The high temperature chains mix over many posterior modes, allowing diverse models to be propagated to the low temperature chain.We allow the chains to run without swapping for N st iterations at the beginning of the run to allow them to get close to their stationary distributions.Specifying a temperature ladder can be difficult.Temperatures need to be close enough to each other to allow for frequent swaps (with acceptance rates between 20 and 60%; Altekar, Dwarkadas, Huelsenbeck, and Ronquist 2004), and the highest temperature (t T ) needs to be high enough to be able to explore all the modes.Future versions of this package may make some attempt at automatically specifying and altering a temperature ladder.Further, a message passing interface (MPI) approach to handling the multiple chains could result in substantial speedup, and may be implemented in future versions of the package.
Table 3 shows the translation from parameters used for parallel tempering in function bass to symbols we have used in this section.

Functional response
The BASS package has two implementations of methods to use for functional response modeling.

Augmentation approach
The augmentation approach handles functional responses as though the variable indexing the functional response, like time or location, is one of the independent variables.When the functional response is output onto the same functional variable grid for all samples, this results in more efficient calculations involving basis functions because of the Khatri-Rao product structure (Francom et al. 2018).For example, this software is well suited to fit a model where the data are such that a combination of independent variables results in a time series and the grid of times (say, r 1 , . . ., r q ) is the same for each combination.
If there are multiple functional variables, we must specify a maximum degree of interaction for them, K F max .For instance, if the functional output was a spatiotemporal field (a function of three variables) and we specify a maximum degree of functional interaction of two, we would not allow for interactions between both spatial dimensions and time.We would specify the grid of spatial locations and time points as a matrix with three columns rather than a vector like we did in the time series example above.We can also specify a value b F , possibly different from b, that indicates the number of non-zero values required in the functional part Table 4: Translation from mathematical symbols to parameters used in function bass when modeling functional data.
of basis functions.When functional responses are included, the values of b and b F should be relative to the sample size and the size of the functional grid, respectively.
Table 4 shows parameters necessary to model functional responses with the augmentation approach in function bass.The response y should be specified as a matrix when the response is functional.

Basis expansion approach
The second approach that the BASS package can use for modeling functional responses requires decomposing the functional response onto a set of basis functions U = [u 1 , . . ., u Nu ], so that the matrix of functional responses Y = UΦ, where each column of Y is a (flattened) functional response, and Y has n columns.Hence Φ is the N u × n matrix of coefficients in the basis decomposition.Often, we will center and scale the functional responses using the pointwise mean (µ) and standard deviation (κ) of the functional responses before taking the basis decomposition.For dimension reduction purposes, we often use only a subset of n u ≤ N u basis functions, denoted U * with corresponding subset coefficients Φ * .Our approach to functional response modeling is then to fit independent models for the coefficients in the basis decomposition (Francom et al. 2019).If Y ij is the response for the jth functional variable setting using inputs x i , this approach models Y ij with . These models are fit independently, and thus can be done in parallel.In other words, this approach makes n u independent calls to the bass function.
This approach can often provide more accurate results than the augmentation approach, and if the number of available threads is greater than or equal to n u , this can be done in the same amount of time it takes to fit a single univariate model.
Table 5 shows parameters necessary to model functional responses with the general basis function approach in function bassBasis (these parameters are passed as a list).Note that the shortcut function bassPCA calculates a principal component basis given the matrix y and fits the corresponding models, and thus requires only a subset of these parameters (not passed as a list).

Categorical inputs
We include categorical variables by allowing for basis functions to include indicators for categorical variables being in certain categories.Our approach is the Bayesian version of Friedman (1991a) and is described in Francom et al. (2019).If a set of independent variables is separated into continuous variables x and categorical variables c, then the mth basis function equivalent of Equation 2 can be written as where K c m is the degree of interaction for the categorical predictors, 1(•) is the indicator function, v c lm indexes the categorical variables and C lm is a subset of the categories for variables c v c lm .We now allow for K m or K c m to be zero, and specify a K c max (maxInt.cat in function bass).The priors we use for the degree of interaction, variables used and categories used are, in combination with the priors we used above, the same constrained uniform ones.Thus, basis function

Sensitivity analysis
Global sensitivity analysis for nonlinear models using the Sobol' decomposition (Sobol' 2001) is well developed, but often requires large numbers of evaluations of the models for Monte Carlo approximation of integrals (Saltelli et al. 2008).The benefit of polynomial spline models is that Monte Carlo approximation is unnecessary because the integrals can be calculated analytically.
The method decomposes a function f (x) into main effects, two-way interactions, and so on, up to p-way interactions so that (3) Each term in the sum above is constructed so that it is centered at zero and orthogonal to all the other terms.This can be done by calculating the centered (conditional) expectations etc., for all the terms in Equation 3. It is often assumed that x is independent uniform distributed, though the BASS package allows for more complex distributions, still with the assumption of independence.Since the terms in Equation 3 are orthogonal and zero-centered, Using the fact that VAR(f This is a decomposition of the variance of the model into variance due to each main effect, each two-way interaction (after accounting for the associated main effects), etc.Under independent uniform p(x) (as well as some others), all of these integrals are analytical, with solutions given in Francom et al. (2018).Sensitivity indices for main effects and interactions are then defined as proportions of the total variance.Total sensitivity for a particular variable can then be gauged by adding the main effect and all interactions associated with that variable and comparing to the total sensitivity indices for other variables.
We can obtain this variance decomposition for each posterior sample to get posterior distributions of sensitivity indices.This can be time consuming, so the sobol function has an argument mcmc.use to specify which RJMCMC iterations should be used.Calculations of the integrals above can be vectorized when basis functions are the same and only basis function coefficients change.This is the case for many of the RJMCMC iterations, and the sobol function automatically determines this and accounts for it.(As a side note, this is also the case for the predict function).

Functional response
There are a few ways to think about sensitivity analysis for models with functional response.One way is to get the sensitivity indices for the functional variables in the same way we get the sensitivity indices for the rest of the variables.This results in a total variance decomposition.Another approach is to obtain functional sensitivity indices, which would tell us how important a variable or interaction is as we change the functional variable.This can be done by following the procedure just mentioned, but simply not integrating over the functional variable.Hence, all of the expectations above would be conditional on the functional variable.These approaches are explored in Francom et al. (2018) and Francom et al. (2019).
By default, function sobol gets sensitivity indices for the functional variables the same way it does for the other variables.Setting func.var= 1 gets the sensitivity indices as functions of the first (possibly only) functional variable (if there are multiple functional variables, this refers to the first column of the matrix xx.func passed to function bass).
On the other hand, the sobolBasis function (used on an object obtained using bassBasis or bassPCA) gets functional sensitivity indices by default.Because of the complexity of these integrals across the models for the different basis function coefficients, this function can often be sped up significantly by performing the integrals in parallel (using the n.cores parameter).

Categorical inputs
Under our categorical input extension, the necessary expectations to obtain the Sobol' de-composition are still analytical, as described in Francom et al. (2019).For the categorical variables, we replace the integrals with sums over categories.

Examples
We now demonstrate the capabilities of the package on a few examples.For each example, we start by setting the seed (set.seed(0))so that readers can replicate the results.First we load the package R> library("BASS") which we use for all the examples.

Curve fitting
We first demonstrate how the package can be used for curve fitting.We generate y ∼ N (f (x), 1) where x ∈ [−5, 5] and for 1000 samples of x.The data are shown in Figure 3.
We generate the data with the following code.
We then call function bass to fit a BASS model using the default settings.

R> plot(mod)
This generates the four plots shown in Figure 1.The top left and right plots show trace plots (after burn-in and excluding thinned samples) of the number of basis functions (M ) and the error variance (σ 2 ).The bottom left plot shows the response values plotted against the posterior mean predictions (with equal tail posterior probability intervals as specified by the quants parameter).The bottom right plot shows a histogram of the posterior mean residuals along with the assumed Gaussian distribution centered at zero and with variance taken to be the posterior mean of σ 2 .This is for checking the Normality assumption.
Next, we can generate posterior predictions at new inputs, which we generate as x.test.

R> n.test <-1000 R> x.test <-sort(runif(n.test, -5, 5)) R> pred <-predict(mod, x.test, verbose = TRUE)
Predict Start #--May 07 16:37:28 --# Models: 40 By default, the predict function generates posterior predictive distributions for all of the inputs.We can use a subset of posterior samples by specifying the parameter mcmc.use.For instance, mcmc.use= 1:5 will use the first five posterior samples (after burn-in and excluding thinned samples), and will thus be faster.Rather than iterating through the MCMC samples to generate predictions, we instead iterate through "models."The model changes when the basis functions change, which means that we can build the basis functions once and perform vectorized operations for predictions for all the MCMC iterations with the same basis functions.
The object resulting from the predict function is a matrix with rows corresponding to MCMC samples and columns corresponding to settings of x.test.Thus, the posterior mean predictions are obtained by taking the column means.We plot the posterior predictive means against the true values of f (x) as shown in Figure 2.
Note that the predictive distributions in the columns of pred are for f (x).To obtain predictive distributions for data, we would need to include Gaussian error with variance σ 2 (demonstrated in Section 4.5).Posterior samples of σ 2 are given in mod$s2.
In the curve fitting case, we can plot predicted curves.Below, we plot 10 posterior predictive samples along with the true curve (Figure 3).We also show knot locations (in the rug along the x-axis) for one of the posterior samples.
Two final issues to discuss with this example are why we use linear splines (the default degree = 1) and how to tell if we have achieved convergence before taking MCMC samples as posterior samples.We use linear splines almost exclusively when using this package because of their stability and ability to capture nonlinear curves and surfaces.Using a higher degree, such as degree = 3, results in smoother models but suffers from stability problems and is more difficult to fit.We suggest settings of degree other than degree = 1 be used with care, always with scrutiny of prediction performance.Convergence is best assessed by examining the trace plots shown in Figure 1.Especially if the trace plot for σ 2 shows any sort of non-cyclical pattern, the sampler should be run for longer.As a side note, a new sampler can be started from where the old sampler left off by using the curr.listparameter.For instance, we can run mod2 <-bass(x, y, curr.list= mod$curr.list)to start a new sampler from where mod left off.

Friedman function
For our next example, we will test the package on the Friedman function (Friedman 1991b).This function will have 10 inputs, five of which contribute nothing.The other five are used to generate f (x) = 10 sin(πx 1 x 2 ) + 20(x 3 − 0.5) 2 + 10x 4 + 5x 5 .
We generate 200 input samples uniformly from a unit hypercube, calculate f (x) for each and add standard Normal error to obtain data to model.

R> fx.test <-f(x.test) R> plot(fx.test, colMeans(pred)) R> abline(a = 0, b = 1, col = 2)
Now that we are considering a function of many variables, we may be interested in sensitivity analysis.To get the Sobol' decomposition for each posterior sample, we use the sobol function.

R> sens <-sobol(mod, verbose = FALSE)
Note that when verbose = TRUE, this function prints after every 10 models (as with the predict function, vectorizing around models rather than MCMC iterations saves a large amount of time).Depending on the number of basis functions and the number of models, this function can take significant amounts of time.If that is the case, using a smaller set of MCMC iterations by specifying mcmc.use may be useful.
The default plotting for this kind of object (Figure 6) shows boxplots of variance explained for each main effect and interaction that shows up in the BASS model.It also shows boxplots of the total sensitivity indices.
R> plot(sens, cex.axis = 0.5) If there are a large number of main effects or interactions that explain very small percentages of variation, we can show only the effects that are most significant.For instance, we could show only the effects that, on average, explain at least 1% of the variance (Figure 7).

R> boxplot(sens$S[, colMeans(sens$S) > 0.01], las = 2, + ylab = "proportion variance", range = 0)
As expected, we see that almost all of the variance is from the first five variables and the only strong interaction is between the first two variables.
As a final note for this example, we discuss tempering diagnostics.We would like for neighboring chains to have swap acceptance rate of somewhere around 23%.Running bass with verbose = TRUE prints these acceptance rates every 1000 iterations.At the completion of the sampling, we can investigate acceptance rates by dividing the swap counts by the number of swap proposals, as follows.
R> mod$count.swap/mod$count.swap.prop [1] 0.4703961 0.3873609 0.4641745 0.4766776 0.2188010 0.4721832 0.3771823 [8] 0.2198053 Since we have specified nine temperatures, there are eight possible swaps, hence the eight numbers.If, for example, we wanted to increase the first acceptance rate, we would move the second temperature closer to the first.
Further analysis of swapping can be done by looking at swap trace plots.
R> matplot(mod$temp.val,type = "l", ylab = "temperature index") Figure 8 shows the swap trace plot where the y-axis values are temperature indices (1 is the true posterior and 9 is the posterior raised to the smallest power), the x-axis shows MCMC iteration and the colored lines represent the different chains.We want to see these chains mixing throughout.
Determining whether the smallest value of the temperature ladder is small enough to allow for good mixing can be difficult.In this example, we could run the model with temp.ladder= 11.03 and look at mixing diagnostics.One could also look at predicted versus observed plots at the different temperatures for the last MCMC iteration by executing the following code, the output of which is shown in Figure 9.
R> par(mfrow = c(3, 3)) R> temp.ind<-sapply(mod$curr.list, function(x) x$temp.ind)R> for (i in 1:length(mod$temp.ladder)){ + ind <-which(temp.ind== i) + yhat <-mod$curr.list[[ind]]$des.basis%*% mod$curr.list[[ind]]$beta + plot (yhat, y, main = round(mod$temp.ladder[i],2)) + abline(a = 0, b = 1, col = 2) + } Note that the curr.listobject is a list with number of elements equal to the number of temperatures.This list contains the MCMC state for each chain.Since we swap temperatures rather than entire states, the chains are not in order according to temperature.We note that using the default prior for σ 2 with a temperature ladder with relatively large values can lead to instabilities when estimating σ 2 .In cases where that is clearly the case, the prior for σ 2 will be automatically changed and a warning will be generated.
To demonstrate what is different when we use tempering, consider the equivalent BASS model fit without tempering.We compare the root mean square prediction error (RMSE) for the two models, as well as the empirical coverage of 95% probability intervals.First, the RMSE for the model fit without tempering R> pred.noTemp<-predict(mod.noTemp, x.test) R> sqrt(mean((colMeans(pred.noTemp) -fx.test)^2)) [1] 0.4994824 and the empirical coverage R> quants.noTemp2,quantile,probs = c(0.025,0.975[1] 0.915 tend to be moderately better.Under different seeds, we tend to see higher coverage when we use tempering and lower coverage when we do not.We also tend to get better models in terms of RMSE when we use tempering.Other benefits of tempering will be shown in later examples.
Because the computational burden is currently linear in the number of temperatures, using fewer temperatures is better.Thus, for many purposes, the model without tempering may be good enough.
We also point out that this modeling framework can handle correlated inputs.For instance, we use the correlation matrix R> S <-matrix(0.99,nrow = 10, ncol = 10) + diag(10) * 0.01 as a covariance matrix for simulated inputs, and rescale them to be between zero and one so that Friedman function evaluations are comparable to the ones we have used above.The inputs are all highly correlated.The model fitting and prediction are done without any changes from what we have done previously.

Friedman function with a categorical variable
In this example, we use data generated from a function similar to the Friedman function in the previous example but with a categorical variable included.The function, introduced in Gramacy and Taddy ( 2010), has x 11 = 2 10x 4 + 5x 5 x 11 = 3 5x 1 + 10x 2 + 20(x 3 − 0.5) 2 + 10 sin(πx 4 x 5 ) x 11 = 4 as the mean function and standard Normal error.Again, x 6 , . . ., x 10 are unimportant.We generate 500 random uniform samples of the first 10 variables and randomly sample 500 values of the four categories of the 11th variable.The bass function treats input variables as categorical only if they are coded as factors.-data.frame(matrix(runif(n * 10), n, 10), + as.factor(sample(1:4, size = n, replace = TRUE))) R> y <-rnorm (n, f(x), sigma) We fit a model with tempering and use it for prediction, as in the previous example.

R> sens <-sobol(mod)
Plotting the posterior distributions of the most important (explaining more than 0.5% of the variance) sensitivity indices in Figure 12, we see how important the categorical variable is as well as which variables it interacts with.

Friedman function with functional response
Next, we consider an extension of the Friedman function that is functional in one variable Francom et al. (2018).We use where we treat x 1 as the functional variable.Note that we insert a two into the sin function in order to increase the variability due to x 1 and x 2 , making the problem more challenging.We generate 500 combinations of x 2 , . . ., x 10 from a uniform hypercube.We generate a grid of values of x 1 of length 50.This ends up being 500 × 50 combinations of inputs, for which we evaluate f and add standard Normal error.We keep the responses in a matrix of dimension 500 × 50 so that each row represents a curve.The inputs are kept separate in a 500 × 9 matrix and a grid of length 50.

R> matplot(x.func, t(y), type = "l")
In order for the BASS package to handle functional responses, each curve needs to be evaluated on the same grid.Thus, the responses must be able to be stored as a matrix without missing values.

Augmentation approach
We fit the augmentation functional response model by specifying our matrices x and y as well as the grid x.func.
R> mod <-bass(x, y, xx.func = x.func)Following, we make a functional predicted versus observed plot, shown in Figure 14.
R> fx.test <-matrix(f(cbind(rep(x.func, each = n.test),+ kronecker(rep(1, n.func), x.test))), ncol = n.func)R> matplot (fx.test, apply(pred, 2:3, mean) We will demonstrate the two methods of sensitivity analysis discussed in Section 3. First, we can get the Sobol' indices for the functional variable and its interactions just as we do the other variables.This is the default.When we plot the variance decomposition, as shown in Figure 15, the functional variable is labeled with the letter "a."If we had multiple functional variables, they would be labeled with different letters.
R> plot(sens, cex.axis = 0.5) The other approach to sensitivity analysis is to get a functional variance decomposition.This is done by using the func.varparameter.If there is only one functional variable, we set func.var = 1.Otherwise we set func.var to the column of xx.func we want to use for our functional variance decomposition.The left plot shows the posterior mean (using posterior samples specified with mcmc.use) of the functional sensitivity indices in a functional pie chart.The right plot shows the variance decomposition as a function of the functional variable.Thus, the top line in the right plot is the total variance in y as a function of x 1 .The bottom line (black) is the total variance explained by the main effect of x 2 as a function of x 1 .The labels in the plot on the left are the variable numbers (columns of x).

Basis expansion approach
We will utilize parallel computing in the following examples.The BASS functions that can utilize multiple threads can do so in two ways: fork or socket, specified with the parType option.We recommend the fork method, though that is not available on Windows.We would usually specify n.cores = parallel::detectCores() in these functions, but for the purposes of making this document compile on CRAN, we limit the number of threads used with the following specification.

R> if (.Platform$OS.type == "unix
We should also point out that this kind of explicit parallelism sometimes does not play well with multithreading that comes from BLAS.Functions in the BASS package benefit from both forms of parallelism, though we would prefer the explicit parallelism over the multithreading when possible (which seems to happen by default when using openBLAS).
The basis approach to functional response modeling using a principal component basis can be done as follows: R> mod.pca y,perc.var = 95,int.orderThe optional perc.varargument specifies how many principal components should be used in terms of the percent of variance explained.We could alternately specify n.pc.Then n.pc BASS models are fit independently, using n.cores threads.To limit computation, the sobolBasis function requires a specified highest degree of interaction in the decomposition, int.order, and a single MCMC sample to use, mcmc.use.This function benefits from having many threads (often more so than bassPCA).Figure 17 shows the functional variance decomposition.

R> plot(sens.func.pca)
The bassPCA function is a shortcut to the bassBasis function when we want to use a principal component basis.For other bases, we use the bassBasis function directly.We demonstrate this with a wavelet basis below.
First, we must interpolate our functional response data onto a grid of power two.R> pow2 <-floor(log2(ncol(y))) R> y.pow2 <-matrix(nrow = n, ncol = 2^pow2) R> for (i in 1 Then we (optionally) subtract the mean function from each functional response.R> y.pow2.mean2,y.pow2.mean)We use the wmtsa package (Constantine and Percival 2017) to perform the wavelet decomposition for each functional response.

ind = TRUE)[, 1])
Finally, we specify the matrix of basis functions and our reduced dimension response as part of the list that will be passed to the bassBasis.

Air foil data
In this example, we consider a NASA data set, obtained from a series of aerodynamic and acoustic tests of two-and three-dimensional airfoil blade sections conducted in an anechoic wind tunnel (Lichman 2013).The response is scaled sound pressure level, in decibels.There are five inputs: (1) frequency, in Hertzs; (2) angle of attack, in degrees; (3) chord length, in meters; (4) free-stream velocity, in meters per second; and (5) suction side displacement thickness, in meters.The data have 1503 combinations of these inputs, some of which are collinear (variables 2 and 5 have correlation of 0.75).
R> mod <-bass(x, y, nmcmc = 20000, nburn = 10000, thin = 10, + temp.ladder= 1.1^(0:5), verbose = FALSE) We can predict as we have before.However, this prediction is for the mean function.[1] 0.9333333 This puts our empirical coverage near where we would expect it to be.We can plot our 90% prediction intervals as follows, shown in Figure 20.

R> sens <-sobol(mod, verbose = FALSE) R> plot(sens)
The uncertainty in the sensitivity indices in Figure 21 is significant and helps us to understand that there are many possible models for these data that use different variables and interactions.The proper characterization of this uncertainty would be impossible if our RJMCMC chain was stuck in a mode.Hence, tempering is important in this problem.By exploring the posterior modes, tempering allows us to find not just a model that predicts well, but all the models that predict well.

Pollutant spill model
The final example we present is an emulation problem.The simulator is for modeling a pollutant spill caused by a chemical accident, obtained from Surjanovic and Bingham (2017).While fast to evaluate, this simulator provides a good testbed for BASS methods.The simulator has four inputs: (1) mass of pollutant spilled at each of two locations (range 7-13), (2) diffusion rate in the channel (0.02-0.12), (3) location of the second spill (0.01-3), and (4) time of the second spill (30.01-30.295).The simulator outputs a function in space (one dimension) and time that is the concentration of the pollutant.
We generate 1000 combinations of the four simulator inputs uniformly from within their respective ranges.
R> s <-c(0, 0.5, 1, 1.5, 2, 2.5) R> t <-seq (0.3, 60, length.out = 20) R> x.func <-expand.grid(t, s) We will show results when using the bassPCA function rather than the bass function for this problem.The reader may wish to test different parameter ranges, numbers of simulations, and functional response modeling approaches.Our experience is that the bassPCA function performs quite well for this problem, and that various BASS models can handle tens of thousands of model runs.We use function environ available from http://www.sfu.ca/~ssurjano/Code/environr.html to generate realizations of the simulator.We will model the log of the simulator output, though plume models like this may deserve better thought out transformations, as in Bliznyuk, Ruppert, Shoemaker, Regis, Wild, and Mugunthan (2008).
R> mod y,n.pc = 20,save.yhat = FALSE,+ n.cores = nc,verbose = FALSE) Note that we specify save.yhat= FALSE.By default, the bass function saves in-sample predictions for all MCMC samples (post burn-in and thinned).This can be a significant storage burden when we have large amounts of functional data, as we do in this case.Changing the save.yhatparameter can relieve this.If in-sample predictions are of interest, they can be obtained after model fitting using the predict function.
As with the previous example, prediction here is for the mean function.Whatever error is left over (in σ 2 ) is inability of the BASS model to pick up high frequency signal.

Summary
Our proposed BASS framework provides a powerful general tool for nonparametric regression settings.It can be used for modeling with many continuous and categorical inputs, large sample size and functional response.It provides posterior sensitivity analyses without integration error.The MCMC approach to inference, especially using parallel tempering, yields posterior samples that can be used for probabilistic prediction.The BASS package makes these features accessible to users with minimal exposure.These capabilities have been demonstrated with a set of examples involving different dimensions, categorical variables, functional responses, and large datasets.
xx n.pc basis newy y.m y.s Table5: Translation from mathematical symbols to parameters used in function bassBasis when modeling functional data.A subset of these parameters are used in function bassPCA.

Figure 1 :
Figure 1: Diagnostic plots for BASS model fitting.

Figure 2 :
Figure 2: BASS prediction on test data.

Figure 5 :
Figure 5: BASS prediction on test data -Friedman function.

Figure 9 :
Figure 9: Predicted versus observed for the last MCMC iteration of the nine chains at different temperatures.The temperatures are shown above each plot.

Figure 10 :
Figure 10: BASS prediction on test data -Friedman function with correlated inputs.

Figure 11 :
Figure 11: BASS prediction on test data -Friedman function with categorical predictor.

Figure 12 :
Figure 12: Most important main effects and interactions -Friedman function with categorical predictor.

Figure 14 :
Figure 14: BASS prediction performance -Friedman function with functional response.

Figure 15 :
Figure 15: Sensitivity analysis -Friedman function with functional response.

Figure 16 :
Figure 16: Functional sensitivity analysis -Friedman function with functional response.

Figure 17 :
Figure 17: Functional sensitivity analysis, PCA space -Friedman function with functional response.
Figure 18: BASS prediction performance, PCA (left) and wavelet (right) space -Friedman function with functional response.

Figure 19 :
Figure 19: Functional sensitivity analysis, wavelet space -Friedman function with functional response.

Figure 24 :
Figure 24: Variance decomposition as a function of space and time -pollutant spill model.

Table 1 :
h1 h2 g1 g2 degree maxBasis a.tau b.tau Translation from mathematical symbols to parameters used in function bass.

Table 2 :
MCMC N burn N thin bass input w1 w2 nmcmc nburn thin Translation from mathematical symbols to parameters used to specify nominal weights of proposal distributions and number of RJMCMC iterations in function bass.
That is, instead of proposing to add a completely random new basis function in a reversible jump step, we use a proposal generating Symbol w 1 w 2 N

Table 3 :
Translation from mathematical symbols to parameters used for parallel tempering in function bass.
Prediction is as simple as before.If we want to predict on a different functional grid, we can specify that in the predict function with newdata.func.
Then we can call the bassBasis, predict, and sobolBasis functions.To compare predictions, we first interpolate our test data onto a grid of power two.
Now, if we are interested in predicting actual data rather than the mean function, we can incorporate uncertainty from our estimate of σ 2 .