bartMachine: Machine Learning with Bayesian Additive Regression Trees

We present a new package in R implementing Bayesian additive regression trees (BART). The package introduces many new features for data analysis using BART such as variable selection, interaction detection, model diagnostic plots, incorporation of missing data and the ability to save trees for future prediction. It is significantly faster than the current R implementation, parallelized, and capable of handling both large sample sizes and high-dimensional data.


Introduction
Ensemble-of-trees methods have become popular choices for forecasting in both regression and classification problems. Algorithms such as random forests (Breiman 2001) and stochastic gradient boosting (Friedman 2002) are two well-established and widely employed procedures. Recent advances in ensemble methods include dynamic trees (Taddy, Gramacy, and Polson 2011) and Bayesian additive regression trees (BART, Chipman, George, and McCulloch 2010), which depart from predecessors in that they rely on an underlying Bayesian probability model rather than a pure algorithm. BART has demonstrated substantial promise in a wide variety of simulations and real world applications such as predicting avalanches on mountain roads (Blattenberger and Fowles 2014), predicting how transcription factors interact with DNA (Zhou and Liu 2008) and predicting movie box office revenues (Eliashberg 2010). This paper introduces bartMachine, a new R (R Core Team 2014) package available from the Comprehensive R Archive Network at http://CRAN.R-project.org/package=bartMachine that significantly expands the capabilities of using BART for data analysis.
Currently, there exists one other implementation of BART on CRAN: BayesTree , the package developed by the algorithm's original authors. One of the major drawbacks of this implementation is its lack of a predict function. Test data must be provided as an argument during the training phase of the model. Hence it is impossible to generate forecasts on future data without re-fitting withe entire model. Since the run time is not trivial, forecasting becomes an arduous exercise. A significantly faster implementation of BART that contains master-slave parallelization exists as Pratola, Chipman, Higdon, Mc-Culloch, and Rust (2013), but this is only available as standalone C++ source code and not integrated with R. Additionally, a recent package dbarts allows updating of BART with new predictors and response values to incorporate BART into a larger Bayesian model. dbarts relies on BayesTree as it's BART engine.
The goal of bartMachine is to provide a fast, easy-to-use, visualization-rich machine learning package for R users. Our implementation of BART is in Java and is integrated into R via rJava (Urbanek 2013). From a runtime perspective, our algorithm is significantly faster and is parallelized, allowing computation on as many cores as desired. Not only is the model construction itself parallelized, but the additional features such as prediction, variable selection, and many others can be divided across cores as well.
Additionally, we include a variety of expanded and new features. We implement the ability to save trees in memory and provide convenience functions for prediction on test data. We also include plotting functions for both credible and predictive intervals and plots for visually inspecting convergence of BART's Gibbs sampler. We expand variable importance exploration to include permutation tests and interaction detection. We implement recently developed features for BART including a principled approach to variable selection and the ability to incorporate in prior information for covariates (Bleich, Kapelner, Jensen, and George 2014). We also implement the strategy found in Kapelner and Bleich (2014) to incorporate missing data during training and handle missingness during prediction. In Section 2, we provide an overview of BART with a special emphasis on the features that have been extended. In Section 3 we provide a general introduction to the package, highlighting the novel features. Section 4 provides step-by-step examples of the regression capabilities and Section 5 introduces additional step-by-step examples of features unique to classification problems. We conclude in Section 6. Appendix A discusses the details of our algorithm implementation and how it differs from BayesTree. Appendix B offers predictive performance comparisons.

Overview of BART
BART is a Bayesian approach to nonparametric function estimation using regression trees. Regression trees rely on recursive binary partitioning of predictor space into a set of hyperrectangles in order to approximate some unknown function f . Predictor space has dimension of the number of variables, which we denote p. Tree-based regression models have an ability to flexibly fit interactions and nonlinearities. Models composed of sums of regression trees have an even greater ability than single trees to capture interactions and non-linearities as well as additive effects in f .
BART can be considered a sum-of-trees ensemble, with a novel estimation approach relying on a fully Bayesian probability model. Specifically, the BART model can be expressed as: where Y is the n × 1 vector of responses, X is the n × p design matrix (the predictors column-joined), E is the n × 1 vector of noise. Here we have m distinct regression trees, each composed of a tree structure, denoted by T, and the parameters at the terminal nodes (also called leaves), denoted by M. The two together, denoted as T M represents an entire tree with both its structure and set of leaf parameters.
The structure of a given tree T t includes information on how any observation recurses down the tree. For each nonterminal (internal) node of the tree, there is a "splitting rule" taking the form x j < c consisting of the "splitting variable" x j and the "splitting value" c. An observation moves to the left child node if the condition set by the splitting rule is satisfied and to the right child node otherwise. The process continues until a terminal node is reached. Then, the observation receives the leaf value of the terminal node. The sum of the m leaf values becomes its predicted value. We denote the set of tree's leaf parameters as M t = {µ t,1 , µ t,2 , . . . , µ t b t } where b t is the number of terminal nodes for a given tree.
BART can be distinguished from other ensemble-of-trees models due to its underlying probability model. As a Bayesian model, BART consists of a set of priors for the structure and the leaf parameters and a likelihood for data in the terminal nodes. The aim of the priors is to provide regularization, preventing any single regression tree from dominating the total fit.
We provide an overview of the BART priors and likelihood and then discuss how draws from the posterior distribution are made. A more complete exposition can be found in .

Priors and likelihood
The prior for the BART model has three components (1) the tree structure itself (2) the leaf parameters given the tree structure and (3) the error variance σ 2 which is independent of the tree structure and leaf parameters: where the last line follows from an additional assumption of conditional independence of the leaf parameters given the tree's structure.
We first describe P (T t ), the component of the prior which affects the locations of nodes within the tree. Nodes at depth d are nonterminal with probability α(1 + d) −β where α ∈ (0, 1) and β ∈ [0, ∞]. Depth is defined as distance from the root. Thus, the root itself has depth 0, its first child node has depth 1, etc. This prior form has the ability to enforce shallow tree structures, thereby limiting complexity of any single tree. Default values for these hyperparameters of α = 0.95 and β = 2 are recommended by .
For nonterminal nodes, splitting rules occur in two parts. First, the predictor is randomly selected to serve as the splitting variable. In the original formulation, each available predictor is equally likely to be chosen from a discrete uniform distribution with probability that each varible selected is 1/p. This is relaxed in our implementation to allow for a generalized Bernoulli distribution where the user specifies p 1 , p 2 , . . . , p p (such that p j=1 p j = 1), where each is a probability of variable selection. See "covariate priors" (Section 4.10) for further details. Note that "structural zeroes," variables that do not have any valid split values, are assigned probability zero (see Appendix A.1 for details). Once the splitting variable is chosen, the splitting value is chosen from the multiset (the non-unique set) of available values at the node via the discrete uniform distribution.
We now describe the prior component P (M t | T t ) which controls the leaf parameters. Given a tree with a set of terminal nodes, each terminal node (or leaf) has a continuous parameter (the leaf parameter) representing the "best guess" of the response in this partition of predictor space. This parameter is the fitted value assigned to any observation that lands in that node. The prior on each of the leaf parameters is given as: µ iid ∼ N µ µ /m, σ 2 µ . The expectation, µ µ , is picked to be the range center, (y min + y max )/2. The range center can be affected by outliers. If this is a concern, the user can log-transform the response or windsorize the response.
The variance is empirically chosen so that the range center plus or minus k = 2 variances cover 95% of the provided response values in the training set (where k = 2 corresponding to 95% coverage is only by default and can be customized). Thus, since there are m trees, we are then automatically employing σ µ such that mµ µ − k √ mσ µ = y min and mµ µ + k √ mσ µ = y max . The aim of this prior is to provide model regularization by shrinking the leaf parameters towards the center of the distribution of the response.
The final prior is on the error variance and is chosen to be σ 2 ∼ InvGamma (ν/2, νλ/2). λ is determined from the data so that there is a q = 90% a priori chance (by default) that the BART model will improve upon the RMSE from an ordinary least squares regression. Therefore, the majority of the prior probability mass lies below the RMSE from least squares regression. Additionally, this prior limits the probability mass placed on small values of σ 2 to prevent overfitting.
Note that the adjustable hyperparameters are α, β, k, ν and q. Default values generally provide good performance, but optimal tuning can be achieved via cross-validation, an automatic feature implemented and described in Section 4.2.
Along with a set of priors, BART specifies the likelihood of responses in the terminal nodes. They are assumed a priori Normal with the mean being the "best guess" in the leaf at the current moment (in the Gibbs sampler) and variance being the best guess of the variance at the moment i.e., y ∼ N µ , σ 2 .

Posterior distribution and prediction
A Gibbs sampler (Geman and Geman 1984) is employed to generate draws from the posterior distribution of P(T M 1 , . . . , T M m , σ 2 | y). A key feature of the Gibbs sampler for BART is to employ a form of "Bayesian backfitting" (Hastie and Tibshirani 2000) where the jth tree is fit iteratively, holding all other m − 1 trees constant by exposing only the residual response that remains unfitted: The Gibbs sampler, 1 : 2 : proceeds by first proposing a change to the first tree's structure T which are accepted or rejected via a Metropolis-Hastings step (Hastings 1970). Note that sampling from the posterior of the tree structure does not depend on the leaf parameters, as they can be analytically integrated out of the computation (see Appendix A.1). Given the tree structure, samples from the posterior of the b leaf parameters M 1 := {µ 1 , . . . , µ b } are then drawn. This procedure proceeeds iteratively for each tree, using the updated set of partial residuals R −j . Finally, conditional on the updated set of tree structures and leaf parameters, a draw from the posterior of σ 2 is made based on the full model residuals E := y − m t=1 T M t (X). Within a given terminal node, since both the prior and likelihood are normally distributed, the posterior of each of the leaf parameters in M is conjugate normal with its mean being a weighted combination of the likelihood and prior parameters (lines 2, 4, . . . , 2m in Equation set 3). Due to the normal-inverse-gamma conjugacy, the posterior of σ 2 is inverse gamma as well (line 2m + 1 in Equation set 3). The complete expressions for these posteriors can be found in Gelman, Carlin, Stern, and Rubin (2004). Lines 1, 3, . . . , 2m−1 in Equation set 3 rely on Metropolis-Hastings draws from the posterior of the tree distributions. These involve introducing small perturbations to the tree structure: growing a terminal node by adding two child nodes, pruning two child nodes (rendering their parent node terminal), or changing a split rule. We denote the three possible tree alterations as: GROW, PRUNE, and CHANGE. 1 The mathematics associated with the Metropolis-Hastings step are tedious. Appendix A contains the explicit calculations. Once again, over many MCMC iterations, trees evolve to capture the fit left currently unexplained. Pratola et al. (2013) argue that a CHANGE step is unnecessary for sufficient mixing of the Gibbs sampler. While we too observed this to be true for estimates of the posterior means, we found that omitting CHANGE can negatively impact the variable inclusion proportions (the feature introduced in Section 4.5). As a result, we implement a modified CHANGE where we only propose new splits for nodes that are singly internal: both children nodes are terminal nodes (details are given in Appendix A.3). After a singly internal node is selected we (1) select a new split attribute from the set of available predictors and (2) select a new split value from the multiset of avilable values (these two uniform splitting rules were explained in detail previously). We would like to emphasize that the CHANGE step does not alter tree structure.
All 2m + 1 steps represent a single Gibbs iteration. We have observed that generally no more than 1,000 iterations are needed as "burn-in" (see Section 4.3 for convergence diagnostics). An additional 1,000 iterations are usually sufficient to serve as draws from the posterior for f (x). A single predicted valuef (x) can be obtained by taking the average of the posterior values and a quantile estimate can be obtained by computing the appropriate quantile of the posterior values. Additional features of the posterior distribution will be discussed in Section 4.

BART for classification
BART can easily be modified to handle classification problems for categorical response variables. In , only binary outcomes were explored but recent work has extended BART to the multiclass problem (Kindo, Wang, and Pe 2013). Our implementation handles binary classification and we plan to implement multiclass outcomes in a future release.
For the binary classification problem (coded with outcomes "0" and "1"), we assume a probit model, where Φ denotes the cumulative density function of the standard normal distribution. In this formulation, the sum-of-trees model serves as an estimate of the conditional probit at x which can be easily transformed into a conditional probability estimate of Y = 1.
In the classification setting, the prior on σ 2 is not needed as the model assumes σ 2 = 1. The prior on the tree structure remains the same as in the regression setting and a few minor modifications are required for the prior on the leaf parameters.
Sampling from the posterior distribution is again obtained via Gibbs sampling with a Metropolis-Hastings step outlined in Section 2.2. Following the data augmentation approach of Albert and Chib (1993), an additional vector of latent variables Z is introduced into the Gibbs sampler. Then, a new step is created in the Gibbs sampler where draws of Z | y are obtained by conditioning on the sum-of-trees model: Next, Z is used as the response vector instead of y in all steps of Equation 3.
Upon obtaining a sufficient number of samples from the posterior, inferences can be made using the the posterior distribution of conditional probabilities and classification can be undertaken by applying a threshold to the to the means (or another summary) of these posterior probabilities. The relevant classification features of bartMachine are discussed in Section 5.

The bartMachine package
The package bartMachine provides a novel implementation of Bayesian additive regression trees in R. The algorithm is substantially faster than the current R package BayesTree and our implementation is parallelized at the MCMC iteration level during prediction. Additionally, the interface with rJava allows for the entire posterior distribution of tree ensembles to persist throughout the R session, allowing for prediction and other calls to the trees without having to re-run the Gibbs sampler (a limitation in the current implementation). The model object cannot persist across sessions (using R's save command for instance) and we view the addition of this feature as future work. Since our implementation is different from BayesTree, we provide a predictive accuracy bakeoff on different datasets in Appendix B which illustrates that the two are about equal.

Speed improvements and parallelization
We make a number of significant speed improvements over the original implementation.
First, bartMachine is fully parallelized (with the number of cores customizable) during model creation, prediction, and many of the other features. During model creation, we chose to parallelize by creating one independent Gibbs chain per core. Thus, if we employ the deafult 250 burn-in samples and 1,000 post burn-in samples and four cores, each core would sample 500 samples: 250 for a burn-in and 250 post burn-in samples. The final model will aggregate the post burn-in samples for the four cores yielding the desired 1,000 total burn-in samples. This has the drawback of effectively running the burn-in serially (which suffers from Amdahl's Law), but has the added benefit of reducing auto-correlation of the sum-of-trees samples in the posterior samples since the chains are independent which may provide greater predictive performance. Parallelization at the level of likelihood calculations is left for a future release as we were unable to address the costs of thread overhead. Parallelization for prediction and other features scale linearly in the number of cores.
Additionally, we take advantage of a number of additional computational shortcuts: 1. Computing the unfitted responses for each tree (Equation 2) can be accomplished by keeping a running vector and making entry-wise updates as the Gibbs sampler (Equation 3) progresses from step 1 to 2m. Additionally, during the σ 2 sampling (step 2m + 1), the residuals do not have to be computed by dropping the data down all the trees.
2. Each node caches its acceptable variables for split rules and the acceptable unique split values so they do not need to be calculated at each tree sampling step. Recall from the discussion concerning uniform splitting rules in Section 2.1 that acceptable predictors and values change based on the data available at an arbitrary location in the tree structure. This speed enhancement, which we call memcache comes at the expense of memory and may cause issues for large data sets. We include a toggle in our implementation defaulted to "on." 3. Careful calculations in Appendix A eliminate many unnecessary computations. For instance, the likelihood ratios are only functions of the squared sum of responses and no longer require computing the sum of the responses squared. iid ∼ U (−1, 1), and standard normal noise. Note that we do not vary p as it was already shown in  that BART's computation time is largely unaffected by the dimensionality of the problem (relative to the influence of sample size). We include results for BART using BayesTree, bartMachine with one and four cores, the memcache option on and off, as well as four cores, memcache off and computation of in-sample statistics off (all with m = 50 trees). In-sample statistics by default are computing in-sample predictions (ŷ), residuals (e := y −ŷ), L1 error which is defined as and root mean squared error which is defined as L2/n train . We also include random forests model creation times via the package randomForest (Liaw and Wiener 2002) with its default settings.
We first note that Figure 1a demonstrates that the bartMachine model creation runtime is approximately linear in n (without in-sample statistics computed). There is about a 30% speed-up when using four cores instead of one. The memcache enhancement should be turned off only with sample sizes larger than n = 20, 000. Noteworthy is the 50% reduction in time of constructing the model when not computing in-sample statistics. In-sample statistics are computed by default because the user generally wishes to see them. Also, for the purposes of this comparison, BayesTree models compute the in-sample statistics by necessity since the trees are not saved. The randomForest implementation becomes slower just after n = 1, 000 due to its reliance on a greedy exhaustive search at each node. Figure 1b displays results for smaller sample sizes (n ≤ 2, 000) that are often encountered in practice. We observe the memcache enhancement provides about a 10% speed improvement. Thus, if memory is an issue, it can be turned off with little performance degradation.

Missing data in bartMachine
bartMachine implements a native method for incorporating missing data into both model creation and future prediction with test data. The details are given in Kapelner and Bleich (2014) but we provide a brief summary here.
There are a number of ways to incorporate missingness into tree-based methods (see Ding and Simonoff 2010 for a review). The method implemented here is known as "Missing Incorporated in Attributes" (MIA, Twala, Jones, and Hand 2008, section 2) which natively incorporates missingness by augmenting the nodes' splitting rules to (a) also handle sorting the missing data to the left or right and (b) use missingness itself as a variable to be considered in a splitting rule. Table 2 summarizes these new splitting rules as they are implemented within the package.
Implementing MIA into the BART procedure is straightforward. These new splitting rules are sampled uniformly during the GROW or CHANGE steps. For example, a splitting rule might be "x j < c or x j is missing." To account for splitting on missingness itself, we create dummy vectors of length n for each of the p attributes, denoted M 1 , . . . , M p , which assume the value 1 when the entry is missing and 0 when the entry is present. The original training matrix is then augmented with these dummies, giving the opportunity to select missingness itself when choosing a new splitting rule during the grow or change steps. Note that this can increase the number of predictors by up to a factor of 2. We illustrate building a bartMachine model with missing data in Section 4.8. As described in Chipman et al. (2010, Section 6), BART's runtime increases negligibly in the number of covariates and this has been our experience using the augmented training matrix.

Variable selection
Our package also implements the variable selection procedures developed in Bleich et al. (2014), which is best applied to data problems where the number of covariates influencing the response is small relative to the total number of covariates. We give a brief summary of the procedures here.
In order to select variables, we make use of the "variable inclusion proportions," the proportion of times each predictor is chosen as a splitting rule divided by the total number of splitting rules appearing in the model (see Section 4.5 for more details). The variable selection procedure can be outlined as follows: 1. Compute the model's variable inclusion proportions.
2. Permute the response vector, thereby breaking the relationship between the covariates and the response. Rebuild the model and compute the "null" variable inclusion proportions.
Repeat this a number of times to create a null permutation distribution.
3. Three selection rules can be used depending on the desired stringency of selection: (a) Local Threshold: Include a predictor x k if its variable inclusion proportion exceeds the 1 − α quantile of its own null distribution.
(b) Global Max Threshold: Include a predictor x k if its variable inclusion proportion exceeds the 1 − α quantile of the distribution of the maximum of the null variable inclusion proportions from each permutation of the response.
(c) Global SE Threshold: Select x k if its variable inclusion proportion exceeds a threshold based from its own null distribution mean and SD with a global multiplier shared by all predictors.
The Local procedure is the least stringent in terms of selection and the Global Max procedure the most. The Global SE procedure is a compromise, but behaves more similarly to the Global Max. Bleich et al. (2014) demonstrate that the best procedure depends on the underlying sparsity of the problem, which is often unknown. Therefore, the authors include an additional procedure that chooses the best of these thresholds via cross-validation and this method is also implemented in bartMachine. As highlighted in Bleich et al. (2014), this method performs favorably compared to variable selection using random forests "importance scores", which rely on the reduction in out-of-bag forecasting accuracy that occurs from shuffling the values for a particular predictor and dropping the out-of-bag observations down each tree. Examples of these procedures for variable selection are provided in Section 4.9.

bartMachine Package Features for Regression
We illustrate the package features by using both real and simulated data, focusing first on regression problems.

Computing parameters
We first set some computing parameters. In this exploration, we allow up to 5GB of RAM for the Java heap 2 and we set the number of computing cores available for use to 4.
R> options(java.parameters = "-Xmx5000m") R> library("bartMachine") R> set_bart_machine_num_cores (4) The following Sections 4.2 -4.9 use a dataset obtained from UCI (Bache and Lichman 2013). The n = 201 observations are automobiles and the goal is to predict each automobile's price from 25 features (15 continuous and 10 nominal), first explored by Kibler, Aha, and Albert (1989). 3 This dataset also contains missing data. The following code loads the data. We omit missing data for now and we create a variable for the design matrix X and the response y which has already been log-transformed.

Model building
We are now are ready to construct a bartMachine model. The default hyperparameters generally follow the recommendations of  and provide a ready-to-use algorithm for many data problems. Our hyperparameter settings are m = 50, 4 α = 0.95, β = 2, k = 2, q = 0.9, ν = 3, and probabilities of the GROW / PRUNE / CHANGE steps is 28% / 28% /44%. We set the number of burn-in Gibbs samples to 250 and number of post-burn-in samples to 1,000. We default the missing data feature to be off. We default the covariates to be equally important a priori. Other parameters and their defaults can be found in the package's online manual. Below is a default bartMachine model. Here, X denotes automobile attributes and y denotes the log price of the automobile.
R> bart_machine <-bartMachine(X, y) Building bartMachine for regression ... evaluating in sample data...done If one wishes to see more information about the individual iterations of the Gibbs sampler of Equation 3, the flag verbose can be set to "TRUE." One can see more debug information from the Java program by setting the flag debug_log to true and the program will print to unnamed.log in the current working directory. In Figure 2 we inspect the model object to query its in-sample performance and to be reminded of the input data and model hyperparameters. Note that the "p-val for shapiro-wilk test of normality of residuals" is marginally less than 5%. Thus we conclude that the noise of Equation 1 is not normally distributed. Just as when interpreting the results from a linear model, non-normality implies we should be circumspect concerning bartMachine output that relies on this distributional assumption such as the credible and prediction intervals of Section 4.4.
Since the response was considered continuous, we employ bartMachine for regression. The dimensions of the design matrix are given. Note that we dropped 41 observations that contained missing data (which we will retain in Section 4.8). We then have a recording of the MSE for the OLS model and our average estimate of σ 2 e . We are then given in-sample statistics on error. Pseudo-R 2 is calculated via 1 − SSE/SST . Also provided are outputs from tests of the error distribution being mean centered and normal. In this case, we cannot conclude normality of the residuals using the Shapiro-Wilk test.
We can also obtain out-of-sample statistics to assess level of overfitting by using k-fold crossvalidation. Using 10 randomized folds we find: The Pseudo-R 2 being lower out-of-sample versus in-sample suggests evidence that bartMachine is slightly overfitting (note also that the training sample during cross-validation is 10% smaller). This function also returns theŷ predictions as well as the vector of the fold indices (which are omitted above).
It may also be of interest how the number of trees m affects performance. One can examine how out-of-sample predictions vary by the number of trees via R> rmse_by_num_trees (bart_machine, num_replicates = 20) and the output is shown in Figure 3.

Number of Trees
Out-Of-Sample RMSE Figure 3: Out-of-sample predictive performance by number of trees It seems that increasing m > 50 does not result in any substantial increase in performance.
We can now try to build a better bartMachine by grid-searching over a set of hyperparameter combinations, including m (for more details, see BART-cv in . The default grid search is small and it can be customized by the user. R> bart_machine_cv <-bartMachineCV(X, y) ... bartMachine CV win: k: 2 nu, q: 3, 0.9 m: 200 This function returns the "winning" model, which is the one with lowest out-of-sample RMSE over a 5-fold cross-validation. Here, the cross-validated bartMachine model has slightly better in-sample performance (L1 = 8.18, L2 = 0.68 and Pseudo-R 2 = 0.978) as well as slightly better out-of-sample performance (L1 = 21.05, L2 = 4.40 and Pseudo-R 2 = 0.858) which were evaluated via: R> k_fold_cv(X, y, k_folds = 10, k = 2, nu = 3, q = 0.9, num_trees = 200) Predictions are handled with the predict function. Fits for the first seven rows are: R> predict(bart_machine_cv, X[1 : 7, ]) [1] 9.494941 9.780791 9.795532 10.058445 9.670211 9.702682 9.911394 We also include a convenience method bart_predict_for_test_data that will predict and return out-of-sample error metrics when the test outcomes are known.

Assumption checking
The package includes features that assess the plausibility of the BART model assumptions.
Checking the mean-centeredness of the noise is addressed in the summary output of Figure 2 and is simply a one-sample t-test of the average residual value against a null hypothesis of true mean zero. We assess both normality and heteroskedasticity via:

R> check_bart_error_assumptions(bart_machine_cv)
This will display a plot similar to Figure 4 which contains a QQ-plot (to assess normality) as well as a residual-by-predicted plot (to assess homoskedasticity). It appears that the errors are most likely normal and homoskedastic.
In addition to the model assumptions, BART requires convergence of its Gibbs sampler which can be investigated via: R> plot_convergence_diagnostics(bart_machine_cv) Figure 5 displays the plot which features four types of convergence diagnostics (each are detailed in the figure caption). It appears that the bartMachine model has been sufficiently burned-in.

Credible intervals and prediction intervals
An advantage of BART is that if we believe the priors and model assumptions, the Bayesian probability model and corresponding burned-in MCMC iterations provide the approximate posterior distribution of f (x). Thus, one can compute uncertainty estimates via quantiles of the posterior samples. These provide Bayesian "credible intervals" which are intervals for the conditional expectation function, E [y | X].
Another useful uncertainty interval can be computed for individual predictions by combining uncertainty from the conditional expectation function with the systematic, homoskedastic normal noise produced by E. This is accomplished by generating 1,000 samples (by defauilt) from the posterior predictive distribution and then reporting the appropriate quantiles.
Below is an example of how both types of intervals are computed in the package (for the 100th observation of the training data):

Assessment of Normality p−val for shapiro−wilk test of normality of residuals: 0.203
Normal Q−Q plot for in−sample residuals (Theoretical Quantiles) es q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qqq q q q q q q q q q q q q q q q  Note that the prediction intervals are wider than the credible intervals because they reflect the uncertainty from the error term.
We can then plot these intervals in sample: R> plot_y_vs_yhat(bart_machine_cv, credible_intervals = TRUE) R> plot_y_vs_yhat(bart_machine_cv, prediction_intervals = TRUE) Figure 6a shows how our prediction fared against the original response (in-sample) with 95% credible intervals. Figure 6b shows the same prediction versus the original response plot now with 95% prediction intervals.   Green dots indicate the true response is within the stated interval and red dots indicate otherwise. Note that the percent coverage in (a) is not expected to be 95% because the response includes a noise term.

Variable importance
After a bartMachine model is built, it is natural to ask the question: which variables are most important? This is assessed by examining the splitting rules in the m trees across the post burn-in MCMC iterations which are known as "inclusion proportions" . The inclusion proportion for any given predictor represents the proportion of times that variable is chosen as a spliting rule out of all splitting rules among the posterior draws of the sum-of-trees model. Figure 7 illustrates the inclusion proportions for all variables obtained via: R> investigate_var_importance(bart_machine_cv, num_replicates_for_avg = 20)

Variable effects
It is also natural to ask: does x j affect the response, controlling for other variables in the model? This is roughly analogous to the t-test in ordinary least squares regression of no linear effect of x j on y while controlling for x −j . The null hypothesis here is the same but the linearity constraint is relaxed. To test this, we employ a permutation approach where we record the observed Pseudo-R 2 from the bartMachine model built with the original data. Then we permute the x j th column, thereby destroying any relationship between x j and y, construct a new duplicate bartMachine model from this permuted design matrix and record a "null" Pseudo-R 2 . We then repeat this process to obtain a null distribution of Pseudo-R 2 's. Since the alternative hypothesis is that x j has an effect on y in terms of predictive power, our p value is the proportion of null Pseudo-R 2 's greater than the observed Pseudo-R 2 , making our procedure a natural one-sided test. Note, however, that this test is conditional on the BART model and its selected priors being true, similar to the assumptions of the linear model.
If we wish to test if a set of covariates A ⊂ {x 1 , . . . , x p } affect the response after controlling for other variables, we repeat the procedure outlined in the above paragraph by permuting the columns of A in every null sample. We do not permute each column separately, but instead permute as a unit in order to perserve collinearity structure. This is roughly analogous to the partial F -test in ordinary least squares regression.
If we wish to test if any of the covariates matter in predicting y, we simply permute y during the null sampling. This procedure breaks the relationship between the response and the predictors but does not alter the existing associations between predictors. This is roughly analogous to the omnibus F -test in ordinary least squares regression.
At α = 0.05, Figure 8a demonstrates an insignificant effect of the variable width of car on price. Even though width is putatively the "most important" variable as measured by proportions of splits in the posterior sum-of-trees model (Figure 7), note that this is largely an easy prediction problem with many collinear predictors. Figure 8b shows the results of a test of the putatively most important categorical variable, body style (which involves permuting the categories, then dummifying the levels to preserve the structure of the variable). We find a marginally significant effect (p = 0.0495). A test of the top ten most important variables is convincingly significant (Figure 8c). For the omnibus test, Figure 8d illustrates an extremely statistically significant result, as would be expected. The code to run these tests is shown below (output suppressed).

Partial dependence
A data analyst may also be interested in understanding how x j affects the response on average, after controlling for other predictors. This can be examined using Friedman (2001)'s Partial Dependence Function (PDP), where x −j denotes all variables except x j . The PDP of predictor x j gives the average value of f when x j is fixed and x −j varies over its marginal distribution, dP (x −j ). As neither the true model f nor the distribution of the predictors dP (x −j ) are known, we estimate Equation 4 by computingf  where n is the number of observations in the training data andf denotes the bartMachine model. Since BART provides an estimated posterior distribution, we can plot credible bands for the PDP function. In Equation 5, thef can be replaced with a function that calculates the qth quantile of the post-burned-in MCMC iterations forŷ. Figure 9a plots the PDP along with the 2.5%ile and the 97.5%ile for the variable horsepower. By varying over most of the range of horsepower, the price is predicted to increase by about $1000. Figure 9b plots the PDP along with the 2.5%ile and the 97.5%ile for the variable stroke. This predictor seemed to be relatively unimportant according to Figure 7 and the PDP confirms this, with a very small, yet nonlinear average partial effect. The code for both plots is below.

Incorporating missing data
The procedure for incorporating missing data was introduced in Section 3.2. We now build a bartMachine model using this procedure below: R> y <-automobile$log_price R> X <-automobile; X$log_price <-NULL R> bart_machine <-bartMachine(X, y, use_missing_data = TRUE, Note that we now use the complete data set including the 41 observations for which there were missing features. Also note that p has now increased from 41 to 50. The nine "new" predictors are: [1] "engine_location_rear" "engine_type_rotor" "fuel_system_4bbl" [4] "fuel_system_spfi" "M_normalized_losses" "M_bore" [7] "M_stroke" "M_horsepower" "M_peak_rpm" The first two predictors are two new levels for the variable engine_location which appear in the 41 rows with missingness. The next two predictors are two new levels for the variable fuel_system which appear in the 41 rows with missingness as well. The last five new predictors are dummy variables which indicate missingness constructed from the predictors which exhibited missingness (due to the use_missing_data_dummies_as_covars parameter being set to true).
The procedure of Section 3.2 also natively incorporates missing data during prediction. Missingness will yield larger credible intervals. In the example below, we suppose that the curb_weight and symboling values were suddenly unavailable for the 20th automobile and we observe their credible intervals widening as a result.

Variable selection
In this section we demonstrate the principled variable selection procedure introduced in Section 3.3 and developed in detail in Bleich et al. (2014). The following code will select variables based on the three thresholds and also displays the plot in Figure 10. 5 R> vs <-var_selection_by_permute(bart_machine, bottom_margin = 10, num_permute_samples = 10) R> vs$important_vars_local_names "curb_weight" "city_mpg" "engine_size" "horsepower" "length" "width" "num_cylinders" "body_style_convertible" "wheel_base" "peak_rpm" "highway_mpg" "wheel_drive_fwd" R> vs$important_vars_global_max_names "curb_weight" "city_mpg" "engine_size" "horsepower" "length" R> vs$important_vars_global_se_names "curb_weight" "city_mpg" "engine_size" "horsepower" "length" "width" "num_cylinders" "wheel_base" "wheel_drive_fwd" Usually, "Global Max" and "Global SE" perform similarly, as they are both more stringent in selection. However, in many situations it will not be clear to the data analyst which threshold is most appropriate. The "best" procedure can be chosen via cross-validation on out-of-sample RMSE as follows: q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 0.00 0.03 proportion included   curb_weight  city_mpg  engine_size  horsepower  length  width  num_cylinders  body_style_convertible  wheel_base  peak_rpm  highway_mpg  wheel_drive_fwd  engine_type_l  stroke  wheel_drive_rwd  fuel_system_mpfi  bore  engine_type_ohc  engine_type_dohc  normalized_losses  engine_location_rear  engine_location_front  aspiration_std  fuel_type_gas  fuel_type_diesel  aspiration_turbo  compression_ratio  fuel_system_2bbl  body_style_hardtop  symboling  fuel_system_idi  height  engine_type_ohcv  fuel_system_4bbl  fuel_system_spfi  engine_type_ohcf  num_doors  body_style_wagon  body_style_hatchback  engine_type_rotor  fuel_system_mfi  wheel_drive_4wd fuel_system_spdi fuel_system_1bbl body_style_sedan q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q Figure 10: Visualization of the three variable selection procedures outlined in Section 3.3 with α = 0.05. The top plot illustrates the "Local" procedure. The green lines are the threshold levels determined from the permutation distributions that must be exceeded for a variable to be selected. The plotted points are the variable inclusion proportions for the observed data (averaged over five duplicate bartMachine models). If the observed value is higher than the green bar, the variable is included and is displayed as a solid dot; if not, it is not included and it is displayed as an open dot. The bottom plot illustrates both the "Global SE" and "Global Max" thresholds. The red line is the cutoff for "Global Max" and variables pass this threshold are displayed as solid dots. The blue lines represent the thresholds for the "Global SE" procedure. Variables that exceed this cutoff but not the "Global Max" threshold are displayed as asterisks. Open dots exceed neither threshold.

Informed prior information on covariates
To illustrate, we construct a design matrix X where the first five columns are predictors which influence the response (x 1 , . . . , x 5 in Equation 6) and the next 95 columns are predictors that do not affect the response.
All that is required is a specification of relative weights for each predictor. These are converted internally to probabilities. We assign 5 times the weight to the 5 true covariates of the model relative to the 95 useless covariates.
R> prior <-c(rep(5, times = 5), rep(1, times = 95)) We now sample 500 observations from the Friedman function and construct a default bart-Machine model as well as a bartMachine model with the informed prior and compare their performance on a test set of another 500 observations.
Note that by default the prior vector down-weights the indicator variables that result from dummifying factors so that the total set of dummy variables has the same weight as a continuous covariate.

Interaction effect detection
In Section 4.5, we explored using variable inclusion proportions to understand the relative influences of different covariates. A similar procedure can be carried out for examining interaction effects within a BART model. This question was initially explored in Damien, Dellaportas, Polson, and Stephens (2013) where an interaction was considered to exist between two variables if they both appeared in at least one splitting rule in a given tree. We refine the definition of an interaction as follows.
We first begin with a p × p matrix of zeroes. Within a given tree, for each split rule variable j, we look at all split rule variables of child nodes, k, and we increment the j, k element of the matrix. Hence variables are considered to interact in a given tree only if they appear together in a contiguous downward path from the root node to a terminal node. Note that a variable may interact with itself (when fitting a linear effect, for instance). Since there is no order between the parent and child, we then add the j, k counts together with the k, j counts (if j = k). Summing across trees and MCMC iterations gives the total number of interactions for each pair of variables from which relative importance can be assessed.
We demonstrate interaction detection on the Friedman function using 10 covariates using the code below: R> interaction_investigator(bart_machine, num_replicates_for_avg = 25, num_var_plot = 10, bottom_margin = 5) Shown in Figure 11 are the ten most important interactions in the model. The illustration is averaged over many model constructions to obtain stable estimates across many posterior modes in the sum-of-trees distribution. Notice that the interaction between x 1 and x 2 dominates all other terms, as bartMachine is correctly capturing the single true interaction effect in Equation 6. Choosing which of these interactions significantly affect the response is not addressed in this paper. The methods suggested in Section 3.3 may be applicable here and we consider this to be fruitful future work.

bartMachine Model Persistence Across R Sessions
A convenient feature of bartMachine is its ability to "serialize." Serialization allows the user to construct models and have them persist across R sessions. The cost is time during model creation and hard drive space. Thus, the serialize feature is defaulted to "off." We demonstrate using the code below: R> bart_machine <-bartMachine(X, y, serialize = TRUE) R> save.image("bart_demo.RData") R> q("no") The training data is the same as in the previous section: n = 100 and p = 10. For the default bartMachine settings, m = 50, number of burn-in MCMC iterations is 250 and number of posterior samples is 1000. These settings yielded an almost instant serialization, generating a 2.1MB RData file. For a more realistic dataset with n = 5000, p = 1000, m = 100 and 5000 posterior samples, the serialization and saving of the file took a few minutes and requires 100MB.

bartMachine Package Features for Classification
In this section we highlight the features that differ from the regression case when the response is dichotomous. The illustrative dataset consists of 332 Pima Indians obtained from the UCI repository. Of the 332 subjects, 109 were diagnosed with diabetes, the binary response variable. There are seven continuous predictors which are body metrics such as blood pressure, glucose concentration, etc. and there is no missing data.
Building a bartMachine model for classification has the same computing parameters except that q, ν cannot be specified since there is no longer a prior on σ 2 (see Section 2.3). We first build a cross-validated model below.
R> bart_machine_cv <-bartMachineCV(X, y Classification models have an added hyperparameter, prob_rule_class, which is the rule for determining if the probability estimate is great enough to be classified into the positive category. We can see above that the bartMachine at times predicts "NO" for true "YES" outcomes and we suffer from a 37.6% error rate for this outcome. We can try to mitigate this error by lowering the threshold to increase the number of "YES" labels predicted: R> bartMachine(X, y, prob_rule_class = 0. When using the covariate tests of Section 4.6, total misclassification error becomes the statistic of interest instead of Pseudo-R 2 . The p value is calculated now as the proportion of null samples with lower misclassification error. Figure 12 illustrates the test showing that predictor age seems to matter in the prediction of Diabetes, controlling for other predictors.  Figure 12: Test of covariate importance for predictor age on whether or not the subject will contract Diabetes.

BART test for importance of covariate(s): age Null Samples of Misclassification Errors
The partial dependence plots of Section 4.7 are now scaled as probit of the probability estimate. Figure 13 illustrates that as glucose increases, the probability of contracting Diabetes increases linearly on a probit scale.
Credible intervals are implemented for classification bartMachine and are displayed on the probit scale. Note that the prediction intervals of Section 4.4 do not exist for classification. Other functions work similarly to regression except those that plot the responses and those that explicitly depend on RMSE as an error metric.

Discussion
This article introduced bartMachine, a new R package which implements Bayesian additive regression trees. The goal of this package is to provide a fast, extensive and user-friendly implementation accessible to a wide range of data analysts, and increase the visibility of BART to a broader statistical audience. We hope we have provided organized, well-documented open-source code and we encourage the community to make innovations on this package.

Replication
The stable version of bartMachine is on CRAN and the development version is located at http://github.com/kapelner/bartMachine. The package code is under the GPL3 and MIT licenses. Results, tables, and figures found in this paper can be replicated via the scripts located in the bart_package_paper folder within the git repository.
banek for his very generous help with rJava at many stages of this research. Adam Kapelner acknowledges support from the National Science Foundation's Graduate Research Fellowship Program.

A. Sampling to Modify Tree Structure
This section provides details on the implementation of Equation 3 (steps 1, 3, . . . , 2m − 1), the Metropolis-Hastings step for sampling new trees. Recall from Section 2.2 that trees can be altered via growing new child nodes from an existing terminal node, pruning two terminal nodes such that their parent becomes terminal, or changing the splitting rule in a node.
Below is the Metropolis ratio (Gelman et al. 2004, p.291) where the parameter sampled is the tree and the data is the responses unexplained by other trees denoted by R. We denote the new, proposal tree with an asterisk and the original tree without the asterisk.
We accept a draw from the posterior distribution of trees if a draw from a standard uniform distribution is less than the value of r. Immediately we note that it is difficult (if not impossible) to calculate the posterior probabilities for the trees themselves. Instead, we employ Bayes' Rule, , and plug the result into Equation 7 to obtain: Note that the probability of the tree structure is independent of σ 2 .
The goal of this section is to explicitly calculate r for all possible tree proposals -GROW, PRUNE and CHANGE. For each proposal, the calculations are organized into separate sections detailing each of the three ratios -transition, likelihood and tree structure. Note that our actual implementation uses the following expressions in log form for numerical accuracy.

Transition ratio
Transitioning from the original tree to a new tree involves growing two child nodes from a current terminal node: P (selecting the jth attribute to split on) × P (selecting the ith value to split on) .
We chose one of the current b terminal nodes which we denote the ηth node, and then we pick an attribute and split point. p adj (η) denotes the number of predictors left available to split on. This can be less than p if certain predictors do not have two or more unique values once the data reaches the ηth node. For example, this regularly occurs if a dummy variable was split on in some node higher up in the lineage. n j·adj (η) denotes the number of unique values left in the pth attribute after adjusting for parents' splits.
Transitioning from the new tree back to the original tree involves pruning that node: where w * 2 denotes the number of second generation internal nodes (nodes with two terminal child nodes) in the new tree. Thus, the full transition ratio is: Note that when there are no variables with more two or more unique values, the probability of GROW is set to zero and the step will be automatically rejected.

Likelihood ratio
To calculate the likelihood, the tree structure determines which responses fall into which of the b terminal nodes. Thus, where each term on the right hand side is the probability of responses in one of the b terminal nodes, which are independent by assumption. The R 's denote the data in the th terminal node and where n denotes how many observations are in each terminal node and n = b =1 n . We now find an analytic expression for the node likelihood term. Remember, if the mean in each terminal node, which we denote µ , was known, then we would have R 1 , . . . , R n | µ , σ 2 iid ∼ N µ , σ 2 . BART requires µ to be integrated out, allowing the Gibbs sampler in Equation 3 to avoid dealing with jumping between continuous spaces of varying dimensions (Chipman et al. 2010, page 275). Recall that one of the BART model assumptions is a prior on the average value of µ ∼ N 0, σ 2 µ and thus, P R 1 , . . . , R n | σ 2 = R P R 1 , . . . , R n | µ , σ 2 P µ ; σ 2 µ dµ which can be shown via completion of the square or convolution to be P R 1 , . . . , R n | σ 2 = 1 (2πσ 2 ) n /2 whereR denotes the mean response in the node and R i denotes the observations i = 1 . . . n in the node.
Since the likelihoods are solely determined by the terminal nodes, the proposal tree differs from the original tree by only the selected node to be grown, denoted by , which becomes two children after the GROW step denoted by L and R . Hence, the likelihood ratio becomes: Plugging Equation 9 into Equation 10 three times yields the ratio for the GROW step: where n L and n R denote the number of data points in the newly grown left and right child nodes.

Tree structure ratio
In Section 2.1 we discussed the prior on the tree structure (where the splits occur) as well as the tree rules. For the entire tree, where H terminals denotes the set of terminal nodes and H internals denotes the internal nodes.
Recall that the probability of splitting on a given node η is P SPLIT (η) = α/ (1 + d η ) β . The probability is controlled by two hyperparameters, α and β, and d η is the depth (number of parent generations) of node η. When assigning a rule, recall that BART picks from all available attributes and then from all available unique split points. Using the notation from the transition ratio section, P RULE (η) = 1/p adj (η) × 1/n j·adj (η).
Once again, the original tree features a node η that was selected to be grown. The proposal tree differs with two child nodes denoted η L and η R . We can now form the ratio: The last line follows from algebra and using the fact that the depth of the grown nodes is the depth of the parent node incremented by one (d η L = d η R = d η + 1).

A.2. Prune proposal
A prune proposal is the "opposite" of a grow proposal. Prune selects a singly internal node (a node whose children are both terminal) and removes both of its children. Thus, each ratio will be approximately the inverse of the ratios found in the previous section concerning the grow proposal. Note also that prune steps are not considered in trees that consist of a single root node.

Transition ratio
We begin with transitioning from the original tree to the proposal tree: P (T → T * ) = P (PRUNE) P (selecting η to prune from) = P (PRUNE) 1 w 2 where w 2 denotes the number of singly internal parent nodes which have two terminal children (thus no grandchildren). To transition in the opposite direction, we are obligated to grow from node η. This is similar to Equation 8 except the proposed tree has one less terminal node due to the pruning of the original tree, resulting in a 1/(b − 1) term: Thus, the transition ratio is: w 2 (b − 1)p adj (η * )n j * ·adj (η * ) .

Tree structure ratio
This is also simply the inverse of the tree structure ratio for the grow proposal:

A.3. Change proposal
A change proposal involves picking an internal node and changing its rule by picking both a new available predictor to split on and a new valid split value among values of the selected predictor. Although this could be implemented for use in any internal node in the tree, for simplicity we limit our implementation to singly internal nodes: those that have two terminal child nodes.

Transition ratio
The transition to a proposal tree is below: P (T → T * ) = P (CHANGE) P (selecting node η to change) × P (selecting the new attribute to split on) × P (selecting the new value to split on) When calculating the ratio, the first three terms are shared in both numerator and denominator. The probability of selecting the new value to split on will differ as different split features have different numbers of unique values available. Thus we are left with P (T * → T) P (T → T * ) = n j * ·adj (η * ) n j·adj (η) where n j * ·adj (η * ) is the number of split values available under the proposal tree's splitting rule and n j·adj (η) is the number of split values available under the original tree's splitting rule.
Substituting Equation 9 four times and using algebra, the following expression is obtained for the ratio: if the number of responses in the children do not change in the proposal (n 1 = n * 1 and n 2 = n * 2 ).
The probability of splits remain the same because the child nodes are at the same depths. Thus we only need to consider the ratio of the probability of the rules. Once again, the probability of selecting the new value to split on will differ as different split features have different numbers of unique values available. We are left with P(T * )/P(T) = n j·adj (η)/n j * ·adj (η * ).
Note that this is the inverse of the transition ratio. Hence, for the change step, only the likelihood ratio needs to be computed to determine the Metropolis-Hastings ratio r.

B. Bakeoff
We baked off nine regression data sets and assessed out-of-fold RMSE using 10-fold crossvalidation. We average the results across 20 replications of cross-validation. The results are displayed in Table 3.  Table 3: RMSE values for three machine learning algorithms averaged across 20 replicates. Asterisks indicate a significant difference between bartMachine and BayesTree at a significance level of 5% with a Bonferroni correction. Comparisons with randomForest's performance were not conducted.
We conclude that the implementation outlined in this paper performs approximately the same as the previous implementation with regards to predictive accuracy.