ada: An R Package for Stochastic Boosting

Boosting is an iterative algorithm that combines simple classification rules with "mediocre" performance in terms of misclassification error rate to produce a highly accurate classification rule. Stochastic gradient boosting provides an enhancement which incorporates a random mechanism at each boosting step showing an improvement in performance and speed in generating the ensemble. ada is an R package that implements three popular variants of boosting, together with a version of stochastic gradient boosting. In addition, useful plots for data analytic purposes are provided along with an extension to the multi-class case. The algorithms are illustrated with synthetic and real data sets.


Introduction
Boosting has proved to be an effective method to improve the performance of base classifiers, both theoretically and empirically.The underlying idea is to combine simple classification rules (the base classifiers) to form an ensemble, whose performance is significantly improved.The origins of boosting lie in PAC learning theory [25], which established that learners that exhibit a performance slightly better than random guessing when appropriately combined can perform very well.
A provably polynomial complexity boosting algorithm was derived in [21], whereas the Adaptive Boosting (AdaBoost) algorithm in various varieties [9; 10] proved to be a practical implementation of the boosting ensemble method.Since its introduction, many authors have sought to explain and improve upon the AdaBoost algorithm.In [11], it was shown that the AdaBoost algorithm can be thought of as a stage-wise gradient descent procedure that minimizes an exponential loss function.In addition, three modifications of the original algorithm were proposed: Gentle-, Logit-, and Real AdaBoost.
Recently, several authors have explored regularizing the boosting algorithm [12; 20].For example, if classification trees constitute the base classifiers, regularization is accomplished through the learning rate parameter.This parameter controls the algorithm's ability to search the collection of all possible trees for the given data set.However, the improved performance in terms of predictive accuracy comes at a heavy computational cost.Stochastic gradient boosting [13; 19] utilizes a random mechanism for regularization purposes and achieves significant computational savings.
The ada package implements the original AdaBoost algorithm, along with the Gentle and Real AdaBoost variants, using both exponential and logistic loss functions for classification problems.In addition, it allows the user to implement regularized versions of these methods by using the learning rate as a tuning parameter, which lead to improved computational performance.The base classifiers employed are classification/regression trees and therefore the underlying engine is the rpart package.The ada package uses rpart's functionality for handling missing data and surrogate splits.Some important features incorporated in the ada package are: (i) the use of both regression and classification trees for boosting, (ii) various useful plots that aid in assessing variable importance and relationships between subsets of variables and (iii) tweaking the controls of the rpart function to become applicable for boosting purposes.
The boosting framework typically accomplishes the difficult task of providing both strong predictive performance and useful model diagnostics, which has made it desirable for many classification problems [9; 10; 14].In this paper, we provide analysis of boosting on pharmacology data [26], where the goal is to predict whether a compound is soluble, and to identify variables that are important for predicting this property.In addition to pharmacology, boosting algorithms have encompassed a wide range of applications including tumor identification and gene expression data [7], proteomics data [24], financial and marketing data [2; 18], fisheries data [17], and microscope imaging data [15].For many of these applications, ada will be particularly useful since it implements well documented tools for assessing variable importance, evaluating training and testing error rates, and viewing pairwise plots of the data.
Currently, free R packages exist for boosting that efficiently build regression trees, smoothing splines, and additive models (such as gbm and mboost) [16; 19].The gbm package offers two versions of boosting for classification (gentle boost under logistic and exponential loss).In addition, it includes squared error, absolute error, Poisson and Cox type loss functions.However, the latter loss functions are not natural or recommended for classification purposes [14].The mboost package has to a large extent similar functionality to the gbm package and in addition implements the general gradient boosting framework using regression based learners.In our experience, these packages are more suited for users in need of using boosting in models with a continuous or count type outcome.On the other hand, the ada package provides a straightforward, well-documented, and broad boosting routine for classification, ideally suited for small to moderate-sized data sets.As an enhancement over current boosting packages, ada incorporates two popular loss functions for three boosting variants, variance reduction components (similar to bagging), regularization, and performance diagnostics.In addition, at each iteration ada incorporates cross-validation to determine tree depth, which resolves the selection of tree depth associated with gbm.ada's extensive documentation and ease of use, provides a natural package for individuals interested in quickly familiarizing themselves with the boosting methodology and assessing how boosting would perform on their data.
The paper is organized as follows: in section 2 a brief introduction to the various boosting algorithms implemented in ada is provided; section 3 discusses various implementation issues, while in section 4 a description of the available functions is given.Finally, section 5 discusses several practical issues in the context of real data examples.

A brief account of boosting algorithms
In this section a brief synopsis of the history of AdaBoost (now referred to as Discrete Ad-aBoost) is given, which discusses the origins of this algorithm, as well as the relationship between boosting and additive models.The next section focuses on implementation details for each boosting variant implemented in the package.This is accomplished by presented a general boosting algorithm and then discussing the various variants as special cases.
In a classification problem, a training data set consisting of n objects is available.Each object is characterized by a p-dimensional attribute (feature/variable) vector x, belonging to a suitable space (e.g.R p ), and a class label (response) y ∈ {+1, −1}.The objective is to construct a decision (classification) rule F (x) that would accurately predict the class labels of objects for which only the attribute vector is observed.

Historical perspective
In 1996, Freund and Schapire [9] produced the well-known AdaBoost.M1 (also known as Discrete AdaBoost) algorithm (given below).In short, AdaBoost.M1 generates a sequentially weighted set of weak base classifiers that are combined to form an overall strong classifier.In each step of the sequence, AdaBoost attempts to find an optimal classifier according to the current distribution of weights on the observations.If an observation is incorrectly classified using the current distribution of weights, then the observation will receive more weight in the next iteration.On the other hand, correctly classified observations under the current distribution of weights will receive less weight in the next iteration.In the final overall model, classifiers that are accurate predictors of the training data receive more weight, whereas, classifiers that are poor predictors receive less weight.Thus, AdaBoost uses a sequence of simple weighted classifiers, each forced to learn a different aspect of the data, to generate a final, comprehensive classifier, which with high probability outperforms in terms of misclassification error rate any individual classifier.The basic steps of the algorithm as described next (Algorithm 1): fit y = h m (x) as the base weighted classifier using w i and d 4:

5:
w i = w i exp{α m I{y i = h m (x i )}} scaled to sum to one ∀i ∈ {1, . . ., N } 6: end for In 2000, Friedman et al. [11] established connections of the AdaBoost.M1 algorithm to statistical concepts such as loss functions, additive modeling, and logistic regression.In particular, they showed that AdaBoost.M1 fits a forward stagewise additive logistic regression model that minimizes the expectation of the exponential loss function, e −yF (x) , with F (x) denoting the boosted classifier.
While the AdaBoost algorithm has been shown empirically to improve classification accuracy, it produces at each stage as output the object's predicted label.This coarse information may hinder the efficiency of the algorithm in finding an optimal classification model.To overcome this deficiency, several proposals have been put forth in the literature; for example, the algorithm outputs a real-valued prediction rather than class labels at each stage of boosting.The latter variant of boosting corresponds to the Real AdaBoost algorithm, where the class probability estimate is converted using the half-log ratio to a real valued scale.This value is then used to represent an observation's contribution to the final overall model.Furthermore, observation weights for subsequent iterations are updated according to the exponential loss function of AdaBoost.Like Discrete AdaBoost, the Real AdaBoost algorithm attempts to minimize the expectation of e −yF (x) .In general, these modifications allow Real AdaBoost to more efficiently find an optimal classification model relative to AdaBoost.M1.In addition to the Real AdaBoost modification, Friedman et al. [11] proposed a further extension called the Gentle AdaBoost algorithm which minimizes the exponential loss function of AdaBoost through a sequence of Newton steps.Although Real AdaBoost and Gentle AdaBoost optimize the same loss function and perform similarly on identical data sets, Gentle AdaBoost is numerically superior because it does not rely on the half-log ratio.

Stochastic boosting
Boosting inherently relies on a gradient descent search for optimizing the underlying loss function to determine both the weights and the learner at each iteration [12].In Stochastic Gradient Boosting (SGB) a random permutation sampling strategy is employed at each iteration to obtain a refined training set.The full SGB algorithm with the gradient boosting modification relies on the regularization parameter ν ∈ [0, 1], the so-called learning rate.Compute line search step α m = arg min α i∈πm L(y i , F (x) + αη(h m (x i ))) (in some cases this step may be omitted or α m = 1)
In the case of exponential loss, the line search step solution (Step 3, Algorithm 2) can be written as: p i (α k−1 ) = w i e −αy i η i , and w i = e −y i F (x i ) .The final stageweight α m = α ∞ is used for Algorithm 2. For the logistic loss (L2-Boost) version we have the line search corresponding to: p i (α k−1 ) = w i e −αy i η i 1+w i e −αy i η i , and w i = e −y i F (x i ) .Next, we provide the details for adapting Algorithm 2 to perform each variant of boosting.

Remark:
The ada package provides the flexibility to fit all the stageweights with a value of 1.This is known as -boosting where one fits the ensemble with arbitrarily small ν [20].

Discrete AdaBoost
For Discrete AdaBoost, set L(y, g) = e −yg ⇒ w i = −y i e −y i F i (exponential loss) and η(x) =sign(x).Using expression (1) with this value of η, the optimization problem has a closed form solution given by α m = 0.5 log 1−errm errm .Therefore, the algorithm is fitting the original Discrete AdaBoost algorithm with a random sampling strategy at each iteration.
For Discrete L2-Boost, one optimizes L(y, g In this case, the stageweight does not have a closed form solution and the software solves (2) directly.

Real AdaBoost
For Real AdaBoost and Real L2-Boost, set η(p) = log p 1−p , where p ∈ [0, 1] (i.e. a probability class estimate) and use the same weight as in Discrete AdaBoost.If α m = 1 is set for all m and exponential loss is used, then Real AdaBoost coincides with the algorithm presented in [11].However, ada has the flexibility of optimizing ( 1) and (2) to determine the stageweights.

Gentle AdaBoost
For Gentle and Gentle L2-Boost, set η(x) = x.This algorithm requires fitting a regressor at each iteration and results in the original GentleBoost algorithm whenever α m = 1.As with Real boosting, the algorithm can solve the line search directly.

Connection to bagging
The SGB algorithm has been noted to have a strong connection to bagging [4], and is often referred to as a hybrid bagging and boosting algorithm [13].The ada package takes the connection one step further by allowing the stageweights to be adjusted towards bagging.
A close inspection reveals that if one executes Algorithm 2 with ν := 0, then the ensemble of trees will be generated by random subsamples of the data with identical case weights (i.e.letting F ν (x) be the ada output, then . This is the exact tree fitting process used to bag trees [4], with the exception that in bagging the ensemble results as an average of the trees (i.e.B(x) = 1 M M m=1 h m (x) is a bagged ensemble but under the same random settings one would obtain the same trees as with F 0 (x)).In light of this, we add a shift=TRUE argument which supplies a post processing shift of the ensemble towards bagging after constructing the ensemble (i.e. the final ensemble is , where F 0 (x) = B(x) equates to bagging).However, unlike bagging (with the exception of ν = 0) the individual h's are obtained via a greedy weighting process.In the case of -boosting (i.e.α m := 1) then the resulting ensemble is an average over h.

Implementation issues
In this section we discuss implementation issues for the ada package.
Figure 1: The functional flow of ada.The top section consists of the functions used to create the ada object, while the bottom section are the functions invoked using an initialized ada object.The yellow functions are called only by ada, the red functions are called by the user and the green functions are called by both.

Functional structure
The functional layout for the ada package is shown in Figure 1.An object of type ada can be created by invoking the functions ada, which calls either ada.formula or ada.default, depending on the input to ada.The ada.formula function is a generic formula wrapper for ada.default that allows the flexibility of a formula object as input.The ada.default function, in turn, calls the four functions that correspond to the boosting algorithms: discrete.ada,real.ada,logit.adaand gentle.ada.Once an object of type ada is created, the user can invoke the standard R commands: predict, summary, print, pairs,update, or plot.In addition, ada includes a varplot and addtest function, which we discuss below.

Construction of base learners using rpart
The most popular base (weak) learners employed by both boosting and stochastic boosting algorithms are classification or regression trees.Both of these algorithms are implemented in the rpart package, and can be tweaked for either boosting or stochastic gradient boosting.Because the ada package uses rpart as its engine, ada inherits the flexibility and advantages of rpart.Because of rpart, ada can handle missing data, can implement either classification or regression trees, and can use cross-validation to automatically determine individual tree depth.In addition, because ada uses rpart, ada will inherit any improvements made to rpart.
Remark: Notice, that the gbm package, for instance, uses its own internal recursive partitioner for only computing regression trees with no means for estimating tree depth.However, we feel that the flexibility and enhancements provided by rpart yields a worthwhile tree engine for ada.
The rpart function selects tree depth by using an internal complexity measure together with cross-validation.In the case of SGB we have empirically noticed strong performance with this automatic choice of tree depth (especially for larger data sets).
However, in deterministic gradient boosting, the tree size is usually selected a priori.For example, stumps (2-split) or 4-split trees (e.g.split the data into a maximum of 4 groups) are commonly selected as the weak learners.In rpart one should set cp = −1, which forces the tree to split until the depth of the tree achieves the maxdepth setting.Thus, by specifying the maxdepth argument, the number of splits can be controlled.It is important to note that the maxdepth argument works on a log 2 scale, hence the number of splits is a power of 2. In small data sets, it is useful to appropriately specify the minsplit argument, in order to ensure that at least one split will be obtained.Finally, it is worth noting that a theoretically rigorous approach for setting the tree depth in any form of boosting is still an open problem [22].
The following code illustrates how the control parameters for generating stumps and 4-split trees:

Description of the functions available in the 'ada' package
To illustrate the functions available in ada, we use a ten dimensional synthetic data set, comprised of two interspersed classes exhibiting a sinusoid pattern (Figure 2).In this data, two variables that determine the class shapes were corrupted by eight dimensions of standard Gaussian noise, creating a difficult to trace boundary between the classes.Five hundred observations equally divided among the two classes were generated, and 20% were assigned to the training set.The following R code was used to generate the data.

Gentle AdaBoost
The following call provides a Gentle AdaBoost ensemble with 100 iterations, tree depth of 8, ν = 0.1 (regularization), and the ensemble is shifted towards bagging using the bag.shift=TRUE argument (Section 2.3). >

L2 Boost
To call L2-Boost (boosting with logistic loss [12]) with gentle boost, invoke ada in the following way.
> In addition to performing gentle boost the logistic loss function can be invoked with the type="discrete" or type="real".

Stageweight convergence
For many of these methods a Newton Step is necessary for the convergence of the stageweight at each iteration.For this example the stageweights converged quickly.In some situations the convergence may not be as fast; to increase the number of iterations for convergence purposes use the max.iter argument.

Using an 'ada' object
Upon building an ada object, one can explore the model performance and characteristics through several different tools.The following call shows how to create a plot using discrete adaboost on the training data (Figure 3 (right)).

> plot(gdis)
Notice that the training error steadily decreases across iterations.This shows that boosting can effectively learn the features in a data set.However, a more valuable diagnostic plot would involve the corresponding test set error plot.The following call creates both the training and testing error plots side-by-side, as seen in Figure 3 (right).

> plot(gdis,FALSE,TRUE)
Remark: In many situations the class priors (proportion of true responses) are unbalanced; in these situations, learning algorithms often focus on learning the larger set.Hence, errors tend to appear low, but only because the algorithm classifies objects into the larger class.An alternative measure to absolute classification error is the Kappa statistic [6], which adjusts for class imbalances.

The summary function
Another utility often used with objects in R is summary.To summarize an ada object the summary function returns the kappa value and training accuracy for the final iteration in a given ensemble.If testing data sets are included, then the accuracy and kappa value will also be reported.The argument n.iter allows the user to specify which iteration to report (default is the model iter specification).The following code shows how to call this function using Gentle AdaBoost with the OOB error minimum.Accuracy: 0.757 Kappa: 0.517

The predict function
The predict function is written to match the arguments of the predict.rpartgeneric function.Hence, the arguments consist of object, newdata, and type, which give a specific prediction result.The default type for this function is to return the predicted vector of class labels for the training data.The options type='prob', type='both' and type='F' either give the probability class estimates, both the portability class estimates and the vector of labels, or the weighted sum over the ensemble, respectively.The following code shows how to predict with Real AdaBoost.
> pred<-predict(greal,train The first column provides the number of the actual observation in the original data set.

Remark:
The probability class estimate for any boosting algorithm is defined as 1+e 2F (x) .However, since the function e x is considered infinite by R for large x it is necessary to compute this value on the logarithmic scale and force it to 1 if e 2F (x) = ∞.As a result, one can not get the original F from this transformation, which is needed for the multi-class case.This is the rationale for having the option of setting type="F".The usage of this argument setting is shown in the multi-class example in the next section.

Remark:
The newdata option requires a data.frame of observations with the exact same variable names as the training data.If a matrix format is used to represent the training data, then the variable names will most likely be the default V 1 , . . ., V p .The following code shows how to change the names of the columns, if the training data are in a matrix or a data.frameformat, respectively.> # suppose train is a matrix and the variable names get the default > test<-as.data.frame(test)> names(test)<-c("y",paste("V",1:p,sep="")) > # train is the training data as a data.frame> names(test)<-names(train)

The update function
Invoke the following command to add more trees to the glog ensemble, which was constructed above with 20  The varplot function Gentle AdaBoost will be used to illustrate the variable importance function.The following code shows how to plot the variables ordered by the importance score defined in [14].

> varplot(ggen)
Figure 4 gives us the scores for the variable assessment for Gentle AdaBoost.
To obtain the variable scores directly (without a plot) use the following code.

The pairs function
The pairs tool produces a visualization of the pairwise relationships between a subset of variables in the data set.The upper panel plots represent the true class labels as colors for each pairwise relationship, while the lower panel gives the predicted class for each observation.Also the observations in the plots on the lower panel are scaled by the class probability ada: an R Package for Stochastic Boosting estimate, where the size of the point represents the probability estimate.Hence this plot can help identify observations that are difficult for boosting to classify.
To generate the pairwise plots for the three top variables determined by the varplot function, issue the command: > pairs(gdis,train[,-1],maxvar=3)

Examples
The examples below will illustrate various aspects and uses of the ada package.

Testing Results
Accuracy: 0.889 Kappa: 0.777 Notice that the testing error is 11.1%, which agrees with the results found in [14].
Remark: The variables in this example are all equally important by construction, and therefore the diagnostics for variable selection and pairwise plots are not shown.

Solubility data
The ada package is used to analyze a data set that contains information about compounds used in drug discovery.Specifically, this data set consists of 5631 compounds on which an in-house solubility screen (ability of a compound to dissolve in a water/solvent mixture) was performed.Based on this screen, compounds were categorized as either insoluble (n=3493) or soluble (n=2138).Then, for each compound, 72 continuous, noisy structural descriptors were computed.Of these descriptors, one contained missing values for approximately 14% (n=787) of the observations.The objective of the analysis is to model the relationship between the structural descriptors and the solubility class.
For modeling purposes, the original data set was randomly partitioned into training (50%), test (30%), and validation (20%) sets.The data will be called soldat and the compound labels and variable names have been blinded for this illustration.Gentle AdaBoost with default settings was used on the training set.This data set contained a descriptor with missing values and recall that the default setting is given by na.action=na.rpart.This option allows rpart to search all descriptors, including those with missing values using surrogate splits [3], to find the best descriptor for splitting purposes.Testing accuracy rates are printed in the order they are entered so the accuracy on the testing set is 0.765 and on the validation set 0.781.
For this type of early drug discovery data, the Gentle AdaBoost algorithm performs adequately with test set accuracy of 76.5% (kappa≈ 0.5). Figure 8 also illustrates the model's performance for the training and test sets across iterations.

> plot(gen1,TRUE,TRUE)
In order to enhance our understanding regarding the relationship between descriptors and the response, the varplot function was employed.It can be seen from Figure 9 that variable 5 is the most important one.

> varplot(gen1)
The features are quite noisy and the variables are correlated, which makes variable importance difficult to ascertain from this one run.Next we provide a small run to obtain the average variable importance over 20 iterations.
> According to the average variable importance, eight of the top eleven variables measure various aspects of a compound's hydrophilicity and hydrophobicity, which makes scientific sense in this context.These various hydrophilicity and hydrophobicity measures account for a molecule's tendency to dissolve in water -key measures for determining a molecule's solubility.
Remark: In practice, we have found the average performance for variable importance informative, and well worth the time it takes to compute it.Also, given the size of this data set, it may be convenient to invoke R with -max-mem-size=2Gb.

Stochastic boosting in a multi-class context
Currently, several boosting algorithms can not directly handle a K-class response.This topic constitutes an active area of research.However, several methods have been proposed in the literature to address this issue.One popular strategy is the one-versus-all technique, where each individual class (typically coded as 1), is modeled against all the remaining classes (each coded as zero), and K different ensembles are constructed.The values F k (x) = M m=1 α m f m (x), k = 1, ..., K returned by each ensemble are compared and the value corresponding to the maximum F k is given as the class label [11].
To illustrate the use of ada for multi-class data sets, a ten dimensional feature vector (X 1 , . . ., X 10 ) was simulated from a standard normal distribution.The response variables is constructed as: The in class test error rate and total test error rate will be obtained below using the sapply and table commands in R. For more information on these commands refer to [1].Notice that although the test error rate appears high, it is still substantially lower than random guessing (0.66).Also the in class error rate for class two seems to be rather high.
In order to assess the magnitude of the error rate, a random forest classifier [5] was considered.The choice of a random forest as an ensemble method is due to its ability to handle multi-class problems and its overall competitive performance.The following code using the R package randomForest constructs such an ensemble for the data at hand.It can be seen that the overall test error rate (0.415) is about the same as that produced by Discrete AdaBoost, which strongly indicates that the combination of the ada and the one-versus-all strategy can easily handle multi-class classification problems.

Summary and concluding remarks
In this paper, the R package ada that implements several boosting algorithms is described.Its key features are its functional modularity, the adjustment of class priors for tree classifiers, as well as the incorporation of several useful in practice plots, such as the pairs and the importance of variables plot.
Overall, boosting has had a significant impact both in theoretical and applied research on classification problems, as can be seen by the size of the existing literature on the topic.However, the choice of base classifiers plays a crucial role in the performance of the ensemble.Typically for 4 or 8-split trees, after a large number of iterations the training error rate is driven to zero, while the test error decreases up to a certain level and then oscillates around this level.In that respect, the performance seen in our first data example is somewhat atypical, where the training error rate has not reached zero even after 400 iterations.This is most likely due to the use of stumps as a base classifier.

Figure 2 :
Figure 2: Plot of the two informative variables, colored by the response.

Figure 3 :
Figure 3: (right) Training error by iteration number for the example data.(left) The training and testing error by iteration number for the example data.

Figure 4 :
Figure 4: Variable importance scores produced by the varplot command.

Figure 5 :
Figure 5: Pairs plot of first three descriptors.

Figure 6 :
Figure 6: Pairs plot of the two informative variables.

Figure 7 :
Figure 7: Training and testing error and kappa accuracy by iteration.

Figure 8 :Figure 9 :
Figure 8: Training, testing, and kappa error values by iteration for the Solubility Data.
To add the testing data set to the model, simply use the addtest function.This function allows us to evaluate the testing set without refitting the model.>gdis<-addtest(gdis,test[,-1],test[,1])) trees.
This produces a three class analog to the example given above.For this example, a training and a testing data set which comprised of 200 and 1000 observations, respectively, were generated.Iy,1,which.max)>test<-data.frame(y=y[indtest],x[indtest,])A250-iteration stochastic version of Discrete AdaBoost ensemble was constructed using default trees as the base learner.The implementation of the one-versus-all strategy is given next: