Classiﬁcation Trees for Ordinal Responses in R : The rpartScore Package

This paper introduces rpartScore (Galimberti, Soﬀritti, and Di Maso 2012), a new R package for building classiﬁcation trees for ordinal responses, that can be employed whenever a set of scores is assigned to the ordered categories of the response. This package has been created to overcome some problems that produced unexpected results from the package rpartOrdinal (Archer 2010). Explanations for the causes of these unexpected results are provided. The main functionalities of rpartScore are described, and its use is illustrated through some examples.


Introduction
rpartOrdinal is a package implemented by Archer (2010) in R (R Development Core Team 2012), that contains some splitting functions for building a classification tree for predicting an ordinal response. These splitting functions are meant to be used in conjunction with the R package rpart (Therneau and Atkinson 2011), which was originally introduced as an implementation of the classification and regression trees (CART) methodology (Breiman, Friedman, Olshen, and Stone 1984). Three splitting functions are considered in rpartOrdinal. These three splitting functions are based on an ordinal impurity function (Piccarreta 2008), the generalized Gini impurity function (Breiman et al. 1984) and the ordered twoing method (Breiman et al. 1984).
While using this package on some real data , we obtained some unexpected results. A careful inspection of the R functions contained in rpartOrdinal led us to discover the causes of these problems and to implement rpartScore ), a new R package that overcomes these problems. In this paper, we provide theoretical and empirical explanations for the causes of the unexpected results obtained with rpartOrdinal (Section 2). We next describe the main functionalities of rpartScore (Section 3). Finally, some directions for future research are highlighted in Section 4. All the examples shown in this paper were performed using the R release 2.14.2.

Unexpected results from rpartOrdinal
The reasons why rpartOrdinal may produce unexpected results depend on multiple issues. Some of these issues are related to the theoretical properties of the generalized Gini impurity function and to the way this function has been implemented in rpartOrdinal. The first aspect is detailed in Section 2.1, while Section 2.2 and Section 2.3 address the second aspect. Another issue concerns the measure of the predictive performance used in the pruning procedure (see Section 2.4).

Some properties of the generalized Gini impurity function
Using a notation similar to the one in Archer (2010), consider a data set {(y i , x i1 , . . . , x ip ); i = 1, . . . , n} that contains the values of a response Y with J ordered categories {ω 1 < ω 2 < . . . < ω J } and of p predictors X 1 , . . . , X p observed on n sample units. A classification tree can be derived by recursively partitioning the data set using splitting rules, which are logical rules defined according to the values of the predictors. Most statistical packages that can be used to implement classification trees rely on splitting rules obtained using one predictor at a time. According to the CART methodology, the optimal binary splitting rule for a given node t in a tree T is the one that results in the largest decrease in node impurity, as measured using a given impurity function. Let p(ω j |t) be the proportion of units in node t belonging to the j-th category of Y , for j = 1, . . . , J. The generalized Gini impurity function (Breiman et al. 1984) for node t is given by i GG (t) = J k=1 J l=1 C(ω k |ω l )p(ω k |t)p(ω l |t), where C(ω k |ω l ) represents the misclassification cost of assigning category ω k to a sample unit belonging to category ω l . Clearly, C(ω j |ω j ) = 0 for j = 1, . . . , J. When C(ω k |ω l ) = C(ω l |ω k ) ∀ k = l, these misclassification costs are symmetric and can also be interpreted as dissimilarities between pairs of categories of Y . The nominal Gini impurity function is obtained as a special case of Equation 1 by choosing C(ω k |ω l ) = 1 ∀ k = l.
For any binary split s of node t, units assigned to node t are partitioned into two child nodes, t R and t L . According to Breiman et al. (1984), the decrease in node impurity induced by the binary split s of node t is given by where p(t), p(t R ) and p(t L ) are the proportions of units assigned to nodes t, t R and t L , respectively. At any step in the recursive partitioning procedure, an exhaustive search among all the admissible binary splits for a given node is performed, and the split showing the largest value of Equation 2 is selected.
Suppose that a set of increasing scores {s 1 < s 2 < . . . < s J } is assigned to the ordered categories of the response Y . Variable misclassification costs can be defined by considering suitable transformations of the absolute differences between pairs of scores. For example, if one chooses C(ω k |ω l ) = |s k − s l |, Equation 1 becomes Piccarreta (2001) showed that i GG1 (t) can also be computed as follows: where F (ω j |t) = j h=1 p(ω h |t), for j = 1, . . . , J, are the cumulative proportions of Y in node t. By choosing any set of scores such that (s j+1 − s j ) = 1 for j = 1, . . . , J − 1, Equation 4 simplifies to which is proportional to the ordinal impurity function i OS (t) = J−1 j=1 F (ω j |t) (1 − F (ω j |t)) proposed by Piccarreta (2008).
By adding and subtracting to each term (s k − s l ) the average score within node t, s(t) = J j=1 s j p(ω j |t), it is easy to prove that that is, i GG2 (t) is proportional to SS(t), which is the deviance of the scores s 1 , . . . , s J within node t. The resulting splitting function is then equal to Thus, from a computational point of view, ∆Imp GG2 (t, s) corresponds (up to a multiplicative factor) to the ANOVA splitting function for a regression tree built using the scores s 1 , . . . , s J as numerical values of the ordered categories of the response Y . This equivalence also makes it possible to simplify the search for the best split when the admissible binary splits are defined according to a predictor X with M unordered categories. The split that maximizes Equation 8 can be found by treating X as an ordinal categorical predictor, after ordering its categories according to the M conditional mean scores of Y given X (see Breiman et al. 1984, for a detailed proof of this result). This makes it possible to examine only M − 1 splits of the 2 M −1 − 1 splits that can be defined according to all of the partitions of the M categories into two nonempty classes. Note that this computational simplification does not hold when the decrease in impurity is evaluated using the impurity function in Equation 5 (see Section 2.3 for an example). Archer (2010) proposes that the generalized Gini impurity function be implemented by specifying a matrix of variable misclassification costs as a loss parameter in the optional parms argument within the rpart call. The method = argument of the loss.matrix function in rpartOrdinal can be set to either "linear" or "quadratic" for creating linear or quadratic costs, respectively. According to Archer (2010), classification trees based on i GG1 (t) and i GG2 (t) should be derived in this way.

Use of variable misclassification costs in rpartOrdinal
The following example shows some of the unexpected results produced by rpartOrdinal. Consider the lowbwt data set, contained in rpartOrdinal and used in some illustrative applications presented by Archer (2010). The ordinal response Category is obtained by binning the variable bwt according to the following cutoffs: 2500, 3000, and 3500. This data set also contains several covariates that can be used to predict this ordinal response (see Archer 2010, for further details).
The linear costs corresponding to Category are obtained as follows: R> library("rpartOrdinal") R> data("lowbwt") R> lowbwt$Category <-factor(ifelse(lowbwt$bwt <= 2500, 3, + ifelse(lowbwt$bwt <= 3000, 2, ifelse(lowbwt$bwt <= 3500, 1, 0))), + ordered = TRUE) R> linear.loss <-loss.matrix(method = "linear", lowbwt$Category) R> linear.loss It is evident that these costs can be obtained by choosing any set of scores such that (s j+1 − s j ) = 1 for j = 1, . . . , J − 1 and by setting C(ω k |ω l ) = |s k − s l |. Thus, according to the properties of the generalized Gini impurity function described in Section 2.1, the use of linear misclassification costs should lead to the same optimal splits selected by the ordinal impurity function i OS (t) because the two corresponding splitting functions are equal (up to a constant multiplicative factor). However, the corresponding classification trees obtained by analyzing the lowbwt data set using rpartOrdinal are different (see Archer 2010, pp. 7-8). This difference is due to the fact that, contrary to what is stated by Archer (2010), p. 6, the specification of a matrix of variable misclassification costs as a loss parameter in the optional parms argument within the rpart call does not lead to a classification tree based on the generalized Gini impurity function. This happens because the rpart function does not implement this impurity function (Therneau and Atkinson 1997, p. 7), but in fact resorts to the so-called altered prior method when supplied with variable misclassification costs (see Breiman et al. 1984;Therneau and Atkinson 1997, for a detailed description of this method and its rationale). Briefly, this method assumes that the misclassification costs have the following structure: C(ω k |ω l ) = C(ω l ) ∀ k = l, that is, variable misclassification costs that vary only according to the true category of Y . The altered prior probabilities for the categories of Y are computed as follows:π where π(ω j ) is the prior probability of observing a unit belonging to category ω j . By default, in the rpart function, these prior probabilities are set equal to the proportions of units in the data set belonging to the J categories of Y : p(ω j ) for j = 1, . . . , J. Furthermore, when supplied with a matrix of variable misclassification costs, before computing the altered prior probabilities, the rpart function sets C(ω j ) = J k=1 C(ω k |ω j ) for j = 1, . . . , J. The altered prior probabilities are then used to adjust the proportions p(ω j |t) according to the following formula:p where p(t|ω j ) is the proportion of units belonging to the j-th category of Y which are assigned to node t.
The altered proportionsp(ω j |t) are then used to compute node impurity according to the nominal Gini impurity function which is the impurity measure used by default in the rpart function for building classification trees when Y is an unordered response. Thus, the resulting classification trees do not take into account the ordered nature of the response.
In the lowbwt data set example, p(ω j ), C(ω j ) andπ(ω j ), for j = 1, . . . , 4 can be computed as follows: R> pj <- It is easy to see that using linear misclassification costs leads to increased prior probabilities for extreme categories of the response and to decreased prior probabilities for intermediate categories. Thus, when supplied with a linear misclassification cost matrix, the rpart function not only employs an impurity function that ignores the ordinal nature of the response but also simplifies the misclassification cost structure by giving more weight to misclassification errors for extreme categories of Y .
To provide empirical evidence to support the conclusion that specifying a matrix of linear misclassification costs as a loss parameter in the optional parms argument within the rpart function call does not lead to a classification tree based on i GG1 (t), three trees are built on the lowbwt data set, using the same ordinal response and predictors as in Archer (2010): ordinal.rpart is obtained using the ordinal impurity criterion implemented in rpartOrdinal, linear.loss.rpart is obtained by supplying the linear misclassification cost matrix linear.loss as a loss parameter in the optional parms argument within the rpart call (as suggested by Archer 2010, p. 6) and altered.priors.lin.rpart is obtained by supplying altered prior probabilities alt.pj.lin (computed according to Equation 9) as a prior parameter in the optional parms argument within the rpart call.
The summary of each of these three R objects contains information about the selection of the optimal split for node 1 (illustrated below) as well as other information on the corresponding tree. For the best five candidate splits of node 1 (each defined by a different predictor), the summaries report the values of improve, which are computed as the decrease in node impurity for ordinal.rpart and as the decrease in node impurity multiplied by n for linear.loss.rpart and altered.priors.lin.rpart. A comparison of these values reveals that linear.loss.rpart does not coincide with ordinal.rpart. The values of improve in the summaries lead to a different ranking of the candidate splits; hence, these values are not proportional (as they should be, according to the theoretical properties described in Section 2.1). Furthermore, as expected, linear.loss.rpart gives the same values for improve as does altered.priors.lin.rpart.
R> ordinal.rpart <-rpart(Category~age + lwt + race + smoke + ptl + ht + + ui + ftv, data = lowbwt, method = ordinal) R> summary(ordinal.rpart) Using a similar comparison, it is also possible to show that the use of a quadratic misclassification cost matrix (as proposed by Archer 2010) leads to decreases in node impurity that are not proportional to the ones obtained by fitting a regression tree to the scores of Y (as they should be, according to the properties described in the Section 2.1) but are instead equal to the decreases obtained by a classification tree built using the nominal Gini impurity function and the following altered priors:

Call
R> quadratic.loss <-loss.matrix(method = "quadratic", lowbwt$Category) R> Cj.quad <-apply(quadratic.loss, 1, sum) R> Cj.quad  Note that these latter comparisons are meaningful only with respect to the decreases in node impurity: predicted values from the regression tree fitted to the scores of Y should not be used, because they are equal to the average score within each terminal node and hence are meaningless for predicting a categorical response.

Use of categorical predictors in rpartOrdinal
Other unexpected results may arise when using rpartOrdinal with unordered categorical predictors. As stated in Section 2.1, when a predictor with M unordered categories is considered, 2 M −1 −1 admissible splits can be defined, according to all of the partitions of the M categories into two nonempty classes. However, if the impurity function i GG2 (t) is used to compute the decrease in node impurity, only M − 1 splits can be examined to find the best one. These M − 1 splits can be identified by treating the predictor as an ordinal one, after ordering its categories according to the M conditional mean scores of Y , given the predictor.
The following instructions make it possible to compute the decrease in node impurity using i OS (t) for all the 2 3 − 1 = 7 admissible splits: R> no.classes <-length(y.labs) R> node <-xtabs(wt~y)/sum(wt) R> cum.node<-cumsum(node) R> root<-0 R> for(j in 1:(no.classes -1)) root <-root + + cum.node[j] * (1 -cum.node[j]) R> num.cat <-length(x.labs) R> splits <-expand.grid(rep(list(c(0, 1)), (num.cat))) R> splits <-splits[1:(nrow(splits) / 2), ] R> splits <-splits [-1, ] R> colnames(splits) <-x.labs R> rownames(splits) <-1:nrow(splits) R> node.l <-function(a) apply(as.  Note that the rows of splits make it possible to identify the categories of X belonging to one of the two sets that define each binary partition of the M categories of X. Without loss of generality, for any admissible split, it is assumed that each row of splits defines the categories of X associated with the left child node. Furthermore, for any admissible split in splits, there are corresponding columns in matrices left.1 and right.1 that contain the conditional frequency distributions of Y in the left and right child node, respectively. According to i OS (t), the maximum decrease in node impurity (0.0800) can be achieved by considering the following binary partition of the categories of X: {(A, D), (B, C)}. Note that this partition cannot be examined if the predictor is treated as an ordinal one, after ordering its categories according to the conditional mean scores of Y .
To obtain classification trees based on the ordinal impurity function i OS (t), Archer (2010) defines the list ordinal, to be passed to method = option in rpart function. This list consists of three functions named eval, split and init. In particular, the split function makes it possible to compute, for any given node t, the decreases in node impurity induced by the binary splits defined according to any predictor. When this function examines an unordered categorical predictor X with M categories, it returns a list containing two objects: direction and goodness. The direction object is a vector of length M containing the category labels of X, ordered according to some criterion. In rpart, it is assumed that the best split can be found by treating X as an ordinal predictor, where the ordering of its categories is the one reported in direction. The goodness object contains the decreases in impurity for the corresponding M − 1 splits.
The following instructions are taken from the ordinal$split function contained in rpartOrdinal and have been applied to the data summarized in cont.table. These instructions make it possible to obtain the information that ordinal$split passes to the rpart function to select the best split: R> class.labs <-names(table((y))) R> ux <-sort(unique(x)) R> ord <-order(means) R> n <-length(ord) R> y <-y[order(x)] R> x <-sort(x) R> ymat <-matrix(rep(0, length(x) * no.classes), ncol = no.classes) R> for ( According to these results, the binary partition {(A), (B, C, D)} of the categories of X should be used to define the best split, leading to a decrease in node impurity equal to 0.0431. This example shows that the ordinal$split function proposed by Archer (2010) to implement the ordinal impurity function wrongly assumes that the best split can be found after ordering the categories of X by increasing values of the conditional mean scores of Y . Thus, the split selected using ordinal$split may be different from the optimal one (as in the example provided). These latter errors can be explained by examining the left and right objects, whose rows contain the conditional distributions of Y in 13 different partitions of the sample units into two classes that are examined by ordinal$split. The impure object provides the corresponding decreases in impurity. Furthermore, the conditional distributions of Y and the decreases in node impurity corresponding to splits defined according to the ordering given by direction are contained in rows 11, 9 and 5, while the values contained in goodness correspond to rows 3, 7 and 11.

Optimal tree size selection
As many authors have pointed out (see, for example, Breiman et al. 1984;Murthy 1998;Hastie, Tibshirani, and Friedman 2009), one of the main issues in recursive partitioning methods is the selection of the optimal tree size. This issue is particularly crucial when the aim of the analysis is to obtain a prediction rule for future observations. Several techniques have been suggested for obtaining correctly sized trees (see, for example, Niblett and Bratko 1986;Mingers 1987;Quinlan 1993;Cappelli, Mola, and Siciliano 2002;Zhong, Georgiopoulos, and Anagnostopoulos 2008). A comparison among some of these pruning procedures can be found in Esposito, Malerba, and Semeraro (1997). One of the most popular techniques is the pruning procedure proposed by Breiman et al. (1984). This procedure is based on the cost-complexity measure where R(T ) is usually a measure of the predictive performance of the tree T , card(T ) is the complexity of T (measured by its number of leaves) and α is a tuning parameter (α > 0) that governs the trade-off between predictive performance and tree complexity. The choice of the functional form for R(T ) depends on the nature of Y .
The method = option in rpart function allows users to specify not only their own split functions but also their own predictive performance measures. In particular, the function eval is used to computeŝ(t) and R(t), which are the predicted value of Y and the predictive performance measure within node t, respectively. R(T ) is then obtained as t∈T R(t), whereT is the set of the terminal nodes of the tree T . The eval function associated with the ordinal method implemented by Archer (2010) is given by R> ordinal$eval function (y, wt, parms) { labels <-names(table(y)) freq <-tapply(wt, y, sum) id <-labels[which(freq == max(freq))] newid <-ifelse(length(id) > 1, id[sample(1:length(id), size = 1)], id) wmean <-sum(y * wt)/sum(wt) rss <-sum(wt * (y -wmean)^2) list(label = newid, deviance = rss) } <environment: namespace:rpartOrdinal> Note that this function setsŝ(t) equal to the modal category of Y within node t. However, it computes R(t) as the within-node deviance, which is inconsistent with the categorical nature of the response, as it implicitly exploits the average score as predicted value for Y . The same eval function is also associated with the twoing method implemented by Archer (2010). Two more suitable predictive performance measures to be used when Y is an ordinal response are the total number of misclassifications R mr (T ) (Breiman et al. 1984) and the total misclassification cost R mc (T ) (Piccarreta 2008). When a set of increasing scores {s 1 < s 2 < . . . < s J } is assigned to the ordered categories of Y , these measures can be computed as follows: where s i is the observed score for unit i,ŝ i,T is the predicted score for unit i according to the tree T , and I {s i } (ŝ i,T ) = 1 if s i =ŝ i,T and 0 otherwise. Note thatŝ i,T =ŝ(t) for all units i that are assigned to the terminal node t of T , according to the splitting rules in the tree T .
The predictive performance measures in Equation 13 and Equation 14 can also be exploited to selectŝ(t), ∀t ∈T , by minimizing them within each node. In particular,ŝ(t) is given by the within-node modal score when R mr (T ) is used and by the within-node median score when R mc (T ) is used.

A new package for classification trees with ordinal response
To overcome the problems with rpartOrdinal highlighted in Section 2, we have implemented rpartScore, a new R package that makes it possible to build classification trees for responses with ordered categories. This package can be employed whenever a set of increasing scores {s 1 < s 2 < . . . < s J } is assigned to the categories of the response. The simplest choice for these scores is given by s j = j for j = 1, . . . , J, but this package can be used with any other choice for the scores. General guidelines for choosing scores to be assigned to a categorical variable can be found, for example, in Agresti (2002).
This package uses the general CART framework. In particular, it allows the user to grow classification trees using splitting functions based on the impurity functions i GG1 (t) and i GG2 (t). Furthermore, it performs the pruning procedure using either R mr (T ) or R mc (T ).
rpartScore requires the rpart package. The main function in rpartScore is rpartScore, which uses the function rpart internally. In particular, rpartScore creates a model frame, a list containing parameters that control aspects of the rpart fit, and a list of three functions named eval, split and init. These three objects are passed to the function rpart through the arguments model, control and method, respectively. Because the function rpart does not perform cross-validation when the argument method is set equal to a list, the function rpartScore also uses the function xpred.rpart to internally compute the cross-validated values for the predictive performance measure, used to select the optimal tree size, and their estimated standard errors. Note that this can require long computational times, especially when the pruning procedure requires the evaluation of many subtrees. In this situation, it may be advisable to use either a small number of cross-validation groups or an independent test set.
The functionalities of rpartScore are almost the same as those of rpart. The main difference is the presence of two arguments (split and prune) instead of the method argument.
The argument split controls the impurity function used to grow the classification tree by setting it equal to i GG1 (t) ("abs" -is the default option) or to i GG2 (t) ("quad"). Note that when split = "abs" and (s j+1 − s j ) = 1 for j = 1, . . . , J − 1, rpartScore builds a tree based on the ordinal impurity function i OS (t) or, equivalently, based on the generalized Gini impurity function with linear costs as proposed by Archer (2010) but without producing the unexpected results that are obtained using rpartOrdinal (see Section 2.2 and Section 2.3).
In particular, the internal call to rpart in rpartScore uses the function splitAbs, which, in contrast to the function ordinal$split in rpartOrdinal, examines all of the admissible splits for unordered categorical predictors as well. Furthermore, the objects direction and goodness returned by splitAbs are devised such that the internal call to rpart selects the optimal split (for further details, see the internal code of splitAbs).
Analogously, classification trees based on the generalized Gini impurity function with quadratic costs can be obtained by setting split = "quad" and choosing scores such that (s j+1 −s j ) = 1 for j = 1, . . . , J − 1.
The argument prune makes it possible to select the predictive performance measure used to prune the classification tree and can take two values: "mr" (R mr (T )) or "mc" (R mc (T ) -is the default option). In this way, rpartScore overcomes the problem highlighted in Section 2.4, using predictive performance measures that are consistent with the categorical nature of the response.
The function rpartScore returns an object belonging to the class rpart. Thus, all of the functions contained in the rpart package for summarizing, plotting and pruning trees, as well as for making predictions, can also be applied to the objects obtained using rpartScore. These objects can also be displayed, summarized and plotted using the functionalities of the partykit package (Hothorn and Zeileis 2012). However, since objects created using rpartScore contain as response a numeric variable whose values correspond to the scores {s 1 < s 2 < . . . < s J }, the default options in partykit functions interpret the response as a quantitative variable. This may lead to summary measures and graphical representations of the dependent variable that are not consistent with its original nature. In order to avoid this problem, suitable options have to be chosen when using partykit functions.
According to the 1-se rule (Breiman et al. 1984), the optimal tree size is equal to 6 (see Figure 1). The function prune makes it possible to extract the corresponding optimal tree, as shown in Figure 2. For each node in the tree, units that satisfy the corresponding splitting rule are assigned to the left child node, while units that do not satisfy that rule are assigned to the right child node. The following instructions make it possible to perform a similar analysis using R mr (T ) instead of R mc (T ) as the predictive performance measure: R> T.abs.mr <-rpartScore(Category.s~age + lwt + race + smoke + + ptl + ht + ui + ftv, data = lowbwt, prune = "mr", xval = xgroups) R> T.abs.mr.min.pos <-which. As shown in Figure 3, the change in the predictive performance measure leads to the selection of an optimal tree size equal to 4 (thus, to a simpler classification tree). The optimal splitting rules of T.abs.mr.opt are present also in T.abs.mc.opt, and this is due to the use of the same splitting function for building the two trees. Note that 3 out of 4 terminal nodes of T.abs.mr.opt are characterized by the same predicted score. The classification trees based on the impurity function i GG2 (t) and pruned according to either R mc (T ) or R mr (T ) are displayed in Figure 4 and in Figure 5 and can be obtained as follows: R> T.quad.mc <-rpartScore(Category.s~age + lwt + race + smoke + + ptl + ht + ui + ftv, split = "quad", data = lowbwt, xval = xgroups) R> T.quad.mc.min.pos <-which.  (1, 1)) R> T.quad.mr <-rpartScore(Category.s~age + lwt + race + smoke + ptl + ht + + ui + ftv, split = "quad", prune = "mr", data = lowbwt, xval = xgroups)  In this example, the choice of the splitting function has little effect on the tree topology. The two trees T.abs.mc.opt and T.quad.mc.opt, pruned using R mc (T ), differ only in the thresholds used in one splitting rule: age<18.5 and age<19.5, respectively (see Figure 2 and Figure 4). Similarly, T.quad.mr.opt is identical to T.abs.mr.opt, except for the presence of an additional splitting rule and, thus, an additional terminal node (see Figure 3 and Figure 5).

Choosing splitting function and predictive performance measure
The function rpartScore allows the user to obtain K = 4 different trees, according to the choice of the splitting function and the predictive performance measure. In the illustrative example shown in Section 3.1, the main differences among these trees can be attributed to the choice of the predictive performance measure. The identification of the best combination of the splitting function and the predictive performance measure is beyond the scope of this paper. However, by exploiting the theoretical framework described in Hothorn, Leisch, Zeileis, and Hornik (2005), a simulation experiment with a dependent K = 4 samples design has been performed on the lowbwt data set, to evaluate whether significant differences exist among the K = 4 solutions.
Differences have been measured in terms of ordinal association between predicted and observed scores, using Somers' d measure (Agresti 2002). This is an asymmetric measure of association, which in this experiment has been computed as the difference between the proportions of concordant and discordant pairs of units with respect to the observed and predicted scores out of those pairs that are untied with respect to the observed scores. The choice of this measure instead of a symmetric one, such as Goodman and Kruskal's γ or Kendall's τ b , is due to the fact that these symmetric measures are not defined when the predicted score is constant for all of the units (which happens whenever the optimal tree size is 1). The function somersXY, which computes values of Somers' d measure for a set of observed (argument x) and predicted (argument y) scores, is implemented as follows: R> somersXY <-function(x, y) { + out <-table(x, y) + r <-nrow(out) + c <-ncol(out) + p <-matrix(rep(0, r * c), ncol = c) + q <-matrix(rep(0, r * c), ncol = c) + somers.num <-0 + if(c > 1) { + for(i in 2:r) { + for(j in 2:c) { + p[i, j] <-sum(out[i, j] * sum(out[1:(i -1), 1:(j -1)])) + } + } + for(i in 2:r) { + for(j in 1:(c -1)) { + q[i, j] <-sum(out[i, j] * sum(out[1:(i -1), (j + 1):c])) + } + } + somers.num <-(sum(p) -sum(q)) + } + n <-length(x) + out.x <-table(x) + somers.den <-0.5 * (n * (n -1) -sum(out.x * (out.x -1))) + somers.num / somers.den + } A training set (containing 131 of the 189 units in the lowbwt data set, selected using a stratified random scheme) has been employed to grow and prune four classification trees (one for each combination of the splitting function and the predictive performance measure) using 10-fold cross-validation and the 1-se rule. These trees have been used to predict the scores of the remaining 58 units (the test set). The value of Somers' d has been computed for the test set for each of the four trees. To account for sampling variability, B = 100 different pairs of training and test sets have been considered, resulting in a total of 400 values for Somers' d. Stratified sampling has been performed using the R package sampling (Tillé and Matei 2012).