Multi-Objective Parameter Selection for Classiﬁers

Setting the free parameters of classiﬁers to diﬀerent values can have a profound impact on their performance. For some methods, specialized tuning algorithms have been developed. These approaches mostly tune parameters according to a single criterion, such as the cross-validation error. However, it is sometimes desirable to obtain parameter values that optimize several concurrent – often conﬂicting – criteria. The TunePareto package provides a general and highly customizable framework to select optimal parameters for classiﬁers according to multiple objectives. Several strategies for sampling and optimizing parameters are supplied. The algorithm determines a set of Pareto-optimal parameter conﬁgurations and leaves the ultimate decision on the weighting of objectives to the researcher. Decision support is provided by novel visualization techniques.


Introduction
Many state-of-the-art classifiers have free parameters that influence their behavior like generalization performance and robustness to noise. Choosing the proper parameters is an important aspect of adapting classifiers to specific types of data. For example, tuning the cost and kernel parameters of support vector machines is essential for obtaining sensible results (Chang and Lin 2001;Fröhlich and Zell 2005;Pontil and Verri 1998). Specialized parameter tuning approaches for classifiers have been developed. Chapelle et al. (2002) optimize support vector machine (SVM) parameters by a gradient descent of different estimates of the generalization error. Various evolutionary algorithms and swarm algorithms have been designed to optimize SVM parameters (e.g., Chunhong and Licheng 2004;Zhang et al. 2009;de Souza et al. 2006;Kapp et al. 2009). Kalos (2005) suggests a particle swarm optimization algorithm to determine the structure of neural networks. Kohavi and John (1995) introduce a general framework that optimizes a classifier according to its cross-validation error in a best-first search algorithm. Sequential parameter optimization (Bartz-Beielstein et al. 2005;Bartz-Beielstein 2006) is another parameter tuning framework that tries to cope with the stochastically disturbed results of search heuristics by repeated evaluations. For this methodology, the R (R Development Core Team 2011) package SPOT ( Bartz-Beielstein et al. 2011) has been developed. Also other R packages include specialized tuning functions for classifiers. The e1071 package (Dimitriadou et al. 2010) provides a generic tuning function that optimizes the classification error for some classifiers, such as k-nearest neighbour, SVMs (Vapnik 1998), decision trees (Breiman et al. 1984), and random forests (Breiman 2001). The tuneRF function in the randomForest package tunes the number of variables per split in a random forest with respect to the out-of-bag error (Liaw and Wiener 2002).
Most frequently, parameters are tuned in such a way that they optimize a single criterion, such as the cross-validation error, which can be a good estimate of the generalization performance. However, it is sometimes desirable to obtain parameter values that optimize several concurrent criteria at the same time. In this context, it is often hard to decide which trade-off of these criteria matches the requirements best. The discipline of multiple criteria decision analysis (MCDA) analyzes formal approaches to support decision makers in the presence of conflicting criteria (see, e.g., Belton and Stewart 2002).
One possibility of optimizing a number of criteria is combining them in a weighted sum of objective functions. Still, this requires the definition of a fixed weighting, which is often highly subjective and sensitive to small changes in these weights (Deb 2004). A more flexible way of optimization is based on the identification of Pareto-optimal solutions (Laux 2005). Here, all objectives are treated separately, which means that there is no strict ordering of solutions. This selection procedure retrieves all optimal trade-offs of the objectives and leaves the subjective process of selecting the desired one to the researcher. To date, parameters are almost always tuned in a single-objective manner. Exceptions are a specialized two-objective genetic algorithm that tunes the parameters of SVMs according to training error and model complexity in terms of number and influence of the support vectors (Igel 2005;Suttorp and Igel 2006). Also Zhang (2008) converts a multi-objective optimization of SVM parameters into a single-objective optimization by introducing weight parameters to control the trade-offs.
We developed the TunePareto package for the statistical environment R (R Development Core Team 2011) that allows for a flexible selection of optimal parameters according to multiple objectives. Unlike previously published specialized tuning algorithms, this general approach is able to identify optimal parameters for arbitrary classifiers according to user-defined objective functions. Parallelization support via the snowfall package (Knaus et al. 2009) is also included. The software further provides multiple visualization methods to analyze the results. It is freely available from the Comprehensive R Archive Network at http://CRAN.R-project. org/package=TunePareto.

Multi-objective parameter selection
As briefly mentioned, most existing parameter tuning approaches optimize classifier parameters according to a single objective, which is most often the cross-validation error. However, in some contexts, a classifier may have to meet several criteria. These criteria are usually conflicting, such that optimizing one leads to a deterioration of another. Imagine one would like to determine a predictor for a disease. Important properties of such a classifier are the sensitivity (i.e., the fraction of cases predicted correctly by the classifier) and the specificity (i.e., the fraction of controls predicted correctly by the predictor). It is somehow intuitive that these objectives usually cannot be optimized at the same time: e.g., a classifier that classifies all examples as cases has a perfect sensitivity, but a worst-case specificity and vice versa. A trade-off of sensitivity and specificity is often the desired result. This raises the question of how to optimize a classifier according to more than one objective. One possibility is to join the objectives in a weighted sum, i.e., f (c) = w 1 · Sens d (c) + w 2 · Spec d (c), where Sens d (c) is the sensitivity of a classifier c on data set d, and Spec d (c) is the specificity of c. However, it remains unclear how to choose the weights w 1 and w 2 appropriately. Usually, it is not known which weight combination is associated with the desired trade-off. Often, several trade-offs may be valid. Furthermore, weighted sums of objectives cannot retrieve all individually optimal solutions if the optimization problem is non-convex (Deb 2004). As it is generally unknown whether an optimization problem is convex, it is often more desirable to determine optimal classifier parameter configurations according to dominance-based methods. These include multiple trade-offs, which may later be analyzed manually by the user to choose the most appropriate one for a specific scenario. For example, a classical method of choosing a good trade-off would be to test several classifiers and parameters, to plot them in a ROC curve, and to choose a good classifier on the basis of this curve. This is a special case of a multi-objective optimization.

Pareto optimality
Dominance-based selection techniques provide a means of including all possible trade-offs in the set of solutions (i.e., classifier parameter configurations) by considering the objective functions separately. So-called Pareto-optimal solutions are those solutions that cannot be improved in one objective without getting worse in another objective. Here, we do not consider reflexive partial orders such as the Pareto-order (see, e.g., Pappalardo 2008;Luc 2008). The objective function values of Pareto-optimal solutions with different trade-offs form the socalled Pareto front. An example is depicted in Figure 1. When there are multiple optimal solutions representing different trade-offs, it is often advisable to leave the final decision of the preferred trade-off to a human expert.
We now introduce a formal definition of the optimization problem (see also Deb 2004). Define a set of classifiers C = c 1 , . . . , c n with different parameter configurations. The classifiers are rated according to M objective functions F (c) = (f 1 (c), . . . , f M (c)).
Here, ≺ and are the general "better" and "better or equal" relations, depending on whether the objectives are maximization or minimization objectives.
A classifier is called Pareto-optimal if there is no classifier in C that dominates it.
All Pareto-optimal classifiers form the (first) Pareto-optimal set P 1 (C). Furthermore, we inductively define the i-th Pareto-optimal set P i (C) as the Pareto-optimal set among the solutions that are not in one of the preceding Pareto-optimal sets, i.e., the Paretooptimal solutions of C \ i−1 j=1 P j (C).
The i-th Pareto front PF i (C) = {F (c) | c ∈ P i (C)} is the set of fitness values of the combinations in the i-th Pareto set. Dominance imposes a strict partial order on the classifier set C: It is transitive, i.e., if a classifier c i dominates another classifier c j and c j dominates c k , then c i automatically dominates c k .
It is not reflexive, i.e., a classifier c i cannot dominate itself.
In particular, it is not complete: If c i does not dominate c j , c j does not necessarily dominate c i , as one classifier may be better in one objective, and the other classifier may be better in another objective.
Hasse diagrams visualize the transitive reduction of a partially ordered set in form of a directed acyclic graph: The elements are the nodes of the graph, and the precedence (dominance) relations are represented by edges. The transitive reduction yields only direct dominance relations, i.e., transitive edges are removed. An example of a Hasse diagram is depicted in Figure 2.
In the case of classifier parameter optimization, several stochastic factors are introduced to the strict partial order: If the ranges of optimized parameters are non-finite (e.g., continuous or unbounded), not all parameter configurations can be evaluated. Sampling strategies and search heuristics often introduce a high amount of randomness, e.g., random sampling approaches or evolutionary algorithms. Furthermore, the objective functions are always approximations of theoretical measures (e.g., the cross-validation error is an estimate of the true classification risk on the fixed, but unknown distribution of the data set). Some classifiers also introduce inherent stochasticity if they involve random decisions (e.g., random forests  Sensitivity and specificity of a set of Pareto-optimal solutions. In this example, the Pareto front corresponds to a ROC curve. The feasible region (shaded in blue) was restricted to a sensitivity of at least 0.6 and a specificity of 0.8. Right: The same Pareto-optimal solutions embedded in a desirability landscape. The desirability was calculated using two one-sided Harrington functions aggregated by the geometric mean. Similar to the above feasible region, the parameters for the functions were chosen as y (1) = 0.6, d (1) = 0.01 and y (2) = 0.99, d (2) = 0.99 for the sensitivity and y (1) = 0.8, d (1) = 0.01 and y (2) = 0.99, d (2) = 0.99 for the specificity. For comparison, the feasible region of the strict clipping at a sensitivity of 0.6 and a specificity of 0.8 is again highlighted in blue.
build their trees at random). This means that the resulting partially ordered set constitutes an approximation of the true ordering.
In some cases, certain extreme trade-offs may be inappropriate. For example, one usually does not want to obtain a classifier with a perfect reclassification error, but a bad generalization performance in cross-validation experiments.
A simple way of handling this is to "clip" the Pareto front to a desired range of trade-offs (Figure 3, left). This is accomplished by specifying upper bounds for minimization objectives or lower bounds for maximization objectives, i.e., restricting the feasible region to a hypercube. In some cases, none of the solutions may be located in the feasible region.
Another way of imposing restrictions on objectives are desirability functions originally proposed by Harrington (1965). Essentially, the approach consists of transforming the objective scores according to their desirability (usually to a range of [0, 1], where a value of 0 means that this score is inappropriate and a value of 1 means that this is a desired value), and of combining them to an overall desirability index, often the geometric mean or the minimum. The desirability transformation can help to add additional properties of utility, e.g., instead of solely using specification limits like in "clipping" these transformations can emphasize notions of mid-specification quality. The transformation also ensures that all objectives operate on a comparable scale. The desirability indices for the Pareto-optimal parameter configurations can then be calculated, and the configurations can be ranked according to their desirability. This usually yields low ranks for the more balanced solutions and high ranks for the configurations in which only one objective has an extreme value. However, unlike in the clipping approach, these extreme configurations are not thrown away. Thus, the desirability ranking approach is a softer way of handling constraints.

Sampling strategies
The input of our method are intervals or lists of values for the parameters to optimize. The algorithm trains a classifier for combinations of parameter values and applies user-defined objective functions, such as the classification error, the sensitivity, or the specificity in reclassification or cross-validation experiments. It returns the Pareto set (or an approximation thereof) which comprises optimal parameter configurations with different trade-offs.
The choice of parameter configurations to rate is a crucial step for the optimization. If all parameters have a (small) finite number of possible values, a full search of all possible combinations can be performed. In case of continuous parameters or large sets of possible values, sampling strategies have to be applied to draw a subset of n * parameter configurations.
The most obvious strategy is to simply draw parameter values uniformly at random from the specified ranges or sets. However, this strategy does not ensure that parameter configurations are distributed evenly in the d-dimensional parameter space.
A well-known strategy to ensure a good coverage of the multivariate parameter space is Latin hypercube sampling (McKay et al. 1979). This strategy places each parameter range on one side of a d-dimensional hypercube. Continuous parameters are then divided into n * subintervals, and for each of these intervals, one value is drawn uniformly at random. Discrete parameters are placed on a grid with n * points such that the difference in the frequencies of any two parameter values on the grid is at most 1. Finally, the n * values in the d dimensions are joined to form n * parameter configurations.
Another possibility of covering the parameter space is the use of quasi-random low-discrepancy sequences (Niederreiter 1992). These sequences are designed to distribute points evenly in an interval or hypercube (see, e.g., Maucher et al. 2011 for an application). Low discrepancy guarantees that any optimal parameter combination is in close distance to a configuration in the sample. As there is no such guarantee in random strategies such as uniform selection or Latin hypercube sampling, we generally recommend the usage of such sequences rather than these strategies.
We use three multi-dimensional quasi-random sequences: The where Θ b is a van der Corput sequence (van der Corput 1935) with base b, i.e.
Here, the a j are digits of the base b representation of n, i.e., n = ∞ j=0 a j (n)b j and a j ∈ {0, 1} for all j.
A one-dimensional Sobol sequence X S (n) is computed from a sequence of states defined by the recursion where p = x s + c 1 x s−1 + . . . + c s−1 + 1 is a primitive polynomial of degree s in the field Z 2 and the a i denote the binary representation of n. Here, x ⊕ y denotes the bitwise XOR of the binary representations of x and y. For a multi-dimensional Sobol sequence, one-dimensional Sobol sequences are combined, i.e.
where x i (n) is the n-th element of a one-dimensional Sobol sequence obtained from a polynomial p i and all polynomials p 1 , . . . , p d are pairwise different. For an exemplary implementation, see Bratley and Fox (1988).
The Niederreiter sequence of dimension d and base b is defined as  and 4). In this low-dimensional example, the Niederreiter sequence and the Sobol sequence (Panel 4) are identical. The examples show that quasi-random sequences achieve a more regular coverage of the search space than random numbers and Latin hypercube sampling. In the context of quasi-Monte Carlo integration, Morokoff and Caflisch (1995) showed that Halton sequences generally give better results for up to 6 dimensions, whereas Sobol sequences (and Niederreiter sequences, as they are a generalization of Sobol sequences) perform better for higher dimensions. This may give a rough guidance for choosing the proper sequence.
For very large parameter spaces and in particular for continuous parameters, too many configurations may be required to cover the search space in a sufficient resolution. For such cases, it may be preferable to tune parameters according to a search heuristic. TunePareto implements a multi-objective evolutionary algorithm for this purpose. The employed algorithm is based on the well-known NSGA-II approach (Deb et al. 2002). This approach employs a multi-objective selection procedure in conjunction with a crowding distance to obtain a good coverage of the Pareto front.
Unlike the original NSGA-II, our implementation introduces some features known from Evolution Strategies (see, e.g., Eiben and Smith 2003;Beyer and Schwefel 2002), in particular a self-adaptation of mutation rates. This reduces the number of parameters of the algorithm, as the distribution parameters for mutation and recombination do not have to be specified. It also allows for a more differentiated mutation scheme, as the genes (i.e., the classifier parameters) are mutated using individual mutation rates. The following briefly characterizes the algorithm implemented in TunePareto: Representation An individual c consists of genes g 1 , . . . , g d corresponding to the d parameters to optimize. For each continuous parameter g k , an individual has a mutation rate σ k .
Initialization The first generation of µ individuals is drawn at random from the parameter space using Latin hypercube sampling.

Fitness measurement
The fitness of an individual c is the vector of the M objective function values F (c i ) = (f 1 (c i ), . . . , f M (c i )). In addition, a crowding distance is assigned to each individual. This crowding distance quantifies the uniqueness of a fitness vector compared to other vectors and preserves diversity on the Pareto front. Let rk m (c i ) be the rank of configuration c i when sorting the configurations according to objective f m in increasing order. For each objective f m and each configuration c i , the cuboid formed by the nearest neighbours with respect to this configuration is Here, we assume that the range of f m is [0, 1], which means that objective functions with different ranges have to be normalized. The total crowding distance of configuration c i is then This crowding distance is employed both for parent selection and survivor selection.
Recombination In each generation, λ offspring are created from the µ parents by randomly mating two parents. For each of the λ offspring, two parents are chosen according to a tournament selection using the crowded-comparison operator ≺ n by Deb et al. (2002): That is, a configuration c i is better than a configuration c j if it is in a better Pareto set than c j or if it is in the same Pareto set, but has a higher crowding distance. A discrete recombination scheme is used to determine the classifier parameter values of the children: For each parameter, the value of one of the parents is chosen at random. For the mutation rates, intermediate recombination is used (i.e., the mean of the two parent mutation rates is taken for the child). This corresponds to a commonly used recombination scheme in Evolution Strategies.
Mutation Each of the offspring is mutated. For the continuous parameters, we use uncorrelated mutations, i.e. N (m, s) being a value drawn from the normal distribution with mean m and standard deviation s.
Discrete parameters are mutated with a probability of 1 d . For integer parameters, mutations are applied by choosing one of {g k − 1, g k + 1}. For nominally scaled parameters, a new value is chosen uniformly at random.
Survivor selection The next generation is selected in a µ+λ strategy by merging the previous generation and the offspring and then applying the non-dominated sorting procedure also used in NSGA-II (Deb et al. 2002): The Pareto sets P i of the configurations are determined, and parameter configurations are taken from the successive sets (starting with P 1 ) until the desired generation size µ is reached. If there are more configurations in the current Pareto set than required to obtain the generation size µ, configurations are chosen according to their crowding distances, taking the configurations with the highest crowding distances D.

The TunePareto package
At the core of the package is the general function tunePareto, which can be configured to select parameters for most standard classification methods provided in R. Classifiers are encapsulated in TuneParetoClassifier objects. These objects constitute a way of describing the calls to training and prediction methods of a classifier in a very generic way. TunePareto includes predefined comfortable interfaces to frequently used classifiers, i.e., for k-nearest neighbour (k-NN), support vector machines (SVM, Vapnik 1998), decision trees (Breiman et al. 1984), random forests (Breiman 2001), and naïve Bayes (Duda and Hart 1973;Domingos and Pazzani 1997). For all other classifiers, such wrappers can be obtained using the tuneParetoClassifier function.
Parameters are selected according to one or several objective functions. A set of objective functions are predefined in TunePareto, such as the error, sensitivity and specificity, the confusion of two classes in reclassification and cross-validation experiments, or the error variance across several cross-validation runs. Cross-validation experiments can be performed by using a stratified or a non-stratified cross-validation. It is also possible to define custom objective functions, which is supported by various helper functions.
In the following example, we apply a random forest classifier (Breiman 2001) to the Parkinson data set (Little et al. 2009) available from the University of California at Irvine (UCI) machine learning repository ( tunePareto is supplied with the data set, the corresponding class labels, the tuned parameters, and the objective functions. The tuned parameters are supplied in the . . . argument (in this case the single parameter ntree). Possible values of parameters can either be specified as lists of possible values, or as continuous parameter ranges using the function as.interval.
From the specified ranges of all optimized parameters, combinations of values are generated and tested. By default, all possible combinations are tested. If one would like to specify a certain set of combinations to be tested, the parameterCombinations parameter can be set instead of supplying the value ranges in the . . . argument.
Printing the resulting object shows the Pareto-optimal solutions among the tested configurations and their objective values:

R> result
Pareto-optimal parameter sets: CV.Error CV.Sensitivity ntree = 220 0.08564103 0.9727891 In this case, there are three Pareto-optimal solutions. The objective scores of all (not only the optimal) solutions can be viewed by printing result$testedObjectiveValues.

Sampling strategies
If continuous parameters are used, a full search is not possible. Here, we have to apply a sampling strategy in order to obtain a good coverage of the parameter space. The following example uses Latin hypercube sampling to optimize the cost and gamma parameters of an RBF support vector machine using 30 samples. The sampling strategy is specified using the sampleType parameter. Parameters are tuned according to sensitivity and the mean class-wise cross-validation error (CV.WeightedError). This error rate accounts for unbalanced classes as in the Parkinsons data set. As we would like to compare the results of the tuning process with other sampling strategies, we generate the partition for the cross-validation in advance using generateCVRuns. This returns a list structure specifying the folds for the different repetitions of the cross-validation. This structure can be supplied to the cross-validation objectives in the foldList parameter, ensuring that all experiments are based on the same cross-validation partition.

Visualization
A classical way of visualizing the results of a multi-objective optimization is plotting the (approximated) Pareto fronts. In TunePareto, this is accomplished using the plotParetoFronts2D function. For the above results

R> plotParetoFronts2D(result, drawLabels = FALSE)
plots the 2-dimensional Pareto front. To enhance clarity, the labels of the points (i.e., the parameter values) are suppressed.  the evolutionary algorithm, all returned solutions are Pareto-optimal, so that there is only a single front.
If parameters are selected according to more than two objectives, the standard 2-dimensional plot is not applicable. TunePareto includes two further plots that can cope with more than two objectives. In the following example, we optimize an SVM according to the cross-validation error, the sensitivity and the specificity with respect to class 1. We use a sampling strategy according to the Niederreiter sequence. This requires the gsl package (Hankin 2006), a wrapper for the GNU Scientific Library.
R> result <-tunePareto(data = parkinsons, labels = parkinsons.labs, + classifier = tunePareto.svm(), gamma = as.interval(0.01, 1), + cost = 1, kernel = "radial", sampleType = "niederreiter", + numCombinations = 20, objectiveFunctions = list( + cvError(nfold = 10, ntimes = 10), + cvSensitivity(nfold = 10, ntimes = 10, caseClass = 1), + cvSpecificity(nfold = 10, ntimes = 10, caseClass = 1))) R> result Pareto-optimal parameter sets: CV The results are depicted in a matrix plot in Figure 6: For each pair of objectives, the approximated Pareto fronts are plotted as in the above example, and the plots are accumulated in a matrix structure. This makes a visual comparison of more than 2 objectives possible. Pairs of objectives always occur twice in the matrix -each objective is plotted once on each axis. The labels correspond to the configurations. By default, labels are drawn in such a way that they do not overlap with each other and do not exceed the margins of the plot, which means that some labels may be omitted. All labels can be drawn by setting fitLabels = FALSE.
Although the Pareto fronts of the single 2-dimensional plots in the matrix do not correspond to the overall Pareto front, this pairwise comparison can reveal relations of objectives that are not visible from the overall results. For example, the top-right and bottom-left plots reveal that in this case, the cross-validation error and the specificity are not concurrent. We gain a total ordering on both objectives as each Pareto front consists only of a single configuration. This means that it might suffice to optimize only one objective. Recalculating the Pareto set according to sensitivity and specificity -omitting the cross-validation error -highlights this: The Pareto-optimal solutions are the same as in the above example using all three objectives.  A further possibility to visualize optimizations with more than two objectives is to plot the Pareto front approximations in a graph (see Figure 7). When read from left to right, this  Figure 2). The nodes represent the parameter configurations and are ordered in columns according to the Pareto fronts. The edges represent dominance relations between two configurations. For example, gamma = 0.12408 is dominated by gamma = 0.49533. The color indicators show in which objective a configuration is optimal with respect to its Pareto front (e.g., gamma = 0.74283 has the best sensitivity in the first Pareto front).
corresponds to a Hasse diagram of the dominance relations with an additional color encoding for the best values in an objective (see also Figure 2).

R> plotDominationGraph(result, legend.x = "topright")
The function plotDominationGraph is based on the igraph package (Csardi and Nepusz 2006). Each node in the graph corresponds to one parameter configuration, and an edge corresponds to a dominance relation. The nodes are ordered such that the columns correspond to Pareto fronts. Small color indicators next to the nodes show in which of the objectives the corresponding configuration is optimal with respect to its Pareto front. In the default setting, transitive dominance relations are not drawn, as they are always caused by multiple direct dominance relations of configurations. Transitive edges can be included by setting the parameter transitiveReduction to FALSE. This is a more abstract representation than the usual Pareto front plot, as the actual scores for the objectives are not depicted. The graph representation allows for capturing dominance relations among the configurations at a glance and is suitable for any number of dimensions.

Selecting configurations
In principle, all solutions on the (first) Pareto front can be viewed as equally good. However, there are often additional requirements for the solutions. Consider an example similar to the one above: We tune the SVM gamma parameter according to specificity and sensitivity.

R> result2
Pareto-optimal parameter sets matching the objective restrictions: We can visualize this using plotParetoFronts2D, as depicted in Figure 8. The boundaries are drawn as grey dashed lines. In this case, only the boundary of the specificity is visible, as the boundary for the sensitivity is very far apart from the performance of all solutions and thus outside the drawing region.
Desirability functions constitute another possibility of imposing restrictions to the objective values. The desire package (Trautmann et al. 2009)  In this example, the objective values are rated according to Harrington's one-sided desirability function (Harrington 1965). Again, we set a value of 0.6 as a margin for both specificity and sensitivity. A value of 0.99 is considered as nearly optimal. The desirability functions of a parameter configuration c are aggregated according to the geometric mean. The values of the desirability index di are used to rank the Pareto-optimal configurations. The example shows that Pareto-optimal solutions with balanced objective values are ranked higher than those with an extremely good performance in a single objective. This behaviour is influenced by the choice of the of the geometric mean for the desirability index and may change when using different desirability indices. Here, solutions with a better specificity are preferred, as the sensitivity is always close to the maximum.

Customizing TunePareto
The TunePareto package is flexible and can be extended by custom classifier wrappers and objective functions.
Classifiers are encapsulated in TuneParetoClassifier objects, which describe the calls needed for training and applying a classifier in TunePareto. To utilize these methods directly, TuneParetoClassifier objects can not only be used in tunePareto, but also in special training and prediction functions (trainTuneParetoClassifier and predict.TuneParetoModel) that can be integrated into other custom tuning procedures.
We use the random forest classifier to illustrate the creation of custom classifier objects: R> forest <-tuneParetoClassifier(name = "randomForest", + classifier = randomForest, predictor = predict, + classifierParamNames = "ntree", predictorParamNames = NULL, + useFormula = FALSE, trainDataName = "x", trainLabelName = "y", + testDataName = "newdata", modelName = "object", + requiredPackages = "randomForest") The tuneParetoClassifier function creates a wrapper for the classifier to be called. The name parameter specifies a human-readable name of the classifier. The further parameters specify the type and arguments of the classifier and predictor methods. Here, classifier specifies the classifier training function, and predictor specifies the prediction function. It is also possible to call a function that integrates both training and prediction by leaving the predictor parameter empty. classifierParamNames and predictorParamNames are vectors that define the names of arguments that are accepted as valid parameters for the classifier and the predictor function by tunePareto. In this case, we specify only the ntree parameter, which is the parameter we would like to optimize. Default values for parameters can be set using two further parameters predefinedClassifierParams and predefinedPredictorParams.
trainDataName, trainLabelName, testDataName and modelName are string parameters that specify the names of the arguments of the training and prediction functions for the training data, the class labels, the test data for prediction, and the trained model in the prediction function respectively. The requiredPackages parameter lists the packages that are required to run the classifier. These packages are loaded automatically by TunePareto. If run in a snowfall cluster, the packages are loaded on all nodes. The forest object resulting from the call can be passed to the classifier parameter of tunePareto.
The randomForest classifier can be called in two ways: by providing the data and the labels using the x and the y parameters, or by providing a formula and a data frame.
In the above example, we use the crossValidation function as a precalculation function. Precalculation functions receive the classifier, the training and test data, and the parameters as an input. Furthermore, they can take additional parameters defined in the precalculationParams argument of createObjective. In this case, these are the number of runs and folds, the switches for leave-one-out cross-validation and stratification, and the foldList parameter which can be used to supply a precalculated cross-validation partition instead of generating a random partition. The output of a precalculation function is not predefined -it is a single object which is passed directly to the actual scoring functions and can take any form these functions are able to process. Here, it is a list of runs, each containing a list of folds with a vector of true labels and the predicted labels. This list is the first parameter of the function defined in the objectiveFunction argument. Like the precalculation function, this scoring function can also have additional parameters. These are specified in the objectiveFunctionParams argument. In this case, we have to specify the class which is considered as the positive class for the calculation of the rate (caseClass). The scoring function determines the mean fraction of false positive predictions across the runs of the cross-validation. The direction argument specifies whether the optimal score is the minimum or the maximum. Furthermore, a readable name is supplied for the objective. The function cvFalsePositives can now be called and passed to the objectiveFunctions parameter of tunePareto just like the objective functions known from the above examples.

Discussion
Parameter tuning is an every-day issue for many researchers in the field of machine learning. Parameters are often specified according to rules of thumb and intuition or by rudimentary trials. Automatic parameter tuning has been studied mainly focusing on single classifiers and single objectives.
Parameter selection for classifiers should obey certain standards. Many published results are over-optimistic because the same data is used both for parameter selection and validation of the final classifier. To obtain unbiased results, Bishop (1995) suggests splitting the data into a training set, a validation set, and a test set. The training set and the validation set are used to determine the parameters of the classifiers. The performance of the classifier is then assessed independently on the test set. A similar approach was recently proposed by Boulesteix and Strobl (2009). Varma and Simon (2006) suggest a nested cross-validation for the parameter selection. Bischl et al. (2010) describe common pitfalls in the context of tuning and resampling experiments.
TunePareto provides a general framework for the selection of classifier parameters accord-ing to multiple objectives. Parameter values can be chosen according to intelligent sampling strategies and search heuristics, such as quasi-random sequences and evolutionary algorithms. The package includes wrappers for many state-of-the-art classifiers and objective functions, but can be extended for almost any classifier using arbitrary objective functions. The multiobjective view on the parameter selection problem can help discovering trade-offs of objectives that remain invisible when optimizing according to a single objective. The basic idea is not to determine a single best parameter configuration, but to offer a range of good parameter configuration with different classifier properties, leaving the ultimate decision to the researcher. Decision support is provided by visualization functions. In particular, the package introduces visualization techniques for more than two objectives.
Although it is often advisable to consider several objectives separately, one should keep the number of objectives small. With four or more objectives, so-called many-objective optimization problems arise, which impose additional problems on the tuning process (see, e.g., Ishibuchi et al. 2008). In particular, almost all solutions are non-dominated if too many objectives are specified, which means that it is hard to determine the desired solutions. Furthermore, the number of solutions needed to approximate the Pareto front increases exponentially with the number of objective functions.
The stochasticity of the tuning procedure -caused by randomized processes such as the selection of a partition for the cross-validation, random factors in classifiers, and the sampling of parameter combinations -may require to take additional measures. For example, repetitions in the calculation of the objective functions (such as specifying multiple runs in the ntimes parameter of cross-validation objectives) can reduce the effects of outliers. Another option is to run the complete tuning process repeatedly and to calculate a joint Pareto front. In particular, the evolutionary search process can benefit from restarts, as it may converge to different optima depending on its initialization and random seed. When merging results from repeated subsampling experiments, one should ensure that all these experiments use the same partitions (e.g., by supplying a pregenerated fold list to a cross-validation) to make the results comparable.