Developer-Friendly and Computationally Eﬃcient Predictive Modeling without Information Leakage: The emil Package for R

Data driven machine learning for predictive modeling problems (classiﬁcation, regression, or survival analysis) typically involves a number of steps beginning with data pre-processing and ending with performance evaluation. A large number of packages providing tools for the individual steps are available for R , but there is a lack of tools for facilitating rigorous performance evaluation of the complete procedures assembled from them by means of cross-validation, bootstrap, or similar methods. Such a tool should strictly prevent test set observations from inﬂuencing model training and meta-parameter tuning, so-called information leakage, in order to not produce overly optimistic performance estimates.Herewe present a new package for R denoted emil (evaluation of modeling without information leakage) that oﬀers this form of performance evaluation. It provides a transparent and highly customizable framework for facilitating the assembly, execution, performance evaluation, and interpretation of complete procedures for classiﬁcation, regression, and survival analysis. The components of package emil have been designed to be as modular and general as possible to allow users to combine, replace, and extend them if needed. Package emil was also developed with scalability in mind and has a small computational overhead, which is a key requirement for analyzing the very big data sets now available in ﬁelds like medicine, physics, and ﬁnance. First package emil ’s functionality and usage is explained. Then three speciﬁc application examples are presented to show its potential in terms of parallelization, customization for survival analysis, and development of ensemble models. Finally a brief comparison to similar software is provided.


Introduction
Data driven machine learning for predictive modeling problems (classification, regression, and survival analysis) is employed in virtually all scientific domains and rapidly grows in importance in the context of what is called "big data" (Snijders, Matzat, and Reips 2012;Wu, Zhu, Wu, and Ding 2014). In this context a prediction model is a computable function f that transforms an input feature vector x into a prediction y determined as y = f (x) where y may have one or several dimensions. Usually the main purpose of a predictive modeling exercise is to provide decision support in the form of actual predictions for new inputs of interest as well as in the form of assigning importance to the features in the input vector x. Predictive modeling typically involves the following three steps: (1) data pre-processing including imputation of missing values, raw data transformations and/or feature selection; (2) parameter fitting and model family selection including tuning of meta-parameters or model selection by an information theory based criterion, like AIC (Akaike 1973) or BIC (Schwarz 1978); and (3) prediction using the selected final prediction model. When performing predictive modeling using a particular computational procedure consisting of a pre-processing part (step 1 above) and a modeling procedure part (step 2 above), a problem of fundamental interest is to characterize and quantify the overall prediction performance of the models produced by the procedure. This is because the computational procedure used might be too liberal or too constrained for the underlying prediction problem of interest. If too liberal the final prediction models produced will have been fitted too well to the design/training examples used (overfitting) and will therefore perform poorly on new external test examples. If the computational procedure would be too constrained the models would not be flexible enough to enable sufficiently close fitting to the design examples. Typically the overall prediction performance, especially the expected average performance of the prediction models produced, is estimated using testing on external examples that have not at all been part of the pre-processing or the modeling procedure. In order to obtain a reliable characterization that is not dependent on one particular split of the available data into one set of examples for training and one set of examples for external performance estimation, usually cross-validation (CV), or some other resampling method, is employed to build multiple prediction models, each tested on independent external test examples (Hastie, Tibshirani, and Friedman 2001;Varma and Simon 2006;Lawless and Yuan 2010). If test examples are allowed to influence the modeling because they are not excluded from pre-processing and the modeling procedure there is a risk of positively biasing the resulting performance estimate. We will refer to such influences as information leaks. Examples of such leaks include imputation or feature selection methods applied to an entire data set prior to resampling based performance evaluation.
When performing the kind of performance evaluation of a computational procedure used for predictive modeling described above, it is desirable to use an existing computational framework that takes care of the tedious but straight-forward book-keeping aspects, yet is flexible enough to allow for easy customization and comparison to alternative solutions, while not adding an unnecessarily large computational or memory burden. The package emil (Bäcklin and Gustafsson 2018) for R (R Core Team 2018) is designed to achieve precisely this. It provides a simple, transparent, light-weight, and well documented framework encompassing all of the aspects inherent to data driven machine learning based predictive modeling. Rather than providing a high level analysis platform rich in pre-defined features package emil strives towards being an application programming interface (API) with highly reusable components and consistent syntax and data structures. This results in analysis scripts that are easy to read, low in maintenance, and easy to debug. Package emil does however include several methods for each analysis step that can be used to do analyses out-of-the-box and serve as examples of how to implement customized methods. The usability of package emil has recently been appreciated in a master of science engineering program course focused on multivariate data analysis as well as in a few master thesis projects.
First we present an overview of the functionality of the emil package, followed by three application examples, and lastly we provide a comparison to the caret package (Kuhn 2008;Kuhn et al. 2018) which provides partially overlapping functionality. Functions used in the examples from packages other than emil and caret are summarized in Table 1. For a comprehensive guide to all aspects of the emil package's usage, please refer to its included documentation (accessible from within R by entering ?emil). Package emil is publicly available for download at the Comprehensive R Archive Network (CRAN) at https://CRAN.R-project. org/package=emil, and all code and data required to run the examples and benchmarking is publicly available at http://github.com/Molmed/Backlin-2017.

Notation and terminology
In this work the following notation and terminology is used (as partly outlined by Hastie et al. 2001). The structure and adjustable parameters of a prediction model f yielding predictions y = f (x) is said to be "fitted" to a dataset. A dataset consists of a matrix or data frame x of size n × p, where n denotes individual "observations" and p denotes "features", and a vector of "response values" y of length n. The words "train" and "design" commonly used elsewhere have the same meaning as "fit" and can be used interchangeably. Such a relationship is also the case for the groups of words "observations", "objects", and "examples"; "variables", "attributes", "independent variables", and "predictor variables"; and "response", "response variable" and "dependent variable". Also note that x and y may have a different structure elsewhere. For example, x may be a collection of documents stored as text and y may be a matrix with multiple response values for each observation.
Often the fitting procedure used has one or several meta-parameters that also need to be selected based on the data available. One classical example is ridge regression (Hastie et al. 2001) where the resulting prediction model can be written as the linear model y = w θ x where the coefficient vector w θ is determined using a fitting procedure that involves a penalty (regularization) term having an influence determined by a meta-parameter θ. Another classical example is the number of latent variables when performing linear regression using the partial least squares algorithm (PLS, Hastie et al. 2001). Also in this case a linear prediction model y = w k x is built where the coefficient vector w k is determined using ordinary least squares fitting in a space constrained by k latent variables where k has to be tuned/selected based on the set of training examples available.
In addition to the kind of meta-parameters discussed above, in many cases one also has to choose between different model classes. For example, in regression modeling it is common to choose between prediction models that are linear (i.e., have the form y = w x), prediction models based on standard binary regression trees (RTs), prediction models based on k-nearestneighbor regression (kNNR), and models based on multilayer perceptron artificial neural networks (MLPANNs) (Hastie et al. 2001). For each of these classes there are meta-parameters to tune (for linear models see above, for RTs the depth of the tree, for kNNR one has to choose   (Huber et al. 2015) contains the core functionality of the Bioconductor platform (Gentleman et al. 2004), which is widely used to do bioinformatic analyses.
the value of k, for MLPANN one has to choose the number of hidden layers and hidden nodes) and on top of this one also has to choose the class of the final model, which is to be used for prediction of unknown examples and for interpretation of important variables etc. Therefore one should regard the variable used to encode the model class as another meta-parameter that has to be tuned based on the training examples available. Thus the "fitting" of the structure and the parameters of one prediction model to data can also be viewed and interpreted as the result of a fitting procedure that selects one model (defined by its specific structure and associated parameter values) in the space of all conceivable models spanned by all plausible structures and parameter values.

Functionality
The main objective of the emil package is to build and evaluate different computational procedures for predictive modeling using resampling based testing. It is particularly developed for providing unbiased performance estimation, an important task when the number of samples is small compared to the number of features and/or compared to the number of model parameters including alternative model structures. Under such conditions there is always a serious risk of overfitting models (structure and parameters) as well as making overly optimistic performance estimates. As already mentioned, most fitting algorithms contain meta-parameters for controlling overall model complexity, e.g., the amount of penalty (regularization) in ridge regression or the depth of a regression tree, apart from the ordinary parameters obtained as a consequence of the choice of meta-parameter value, i.e., the actual coefficients in the resulting linear prediction model (ridge regression) and the variables and corresponding split points in the regression tree. These meta-parameters are typically also tuned with resampling based testing and it is important to keep this exercise strictly inside the training set, requiring a two-layered resampling scheme, for combined tuning and unbiased performance estimation (Hastie et al. 2001;Varma and Simon 2006;Lawless and Yuan 2010).
Consider the task where a data set D is to be pre-processed according to a pre-processing algorithm A and then used by an modeling procedure B(θ), defined by a user-defined metaparameter θ, to create a prediction model f θ . Returning to the ridge regression example above as one illustration, θ is known as the ridge/penalty parameter and controls the amount of regularization used when determining the optimal coefficient vector w θ . In order to obtain an unbiased estimate/characterization of the performance of such a modeling procedure one needs to perform a two-layered resampling procedure outlined in Algorithm 1. Such a procedure inevitably requires many data subsets to be generated and processed, and it is of vital importance to make sure that no test set is allowed to influence the contents of any training set, including all data pre-processing steps, in order to not bias the final performance estimate.
Although A generally is an unsupervised method that only serves to prepare the features of a data set for analysis, such as a normalization, imputation or compression technique, it is important to acknowledge the fact that it is the combination of A and B(θ) that constitutes the complete computational procedure for producing a model from a raw data set. There is no theoretical reason for separating A and B(θ) into different entities, but in practice it is convenient to do so to enable different implementations of A and B(θ) to be used and combined independently of one another. This also allows the results from A to be reused together with a different B(θ), which can save a lot of time if A is a time consuming algorithm.
Algorithm 1 Two-layered resampling. loop for performance evaluation. Randomly split D into mutually exclusive subsets D f (fitting) and D t (testing). loop for meta-parameter tuning.
Randomly split D f into mutually exclusive subsets D ff and D ft . Define a normalization transform n ff using the algorithm A applied to D ff alone. for each possible value of the meta-parameter θ do Fit one model f θ using B(θ) applied to the transformed data set n ff (D ff ).
Estimate the performance of f θ using the transformed test set n ff (D ft ). end for return Performance estimates for all parameter values. end loop Determine the most promising value θ * among the tested values for θ. Define a normalization transform n f using the algorithm A applied to D f alone.
Fit one model f θ * using B(θ * ) applied to the transformed data set n f (D f ).
Estimate the performance of f θ * using the transformed test set n f (D t ). end loop return Performance estimates for each random split.
In the emil framework, A is referred to as pre-processing method and B as modeling procedure.

Data
By default a data set is represented in package emil as a matrix or data frame x with observations as rows and features as columns, accompanied by a response vector y. There are no restrictions on the types of data x and y may contain as long as appropriate modeling functions are used to handle them. The most common case is that y is either a numeric implying regression, a factor implying classification, or time-to-event vector of class 'Surv' implying survival analysis, but any type of response could, in principle, be modeled. To improve compatibility with other packages, data can alternatively be supplied as x holding the entire data set and y holding a formula or character scalar marking which feature to be used as response.
The data set of the main example, prostate, was first presented by Stamey et al. (1989) and is available in the R package ElemStatLearn (Halvorsen 2015). It contains the level of prostate specific antigen (PSA) in 97 men prior to prostate cancer treatment by radical prostatectomy, together with 8 clinical features that the model building is to be based upon.

Resampling
Resampling schemes are used to define how the data set is to be split and used for model fitting and testing. Such schemes are created with the function resample that takes a method and the response vector y. This vector y is needed to allow the resampling method to generate folds that are balanced with respect to the different classes (i.e., preserving their relative frequencies).
R> cv <-resample(method = "crossvalidation", y = prostate$lpsa, + nrepeat = 2, nfold = 3) R> head(cv) rep1fold1 rep1fold2 rep1fold3 rep2fold1 rep2fold2 rep2fold3  1  FALSE  TRUE  TRUE  TRUE  TRUE  FALSE  2  TRUE  FALSE  TRUE  TRUE  FALSE  TRUE  3  TRUE  TRUE  FALSE  FALSE  TRUE  TRUE  4  TRUE  TRUE  FALSE  FALSE  TRUE  TRUE  5  TRUE  TRUE  FALSE  TRUE  FALSE  TRUE  6  TRUE  FALSE  TRUE  FALSE  TRUE  TRUE Each column in the output above represents one fold, describing which observations are to be used for training (TRUE) and testing (FALSE) for a particular split of the data set. Internally, resample calls the function resample_crossvalidation to generate the scheme and adds a couple of additional attributes to it (not visible in the code above) to allow inner resampling schemes to be generated from it (see ?subresample). Inner resampling schemes are needed for meta-parameter tuning in each of the folds in the outer loop.
The user may implement and use a custom resampling algorithm by creating a function resample_my_method and then call it with resample(method = "my_method", y) (see ?resample for details).

Splitting and pre-processing
Given a fold defining how the data is to be split the user can now proceed to carry out data pre-processing. This is typically done using a sequence of functions that represent the various actions and transformations the user wants to apply to the data set. The sequence is started with pre_split that sifts out the relevant data subsets and passes them on in a list that can be modified by the subsequent functions. The code below first splits the data, then scales all features to have mean 0 and unit standard deviation to allow for fair comparison between them, and finally converts x to a matrix rather than a data frame since the LASSO method used in the main example requires that form: Pre-processed data set prostate[1:8] of 8 features. 64 observations for model fitting, 33 observations for model evaluation.
Equivalently, the pipe notation %>% introduced in the magrittr package (Milton Bache and Wickham 2014) can be used to express the sequence of pre-processing functions as a single chained command, which is less bulky and arguably easier to read (see Table 1  Package emil contains a number of pre-processing functions ready to be combined and new ones can easily be written by the user (see ?pre_process). The pre_pca function performing principal component analysis (PCA, Hastie et al. 2001, Chapter 14.5) constitutes a concise example of how such a function should be defined: In such well structured code it is easy to spot and correct any source of information leakage between the fitting and testing data. The pre-processing routine can be passed to evaluate using a wrapper function, as in the main example, or as a list of functions as shown here: R> result <-evaluate(procedure = "lasso", x = prostate[1:8], + y = prostate$lpsa, resample = cv, + pre_process = list(pre_split, pre_scale, + function(data) pre_convert(data, x_fun = as.matrix))) It is worth highlighting the fact that the data is only extracted and pre-processed right before model fitting in order to avoid producing any unnecessary memory copies of the data (or subsets of the data). This means that no outer datasets (D f and D t in Algorithm 1) are created until the parameter tuning is completed, producing at most two copies of the data set simultaneously in memory.

Model fitting and testing
When evaluating a procedure using the evaluate function, models are automatically fitted and tested according to the resampling scheme (Section 2.2). This means a collection of models will be produced and tested, one per fold in the scheme. The user can however also create and test individual models in the following way, where rmse denotes the calculation of the root mean squared error: R> model <-fit(procedure = "lasso", x = prostate_split$fit$x, + y = prostate_split$fit$y) R> prediction <-predict(object = model, x = prostate_split$test$x) R> rmse(prostate_split$test$y, prediction) [1] 0.737 The prediction object is a list containing various quantities the user might be interested in depending on the modeling method of choice, such as predicted class labels, estimated class probabilities, or prediction intervals. Functions such as error_rate or roc_curve can then be used to summarize the performance (see ?error_fun). Predictions from evaluate can be tabulated with the function get_prediction.
R> head(get_prediction(result, resample = cv, format = "wide")) id rep1fold1 rep1fold2 rep1fold3 rep2fold1 rep2fold2  The argument procedure to the function fit is the key to choosing or customizing modeling algorithms. It takes an object of class 'modeling_procedure' created with the function modeling_procedure or alternatively any object that can be coerced to that class, such as the character "lasso" used in the main example. Information on which methods are available in package emil and other installed compatible packages can be found in the manual page ?list_method.
R> rf <-modeling_procedure(method = "randomForest", + parameter = list(mtry = c(1, 3, 6))) R> rf randomForest modeling procedure. It also enables the user to easily change plug-in functions (see ?extension) and facilitates debugging with the standard facilities of R. The example below invokes the debugger when the function used for fitting is called for the first time. From the progress log, printed by setting .verbose = TRUE (see Section 2.8 for more details), the user can tell that it happened in the inner loop of the two-layered resampling procedure (Algorithm 1), the parameter values and data to be used to fit this particular model can be inspected, and the code can be stepped through line by line while looking for problems.

Model interpretation
Package emil contains a number of functions for extracting and summarizing results produced during the modeling process. Apart from the functions get_performance, get_prediction, and get_tuning introduced earlier there is also a function get_importance for extracting feature importance estimates.
R> lasso <-modeling_procedure("lasso") R> model <-fit(procedure = lasso, x = prostate_split$fit$x, + y = prostate_split$fit$y) R> get_importance(model) How such estimates are calculated depends on the modeling procedure used. In the case of LASSO above the importance estimates are simply the fitted model coefficients w θ , mentioned in Section 1.1. To see exactly how they were obtained the user can view the code of the procedure specific importance function by entering the following command (output not shown due to its length).

R> lasso$importance_fun
The reason get_importance should be called on a fitted model rather that the procedure itself is that the final model parameters are typically needed to estimate the features' importance. The model also contains a copy of the procedure used to create it, allowing it to find the correct extraction function.

Downstream analysis
The functions presented in Sections 2.2-2.5 constitute the base of the emil framework. Figure 1 provides a schematic overview of them (blue and green) together with the types of input and output they require and produce (yellow). For most applications the objects can be thought of as belonging to one of three groups: those related to the data, those related to the modeling, and those that are end results. Data and result objects are generally represented as data frames that can directly be passed to high level data manipulation and plotting tools, such as dplyr (Wickham and François 2017), tidyr (Wickham 2014), ggplot2 (Wickham 2009), or lattice (Sarkar 2008). The modeling objects are all lists to allow for any type of contents that the user might want to create during the modeling, including elements that have not been foreseen by the authors of package emil. Internally, the modeling functions reuse the components of Figure 1 (see Figure 2) giving all modeling objects a consistent structure. Figure 1: A schematic overview of the most common functions (blue) and object types (yellow) that the user works with in the emil package. The green rounded squares mark functions for which there are several alternatives from which the user must choose. Solid incoming arrows to a node mark required data objects for producing the output marked by the single outgoing arrow and dashed incoming arrows mark optional inputs, e.g., fit requires data and a modeling procedure to create a model. As another example, performance estimates of a procedure can be obtained by calling the function get_performance on an object of class modeling_result created by evaluate. evaluate in turn requires data, a resampling scheme, and a modeling procedure.
This allows partial results to be easily retrieved and condensed, typically using the functions subtree or select, which is an emil specific extension to the dplyr function select.
For example, objects of class 'modeling_result' are nested lists where each top-level-element corresponds to a fold in the resampling scheme. All top-level-elements contain the same types of information (a model, predictions, a performance estimate, etc.) and these are of the same form as returned by fit or predict etc. These are referred to as second-level-elements. The following commands can be used to extract only the performance estimates. The inner structure of the functions fit, tune, and evaluate. The same components that the user has access to are reused internally, making the framework produce results and logs that are consistent in structure and easy to process and follow. The fit function creates predictive models by first tuning any untuned meta-parameters contained in the modeling procedure and then calling its fitting function. The tune function selects meta-parameter values by splitting a modeling procedure into separate procedures with fixed meta-parameter values, evaluating all of them and selecting the one that performs the best. The evaluate function asserts how well the complete computational procedure including pre-processing and modeling does. evaluate performs the work outlined in its box once per fold of the resampling scheme.
The first unnamed argument 1 to subtree above specifies which top-level-elements of result that should be selected (TRUE meaning all folds), and the second unnamed argument specifies 1 The second and third arguments are unnamed since they lack an assignment operator (=) and a left-handside name. Unnamed arguments are assigned to the appropriate variable inside the function based on their position in the call. Due to the recursive structure of subtree and use of the dots argument "..." the indexing arguments to subtree are typically not named.
that what second-level-elements are to be selected (only elements named "error").
The select function works in a similar way but always produces a data frame 2 , which can be directly parsed by the high level data manipulation tools mentioned above. In cases where the conversion from list element to data frame is not obvious a function returning a data frame can be used instead of an indexing vector or value. Let us return to the main example for a demonstration of this functionality. The modeling procedure internally used the glmnet package (Friedman, Hastie, and Tibshirani 2010) for fitting LASSO models, but rather than using the meta-parameter tuning framework of package emil to tune its complexity parameter λ, the fitting function calls a native tuning function provided in the glmnet package (for improved computational efficiency). Because of this, the function get_tuning will not be able to extract the tuning statistics of λ, but this can easily be done using select.
The result functions introduced earlier, such as get_performance, can also be used with select (see the example in the end of Section 2.7 for a demonstration).

Procedure comparison
Comparing different modeling procedures on the same problem is a common task in predictive modeling. Since the resampling scheme and pre-processing steps must be the same for all modeling procedures to allow for a fair comparison a lot of time may be saved by not repeating it. All modeling functions therefore allow the user to supply multiple modeling procedures, in which case the same splitting and pre-processing chain is used for all.
If the user wishes to compare LASSO regression to ridge regression on the task set up in the main example, the only change needed in the code is to supply multiple procedures in a vector or list.

Scalability
The emil package offers several tools that can help the user perform time-consuming and resource intensive computations: parallelization, checkpointing and an advanced progress logging system. Multicore parallelization is controlled using the argument .cores to evaluate, specifying how many cores to use. In the current version of package emil, cluster parallelization must be set up manually as demonstrated below. Although the setup is quite simple, we thought it best to leave it to the user since cluster interfaces may differ between platforms, but more assistance to the user may be added in the future. It is also worth mentioning that cluster and multicore parallelization may be performed together, which is useful when analyzing large datasets with multiple computers (the code below uses 4 computers with 8 cores each).
Finally, logging can be enabled by setting the argument .verbose = TRUE to fit, tune, or evaluate. The logging system uses the emil functions log_message and indent which can also be utilized by the user to create nicely formatted progress reports in time consuming custom written fitting functions.

Distribution of user created methods
Custom methods for generating resampling schemes, pre-processing, fitting models, estimating performance etc. can be written and distributed in separate packages or plain R script files. The internal documentation page ?extension provides details on how the functions should be written to be fully compatible with the framework. Once a package extending package emil is installed and loaded its contained methods will automatically be detected by package emil.

Application examples
In this section three applications are presented, using the emil package with custom preprocessing and fitting functions to show its flexibility. Section 3.1 shows an implementation of random forest with parallelization at a different level than what package emil provides by default. Section 3.2 demonstrates a method to combine PCA and proportional hazards regression (referred to as Cox regression below, Cox 1972) to solve a high dimensional survival analysis task. Section 3.3 demonstrates a method to create an ensemble classifier on the fly by combining already implemented classification methods.

An alternative parallelization routine
How to perform a modeling task in the most resource efficient way is not the same across all modeling algorithms and applications. Ensemble methods such as random forest provide a nice illustration of this dilemma through the fact that parallelization may be performed on different levels, which leads to different memory requirements and computation times.
Consider the following two ways to parallelize the emil framework on a high performance machine with 16 CPU cores. The aim is to evaluate a procedure in which random forest classifiers containing 8000 decision trees are trained and used for classification by 8-fold cross-validation repeated 4 times (32 folds in total). The default implementation executes the 32 folds in parallel, distributed on the 16 cores, where each core fits all 8000 trees of 2 forests (Figure 4 top). The alternative implementation executes the 32 folds sequentially, but distributes the trees of each fold so that each core fits 500 trees of 32 forests (Figure 4 bottom).
user system elapsed 5878.389 0.170 5874.807 Next, the standard parallel implementation is set up and tested: R> system.time(result_par1 <-evaluate(procedure = proc, x = x, y = y, + resample = cv, .cores = 16)) By spawning multiple R processes elapsed time is greatly reduced at the expense of a slight increase in user and system times. Figure 4: Two alternatives for parallelizing the same analysis presented in Section 3.1. Serial steps are orange and parallel steps are yellow. The task involves fitting and testing 32 random forest models containing 8000 decision trees each (one forest per fold of a 4 × 8 CV scheme). Top: The first implementation fits all 32 models in parallel, and fits the 8000 trees of each sequentially. Bottom: The second implementation trains the models sequentially, but splits up the 8000 trees on the CPU cores, producing more but smaller subtasks.
user system elapsed 6618.851 33.675 617.991 On the other hand, the alternative implementation required much less memory since all sub-processes can work on the same pair of training and test sets, whereas the standard parallel implementation needs to keep 16 unique pairs in memory simultaneously. When working with large data sets this can be a trade-off well worth making.

High dimensional survival modeling
The next example is based on a publicly available gene expression data set with breast cancer tumor samples (Miller et al. 2005). The following analysis was not used in the original paper, but was invented ad hoc to demonstrate the emil framework. The data set contained expression levels of 18822 genes measured with 44928 probes across 251 samples. The response modeled was the time to relapse. Cox regression is commonly used for survival analysis problems, but since this data set has far more features than observations, and Cox regression lacks feature selection or regularization, the Cox regression algorithm will not be able to find a unique solution. To remedy this, PCA was used to pre-process the data set to compress it to a manageable number of features. However, since not all patients received the same treatment an additional treatment feature must also be included for stratification, but should of course not be part of the PCA.
First the data set is loaded and prepared for the analysis. The package breastCancerUPP (Schroeder, Haibe-Kains, Culhane, Sotiriou, Bontempi, and Quackenbush 2011) is used to load the data set upp, which is attached to the workspace by calling data("upp", package = "breastCancerUPP"). The same object contains both experimental and clinical/phenotypic data, which are accessed using the Biobase functions exprs and pData.

Implementation of a custom ensemble classifier
This application example serves to illustrate the reusability of emil's components by implementing a custom ensemble classification method. The fitting function is defined taking both a data set and a list of modeling procedures to fit models from. Using bootstrapping (Efron 1979), different random subsets of the training data are defined for each procedure, and each pair of data subset (fold) and procedure is analyzed together using the Map function. The try statement below is needed to safe-guard against failing model training, which might occasionally occur depending on what procedures are given to the ensemble.
The final step is to return the class with the most votes for each observation (the element prediction in the returned list). The voting statistics are also returned in case the user would be interested.
R> ensemble <-modeling_procedure(method = "ensemble", + parameter = list(procedure = list(rep(c("lda", "qda", "rpart"), + each = 100)))) The performances of the ensemble classifiers built are then estimated using cross-validation on a data set containing sonar frequency profiles (60 features) of 208 objects that were either rocks or metal cylinders, first published by Gorman and Sejnowski (1988). The data set was later published in the UCI Machine Learning Repository (Newman, Hettich, Blake, and Merz 1998) and on CRAN in the package mlbench (Leisch and Dimitriadou 2010). The performance of the methods incorporated into the ensemble was finally estimated and compared to that of the ensemble ( Figure 5).

Comparison to similar software
There are a number of tools available for automating predictive modeling, from graphical workbenches such as SAS Enterprise Miner (SAS Institute Inc. 2013), the Konstanz Information Miner (KNIME, Berthold et al. 2008) and RapidMiner (RapidMiner Inc. 2013) to packages for different programming languages such as caret (Kuhn 2008;Kuhn et al. 2018) for R, scikit-learn (Pedregosa et al. 2011) for Python (Van Rossum et al. 2011), andPRTools (Duin et al. 2007) for MATLAB (The MathWorks Inc. 2017). Packages emil and caret partially overlap in functionality, and have actually been developed independently in parallel for a few years, but both also have some unique features. This section serves to highlight the differences between them.

Performance evaluation of complete procedures
The most notable difference between packages emil and caret is that package emil focuses on evaluating whole computational procedures whereas package caret is only focused on training and tuning them. During training package caret does use resampling based performance evaluation to tune meta-parameters (the inner loop of Algorithm 1), but the package does not contain any function for unbiased performance estimation of the entire procedure. It might be tempting to use the performance estimates obtained during tuning as a substitute, but since the tuning procedure always selects the best performing parameter values there is a risk that the corresponding performance estimate is positively biased (Varma and Simon 2006;Lawless and Yuan 2010). Users of the caret package may of course implement the outer loop of Algorithm 1 themselves to obtain more reliable performance estimates, but this will lead to suboptimal memory usage given the way package caret is implemented.
The memory efficiency and computation time of packages emil (version 2.2.8) and caret (version 6.0-79) was studied on five tasks involving commonly used predictive modeling techniques. Each task was performed in five replicates and monitored using the Syrupy tool (Sukumaran 2015) that repeatedly polled the Unix function ps once per second for the given R processes. Figure 6 shows the computation time and average memory usage over all the replicates. The "ALL" tasks used the data set presented in Section 3.2 with the goal of predicting the samples' estrogen-receptor status from their gene expression profiles, while the "Sonar" task used the data presented in Section 3.3. Task specific details were as follows: • The "ALL-kNN-forest" task consisted of performing kNN-imputation in combination with random forest classification (Breiman 2001) with a pre-selected meta-parameter value, including evaluation by 5-fold cross-validation repeated 3 times.
• The "ALL-PAM" task consisted of performing nearest shrunken centroids classification (Tibshirani, Hastie, Narasimhan, and Chu 2002, NSC) Figure 6: Comparison of memory usage and computation time between packages emil and caret on the 5 benchmarking tasks described in Section 4.1. All curves are averages over five replicates performed sequentially on a high performance machine. Each replicate was run in a separate R process. Garbage collection was performed prior to the fitting of each model and a 3 second waiting period was added to the end of each job to ensure the memory profiler had time to get a final reading. The gray horizontal lines mark the memory requirement of only loading the data set. The small vertical tick marks on the x-axis mark the completion time of individual replicates. The y-axis shows the amount of non-swapped RAM used by the R processes, known as resident set size (RSS).
Tibshirani, Narasimhan, and Chu 2014), including tuning the meta-parameter threshold and evaluation by 5-fold cross-validation repeated 3 times.
• The "ALL-PCA-tree" task consisted of performing PCA pre-processing, selecting the first 20 components, and training decision trees (Breiman, Friedman, Olshen, and Stone 1984) using the rpart package (Therneau et al. 2015), including tuning the meta-parameter maxdepth and evaluation by 3-fold cross-validation repeated twice.
• The "ALL-SVM" task consisted of performing support vector machine (SVM) classification with polynomial kernels using the kernlab package (Karatzoglou, Smola, Hornik, and Zeileis 2004), including tuning the meta-parameter degree and evaluation by 3-fold cross-validation repeated twice.
• The "Sonar-LogitBoost" task consisted of training a single logit boost model (Dettling and Bühlmann 2003) using the caTools package (Tuszynski 2014), including tuning of its meta-parameter nIter and evaluation by external testing on a single test set. This task was taken from package caret's online manual (Kuhn 2015).
Package emil required less memory than package caret on all tasks except "ALL-PAM", mainly because package emil avoids creating some unnecessary in-memory copies of the data set.
When analyzing large data sets (on the order of several gigabytes) it is crucially important to avoid unnecessary copies of the data set, as it may considerably slow down the analysis or sometimes inhibit it entirely. The issue is less serious on the relatively small data sets analyzed in the benchmarking study but the existence of the effect is still apparent, especially when using pre-processing steps such as kNN imputation or PCA. Package emil's core functions fit, tune, and evaluate are designed to never modify the data set in any way, but to leave that to the user-defined pre-processing functions. Since the pre-processing functions operate in a sequential manner on the training and test sets exclusively package emil always leaves the original copy of the data set untouched. This holds true even if a method incorporated into the emil framework requires data in a non-standard form, such as the pamr method for NSC classification. Data should be supplied to the pamr.train function as a list of two elements x and y where x is a matrix with observations as rows and y is a matching response vector. This can easily be accomplished by using a custom pre-processing function 3 .
Some of the advantages of both packages can be combined through calling the emil function evaluate with a modeling procedure with method = "caret". The fitting function will then pass on the data and parameters to the caret function train for tuning and fitting. This may add some additional memory overhead compared to running package emil on its own, but provides a quick way of adding the outer loop of Algorithm 1 to any existing code written for package caret. The example below shows how to define a modeling procedure using package caret to train random forest models 4 .

Other functionality
In addition to the differences highlighted in the previous section, the emil and caret packages both provide some functionality with superior implementation or not present in the other package. The advantages of package emil include a more flexible data pre-processing system (Section 2.3), check-pointing and a more developed progress logging system (Section 2.8), and learning curve analyzes (see ?learning_curve, Duda, Hart, and Stork 2000). The advantages of package caret include a larger number of methods ready to be used by the user, developed parallelization capabilities, and adaptive resampling, which is a technique for reducing the meta-parameter search space as the resampling based testing progresses (Maron and Moore 1997; Shen, Welch, and Hughes-Oliver 2011).

Concluding remarks
When using machine learning-based tools for prediction modeling problems, rigorous performance evaluation is critical in order to obtained unbiased information about the true performance of the prediction models built (Varma and Simon 2006). Resampling based methods such as cross-validation and repeated holdout are useful tools for this but can be misleading unless the test sets by these procedures are excluded from the entire modeling procedure and only used for the final performance evaluation step. The emil package strictly adheres to this principle and provides a versatile toolbox for implementing and evaluating customized modeling procedures for predictive modeling problems in a statistically rigorous way.
Since the applications of predictive modeling are incredibly diverse, the emil framework is designed to be as general and flexible as possible while at the same time staying transparent and light weight. In this paper and the accompanying benchmarking study we have shown several examples of how the components can be combined and reused in different settings to implement alternative routines for pre-processing, parallelization, and parameter tuning, as well as creating custom ensemble classifiers while at the same time staying computationally efficient. This is possible thanks to the API like design of the framework.
While great care has been taken to make the emil package as resource efficient as possible, it should be noted that the memory requirement and computation time of a modeling procedure is heavily dependent on which methods are incorporated into it. The gains by using the emil package will differ from case to case, but a substantial reduction in development time can nevertheless be achieved by not having to re-implement the general analysis framework.
Potential drawbacks of the emil framework is that since the user is empowered to re-implement many features it might require a higher level of programming proficiency than similar packages, for example when setting up cluster parallelization. There are also points where memory efficiency could be improved further, such as delaying the pre-processing of the test sets until the training is completed, but since the sum of the training and test sets never exceeds the size of the entire data set (unless oversampling is used) this would only provide a minor improvement.