rmcfs : An R Package for Monte Carlo Feature Selection and Interdependency Discovery

We describe the R package rmcfs that implements an algorithm for ranking features from high dimensional data according to their importance for a given supervised classiﬁcation task. The ranking is performed prior to addressing the classiﬁcation task per se. This R package is the new and extended version of the MCFS (Monte Carlo feature selection) algorithm where an early version was published in 2005. The package provides an easy R interface, a set of tools to review results and the new ID (interdependency discovery) component. The algorithm can be used on continuous and/or categorical features (e.g., gene expression and phenotypic data) to produce an objective ranking of features with a statistically well-deﬁned cutoﬀ between informative and non-informative ones. Moreover, the directed ID graph that presents interdependencies between informative features is provided. for


Introduction
In the area of feature ranking and selection for high-dimensional supervised classification, a very significant progress has been achieved in the past two decades. For a brief account, up to 2002, see Dudoit and Fridlyand (2003) and for an extensive survey and somewhat later developments see Saeys, Inza, and Larrañaga (2007). Without coming to details let us note that feature selection can be wrapped around the classifier construction or directly built (embedded) into the classifier construction, and not performed prior to addressing the classification task per se by filtering out noisy features first and keeping only informative ones for building a classifier. An early and successful method with embedded feature selection included, not mentioned by Saeys et al. (2007), was developed by Tibshirani and others (see Tibshirani, Hastie, Narasimhan, and Chu 2002 and Tibshirani, Hastie, Narasimhan, and Chu interdependencies builds on identifying features which "cooperate" in determining that some samples belong to one class, other samples to another class, still others to still a further class and so on. It is worthwhile to emphasize that this is completely different from the usual approach which aims at finding features that are similar in some sense. Instead, it can be said that our way to discover interdependencies between features amounts to determining multidimensional dependency between the classes and sequences of features. In this sense, we are in fact interested in contextual (or predictive) interdependencies between features, since the dependency in question requires the context of class information.
Let us emphasize that we do not aim at classification. While in our approach we heavily rely on using classifiers, we do not use them for the classification task per se. Indeed, we use classifiers only to: (i) rank features according to their importance with respect to their discriminative power to distinguish between classes; (ii) discover interdependencies between features. Given the top features found in step (i), one can later use them for classification by any classifier, but this is neither required nor of our interest. And clearly, step (ii) is aimed at something vastly different from sheer solving the classification task.
The procedure is particularly well suited to dealing with high-dimensional data including "small n large p problems", i.e., those with a small number of objects (records, samples) versus several orders of magnitude greater number of features for each object.
The procedure from Dramiński et al. (2008) and Dramiński et al. (2016) for Monte Carlo feature selection and interdependency discovery is briefly recapitulated in Section 2. In Section 3, an overview of the rmcfs package (Draminski and Koronacki 2018) and its main R functions is provided. In Section 4 we demonstrate the use of the package rmcfs by performing the MCFS algorithm and building the ID graph for simulated data. We close with concluding remarks in Section 5.

Feature selection -the MCFS part
We begin with a brief recapitulation of our MCFS; see Dramiński et al. (2008), which can be consulted for details as well as rationale and statistical validation of our approach to feature selection.
We consider a particular feature to be important, or informative, if it is likely to take part in the process of classifying samples into classes "more often than not". This "readiness" of a feature to take part in the classification process, termed relative importance of a feature, is measured via intensive use of classification trees. In the main step of the procedure, we estimate relative importance of features by constructing thousands of trees for randomly selected subsets of features.
More precisely, out of all d features, s subsets of m features are randomly selected, m being fixed and m ≪ d, and for each subset of features, t trees are constructed and their performance is assessed (one can easily see that the procedure is essentially the same as that of the random subspace method, the fact the authors were not aware of at the time they wrote their proposal; cf. Ho 1998).
Each of the t trees in the inner loop is trained and evaluated on different, randomly selected training and test sets that come from a split of the full set of training data into two subsets: each time, out of all n samples, 2/3 of the samples are drawn at random for training in such a way as to preserve proportions of classes from the full set of training data, and the remaining samples are used for testing. See Figure 1 for a block diagram of the procedure.
The relative importance of feature g k , RI g k , is defined as: where summation is over all s · t trees and, within each τ th tree, over all nodes n g k (τ ) of that tree on which the split is made on feature g k , wAcc τ stands for the weighted accuracy of the τ 's tree, GR(n g k (τ )) stands for gain ratio for node n g k (τ ), (no. in n g k (τ )) denotes the number of samples in node n g k (τ ), (no. in τ ) denotes the number of samples in the root of the τ th tree, and u and v are fixed positive reals (now set to 1 by default; cf. Dramiński et al. 2010). The normalizing factor (no. in τ ), which has the same value for all τ , has been included mainly for computational reasons.
With u and v set to 1, there are three parameters, m, s and t to be set by an experimenter. Note that, overall, s · t trees are constructed and evaluated in the main step of the procedure. Both s and t should be sufficiently large, so that each feature has a chance to appear in many different subsets of features and randomness due to inherent variability in the data is properly accounted for. The choice of subset size m of features selected for each series of t experiments should take into account the trade-off between the need to prevent informative features from being masked too severely by the relatively most important ones and the natural requirement that s be not too large. Indeed, the smaller m, the smaller the chance of masking the occurrence of a feature. However, a larger s is then needed, since all features should have a high chance of being selected into many subsets of the features. For classification problems of dimension d ranging from several thousands to hundreds of thousands, we have found that taking m equal to a few hundreds (say, m = 300 − 500) and t equal to maximum 20 (even t = 5 usually suffices) is a good choice in terms of reliability and overall computational cost of the procedure. Finally, for a given m, s can be made a running parameter of the procedure, and the procedure executed for increasing s until the rankings of top scoring p% features prove (almost) the same for successive values of the s.

Determining the cutoff value
The above procedure provides one with a ranking of features. However, this ranking as such does not enable one to discern between informative and non-informative features. A cutoff between these two types of features is needed and we propose to determine it by one of 5 different methods (one of them being still under development). Note that finding the cutoff point is related to separating features with large enough RI values from the rest.
Available methods: • Critical angle: The critical angle method is based on the plot of the features' RIs in decreasing order of size, with the corresponding features equally spaced along the abscissa. The plot can be seen as piecewise linear function, where each linear segment joins two neighboring RIs. Roughly speaking, the cutoff (placed on the abscissa) corresponds to this point on the plot where the slope of consecutive segments changes significantly and lastingly from large to small.
• k-means: The method is based on clustering the RI values into two clusters by the kmeans algorithm. It sets the cutoff where the two clusters are separated. This method is quite valuable when data contains a subset of very informative features.
• Permutations (max RI): The method consists of permuting the decision attribute at least 20 times and running the MCFS algorithm for each permutation. The set of the maximal RIs from all these experiments is assumed approximately normally distributed and a critical value based on the the one-sided (upper-tailed) Student's t test (at 95% significance level) is provided. A feature is declared informative if its RI in the original ranking (without any permutation) exceeds the obtained critical value. A more detailed description of this method is included in Dramiński et al. (2010).
• Permutations (z score): The method consists of permuting the decision attribute at least 30 times and running the MCFS algorithm for each permutation. For each feature, a separate distribution of the obtained RI values is constructed. For each feature, its RI in the original ranking (without any permutation) is compared against the corresponding distribution of RIs from the experiments with permuted decision attribute (the z test is used this time). This method has been described and studied in Bornelöv and Komorowski (2016). It gives similar results to the previous one but needs more MCFS runs. Along with that, however, it gives a separate p value for each feature.
• Contrast attributes: The method consists of permuting all d features to obtain the socalled contrast features, including them into the dataset and running the MCFS. Given RIs of the contrast features, a statistical test can be used to find a cutoff between the informative and non-informative original features. This method is under development and is currently off.
Given our experience and taking into account its sound statistical foundation, we recommend most the permutations (max RI) method. However, it may sometimes prove very conservative in the sense that it may give a highly limited set of informative features. On the other hand, if there are no meaningful features, the method will discover this fact and the set of informative features will prove empty. Finally, one has to admit that using the permutations method requires much caution. Actually, proper significance level should be chosen adaptively, depending on the data under consideration. Instead of adding this kind of complexity to the way the cutoff is determined, we set (by default) the final cutoff value to be the mean of all the obtained cutoffs.

Interdependency discovery -the ID part
Once features are ranked by the MCFS procedure, a natural issue to be raised concerns possible interdependencies among the informative features. In this subsection, we describe shortly a way to present such interdependencies in the form of a directed graph.
The interdependencies among features are often modeled using interactions, similarly as in experimental design and analysis of variance. Perhaps the most widely used approach to recognizing interdependencies is finding correlations between features or finding groups of features that behave in some sense similarly across samples. A classical bioinformatics example of this problem is finding co-regulated features, most often genes or, rather more precisely, their expression profiles. Searching groups of similar features is usually done with the help of various clustering techniques, frequently specially tailored to a task at hand. See Smyth, Yang, and Speed (2003), Hastie, Tibshirani, Botstein, and Brown (2001), Saeys et al. (2007), Gyenesei, Wagner, Barkow-Oesterreicher, Stolte, and Schlapbach (2007) and the literature therein. Our approach to interdependency discovery is significantly different in that we focus on identifying features that "cooperate" in determining that a sample belongs to a particular class. The initial idea was presented in Dramiński et al. (2010) but the current version provides directed ID graphs and is based on an improved interdependency measure (cf. Dramiński et al. 2016).
Our ID graph is based on aggregating information provided by all the s · t trees (see Figure 1). However, let us begin by noting that, for a single classification tree, a set of decision rules defining each class is provided by this tree's paths. Each decision rule is produced as a conjunction of conditions imposed on the particular features which in fact point to conditional interdependencies between the features. Our trust in the decision rules that are learned by any single rule-based classifier, and thus in the discovered interdependencies, is naturally limited. Moreover, the classifier is trained on just one training set and therefore our conclusions are necessarily dependent on the classifier and are conditional upon the training set. In the case of classification trees, the problem is aggravated by their high variance, i.e., their tendency to provide varying results even for slightly different training sets. Accordingly, in order to overcome these problems and provide more objective results, the MCFS-ID procedure rests on building thousands of classification trees.
To see how an ID graph is built, let us recall that each node in each of the multitude of classification trees represents a feature on which a split is made. Now, for each node in each classification tree all its antecedent nodes can be taken into account along the path to which the node belongs; note that each node in a tree has only one parent and thus for the given node we simply consider its parent, then the parent of the parent and so on. In practice, in n δ end for end for end for the maximum possible depth of such analysis, i.e., the number of antecedents considered, if available before the tree's root is attained, is set to some predetermined value, which is the procedure's parameter (its default value being 5). For each pair [antecedent node → given node] we add one directed edge to our ID graph from antecedent node to given node. Let us emphasize again that a node is equated with the feature it represents and thus any directed edge found is in fact an edge joining two uniquely determined features in a directed way. The edges are found along the paths in all the s · t MCFS-ID trees. Clearly, the same edge can appear more than once even in a single tree.
The strength of the interdependence between two nodes, actually two features, connected by a directed edge, termed ID weight of a given edge, or ID weight for short, is equal to the gain ratio (GR) in the given node multiplied by the fraction of objects in the given node and the antecedent node. Thus, for node n k (τ ) in τ th tree, τ = 1, . . . , s · t, and its antecedent node n i (τ ), ID weight of the directed edge from n i (τ ) to n k (τ ), denoted w[n i (τ ) → n k (τ )], is equal to where GR(n k (τ )) stands for gain ratio for node n k (τ ), (no. in n k (τ )) denotes the number of samples in node n k (τ ) and (no. in n i (τ )) denotes the number of samples in node n i (τ ).
The final ID graph is based on the sums of all ID weights for each pair [antecedent node → given node]; i.e., for each directed edge found, its ID weights are summed over all occurrences of this edge in all paths of all MCFS classification trees. For a given edge, it is this sum of ID weights which becomes the ID weight of this edge in the final ID graph. The pseudo code in Algorithm 1 describes the calculation. T denotes the set of all s · t trees and D = {1, 2, . . . , depth} with depth being the predetermined number of antecedents considered.
Note that an edge n i → n k from node n i to node n k is directed as is the edge (if found) from n k to n i , (n k → n i ). Interestingly, in most cases of ID graphs, we find that one of such two edges is dominating, i.e., has a much larger ID weight than the other. Whenever it happens, it means that not only n i and n k form a sound partial decision (a part of a conjunction rule) but also that their succession in the directed rule is not random.
The ID graph is a way to present interdependencies that follow from all of the MCFS classification trees. Each path in a tree represents a decision rule and by analyzing all tree paths we in fact analyze decision rules to find the most frequently observed features that along with other features form good decision rules. The ID graph thus presents some patterns that frequently occur in thousands of classification trees built by the MCFS procedure.
In sum, an ID graph provides a general roadmap that not only shows all the most variable attributes that allow for efficient classification of the objects but, moreover, it points to possible interdependencies between the attributes and, in particular, to a hierarchy between pairs of attributes. High differentiation of the values of ID weights in the ID graph gives strong evidence that some interdependencies between some features are much stronger than others and that they create some patterns/paths calling for interpretation based on background knowledge.
Let us conclude this subsection with a short comparison of our approach to discovering interdependencies between features and that based on the RFs (cf. Wright et al. 2016). Most importantly, we take full advantage of the Monte Carlo mechanism, which makes direct examination of possible interactions superfluous. Indeed, nothing like, e.g., pairwise permutation importance for pairs of features needs to be calculated. Moreover, due to the form of (2), in particular since the summands in the ID weights of pairs of features do not depend in any way on prediction performance of the trees involved in calculation of these summands, we can hope for no masking of interaction effects measured by ID weights by marginal effects (as yet, this last claim has not been confirmed by a thorough simulation study). And, last but not least, ours are directed interactions.
Clearly, only ID graphs for strong interdependencies (interactions) are of interest. It is readily seen that such graphs can equally easily be constructed for discovering two-way interactions when each of the two features involved or only one of them, or even none of them is of large relative importance, i.e., gives a strong marginal effect (except for the XOR interaction of two features with no marginal effects).
It should be noticed, however, that the ID weights are calculated in a way which makes them incomparable with relative importances of individual features. To put it otherwise, both measures give us only rankings (from the greatest to the smallest) of, respectively: features w.r.t. their relative importance; and strengths of pairwise interactions, albeit without direct information on predictive ability of any interaction.

The R package rmcfs
The R package rmcfs is available from the Comprehensive R Archive Network (CRAN) at https://CRAN.R-project.org/package=rmcfs. It contains 13 user functions and one example dataset borrowed from Alizadeh et al. (2000). All of them are described as part of the standard R package documentation and their description is also available through the R built-in help system. Here we will focus only on the major rmcfs functionality covered by the following set of functions: • mcfs() performs Monte Carlo feature selection and interdependence discovery (MCFS-ID) on a given dataset. It uses C4.5 trees (J48) as implemented in Weka 3-6-10, see Hall, Frank, Holmes, Pfahringer, Reutemann, and Witten (2009).
• plot S3 method for class 'mcfs' objects. It plots various aspects of the MCFS-ID result.
• print S3 method for class 'mcfs' objects. It prints basic information of the MCFS-ID results: top features, cutoff values, confusion matrix obtained for s · t trees and classification rules obtained by the RIPPER (JRip) algorithm.
• build.idgraph() constructs the ID graph based on the result returned by the mcfs function.
• data: defines the input data.frame containing all features with decision attribute included.
• projections: defines the number of subsets with randomly selected features. This parameter is usually set to a few thousands and is denoted in Equation 1 as s. By default it is set to "auto" and then the value is based on the size of the input dataset and the featureFreq parameter.
• projectionSize: defines the number of features in one subset. It can be defined by an absolute value (e.g., 100 denotes 100 randomly selected features) or by a fraction of input attributes (e.g., 0.05 denotes 5% of input features). This parameter is denoted in Equation 1 as m. If it is set to "auto" then projectionSize equals √ d, where d is the number of input features. Minimum number of input features in one subset is 1.
• featureFreq: determines how many times each input feature should be randomly selected when projections = "auto". By default each feature should be drawn 150 times.
• splits: defines the number of splits of each subset. This parameter is denoted in Equation 1 as t and by default set to 5.
• splitSetSize: determines whether to limit the input dataset size. It helps to speed up the computation for datasets with a large number of objects. If the parameter is larger than 1, it determines the number of objects that are drawn at random for each of the s · t decision trees. If splitSetSize = 0 then mcfs uses all objects in each iteration.
• balance: determines the way to balance classes. It should be set to 2 or higher if the input dataset contains heavily unbalanced classes. Each subset s will contain all the objects from the least frequent class and randomly selected set of objects from each of the remaining classes. This option helps to select features that are important for discovering a relatively rare class. The parameter defines the maximal imbalance ratio. If the ratio is set to 2, then subset s will contain the number of objects from each class (but the least frequent one) proportional to the square root of the class size size(c). If balance = 0 then balancing is turned off. If balance = 1 it is turned on but does not change the size of classes. Default value is "auto".
• cutoffPermutations: determines the number of permutation runs. It needs at least cutoffPermutations = 20 for a statistically significant result. Minimum is 3; 0 turns off the permutation method.
• buildID: if = TRUE, interdependencies discovery is turned on and all ID graph edges are collected.
• finalRuleset: if = TRUE, classification rules (by RIPPER algorithm) are created on the basis of the final set of features.
• finalCV: if = TRUE, it runs cross validation (CV) experiments on the final set of features.
The following set of classifiers is used: C4.5, NB, SVM, kNN, logistic regression and RIPPER.
• finalCVSetSize: limits the number of objects used in the final CV experiment. For each CV repetition, the objects are selected randomly from the uniform distribution.
• finalCVRepetitions: defines the number of repetitions of the CV experiment. The more repetitions, the more stable is the result.
• seed: seed for the random number generator in Java. By default the seed is random. Replication of the result is possible only if threadsNumber = 1.
• threadsNumber: number of threads to use in computation. More threads need more CPU cores and also memory usage is a bit higher. It is recommended to set this value equal to or less than the CPU available cores.
Function mcfs produces an S3 object of class 'mcfs' that is a list of the following components: • data: input data.frame limited to the top important features set.
• RI: data.frame that contains all features with relevance score sorted from the most relevant to the least relevant. This is the ranking of features.
• ID: data.frame that contains features interdependencies as graph edges. It can be converted into a graph object using the build.idgraph function.
• distances: data.frame that contains convergence statistics of subsequent projections.
• cmatrix: confusion matrix obtained from all s · t decision trees.
• cutoff: data.frame that contains cutoff values obtained by the following methods: criticalAngle, kmeans, permutations (max RI) and the last one representing the mean value based on all cutoff values.
• cutoff_value: the number of features chosen as informative by the method defined by parameter cutoffMethod.
• cv_accuracy: data.frame that contains the classification results obtained by cross validation performed on cutoff_value features. This data.frame exists if finalCV = TRUE. • jrip: classification rules (produced by the RIPPER algorithm) and related CV statistics obtained for cutoff_value features for this algorithm.
Admittedly, one can become overwhelmed by the sheer number of input parameters but in most applications the mcfs function can be used with the set of default/recommended parameters (as in Section 4).
• type: -"ri": plots top features set with their RIs as well as max RI obtained from the permutation experiments. Red color denotes important features.
-"id": plots top ID values obtained from the MCFS-ID.
-"distances": plots distances (convergence diagnostics of the algorithm) between subsequent feature rankings obtained during the MCFS-ID experiment.
-"features": plots top features set along with their RI. It is a horizontal barplot that shows important features in red color and non important in gray.
-"cv": plots cross validation results based on the top features.
-"cmatrix": plots the confusion matrix obtained on all s · t trees.
-"heatmap": plots heatmap results based on top features. Only numeric features can be presented on the heatmap.
• size: number of features to plot.
• ri_permutations: if type = "ri" and ri_permutations = "max", then it additionally shows horizontal lines that correspond to max RI values obtained from each single permutation experiment.
• diff_bars: if type = "ri" or type = "id" and diff_bars = TRUE, then it shows the difference values for RI or ID values.
• features_margin: if type = "features", then it determines the size of the left margin of the plot.
• cv_measure: if type = "cv", then it determines the type of accuracy shown in the plot: weighted or unweighted/balanced accuracy ("wacc" or "acc"). If the target attribute is numeric it is possible to review one of the following prediction quality measures: "pearson", "MAE", "RMSE", "SMAPE".
• heatmap_fun: if type = "heatmap", then it determines the calculation of "mean" or "median" within the class to be shown as heatmap color intensity.
• heatmap_colors: if type = "heatmap", then it defines low and high colors on the heatmap.
• size: number of top features to select. If size = NA, then size is defined by the mcfs_result$cutoff_value parameter.
• size_ID: number of interdependencies (edges in ID graph) to be included. If size_ID = NA, then parameter size_ID is defined by multiplication of size_ID_mult · size.
• outer_ID: if outer_ID = TRUE, then include include all interactions between a feature from the top set features (defined by size parameter) with any other feature.
• orphan_nodes: if orphan_nodes = TRUE, then include all nodes, even if they are not connected to any other node (isolated nodes).
• size_ID_mult: if size_ID_mult = 3 there will be 3 times more edges than features (nodes) present in the ID graph. It works only if size = NA and size_ID = NA.
• size_ID_max: maximum number of interactions to be included in the ID graph (the upper limit).
It produces an S3 'idgraph'/'igraph' object that can be plotted in R, exported to graphML (in XML format) or saved as CSV or RDS files.
• label.dist: space between the node's label and the corresponding node in the plot.

Example
First of all make sure you have Java installed on your computer and then install package rmcfs from the CRAN repository: R> install.packages ("rmcfs") Before loading the package, set the Java parameters (we set the maximum size of memory allocated by Java at 2 GB) and then load the package.
R> options(java.parameters = "-Xmx2g") R> library ("rmcfs") After this the R environment is ready to run the example below. Note that slightly different results than those reported might be obtained using the replication material because of the algorithm's implementation and the random number generator in the Java version used.

The artificial data
To review and understand the MCFS-ID algorithm we suggest to create and use an extraordinarily simple example dataset. It consists of objects from 3 classes, A, B and C, that contain 40, 20 and 10 objects, respectively (70 objects altogether). For each object, we create 6 binary features (A1, A2, B1, B2, C1 and C2 ) that are "ideally" or "almost ideally" correlated with the class feature. If an object's 'class' equals 'A', then its features A1 and A2 are set to class value 'A'; otherwise A1 = A2 = 0. If an object's 'class' is 'B' or 'C', we proceed analogously, but we introduce some random corruption to 2 observations from class 'B' and to 4 observations from class 'C': in the former case, for each of the two observations and both attributes B1/B2, we randomly replace their value 'B' by '0' and in the latter case, again for each of the four observations and both attributes C1/C2, we randomly replace their value 'C' by '0'. The data also contains additional 500 random numerical features with uniformly [0, 1] distributed values. Thus we end up with 6 nominal important features (3 pairs with different levels of importance for classification) and 500 randomly distributed ones.

MCFS-ID on artificial data
Let us run mcfs on the created data. If the given CPU is a HT (hyper-threading) quad core, we recommend setting up to 8 threads for processing because the efficiency gain of multithreaded processing is limited by the number of physical CPU cores. However, in the example below, we use 1 thread to get perfect reproducibility, since the implemented multithreaded mechanism is asynchronous. For a standard PC and 1 thread, it may take around 2 minutes, but we observe nearly linear dependence between number of threads running on physical cores and speed of calculation (see Section 4.3). Parameter cutoffPermutations is set to 5 for this simple example dataset, but for real data we recommend the default 20. After successfully running the MCFS-ID algorithm we can check convergence of the algorithm (see Figure 2). The distance function shows the difference between two consecutive rankings -zero means no changes between two rankings (see the left y-axis). The common part gives the fraction of features that overlap for two different rankings (see the right y-axis). The ranking stabilizes after a number of iterations: the distance tends to zero and the common part tends to 1. beta1 shows the slope of the tangent of a smoothed distance function. If beta1 tends to 0 (the right y-axis) then the distance is given by a flat line.

R> plot(result, type = "distances")
Now we can check the cutoff value for various methods and review the result for the default one. By default, the final cutoff value is equal to the value obtained from the method based on permutations of the decision attribute. The mean obtained from all methods is in our example  equal to 11. The mean cutoff value is larger than the actual number of informative features since the critical angle method takes into consideration only the shape of the RI distribution. The input parameter cutoffMethod for the mcfs function determines the method which we want to use as the oracle. Since the cutoff value is determined by result$cutoff_value, the user may change it accordingly. This value is used by the plot method.

R> result$cutoff
We may plot RI values in decreasing order for the top, e.g., 50 features. See Figure 3. The line with red/gray dots gives RI values, the blue vertical barplot gives the difference δ between consecutive RI values. Informative features are separated from non-informative ones by the cutoff value and are presented in the plot as red and gray dots, respectively. We can also view all maximal RIs obtained from all permutation experiments (parameter plot_permutations = TRUE) - Figure 3 shows these 5 maximal RIs as horizontal red lines. The distribution of the max RI values determines the cutoff value.

R> plot(result, type = "id", size = 50)
Now, we can review labels and RIs of the top features. The resulting plot is presented in Figure 4. One can see that all six features are highly important and their RIs are much higher than those of other features. The set of informative features is flagged in red in the plot. Features A1 and A2 have substantially higher RI values because they "ideally" separate the largest class and in most cases they appear in the root node. The second level of a tree may be determined by either of B1, B2, C1 and C2, but in our case features B1/B2 are less corrupted by "noise" and hence they are much more informative than features C1/C2. Notice Finally, we can build and visualize the ID graph. By default, we plot all the edges that connect the 6 informative features as defined by result$cutoff_value.

R> gid <-build.idgraph(result) R> plot(gid, label_dist = 1)
In the ID graph, as seen in Figure 5, some additional information is conveyed with the help of suitable graphical means. The color intensity of a node is proportional to the corresponding feature's RI. The size of a node is proportional to the number of edges related to this node. The width and level of darkness of an edge is proportional to the ID weight of this edge. Since we would like to review only the strongest ID weights let us plot the ID graph with only the 12 top edges (size_ID = 12).

R> gid <-build.idgraph(result, size_ID = 12) R> plot(gid, label_dist = 1)
In Figure 6, the top 6 features along with top 12 ID weights are presented. Notice that the two most important features, A1 and A2, point to all other ones, while B1 and B2 point to and A2 ). Indeed, such features do not cooperate in distinguishing between classes, since they are the same! For the top features set, when the execution of MCFS-ID has been finished, the procedure runs 10-fold cross validation (CV) on 6 different classifiers (see Figure 7). Each CV is repeated 3 times (defined by the finalCVRepetitions parameter) and the mean value of accuracy and weighted accuracy are gathered. Since the weighted accuracy is equal to the mean over all true positive rates (TPR), it is more meaningful for datasets with unbalanced classes. In our example, given the first two features from the ranking, A1 and A2, only 66% weighted accuracy can be achieved, since only class A can then be separated and the remaining objects can be labeled as B which perfectly classifies 2 out of 3 classes (and hence wacc = 66%). For 6 and more top features the accuracy depends on an algorithm and a given CV experiment. The CV plot presents the result for result$cutoff_value features (red label on the x-axis) and its multiples of (0.25, 0.5, 0.75, 1, 1.25, 1.5, 2).
R> plot (result, type = "cv", measure = "wacc") The RIPPER algorithm is a rule-based classifier and thus it provides the user with classification rules. Function print() provides in particular the rules and their classification ability.

MCFS-ID on real data
In order to illustrate the practical usability of the algorithm we use the Arcene dataset downloaded from the UCI Machine Learning Repository at https://archive.ics.uci.edu/ ml/datasets/Arcene (Dua and Karra Taniskidou 2017). Arcene's task is to distinguish cancer versus normal patterns from mass-spectrometric data. This is a two-class classification problem with 10000 continuous input variables and 100 objects used for training. This dataset was one of 5 datasets of the NIPS 2003 feature selection challenge; see Guyon, Gunn, Ben-Hur, and Dror (2005). After running MCFS-ID with its default parameters we obtain the ranking of features that we use for prediction on the validation set. Starting from 1 up to 200 top features we repeatedly train a SVM and apply it on 100 new objects (unseen during the feature selection process and training of the model) from the validation dataset. For over 100 top features classification accuracy is almost 99% and for over 175 top features it achieves 100% (see Figure 8); this almost or strictly perfect accuracy follows from the task's low complexity and small size of the validation set. Unfortunately, a larger test set with class labels is unavailable).
To verify the convergence of our MCFS heuristic procedure we run the algorithm three times on the same training data -each time with a different seed. All three runs are made using the same default parameters as before. Having obtained all three rankings, we compare the three sets of top 100 features. As many as 92 features are present in all three rankings, regardless of the starting seed. Below we present the top twenty features from the first run and their positions in the rankings for the remaining two runs. To analyze the scalability of our current implementation we run the mcfs procedure for a multiple number of threads for three different Intel CPUs machines. The result presented in Figure 10 shows an initially nearly linear relation between time and threads number: time(th) = time(th = 1)/th, where th is the number of threads, for a 14 cores CPU. Experiments show that the gain in speed highly depends on the architecture of the given CPU. Since hyper-threading (HT) technology is based on utilizing the same physical core by two threads simultaneously, it can only give a boost to the speed when the CPU is not well utilized by a single thread. For a multicore system (2 × 14), we noticed that there is no gain in speed from 16 threads on. We profiled the application to find possible concurrency between threads and locks for critical section execution but it has proved not to be the case. There are many other issues that can affect a Java multithreading program such as: memory bandwidth, L3 cache size limitation, or frequent garbage collector execution (see Qian, Li, Srisa-an, Jiang, and Seth 2015). However multithreading allows to run a calculation 7 times faster by using 8 threads vs. 1 on a 14 cores CPU (49 sec vs. 340 sec respectively).

Summary
In this paper, we presented the rmcfs package that can be successfully used for feature selection and interdependencies discovery. Here, we discussed one extraordinarily simple usage example and only one example using a real-life dataset, but the MCFS-ID algorithm has proven in past studies to be a very useful technique to limit the number of features and select the informative ones in various real high-dimensional problems; see Dramiński et al. (2008), Kierczak et al. (2009, Khaliq, Leijon, Belák, and Komorowski (2015), Enroth, Bornelöv, Wadelius, and Komorowski (2012), Dramiński et al. (2016). The ranking of features can be used to review features one by one (starting from the top) even in the case of big data. Using MCFS, one can build rule sets on top features and review their meaning; see Bornelov, Marillet, and Komorowski (2014). The ID part presents interdependencies be-tween features in a consistent and comprehensive way. It shows not simple correlations but nonlinear relations between features that may reveal causality in the data (to be inferred from or confirmed by suitable background knowledge); see Dramiński et al. (2016). The novelty of the paper lies in presenting: (i) a concise and comprehensive description of an R package that automatically implements proper (default) settings of all crucial MCFS-ID parameters to obtain reliable results for data of any given size; (ii) ways of determining the cutoff value between informative and non-informative features; (iii) a scalability analysis of the current implementation in package rmfcs; (iv) an illustration of using discovered interdependencies (interactions) to take advantage in classification tasks.