clustvarsel: A Package Implementing Variable Selection for Model-based Clustering in R

Finite mixture modelling provides a framework for cluster analysis based on parsimonious Gaussian mixture models. Variable or feature selection is of particular importance in situations where only a subset of the available variables provide clustering information. This enables the selection of a more parsimonious model, yielding more efficient estimates, a clearer interpretation and, often, improved clustering partitions. This paper describes the R package clustvarsel which performs subset selection for model-based clustering. An improved version of the methodology of Raftery and Dean (2006) is implemented in the new version 2 of the package to find the (locally) optimal subset of variables with group/cluster information in a dataset. Search over the solution space is performed using either a stepwise greedy search or a headlong algorithm. Adjustments for speeding up these algorithms are discussed, as well as a parallel implementation of the stepwise search. Usage of the package is presented through the discussion of several data examples.


Introduction
Cluster analysis is the search for a priori unknown group structure in data. Model-based clustering is increasingly becoming one of the most popular cluster analysis methods. Model-based clustering is based on Finite mixture models (McLachlan and Peel, 2000), with each component density usually representing a cluster. For continuous data, Gaussian components are usually used to model clusters. Model-based clustering as implemented in the R package mclust (Fraley et al., 2012) allows for automatic selection of the number of components, and selection of parsimonious covariance structures.
In cluster analysis, as in classification or other supervised learning tasks, the inclusion of noise variables, i.e. features without useful group information, can severely degrade the final results. In fact, the presence of noise variables can negatively impact both the estimation of the number of clusters in the data and the recovery of those groups. The new clustvarsel (version ≥ 2.0) R package implements a wrapper method for automatic variable selection in model-based clustering (as implemented in the mclust package). Thus, the addition of the clustvarsel package allows for automatic variable selection to be included in the estimation process. Raftery and Dean (2006) introduced a stepwise variable selection methodology tailored to model-based clustering. An earlier version of clustvarsel (version 1) implemented this methodology. Variables designated as noise variables in this process were not required to be independent of the clustering variables. However, noise variables could be conditionally independent of the clustering, but still linearly dependent on the clustering variables. This linear dependency was modelled using linear regression. Maugis et al. (2009a) extended the framework of Raftery and Dean (2006) by allowing the noise variables to depend on a (possibly null) subset of the clustering variables via stepwise variable selection in the linear regression (see also Maugis et al., 2009b). This allows for a more parsimonious modelling of the relationship between the noise variables and the clustering variables. For more detail on the variable selection framework from both Raftery and Dean (2006) and Maugis et al. (2009a) see Section 2.
Section 3 introduces the main function in the clustvarsel package, and discusses the options for the available arguments. In Section 4, several examples are presented by applying the methodology to both synthetic and real world datasets. Algorithmic speedups are discussed in Section 5, including a description of a parallel implementation of the stepwise greedy search in Section 5.3. The paper concludes with some discussion and final remarks in Section 6.

Methodology
Model-based clustering assumes that the observed data are generated from a mixture of G components, each representing the probability distribution for a different group or cluster (McLachlan and Peel, 2000;Fraley and Raftery, 2002). For continuous data, the density of each mixture component can be described by the multivariate Gaussian distribution. Thus, the general form of a Gaussian finite mixture model is where π g represents the mixing probabilities, so that π g > 0 and G g=1 π g = 1, φ(·) is the multivariate Gaussian density with parameters (µ g , Σ g ) (g = 1, . . . , G). Clusters are ellipsoidal, centred at the mean vector µ g , with other geometric features, such as volume, shape and orientation, determined by Σ g . Parsimonious parameterisation of covariance matrices is available through the eigenvalue decomposition Σ g = λ g D g A g D g , where λ g is a scalar controlling the volume of the ellipsoid, A g is a diagonal matrix specifying the shape of the density contours, and D g is an orthogonal matrix which determines the orientation of the corresponding ellipsoid (Banfield and Raftery, 1993;Celeux and Govaert, 1995). Fraley et al. (2012, Table 1) report some parameterisation of within-group covariance matrices available in the R package mclust, and the corresponding geometric characteristics. Raftery and Dean (2006) discussed the problem of variable selection for model-based clustering by recasting the problem as a model selection procedure. Their proposal is based on the use of BIC to approximate Bayes factors to compare mixture models fitted on nested subsets of variables. A generalisation of their approach was later discussed by Maugis et al. (2009a,b).
Let us suppose that the set of available variables, X , is partitioned into three disjoint parts: the set of previously selected variables, X clust , the variable under consideration for inclusion or exclusion from the active set, X i , and the set of the remaining variables, X other ≡ X \{X clust ∪X i }. Raftery and Dean (2006) showed that the inclusion (or exclusion) of variables can be assessed using the following BIC difference: where BIC clust (X clust , X i ) is the BIC value for the "best" clustering mixture model (i.e. assuming G ≥ 2) fitted using the features set {X clust ∪X i }, whereas BIC not clust (X clust , X i ) is the BIC value for no clustering for the same set of variables. The latter can be written as i.e. the BIC value for the "best" clustering model fitted using the set X clust plus the BIC value for the regression of the candidate variable X i on the variables included in the set X clust .
The difference in BIC score (1) is an approximation of the log of the Bayes factor comparing the model where the variable under consideration, X i , is a clustering variable with the model where the variable is conditionally independent of the clustering. Large, positive values of BIC diff can be taken as an evidence that variable X i is useful for clustering.
In all clustering models, the "best" model is identified with respect to the number of mixture components (assuming G ≥ 2) and to model parameterisations. In the linear regression model term, X i can depend on all the variables in X clust , a subset of them, or none (complete independence). Finally, note that in both equations (1) and (2) the set of remaining variables, X other , plays no role. For more details of the methodology see Raftery and Dean (2006), and for improvements due to the subset selection in the regression model see Maugis et al. (2009a).
Practical implementation of the above methodology requires the use of an algorithm for checking single variables for inclusion/exclusion from the set of selected clustering variables.
A stepwise greedy search algorithm checks the inclusion of each single variable not currently selected into the current set of selected clustering variables at each step. The variable that has the highest evidence of inclusion is proposed and, if its clustering evidence is stronger than the evidence against clustering, it is included. At every exclusion step, the algorithm checks the removal of each single variable in the currently selected set of clustering variables, and proposes the variable that has the lowest evidence of clustering. The proposed variable is removed if the evidence for its being a clustering variable is weaker than the evidence against. This is similar to the idea for stepwise regression and may suffer from the same instabilities mentioned in Miller (1990) inherent in that approach (although this has not been apparent in the simulations and examples tried thus far).
The stepwise algorithm can be implemented in a forward/backward fashion, i.e. starting from the empty set of clustering variable and then continuing to add or remove features until there is no evidence of further clustering variables. It can also be implemented in a backward/forward fashion, i.e. starting from the full set of features as clustering variables and then continuing to remove or add features until there is no evidence of further clustering variables.
Another possible algorithm is the headlong search, which involves potentially checking less variables at each inclusion or exclusion step, and so may be quicker than the stepwise greedy search (at a possible price in terms of performance) for use on datasets with a large number of variables. At each inclusion step, the headlong algorithm only checks single variables not currently in the set of clustering variables until the difference between the BIC for clustering versus not clustering is above a pre-specified upper level (a default value of 0 implies that the evidence for clustering is greater than that for not clustering).
Because the algorithm stops once this criterion is satisfied, it will not necessarily check all the variables available, and the variable selected will not necessarily be the best possible feature at that step. Any variables that are checked during this step, whose difference between the BIC for clustering versus not clustering is below a pre-specified lower level, are removed from consideration for the rest of the algorithm (a default value of −10 means that there is a strong evidence against clustering). Because of this, possibly irrelevant variables can be removed early on in the algorithm and further reduce the number of variables checked at each step.
Similarly, at each exclusion, the algorithm only checks single variables currently in the set of clustering variables until the BIC difference for clustering versus not clustering is below the pre-specified upper level. The algorithm stops checking once a variable satisfies this criterion, and that variable is removed from the set. If the difference in BIC is smaller than the lower level, then the variable is removed from consideration for the rest of the algorithm, otherwise it can still be checked in future inclusion/exclusion steps. See Badsberg (1992) for full details about the headlong algorithm.

The R package clustvarsel
The clustvarsel package can be used to find the (locally) optimal subset of variables with group/cluster information in a dataset with continuous variables. In this section, usage of the main function clustvarsel and its arguments is described.
The clustvarsel package depends on other packages available on CRAN for model fitting (mclust, Fraley et al. (2014)), or for providing some facilities, such as parallelisation (parallel, R Core Team (2014); doParallel, Revolution Analytics and Weston (2013a); foreach, Revolution Analytics and Weston (2013b); iterators, Analytics (2012)), and subset selection in regression models (BMA, Raftery et al. (2013)). By loading the package as usual with library(clustvarsel), it will also take care of making all the other packages available for the current session.
G An integer vector specifying the numbers of mixture components (clusters) for which the BIC is to be calculated. The default is G = 1:9.
search A character vector indicating whether a "greedy" or, potentially quicker but less optimal, "headlong" algorithm is to be used in the search for clustering variables.
direction A character vector indicating the type of search: "forward" starts from the empty model and at each step of the algorithm adds/removes a variable until the stopping criterion is satisfied; "backward" starts from the model with all the available variables and at each step of the algorithm removes/adds a variable until the stopping criterion is satisfied. For the "headlong" search only the "forward" algorithm is available. emModels1 A vector of character strings indicating the models to be fitted in the EM phase of univariate clustering. Possible models are "E" and "V", described in help(mclustModelNames BIC.lower A numerical value indicating the level of BIC difference between clustering and no clustering below which a variable will be removed from consideration in the headlong algorithm. Default is −10. itermax An integer value giving the maximum number of iterations (of addition and removal steps) the selected algorithm is allowed to run for.
parallel This argument allows to specify if the selected "greedy" algorithm should be run sequentially or in parallel. The possible values are: (i) a logical value specifying if parallel computing should be used (TRUE) or not (FALSE, default) for running the algorithm; (ii) a numerical value which gives the number of cores to employ (by default, this is obtained from function detectCores in the parallel package); (iii) a character string specifying the type of parallelisation to use. The latter depends on system OS: on Windows OS only "snow" type functionality is available, whereas on Unix/Linux/Mac OSX both "snow" and "multicore" (default) functionalities are available.
Options (ii) and (iii) imply that the search is performed in parallel. By default the algorithm is run sequentially.
A basic clustvarsel function call needs to input a matrix or data frame containing the data to analyse. Fine tuning is possible by specifying the arguments described above. The following section presents some examples of its usage in practice.

Examples
In this section we present some data analysis examples based on simulated data and on wellknown real datasets.

Simulated data
We consider some of the synthetic data examples described in Maugis et al. (2009a). Samples were simulated for a 10-dimensional feature vector where only the first two variables provide clustering information. These were generated from a mixture of four Gaussian distributions  Table 1 in Maugis et al., 2009a). These range from independence of clustering variables on the other features (model 1 and 2) to cases of increasing degree of dependence of the irrelevant variables on the clustering ones (model 3 to 7). In the following, we focus only on some of the scenarios. For ease of reading, the values of the parameters for such scenarios are reported in Table 1.
The simulation results were evaluated using the following criteria: • Variable Selection Error Rate (VSER) to assess variable selection performance. VSER is defined as the ratio of the number of errors in selecting (or not selecting) variables to the total number of variables in the set. A perfect recovey of clustering variables gives VSER = 0, while VSER can be no greater than 1.
A perfect classification gives ARI = 1, whereas ARI = 0 for a random classification. Table 2 shows the results from a simulation study for the above synthetic data using sample sizes n = 200 and n = 1000. The methods compared are MCLUST, the best GMM using the full Table 1: Parameter settings for the scenarios used to generated synthetic data: β defines the correlation of irrelevant variables on clustering variables, whereas Ω is the covariance structure of the noise component. 0 p indicates the (2 × p) matrix of zeroes, and I p the (p × p) identity matrix.

Scenario Parameters Scenario Parameters
Model 1 Model 4 β = 0.5 0 0 1 0 6 Model 7 β = 0.5 0 2 0 2 0.5 2 0 0 1 0 3 0.5 1 0 3 set of variables, CLUSTVARSEL[fwd] and CLUSTVARSEL[bkw], the best GMM using the subset of relevant clustering variables selected by, respectively, the forward/backward and backward/forward greedy search, and SPARSEKMEANS, a sparse version of k-means algorithm (Witten and Tibshirani, 2010). Because the last method needs the number of clusters to be fixed in advance, we also included in the comparison versions of the methods based on GMMs with the number of components fixed at the true number of clusters, i.e. G = 4. Finally, note that true subset size is 2, so the optimal VSER should be 0, and the best average ARI value attainable, using the true clustering variables and fixed G = 4 components, is about 0.88.
Compared to the performance of CLUSTVARSEL reported in Table 1 of Maugis et al. (2009a), the new version of the algorithm is able to correctly discard irrelevant variables, both when they are independent of the clustering ones and when they are correlated.
When G is fixed at the true number of clusters, MCLUST gives slightly less accurate results for n = 200, except in the case of complete independence (scenario 1). CLUSTVARSEL provides equivalent accuracy, both if a forward/backward search or a backward/forward search is used.
SPARSEKMEANS shows results equivalent to greedy search in term of accuracy, but it tends to select (i.e. assigns weights different from zero to) too many variables. Consequently, the VSER of SPARSEKMEANS is always worse than that of CLUSTVARSEL. When G is unknown, MCLUST often provides inaccurate clustering, in particular when n = 200.
On the contrary, CLUSTVARSEL is generally able to select the true clustering variables (i.e. VSER is near or exactly zero), and also provides very accurate clustering (i.e. ARI is close to 0.88). The only exceptions are scenarios 1 and 4, for the backward/forward search when n = 200. In these cases the number of selected variables is slightly larger, which in turn causes a small degradation of clustering accuracy. However, for n = 1, 000 the forward/backward and backward/forward greedy searches are equivalent.

Crabs data
The crabs dataset in the MASS package contains five morphological measurements on 200 specimens of Leptograpsus variegatus crabs recorded on the shore in Western Australia (Campbell and Mahon, 1974). Crabs are classified according to their color (blue and orange) and sex, giving four groups. Fifty specimens are available for each combination of colour and sex.
R> data(crabs, package = "MASS") R> X = crabs[,4:8] R> Class = with(crabs, paste(sp, sex, sep = "|")) R> First we look at the result obtained using the function Mclust from the mclust package, with the best model selected by BIC for clustering on all the variables, allowing all possible parameterisations and the number of groups to range over 1 to 5: The estimated MAP classification is obtained from mod1$classification, so a table comparing the estimated and the true classifications, the corresponding misclassification error rate and the adjusted Rand index (ARI), can be obtained as follows: R> The same subset is also obtained by using a backward/forward greedy search: R> clustvarsel(X, G = 1:5, direction = "backward") clustvarsel model object: Stepwise ( The identified subset can be used for fitting the final clustering model as follows:  The accuracy of the clustering obtained on the selected subset of variables is obtained as: R>

Coffee data
Data on twelve chemical constituents of coffee for 43 samples were collected from 29 countries around the world (Streuli, 1973). Each coffee sample is either of the Arabica or Robusta variety.
The dataset is available in the R package pgmm.
R> data(coffee, package = "pgmm") R> X = as.matrix(coffee[,3:14]) R> Class = factor(coffee$Variety, levels = 1:2, labels = c("Arabica", "Robusta")) R> table(Class) The Arabica variety appears to be split into two sub-varieties, whereas the Robusta is correctly identified as a single cluster. As a result, a small value of ARI is obtained. Now, we may try variable selection to drop irrelevant features, and see if we can improve upon the above model. The following code uses the backward/forward greedy search for variable selection, which by default is performed over all the covariance decomposition models and numbers of mixture components from 1 up to 9: R> result = clustvarsel(X, direction = "backward") R> result clustvarsel model object: Stepwise ( Then, the clustering model estimated on the selected subset of variables is: Both the covariance parameterisation (EEE) and the number of mixture components (3) used with the selected features subset agree with those from the model using all the variables. The final clustering confirms the structure we already discussed, in particular the two sub-varieties of Arabica coffee.
To show graphically these findings, we may project the data onto a dimension reduced subspace by using the methodology described in Scrucca (2010): From Figure 1 there is an evident separation between Arabica and Robusta coffee samples along the first direction. Moreover, it seems to confirm the non homogeneous group of Arabica samples, which splits in two sub-varieties along the second direction. Witten and Tibshirani (2010, Sec. 3.3.2) discussed an example where five clustering variables are conditionally independent given the cluster memberships, whereas the remaining twenty features are simply independent standard normal variables, also independent from the clustering ones.
We replicated this experiment (denoted by WT) with varying sample sizes (n g cases for each group) and for a set of different techniques. Table 3 reports the variable selection error rate (VSER) and the classification error rate (CER). As already mentioned in Section 4.1, the VSER is defined as the ratio of the number of errors in selecting (or not selecting) variables with respect to the total number of variables considered. The CER between two partitions, which is equivalent to one minus the Rand index (Rand, 1971), is equal to 0 in the case of perfect agreement, and becomes larger for increasing disagreement (for a formal definition see Witten and Tibshirani, 2010). Smaller values of both VSER and CER are better. These two measures have been chosen for the purpose of comparison with the results in Witten and Tibshirani (2010, Tab. 4) and Celeux et al. (2013, Sect. 3.1).
The model with uniformly better performance is the MCLUST model with the first five variables, i.e. the model which most resembles the data generation mechanism. However, this model is not available when we do not know the clustering variables, as we are assuming here.
SPARSEKMEANS performs well in term of accuracy, but the number of selected variables increases with group sample size, and all the features are selected for n g = 50. MCLUST using all the variables and EII parameterisation, both at the hierarchical initialisation step and for the mixture modelling, has quite good accuracy and improves as group sample size increases. CLUSTVARSEL using backward/forward greedy search, with EII parameterisation both for modelling and initialisation also has good accuracy, and improves as group sample size gets larger. For n g = 50 it is equivalent to the best models. However, the VSER is better than for the other methods, and improves as n g gets larger. Thus, for increasing sample size it converges to the true subset size.
Note that forward/backward greedy search is clearly less accurate than the backward/forward search, with the VSER which is also larger, so the performance is overall worst than that of backward/forward search. Finally, results for CLUSTVARSEL using backward/forward greedy search are essentially equivalent to those obtained with the SELVARCLUSTINDEP software of Maugis et al. 5 Adjustments for speeding up the algorithm

Sub-sampling at hierarchical initialisation step
The EM algorithm is initialised in mclust using the partitions obtained from model-based agglomerative hierarchical clustering. Efficient numerical algorithms for approximately maximiz- Table 3: Average values based on 100 simulation runs for the WT simulation scheme with group sample size n g . All models assume the number of clusters G = 3 to be known. For all the MCLUST models, including those fitted by the variables subset selection algorithm, the EII parameterisation was used for both the hierarchical initialization and the mixture modeling. CLUSTVARSEL[fwd] uses the forward/backward greedy search, whereas CLUSTVARSEL[bkw] uses the backward/forward greedy search. Smaller values of both VSER and CER are better.
VSER CER Model n g = 10 n g = 20 n g = 50 n g = 10 n g = 20 n g = 50 When the number of observations is large, we may allow clustvarsel to use only a subset of the observations at the model-based hierarchical stage of clustering, to speed up the algorithm. This is easily done by setting the argument samp = TRUE, and by specifying the number of observations to be used in the hierarchical clustering subset with sampsize.
Consider the following simulation scheme which constructs a medium sized dataset on five dimensions. Only the first two variables contain clustering information, the third is highly correlated with the first one, whereas the remaining features are simply noise variables.
R> system.time(result1 <-clustvarsel(X, G = 1:5, samp = TRUE, sampsize = 2 )) user system elapsed 6. 33 . 9 6. 44 R> system.time(result2 <-clustvarsel(X, G = 1:5)) user system elapsed 9.739 . 44 9.82 Thus, by using sub-sampling for the initial hierarchical clustering we get an algorithm that is 38% faster with the same accuracy: To investigate the effect of sampling as the number of observations increase we conducted a small simulation study by replicating the above simulation setting with different sample sizes and fixed size at 200 observations for choosing the initial starting points. Figure 3 shows the results averaged over 10 replications. Panel (a) reports the computing time required as the sample size grows, whereas panel (b) shows the relative gain from using a subset of observations at the initial hierarchical stage. As can be seen, efficiency improves roughly exponentially as the number of observations increases, with sampling being about 40 times faster at 10, 000 cases.
As the system time required increases linearly for sampling, when no sampling is used at the initial stage the time required increases approximately exponentially. Furthermore, in all the replications the first two variables have been selected by both methods. Hence, the improvement in terms of computational efficiency has not caused any deterioration in terms of accuracy.

Headlong search
When a dataset contains a large number of variables we may find that using the headlong search algorithm option (search = "headlong") is faster than the default greedy search. To show an example we simulated a dataset analogous to the previous one for the clustering variables, then six more irrelevant variables were added, some correlated with the clustering ones, some independent and some correlated among themselves. Then, we may compare the time required by using the headlong method and using the greedy method.
In situations where there are many observations and a large number of variables, subsampling at the hierarchical initialisation step and the headlong search can be used concurrently to improve computational efficiency. A small simulation study was conducted by replicating the previous simulation scheme with different sample sizes. The methods compared are greedy and headlong searches, without and with sampling using sampize = 2 . The results averaged over 10 replications are shown in Figure 4. Without sampling, headlong search is faster than greedy search with a constant relative gain of about 1.7. The use of sampling at the initial hierarchical stage enables us to achieve an exponential relative gain as the sample size increases for both type of searches. Also in this case, the headlong search appears to be about twice as fast as the greedy search.
The speed/optimally tradeoff in a headlong search can be changed by increasing or decreasing the different levels, e.g. by setting the upper level to 10 instead of 0 we would require a variable to have stronger evidence of clustering before it is included, and by setting the lower level to 0 we would remove variables that at any stage have evidence of clustering weaker by any amount than evidence against clustering. replications for clustvarsel using search = "greedy" with and without sampling, and search = "headlong" with and without sampling. A fixed value sampsize = 2 is used throughout. Panel (b) shows the relative gain, calculated as the ratio of system times, of each strategy against the default greedy search with no sampling. All axes are on the logarithmic scale.

Parallel computing
Parallel computing is a form of computation in which the required calculations are performed simultaneously, either on a single multi-core processors machine or on a cluster of multiple computers.
Direct support of parallelism in R is available since version 2.14.0 (released in October 2011) through the package parallel (R Core Team, 2014). This is essentially a merger of the multicore package (Urbanek, 2011) and the snow package (Tierney et al., 2013). The multicore functionality supports parallelism via forking, which is a concept from POSIX operating systems, so it is available on all R platforms except Windows. In contrast, snow supports different transport mechanisms (e.g. socket connections) to communicate between the master and the workers, and it is available on all operating systems. Other approaches to parallel computing in R are available as described in McCallum and Weston (2011). For an extensive list of packages see CRAN task view on High-Performance and Parallel Computing with R at http://cran.r-project.org/web/ views/HighPerformanceComputing.html.
The greedy search discussed in Section 2 constitutes an embarrassingly parallel problem, i.e. one for which little or no effort is required to separate the problem into a number of parallel tasks. Essentially, the sequential evaluation of candidate variables for inclusion or exclusion, which is the most time consuming task, can be done in parallel. For the actual implementation in clustvarsel we used the doParallel package (Revolution Analytics and Weston, 2013a), a "parallel backend" which acts as an interface between the foreach package (Revolution Analytics and Weston, 2013b) and the parallel package. Essentially, it provides a mechanism needed to execute for loops in parallel.
To specify if parallel computing should be used in the evaluation of the BIC diff criterion in (1), the optional argument parallel must be set to TRUE in the clustvarsel function call. In this case all the available cores, as returned by the detectCores function, are used. A numeric value specifying the number of cores to employ can also be specified in the optional argument parallel.
Finally, the parallelisation functionality depends on system OS: on Windows only 'snow' type functionality is available, whereas on Unix/Linux/Mac OSX both 'snow' and 'multicore' (default) functionalities are available.
As an example, consider a sample of n = 200 observations generated according to the simulation scheme described in Section 5.2. We may compare the sequential greedy backward/forward search with a parallel version of the algorithm with the default maximum cores available and by specifying 2 cores (on a MacBook Pro with i5 Intel ® CPU with 4 cores running at 2.3GHz and with 4GB of RAM): R> system.time(result1 <-clustvarsel(X, G = 1:9, direction = "backward")) user system elapsed 1 3.86 .5 9 1 4.774 R> system.time(result2 <-clustvarsel(X, G = 1:9, direction = "backward", parallel = TRUE)) user system elapsed 158.578 5.685 51.254 R> system.time(result2 <-clustvarsel(X, G = 1:9, direction = "backward", parallel = 2)) user system elapsed 119.161 2.887 67.221 In this case, by using 4 cores we were able to halve the computing time, whereas a 35% speed up is achieved using 2 cores.
By using a machine with P processors instead of just one, we would like to obtain an increase in calculation speed of P times. As shown above, this is not the case because in the implementation of a parallel algorithm there are some inherent non-parallelizable parts and communication costs between tasks (Nakano, 2012). Amdahl's Law (Amdahl, 1967) is often used in parallel computing to predict the theoretical maximum speedup when using multiple processors. If f is the fraction of non-parallelizable task and P is the number of processors in use, then the maximum speedup achievable on a parallel computing platform is given by In the limit, the above ratio converges to S max = 1/f , which represents the maximum increase of speed achievable in theory, i.e. by a machine with an infinite number of processors.
To investigate the performance of our parallel algorithm implementation, we conducted a small simulation study using the above simulation setting for increasing numbers of cores. The study was performed on a 24 cores Intel Xeon ® CPU X5675 running at 3.07GHz and with 128GB of RAM. Figure 5 shows the results averaged over 10 replications. The points represent the observed speedup factor (obtained as s P = t 1 /t P , where t P is the system time employed using P cores) for running the backward algorithm with up to 10 cores. The curve represents the Amdahl's Law (3) with f estimated by non-linear least squares. It turns out that the estimated fraction of sequential part of the backward/forward search for variable selection isf = 0.13, which yields a maximum speedup of S max = 7.7. The estimated fraction of non-parallelizable task is estimated as f = 0.13, which gives a maximum speedup achievable by parallelisation of around 7.7 times the sequential algorithm.

Conclusions and future work
This paper has presented the R package clustvarsel which provides a convenient set of tools for variable selection in model-based clustering using a finite mixture of Gaussian densities. Stepwise greedy search and headlong algorithm are implemented in order to find the (locally) optimal subset of variables with cluster information. The computational burden of such algorithms can be decreased by some ad hoc modifications in the algorithms, or via the use of parallel computation as implemented in the package. Examples illustrating the use of the package in practical applications have been presented. Finally, given the vast solution space, other optimisation techniques could be usefully employed. For instance, the use of genetic algorithms as described in Scrucca (2014) will be included in a future release of the package.