Rmixmod : The R Package of the Model-Based Unsupervised, Supervised, and Semi-Supervised Classiﬁcation Mixmod Library

Mixmod is a well-established software package for ﬁtting mixture models of multivariate Gaussian or multinomial probability distribution functions to a given dataset with either a clustering, a density estimation or a discriminant analysis purpose. The Rmix-mod S 4 package provides an interface from the R statistical computing environment to the C++ core library of Mixmod ( mixmodLib ). In this article, we give an overview of the model-based clustering and classiﬁcation methods implemented, and we show how the R package Rmixmod can be used for clustering and discriminant analysis


Introduction
Clustering and discriminant analysis (or classification) methods are among the most important techniques in multivariate statistical learning. The goal of cluster analysis is to partition the observations into groups ("clusters") so that the pairwise dissimilarities between observations assigned to the same cluster tend to be smaller than observations in different clusters. The goal of classification is to design a decision function from a learning dataset to assign new data to groups a priori known. Mixture modeling supposes that the data are an i.i.d. sample from some population described by a probability density function. This density function is a finite mixture of parametric component density functions, each component modeling one of the clusters. This model is fit to the data by maximum likelihood (McLachlan and Peel 2000).
The Mixmod package (Mixmod Team 2008) is primarily devoted to clustering using mixture models and, to a lesser extent, to discriminant analysis (supervised and semi-supervised situations). Many options are available to specify the models and the strategies to be run. Mixmod allows to fit 28 multivariate Gaussian mixture models for quantitative data and 10 multivariate multinomial mixture models for qualitative data. Estimation of the mixture parameters is performed via the EM, the stochastic EM or the classification EM algorithms. These three algorithms can be chained and initialized in several different ways leading to original strategies (see Section 2.3). The model selection criteria BIC (Bayesian information criterion), ICL (integrated classification likelihood), NEC (normalized entropy criterion), and cross-validation are proposed depending on the modeling purpose (see Section 2.4).
Mixmod, developed since 2001, is a package written in C++. Its core library mixmodLib can be interfaced with any other software packages or libraries, or can be used from the command line. It has been already interfaced with Scilab (Scilab Enterprises 2015) and MATLAB (The MathWorks Inc. 2014), see Biernacki, Celeux, Govaert, and Langrognet (2006). So far it was lacking an interface to R (R Core Team 2015). The Rmixmod package provides a bridge between the R statistical computing environment and the C++ core library of Mixmod and. Both cluster analysis and discriminant analysis can be now performed in R using Rmixmod. User-friendly outputs and graphs allow for a relevant and appealing visualization of the results. The package is available from the Comprehensive R Archive Network (CRAN) at http://CRAN.R-project.org/package=Rmixmod.
This paper reviews in Section 2 Gaussian and multinomial mixture models and the Mixmod library. An overview of the Rmixmod package is then given in Section 3 through a description of the main function and of other related companion functions. The practical use of this package is illustrated in Section 4 on toy datasets for model-based clustering in a quantitative and qualitative setting (Section 4.1) and for discriminant analysis (Section 4.2). Section 5 evokes future works of the Mixmod project.

Model-based classifications
The model-based point of view allows to consider all previous classification tasks in a unified manner.
Mixture models. Let x = {x 1 , . . . , x n } be n independent vectors in X 1 × . . . × X d , where each X j denotes some measurable space, and such that each x i arises from a mixture probability distribution with density where the p k 's are the mixing proportions (0 < p k < 1 for all k = 1, . . . , K and p 1 + . . . + p K = 1), h(·|α k ) denotes a d-dimensional distribution parameterized by α k . As we will see below, h is for instance the density of a Gaussian distribution with mean µ k and variance matrix Σ k and, thus, α k = (µ k , Σ k ). The whole parameter vector (to be estimated) of f is denoted by θ = (p 1 , . . . , p K , α 1 , . . . , α K ).
Label estimation. From a generative point of view, drawing the sample x from the mixture distribution f requires to first draw a sample of labels z = {z 1 , . . . , z n }, with z i = (z i1 , . . . , z iK ), z ik = 1 or 0, depending on if x i is arising from the kth mixture component or not. Depending on if the sample z is completely unknown, completely known or only partially known, we retrieve an unsupervised, a supervised or a semi-supervised classification problem, respectively. Mixture models are particularly well-suited for modeling these different standard situations since an estimate of any label z i (i = 1, . . . , n for unsupervised classification, i = n + 1 for supervised or semi-supervised classification) can be easily obtained by the following so-called maximum a posteriori (MAP) rulê denoting the conditional probability that the observation x i arises from group k: (2)

Parsimonious and meaningful models
The Mixmod library proposes many parsimonious and meaningful models, depending on the type of variables to be considered. Such models provide simple interpretations of groups.

Continuous variables: Fourteen Gaussian models
In the Gaussian mixture model, each x i is assumed to arise independently from a mixture of d-dimensional Gaussian densities with mean µ k and variance matrix Σ k . In this case we have in Equation 1, with α k = (µ k , Σ k ), Thus, clusters associated with the mixture components are ellipsoidal, centered at the means µ k and the variance matrices Σ k determine their geometric characteristics.
Following Banfield and Raftery (1993) and Celeux and Govaert (1995), we consider a parameterization of the variance matrices of the mixture components consisting of expressing the variance matrix Σ k in terms of its eigenvalue decomposition where λ k = |Σ k | 1/d , D k is the matrix of eigenvectors of Σ k and A k is a diagonal matrix, such that |A k | = 1, with the normalized eigenvalues of Σ k on the diagonal in a decreasing order. The parameter λ k determines the volume of the kth cluster, D k its orientation and A k its shape. By allowing some but not all of these quantities to vary between clusters, we obtain parsimonious and easily interpreted models which are appropriate to describe various group situations (see Table 1). More explanations about notation used in this table are given below.

Model
Number of parameters M step Rmixmod model name . CF means that the M step is in closed form, IP means that the M step needs an iterative procedure.
The general family. First, we can allow the volumes, the shapes and the orientations of clusters to vary or to be equal between clusters. Variations on assumptions on the parameters λ k , D k and A k (1 ≤ k ≤ K) lead to eight general models of interest. For instance, we can assume different volumes and keep the shapes and orientations equal by requiring that A k = A (A unknown) and D k = D (D unknown) for k = 1, . . . , K. We denote this model as [λ k DAD ] (or, shortly, [λ k C] where C = DAD ). With this convention, writing [λD k AD k ] means that we consider the mixture model with equal volumes, equal shapes and different orientations.
The diagonal family. Another family of interest consists of assuming that the variance matrices Σ k are diagonal. In the parameterization (3), this means that the orientation matrices D k are permutation matrices. We write Σ k = λ k B k where B k is a diagonal matrix with |B k | = 1. This particular parameterization gives rise to four models: The spherical family. The last family of models consists of assuming spherical shapes, namely A k = I, I denoting the identity matrix. In such a case, two parsimonious models are in competition: [λI] and [λ k I].
Remark. The Mixmod library provides also some Gaussian models devoted to high dimensional data. We do not describe them here since they are not yet available in the Rmixmod package but the reader can refer to the Mixmod website http://www.mixmod.org/ for further information.

Categorical variables: Five multinomial models
We consider now that the data are n objects described by d categorical variables, with respective number of levels m 1 , . . . , m d . The data can be represented by n binary vectors x i = (x jh i ; j = 1, . . . , d; h = 1, . . . , m j ) (i = 1, . . . , n) where x jh i = 1 if the object i belongs to the level h of the variable j and 0 otherwise. Denoting m = d j=1 m j the total number of levels, the data matrix x = {x 1 , . . . , x n } has n rows and m columns. Binary data can be seen as a particular case of categorical data with d dichotomous variables, i.e., m j = 2 for any j = 1, . . . , d.
The latent class model assumes that the d categorical variables are independent given the latent variable: Each x i arises independently from a mixture of multivariate multinomial distributions (Everitt 1984). In this case we have in Equation 1 h with α k = (α jh k ; j = 1, . . . , d; h = 1, . . . , m j ). In (4), we recognize the product of d conditionally independent multinomial distributions with parameters α j k = (α j1 k , . . . , α jm j k ). This model may present problems of identifiability (see for instance Goodman 1974) but most situations of interest are identifiable (Allman, Matias, and Rhodes 2009).
In order to propose more parsimonious models, we present the following extension of the parameterization of Bernoulli distributions used by Celeux and Govaert (1991) for clustering and also by Aitchison and Aitken (1976) for kernel discriminant analysis. The basic idea is to impose the condition on the vector α j k to have a unique modal value for one of its components with the other components sharing uniformly the remaining mass probability. Thus, α j k takes the form (β j k , . . . , β j k , γ j k , β j k , . . . , β j k ) with γ j k > β j k . Since m j h=1 α jh k = 1, we have (m j − 1)β j k + γ j k = 1 and, consequently, β j k = (1 − γ j k )/(m j − 1). The constraint γ j k > β j k becomes finally γ j k > 1/m j . Equivalently and meaningfully, the vector α j k can be reparameterized by a center a j k and a dispersion ε j k around this center with the following decomposition: • Center: a j k = (a j1 k , . . . , a jm j k ) where a jh k = 1 if h indicates the position of γ j k (in the following, this position will be denoted h(k, j)) and 0 otherwise.
• Dispersion: ε j k = 1 − γ j k the probability that the data x i , arising from the kth component, are such that x jh(k,j) i = 1.
Thus, it allows us to give an interpretation similar to the center and the variance matrix used for continuous data in the Gaussian mixture context. The relationship between the initial parameterization and the new one is given by: Equation 4 can be rewritten with a k = (a j k ; j = 1, . . . , d) and ε k = (ε j k ; j = 1, . . . , d) giving Table 2: Number of free parameters of the five multinomial models. We have δ = K − 1, * = pk in the case of free proportions and δ = 0, * = p in the case of equal proportions.
In the following, this model will be denoted as [ε j k ]. In this context, three other models can be defined. We denote [ε k ] the model where ε j k is independent of the variable j, [ε j ] the model where ε j k is independent of the component k and, finally, [ε] the model where ε j k is independent of both the variable j and the component k. In order to maintain some consistency in the notation, we will denote also with [ε jh k ] the most general model introduced in the previous section. The number of free parameters associated with each model is given in Table 2.

EM and EM-like algorithms focus
Estimation of the mixture parameters is performed either through maximization of the loglikelihood (ML) on θ via the EM algorithm (expectation maximization, Dempster, Laird, and Rubin 1997), the SEM algorithm (stochastic EM, Celeux and Diebolt 1985) or through maximization of the completed log-likelihood on both θ and z via the CEM algorithm (classification EM, Celeux and Govaert 1992). We now describe these three algorithms at iteration q. The choice of the starting parameter θ {0} and of the stopping rules are both described later.
The EM algorithm. It consists of repeating the following E and M steps: The SEM algorithm. It is a stochastic version of EM incorporating between the E and M steps a so-called S step restoring stochastically the unknown labels z: It is important to notice that SEM does not converge pointwise. It generates a Markov chain whose stationary distribution is more or less concentrated around the ML estimate. A natural estimate from a SEM sequence (θ {q} ) q=1,...,Q of length Q is either the mean or the parameter value leading to the highest log-likelihood in the whole sequence.
The CEM algorithm. It incorporates a classification step between the E and M steps of EM, restoring by the MAP estimate the unknown labels z: CEM leads to inconsistent estimates (Bryant and Williamson 1978; McLachlan and Peel 2000, Section 2.21) but has faster convergence than EM since it converges with a finite number of iterations. It allows also to retrieve and generalize standard K-means-like criteria, both in the continuous case (Govaert 2009, Chap. 8) and in the categorical case (Celeux and Govaert 1991).
Remark on the partial labeling case. Mixmod allows partial labeling for all algorithms: It is straightforward since known labels z l remain fixed in the E step for all of them. In that case the log-likelihood is expressed by and the completed log-likelihood, denoted now L c (θ, z u ), is unchanged.
Remark on duplicated units. In some cases, it arises that some units are duplicated. Typically, it happens when the number of possible values for the units is low in regard to the sample size. To avoid entering unnecessarily large lists of units, it is also possible to specify a weight w i for each unit y i (i = 1, . . . , r). The set y w = {(y 1 , w 1 ), . . . , (y r , w r )} is strictly equivalent to the set with eventual replications x = {x 1 , . . . , x n }, and we have the relation n = w 1 + . . . + w r .
Remark on spurious solutions. In the Gaussian case, some solutions with (finite) high log-likelihood value can be uninteresting for the user since they correspond to ill-conditioned estimates of covariance matrices for some mixture components. This corresponds to so-called spurious situations (McLachlan and Peel 2000, Sections 3.10 and 3.11). As far as we know such spurious solutions cannot be detected automatically and have to be discarded by hand.

Strategies for using EM and CEM
Both likelihood and completed likelihood functions usually suffer from multiple local maxima where EM and CEM algorithms can be trapped. Slow evolution of the objective function can be also encountered sometimes during a long period for some runs, in particular with EM.
Notice that SEM is not concerned by local maxima since it does not converge pointwise but slow evolution towards the stationary distribution cannot be excluded in some cases.
In order to avoid such drawbacks, Mixmod can act in three ways: chained algorithms, starting strategies and stopping rules. More details can be found in the Mixmod reference manual (Mixmod Team 2008).
Chained algorithms strategies. The three algorithms EM, CEM and SEM can be chained to obtain original fitting strategies (e.g., CEM then EM with results of CEM) taking advantage of each of them in the estimation process.
Initialization strategies. The available procedures of initialization are: • "random": Initialization from a random position is a standard way to initialize an algorithm. This random initial position is obtained by choosing at random centers in the dataset. This simple strategy is repeated several times from different random positions and the position maximizing the likelihood or the completed likelihood is selected.
• "smallEM": A predefined number of EM iterations is split into several short runs of EM launched from random positions. By a short run of EM, we mean that we do not wait for complete convergence but we stop it as soon as the log-likelihood growth is small in comparison to a predefined crude threshold (see details in Biernacki, Celeux, and Govaert 2003). Indeed, it appears that repeating runs of EM is generally profitable since using a single run of EM can often lead to suboptimal solutions.
• "CEM": A given number of repetitions of a given number of iterations of the CEM algorithm is run. One advantage of initializing an algorithm with CEM lies in the fact that CEM converges generally in a small number of iterations. Thus, without consuming a large amount of CPU times, several runs of CEM are performed. Then EM (or CEM) is run with the best solution among all repetitions.
• "SEMMax": A given number of SEM iterations is run. The idea is that a SEM sequence is expected to enter rapidly into the neighborhood of the global maximum of the likelihood function.
Stopping rule strategies. There are two ways to stop an algorithm: • "nbIterationInAlgo": All algorithms can be stopped after a pre-defined number of iterations.
• "epsilonInAlgo": EM and CEM can be stopped when the relative change of the criterion at hand (L or L c ) is small.

Purpose dependent model selection
It is of high interest to automatically select a model or the number K of mixture components. However, choosing a sensible mixture model is highly dependent on the modeling purpose. Before describing these criteria, it can be noted that if no information on K is available, it is recommended to vary it between 1 and the smallest integer larger than n 0.3 (Bozdogan 1993).

Density estimation
If a density estimation perspective is pursued, the BIC must be preferred. It consists of choosing the model and/or K minimizing BIC = −2L(θ) + ν ln n withθ the ML estimate and ν the number of parameters estimated. The BIC is an asymptotic approximation of the integrated likelihood, valid under regularity conditions, and has been proposed by Schwarz (1978). Despite the fact that those regularity conditions are not fulfilled for mixtures, it has been proved that the criterion BIC is consistent if the likelihood remains bounded (Keribin 2000) and has been proved to be efficient on practical grounds (see for instance Fraley and Raftery 1998).

Unsupervised classification
In the unsupervised setting, three criteria are available: BIC, ICL and NEC. But when pursuing a cluster analysis perspective, ICL and NEC can provide more parsimonious answers.
The integrated likelihood does not take into account the ability of the mixture model to give evidence for a clustering structure of the data. An alternative is to consider the integrated completed likelihood. Asymptotic considerations lead to the ICL criterion to be minimized (Biernacki, Celeux, and Govaert 2000): Notice that both expressions of ICL above allow to consider ICL either as L c penalized by the model complexity or as BIC penalized by an entropy term measuring the mixture component overlap.
The NEC criterion measures the ability of a mixture model to provide well separated clusters and is derived from a relation highlighting the differences between the maximum likelihood approach and the classification maximum likelihood approach to the mixture problem. It is defined by withθ K the ML estimate of θ for K components. The index K is used to highlight that NEC is essentially devoted to choosing the number of mixture components K, not the model parameterization (Celeux and Soromenho 1996;Biernacki, Celeux, and Govaert 1999). The chosen value of K corresponds to the lowest value of NEC.

Supervised classification
In the supervised setting, note that only the model (not the number of mixture components) has to be selected. Two criteria are proposed in this situation: BIC and cross-validation. For BIC, the completed log-likelihood (5), where z is fixed to its known value, has to be used. The cross-validation criterion (CV) is valid only in the discriminant analysis (supervised) context. The model leading to the highest CV criterion value is selected. Cross-validation is a resampling method which can be summarized as follows: Consider random splits of the whole dataset ( The CV criterion is then defined by where I v denotes the indices i of data included in (x, z) (v) , δ corresponds to the 0-1 cost andẑ i (θ (v) ) denotes the group to which x i is assigned when designing the assignment rule from the entire dataset (x, z) without (x, z) (v) . When V = n, cross-validation is known as the leave-one-out procedure, and, in this case, fast estimation of the n discriminant rules is implemented in Mixmod in the Gaussian case .

Semi-supervised classification
Two criteria are available in the semi-supervised context (supervised purpose): BIC and CV. For BIC, the partial labeled log-likelihood (6) has to be used. For CV, the whole dataset is split at random in V blocks of approximately equal sizes, including both the labeled and the unlabeled units, to obtain unbiased estimates of the error rate (Vandewalle, Biernacki, Celeux, and Govaert 2010). However, note that the CV criterion is quite expensive to be computed in the semi-supervised setting since it requires to run an EM algorithm V times to estimateθ (v) .

Mixmod library implementation and related packages
The Mixmod library The Mixmod core library (mixmodLib) is the main product of the Mixmod software package. Developed since 2001, it has been downloaded from the Mixmod web site http: //www.mixmod.org/ about 300 times per year. Distributed under the GNU GPL license, mixmodLib has been enhanced and improved for years (Biernacki et al. 2006). Important work has been done to improve performance of the mixmodLib which can today treat very large datasets quickly with accuracy and robustness. Currently, some arbitrarily large "hard" limits for the sample size and for the variable number are respectively fixed to 1 000 000 and 10 000. It is possible to change them but it requires to recompile the source code. The user must also be aware that to reach these limits in practice will essentially depend on the available computing resources.
It contains about 80 C++ classes and can be used in the command line or can be interfaced with any other software package or library (in accordance with the terms of the GNU GPL license). Some of these C++ classes (top level classes) have been created to easily interface mixmodLib. Clustering can be performed with the top level 'XEMClusteringMain' class (using 'XEMClusteringInput' and 'XEMClusteringOutput' classes) and discriminant analysis with the 'XEMLearnMain' class (using 'XEMLearnInput' and 'XEMLearnOutput' classes) for the first step and the 'XEMPredictMain' class (using 'XEMPredcitInput' and 'XEMpredictOutput' classes) for the second step (prediction).
The Rmixmod package uses also the Rcpp package (Eddelbuettel and François 2011) which provides C++ classes that greatly facilitate interfacing C or C++ code in R packages. Since the Rcpp package works only on R versions 2.15 and above, an up-to-date version of R is required for smooth installation of the package.

Existing related packages
To provide a suitable product for an increasingly large and various public, the Mixmod team has developed four products, available at http://www.mixmod.org/: • mixmodLib (developed since 2001), the core library which can be interfaced with any other software package and can also be used in the command line (for expert users).
• mixmodForMatlab (developed since 2002), a collection of MATLAB functions to call mixmodLib supplemented by some functions to visualize results.
• mixmodGUI (developed since 2009), a very user-friendly software package which provides all the clustering functionalities of mixmodLib; we plan to make available soon also discriminant analysis functionalities.

Unsupervised classification and density estimation
Cluster analysis can be performed with the function mixmodCluster(). Illustration of use of this function is given in Section 4.1. This function has two mandatory arguments: a data frame x and a list of number of groups. Default values for model and strategy will be used unless users specify a list of models with the models option (see Section 3.2) or a new strategy with the strategy option (see Section 3.3). By default only the BIC criterion is used to select Input parameter Description data Data frame containing quantitative or qualitative data. Rows correspond to observations and columns correspond to variables. nbCluster Numeric vector indicating the number of clusters. dataType Character indicating the type of data being either "quantitative" or "qualitative". Set as NULL by default, type will be guessed depending on variables type. models A 'Model' object defining the list of models to run. For quantitative data, the model "Gaussian_pk_Lk_C" is called (see mixmodGaussianModel() in Section 3.2 for specifying other models). For qualitative data, the model "Binary_pk_Ekjh" is called (see mixmodMultinomialModel() in Section 3.2 for specifying other models). strategy A 'Strategy' object containing the strategy to run. By default mixmodStrategy() (see Section 3.3) is called. criterion Character vector defining the criterion to select the best model. The best model is the one with the lowest criterion value. Possible values: "BIC", "ICL", "NEC", c("BIC", "ICL", "NEC"). Default is "BIC". weight Numeric vector with n (number of individuals) rows. weight is optional. This option is to be used when weights are associated with the data. knownLabels Numeric vector of size n. It will be used for semi-supervised classification when labels are known. Each element corresponds to a cluster assignment. Table 3: List of all the input parameters of the mixmodCluster() function.
models, but users can make a list of criteria by using the criterion option. In Table 3 the reader will find a summary of all the input parameters of the mixmodCluster() function with its default value if it is not a mandatory parameter.
The mixmodCluster() function returns an instance of the 'MixmodCluster' class. Its two attributes will contain all outputs: • results: A list of 'MixmodResults' objects containing all the results sorted in ascending order according to the given criterion.

Supervised and semi-supervised classification
Supervised and semi-supervised classification can be performed using the mixmodLearn() and the mixmodPredict() functions. Both functions are illustrated in Section 4.2.
mixmodLearn() function. It has two mandatory arguments: a data matrix x and a vector containing the known labels z. As for the mixmodCluster() function the three arguments models, weight and criterion are available. The default criterion is CV (cross-validation).

Input parameter Description data
Data frame containing quantitative or qualitative data. Rows correspond to observations and columns correspond to variables. knownLabels Numeric vector of size equal to the number of observations. Each element corresponds to a cluster assignment. The maximum value corresponds to the number of clusters. dataType Character indicating the type of data being either "quantitative" or "qualitative". Set as NULL by default, type will be guessed depending on variables type. models A 'Model' object defining the list of models to run. For quantitative data, the model "Gaussian_pk_Lk_C" is called (see mixmodGaussianModel() in Section 3.2 for specifying other models). For qualitative data, the model "Binary_pk_Ekjh" is called (see mixmodMultinomialModel() in Section 3.2 for specifying other models). criterion Character vector defining the criterion to select the best model. Possible values: "BIC", "CV" or c("CV", "BIC"). Default is "CV". nbCVBlocks Integer value defining the number of blocks to perform the crossvalidation. This value will be ignored if the CV criterion is not chosen. Default value is 10. weight Numeric vector with n (number of individuals) elements. weight is optional. This option is to be used when weights are associated with the data. Table 4: List of all the input parameters of the mixmodLearn() function.

Input parameter Description data
Data frame containing quantitative or qualitative data. Rows correspond to observations and columns correspond to variables. classificationRule A 'MixmodResults' object which contains the classification rule computed in the mixmodLearn() or mixmodCluster() step. In Table 4 the reader will find a summary of all the input parameters of the mixmodLearn() function and default values for non-mandatory parameters.
The mixmodLearn() function returns an instance of the 'MixmodLearn' class. Its two attributes will contain all outputs: • results: A list of 'MixmodResults' objects containing all the results sorted in ascending order according to the given criterion (in descending order for the CV criterion).
mixmodPredict() function. It only needs two arguments: a data matrix of the remaining observations and a classification rule (see Table 5). It returns an instance of the 'MixmodPredict' class which contains predicted partitions and probabilities.

Continuous variables: Gaussian models
All the Gaussian models summarized in Table 1 are available in Rmixmod. Users can get all the 28 models by calling mixmodGaussianModel().

Categorical variables: Multinomial models
All the multinomial models summarized in Table 2 are available in Rmixmod. Users can get all the 10 models by calling mixmodMultinomialModel().

Companion function for maximum likelihood estimation strategies
The strategies described in Section 2.3 can be tuned using the mixmodStrategy() function. The mixmodStrategy() function has no mandatory arguments and the default arguments are the ones specified in the Mixmod documentation (Mixmod Team 2008). In Table 6 the reader will find a summary of all the input parameters of the mixmodStrategy() function.

Input parameter Description algo
Character vector with the estimation algorithm. Possible values: "EM", "SEM", "CEM", c("EM", "SEM"). Default value: "EM". nbTry Integer value defining the number of tries. nbTry must be a positive integer. Default value: 1. initMethod Character value defining the method of initialization of the algorithm specified in the algo argument. Possible values: "random", "smallEM", "CEM", "SEMMax". Default value: "smallEM". nbTryInInit Integer value defining number of tries in initMethod algorithm. nbTryInInit must be a positive integer. Option available only if initMethod is "smallEM" or "CEM". Default value: 50. nbIterationInInit Integer value defining the number of "EM" or "SEM" iterations in initMethod. nbIterationInInit must be a positive integer. Only available if initMethod is "smallEM" or "SEMMax". Default values: 5 if initMethod is "smallEM" and 100 if initMethod is "SEMMax". nbIterationInAlgo Integer vector defining the number of iterations if nbIteration is used as a stopping rule for the algorithm(s). Default value: 200. epsilonInInit Numeric value defining the epsilon value in the initialization step. Only available if initMethod is "smallEM". Default value: 0.001. epsilonInAlgo Numeric vector defining the epsilon value for the algorithm. Warning: epsilonInAlgo does not make any sense if algo is "SEM", so it needs to be set as NaN in that case. Default value: 0.001. seed Random seed used in the random number generator. Default value: NULL. Table 6: List of all the input parameters of the mixmodStrategy() function.
R> strategy1 <-mixmodStrategy(algo = "CEM", initMethod = "random", + nbTry = 10, epsilonInInit = 0.000001) R> strategy2 <-mixmodStrategy(algo = c("SEM", "EM"), + nbIterationInAlgo = c(200, 100), epsilonInAlgo = c(NA, 1e-04), + seed = 2408) It is well-known that the number of local maxima of the log-likelihood function increases in conjunction with the number of parameters to be estimated. In such a situation, Rmixmod is able to avoid these traps by tuning its previous input parameters, in particular by increasing the value of nbTry. The need for this adjustment is easily detected by the fact that multiple re-runs of Rmixmod provide unstable results.
However, ability of Rmixmod to provide a good search over the log-likelihood function increases the frequency of spurious solutions in the Gaussian case (see a description in Section 2.3.1). Such solutions are then easily manually discarded.
In addition, for some specific reproducibility purposes, note that Rmixmod allows also to control exactly the random seed by providing the optional seed argument (see Table 6).
2. sortbyCriterion(): After calling the mixmodCluster() or mixmodLearn() method, results will be sorted into ascending order according to the first given criterion (descending order for the CV criterion). This method is able to reorder the list of results according to a given criterion. The input parameters are • object: a 'Mixmod' object; • criterion: a string containing the criterion name.
Most of these functions will be illustrated in Section 4.

Graphical functions
Methods for plot, hist and barplot have been implemented for the Rmixmod S4 class 'MixmodResults'. hist and barplot are each specific for quantitative and qualitative data. All these functions will be also illustrated in Section 4.
The output of plot(xem.geyser) is displayed in Figure 1.

Categorical variables: Birds of different subspecies
The birds dataset (Bretagnolle 2007) provides details on the morphology of birds (puffins). Each bird is described by five qualitative variables: one variable for the gender and four variables giving a morphological description of the birds. There are 69 puffins divided into two sub-classes: lherminieri and subalaris (34 and 35 individuals, respectively). Here we run a cluster analysis of birds with 2 clusters: R> data("birds", package = "Rmixmod") R> xem.birds <-mixmodCluster(birds, 2, + strategy = mixmodStrategy(seed = 2408)) The plot() method gives the following output in the qualitative case: A multiple correspondence analysis is performed to get a 2-dimensional representation of the dataset and a bigger symbol is used when observations are similar. The output of plot(xem.birds) is displayed in Figure 2(a).
Also a barplot() method has also been defined. For each qualitative variable, we obtain: • a barplot with the frequencies of the modalities; • for each cluster a barplot with the probabilities for each modality to be in that cluster. The output of barplot(xem.birds) is displayed in Figure 2(b).

Supervised classification
The following example concerns quantitative data. But, obviously, discriminant analysis also works with qualitative datasets in Rmixmod.
The outputs and graphs of discriminant analysis with Rmixmod are illustrated using an example where the aim is to predict a company's ability to cover its financial obligations (Jardin and Séverin 2010;Lourme and Biernacki 2011). This is an important question that requires a profound knowledge of the mechanism leading to bankruptcy. The original first sample (year 2002) is made up of 216 healthy firms and 212 insolvent firms. The second sample (year 2003) is made up of 241 healthy firms and 220 insolvent firms. Four financial ratios expected to provide some meaningful information about the company's financial health are considered: EBITDA/Total Assets, Value Added/Total Sales, Quick Ratio, Accounts Payable/Total Sales.

Further works
The Rmixmod package interfaces almost every functionality of the Mixmod library. Some particular initializations strategies and models to deal with high-dimensional data have not been implemented in the package. But initialization strategies of most interest are available in Rmixmod and the package HDclassif (Bergé, Bouveyron, and Girard 2012) has been recently released for clustering and discriminant analysis of high-dimensional data.
We have proposed some tools to visualize outcomes but data visualization in Rmixmod can still be enhanced. In addition, supervised and semi-supervised classification currently implemented could be greatly improved by including a variable selection procedure for instance (see Maugis, Celeux, and Martin-Magniette 2011). Moreover, we encourage users to contribute by suggesting new graphics or other utility functions.
In the Mixmod project currently some other recent advances in model-based clustering are implemented in order to provide associated efficient R packages. This concerns for instance co-clustering (partitioning simultaneously rows and columns of a dataset) and clustering of mixed data (dealing with quantitative and qualitative data in the same analysis). The next versions of Rmixmod will include these latter functionalities.