MixSim : An R Package for Simulating Data to Study Performance of Clustering Algorithms

The R package MixSim is a new tool that allows simulating mixtures of Gaussian distributions with diﬀerent levels of overlap between mixture components. Pairwise overlap, deﬁned as a sum of two misclassiﬁcation probabilities, measures the degree of interaction between components and can be readily employed to control the clustering complexity of datasets simulated from mixtures. These datasets can then be used for systematic performance investigation of clustering and ﬁnite mixture modeling algorithms. Among other capabilities of MixSim , there are computing the exact overlap for Gaussian mixtures, simulating Gaussian and non-Gaussian data, simulating outliers and noise variables, calculating various measures of agreement between two partitionings, and constructing parallel distribution plots for the graphical display of ﬁnite mixture models. All features of the package are illustrated in great detail. The utility of the package is highlighted through a small comparison study of several popular clustering algorithms.


Introduction
The main goal of clustering is to form groups of similar observations while also separating dissimilar ones. Many clustering algorithms have been developed, such as the iterative k-means (Forgy 1965;MacQueen 1967) and k-medoids (Kaufman and Rousseuw 1990) algorithms, hierarchical (both agglomerative and divisive) algorithms with different merging/splitting criteria called linkages -e.g., Ward's (Ward 1963), single (Sneath 1957), complete (Sorensen 1948) and other -and the probabilistic model-based clustering algorithms, where the observations are assumed to be sampled from an underlying finite mixture model . For a comprehensive review of clustering methods, see Xu and Wunsch (2009).
Clustering is a difficult problem with, consequently, many suggested methodologies, but no uniformly best performer in all situations. Thus, it is of interest to understand the scenarios in which different clustering algorithms perform better, worse, or indifferently. This requires an objective measure that would allow the clustering complexity of a simulated dataset to be controlled and different procedures to be calibrated with regard to this complexity. Performance of clustering algorithms depends on many factors such as cluster representation, orientation, elongation, multimodality, overlap, presence of noise, and others. The majority of procedures aiming to control clustering complexity focus on finding a good measure for the level of separation or overlap among clusters.
There have been many attempts made to define clustering complexity and to evaluate clustering algorithms in different settings. We refer our reader to Steinley and Henson (2005) and Maitra and Melnykov (2010) for a comprehensive review, while focusing our discussion here on a few most recent ones. Dasgupta (1999)'s c-separation for two p-dimensional Gaussian densities with mean vectors µ i and covariance matrices Σ i for i = 1, 2 is defined as c ≤ ||µ 1 − µ 2 ||/ p max{λ (p) (Σ 1 ), λ (p) (Σ 2 )}, where λ (p) represents the largest eigenvalue in the corresponding covariance matrix Σ i . c-separation is widely used by researchers but cannot serve as an adequate measure of the interaction between distributions as it does not take into consideration the structure of both covariance matrices relying in its inference just on the largest eigenvalue and the norm of two mean vectors. Another approach to generating clustered data was proposed by Steinley and Henson (2005). The algorithm is implemented in the MATLAB function OCLUS and conveniently generates clusters of different shapes from various distributions. However, there are some limitations of this approach as well: clusters are marginally independent by construction and the number of overlapping clusters is restricted. A novel separation index was proposed by Qiu and Joe (2006b) and implemented in their package clusterGeneration (Qiu and Joe 2006a) for R (R Development Core Team 2012). In a univariate setting, the developed index is calculated as (q l,2 − q u,1 )/(q u,2 − q l,1 ), where q l,1 and q u,1 are the lower and upper quantiles of the first cluster and q l,2 and q u,2 are the corresponding quantiles of the second cluster. The probability associated with the quantiles is an additional parameter commonly chosen to be equal to 0.05. When two clusters are distant, this index takes values close to 1. It can take negative values but no less than −1 for clusters that are poorly separated. The advantage of this index is its simplicity and applicability for clusters of any shapes. At the same time, while being exact in the univariate framework, this index cannot be readily generalized to the case of two and more dimensions. The authors propose finding the best projection onto a one-dimensional space that maximizes the separation of clusters first. Then, their index can be employed as in the univariate case. Unfortunately, the substantial amount of information can be lost while projecting multidimensional clusters; hence, conclusions can be partial or incorrect. Very recently, Maitra and Melnykov (2010) have provided the only known exact measure capable of measuring interaction between two clusters in terms of the pairwise overlap ω, which is defined for Gaussian mixtures as the sum of two misclassification probabilities and can be calculated in a univariate as well as multivariate framework. More details about this measure can be found in Section 2. Maitra and Melnykov (2010) also developed algorithms which can be used to simulate data from a finite mixture model with specified clustering complexity. Further, they developed the opensource CARP package (Melnykov and Maitra 2011) to evaluate clustering algorithms from a command-line interface. This paper details the use and applicability of MixSim, an R package with kernel writ-ten in C. The package is available from the Comprehensive R Archive Network at http: //CRAN.R-project.org/package=MixSim and allows for the simulation of mixtures with the pre-specified level of average or/and maximum pairwise overlap. Thus, the package can be employed by the user for convenient and systematic comparisons of existing and newly developed clustering procedures. We discuss the use of this package in Section 3, following up with a small utility demonstrator in Section 4. The paper concludes with a discussion in Section 5.

Methodological and algorithmic details
This section briefly defines the basics of the overlap measure for the reader and discusses two mixture simulating schemes adopted by MixSim. It also summarizes several indices frequently used for comparing two partitionings.

Pairwise overlap
As mentioned in Section 1, the notion of pairwise overlap was recently introduced by Maitra and Melnykov (2010). We refer the reader to their paper for the detailed analysis of the measure and its properties. Here, we briefly explain mechanics behind the measure.
Let X be distributed according to the finite mixture model g( is a multivariate Gaussian density of the kth component with mean vector µ k and covariance matrix Σ k . Then, the overlap between ith and jth components is defined as ω ij = ω i|j + ω j|i , where ω j|i is the misclassification probability that the random variable X originated from the ith component but was mistakenly assigned to the jth component; ω i|j is defined similarly. Thus, ω j|i is given by Overlap has nice closed form expressions in some special cases. For example, when π i = π j as well as Σ i = Σ j ≡ Σ, we obtain where Φ is the standard normal cumulative density function. For spherical clusters, the above reduces to ω ij = 2Φ − 1 2σ µ i − µ j . In general, misclassification probabilities are given by where λ 1 , λ 2 , . . . , λ p are eigenvalues of the matrix Σ 1 2 i Σ −1 j Σ 1 2 i and γ 1 , γ 2 , . . . , γ p are the corresponding eigenvectors, U l 's are independent noncentral χ 2 random variables with one degree of freedom and noncentrality parameter given by λ 2 l δ 2 l /(λ l − 1) 2 with δ l = γ l Σ − 1 2 i (µ i − µ j ), independent of W l 's, which are independent N (0, 1) random variables. This provides an efficient way of calculating ω i|j and ω j|i due to the algorithm AS155 (Davies 1980) that computes probabilities for linear combinations of noncentral χ 2 random variables.
Note that if the covariance matrices are multiplied by some positive constant c, it causes inflation (c > 1) or deflation (c < 1) of the components. Thus, we can manipulate the value of c in order to reach the pre-specified level of overlap ω ij (c) between the components. According to Maitra and Melnykov (2010), the function ω ij (c) does not have to be monotone increasing; however, ω ij (c) enjoys monotonicity in the overwhelming number of simulated mixtures. In those rare cases when monotonicity is violated, the simplest solution is to drop the current mixture and simulate a new one.
If c → ∞ and clusters are heterogeneous, the above expression reduces to where U l 's are independent central χ 2 random variables with one degree of freedom. Maitra and Melnykov (2010) do not discuss the case of homogenenous clusters but it can be also addressed as follows. If all clusters are homogeneous, the expression for ω j|i reduces to When c → ∞, δ 2 l → 0 for l = 1, 2, . . . , p. This yields ω ∞ j|i = 0 for π j < π i , ω ∞ j|i = 1 2 for π j = π i , and ω ∞ j|i = 1 for π j > π i . This indicates that the value of asymptotic overlap for homogeneous clusters is ω ∞ ij = ω ∞ j|i + ω ∞ i|j = 1 for any mixing proportions π i and π j .

Mixture model and data generation
Having detailed the pairwise overlap and its implementation in Section 2.1, the overlap can now be employed to control the degree of interaction among mixture components. Before we proceed to the next section describing algorithms for generating mixtures with a pre-specified overlap characteristic, we discuss how mixture model parameters are simulated.
Mean vectors of Gaussian mixture components µ k are obtained as K independent realizations from a uniform p-variate hypercube with bounds specified by the user. Covariance matrices Σ k are taken as draws from the Wishart distribution with parameter p and p + 1 degrees of freedom. The low number of degrees of freedom provides us with random covariance matrices of different orientation and elongation. Finally, mixing proportions π k are generated on the [0, 1] interval subject to the restriction K k=1 π k = 1 with the lower bound pre-specified by the user. To simulate a dataset from a generated mixture, first, cluster sizes are obtained as a draw from a multinomial distribution based on mixing proportions. Then, the corresponding number of realizations are obtained from each multivariate normal component. Maitra and Melnykov (2010) provided two algorithms that simulate Gaussian mixtures according to pre-specified values of average (ω) or/and maximum (ω) overlap. The algorithms are briefly summarized below.

Generate mixture model controlling the average or maximum overlap
The first algorithm generates a mixture model with respect to the level of average or maximum overlap. The algorithm consists of the following three steps.
2. Calculating pairwise overlaps. Compute all pairwise overlaps. Calculate the current estimate ofω (ω). If the difference betweenω andω (ω andω) is negligible, stop the algorithm and provide the current parameters.
3. Scaling clusters. Use root-finding techniques to find a covariance matrix multiplier c such that the difference betweenω(c) andω (ω(c) andω) is negligible.
For finding roots, MixSim obtains bounds of an interval that contains the root by considering positive or negative powers of 2, and then applies the approach of Forsythe, Malcolm, and Moler (1980) to find the root.

Generate mixture model controlling the average and maximum overlap
The second algorithm deals with both characteristicsω andω simultaneously. It can be preferred over the first algorithm to better control the overlap between simulated components.
1. Scaling clusters to reachω. Use the first algorithm to obtain the set of parameters that satisfiesω and fix two components that produced the highest overlap; their covariance matrices will not be involved in inflation/deflation process.
2. Finding c ∨ . Find the largest value of c (say c ∨ ) such that none of pairwise overlaps ω ij (c ∨ ) exceedsω. Ifω(c ∨ ) <ω, discard the realization and return to Step 1.
3. Limited scaling. While keeping the two fixed components unchanged, apply Step 3 of the first algorithm to the rest of components to reach the desiredω. If the obtained parameters satisfyω andω, report them. Otherwise, start with Step 1 again.
It can be noticed that not every combination ofω andω can be obtained. Immediate restrictions areω ≤ω andω ≤ωK(K − 1)/2, where K is the number of mixture components. Also, some combinations ofω andω can be more difficult to reach than the others. In this case, a higher number of mixture resimulations may be needed to find a targeted mixture.

Classification indices
One application of MixSim is the systematic investigation of the properties of clustering algorithms. In order to assess the level of similarity between partitioning vectors, some measure has to be used. There are multiple indices that have been developed for this purpose -see Meilȃ (2006) for a detailed review. Here, we summarize those indices that are implemented in MixSim. Meilȃ (2006) brackets all indices into one of three groups. The first group of indices compares clusterings counting the pairs of points that are assigned to the same or different clusters under both partitionings. The Rand (1971) index falls into this category and is defined as where c 1 and c 2 are the first and second partitioning vectors respectively, N 11 is the number of pairs of points in the same cluster under c 1 and c 2 , and N 00 is the number of pairs in different clusters under c 1 and c 2 ; n represents the number of points in partitioning vectors. The more commonly used modification of R(c 1 , c 2 ) involves adjustment with its E(R(c 1 , c 2 )), providing the adjusted Rand index (Hubert and Arabie 1985): Another, albeit less popular, adjustment of the Rand index was proposed by Mirkin (1996): ).
An interesting index proposed by Fowlkes and Mallows (1983) combines two asymmetric criteria of Wallace (1983), W 1 (c 1 , c 2 ) and W 2 (c 1 , c 2 ) : k representing the size of the kth cluster according to the partitioning c i and K (i) representing the number of clusters in c i . Indices R, AR, and F have upper bounds equal to 1 which can be achieved only in the case of the same partitioning, i.e., c 1 = c 2 . On the contrary, the Mirkin index reaches 0 for identical partitioning vectors; otherwise, it takes positive integer values. These four indices are implemented in MixSim's function RandIndex(). A second group of indices compares partitionings by set matching. The most well-known index here is the proportion of observations that agree on classification for both partitioning vectors. It may be noted that label-switching plays an important role here as the result is label-dependent. MixSim's function ClassProp() calculates this proportion considering all possible permutations of labels and choosing the permutation yielding the highest proportion of agreement between two partitionings. Of course, this approach becomes restrictive for a high number of classes. In this case, heuristic approaches for matching classes, such as provided by the function matchClasses() from the package e1071 (Meyer, Dimitriadou, Hornik, Weingessel, and Leisch 2012), can be applied.
The last category of indices is based on the analysis of the variation of information in two partitioning vectors. Meilȃ (2006) developed an index defined as where H(c i ) is the entropy associated with c i defined as log n (i) k n for i = 1, 2. I(c 1 , c 2 ) represents the mutual information between two partitionings and is defined as where n kr is the number of observations being assigned simultaneously to the kth and rth clusters under partitionings c 1 and c 2 respectively. When c 1 = c 2 , VI (c 1 , c 2 ) = 0. The upper bound for VI is equal to log n. The MixSim function responsible for calculating VI is called VarInf(). It is worth pointing out that all the indices listed above and implemented in MixSim are symmetric, which means that for any index, Index (c 1 , c 2 ) = Index (c 2 , c 1 ).

Package description and illustrative examples
In this section we provide a detailed description of MixSim's capabilities, along with illustrations. First, we briefly summarize MixSim's functionality which includes the following features: simulating Gaussian mixture models with homogeneous or heterogeneous, spherical or ellipsoidal covariance matrices according to the pre-specified level of average or/and maximum overlap; simulating datasets from Gaussian mixtures; calculating pairwise overlap for Gaussian mixture components; simulating outliers, noise and inverse Box-Cox transformed variables; constructing parallel distribution plots to display multivariate Gaussian mixtures; calculating various indices for measuring the classification agreement between two examined partitionings.
The complete list of functions included in the package along with their brief descriptions can be found in Table 1

Simulating mixtures with the function MixSim()
MixSim() is the main function of the package and is responsible for finding a Gaussian mixture model satisfying the user-specified level of average or/and maximum overlap. The command has the following syntax: The parameters of the function are listed in Table 3.
When both parameters BarOmega and MaxOmega are specified, the second algorithm from Section 2.3 is run. If only one of the above parameters is provided, the first algorithm from Section 2.3 is employed. The smallest allowed number of components K is 2; in this case, BarOmega ≡ MaxOmega. MixSim() allows the simulation of mixtures with spherical or general covariance structure as well as with homogenenous or nonhomogeneous components. In order to better control the shape of produced components, the parameter ecc specifying the maximum eccentricity can be used. Maitra and Melnykov (2010) defined the multidimensional eccentricity by extending its definition for two-dimensional ellipses: e = 1 − λ min /λ max with λ min and λ max being correspondingly the smallest and largest eigenvalues of the (covariance) matrix. If some simulated dispersion matrices have e > e max specified by ecc, all eigenvalues will be scaled in order to have e new = e max . PiLow controls the minimum of the mixing proportions, with PiLow = 1 indicating equality of all mixing proportions. Option int specifies a side of a hypercube on which mean vectors are generated. If int is not provided, the default interval is (0, 1). The last three options resN, eps, and lim described in Table 3 specify technical parameters used in MixSim(). resN sets the maximum number of resimulations allowed when the desired overlap cannot be reached easily. This happens, for example, when asymptotic maximum or average overlap is lower than the corresponding value specified by the user. eps represents the error bound used in our root-finding procedure and Davies (1980)'s algorithm approximating the distribution of a linear combination of χ 2 random variables based on the numerical invertion of the characteristic function. Finally, lim specifies the maximum number of terms allowed in numerical integration involved in Davies (1980)'s procedure. We next illustrate three sample usages of MixSim().  Simulating a mixture with non-homogeneous mixture components, equal mixing proportions and pre-specified values of maximum and average pairwise overlap

Arguments Description
Here, we illustrate the use of MixSim() in simulating a 5-dimensional mixture with 4 components, and average and maximum overlaps equal to 0.05 and 0.15 correspondingly. The following example ("sec3.1_ex1") beginning with set.seed(1234) for reproducibility illustrates the use of the function. Since the parameter PiLow is not specified, the mixture model has equal mixing proportions. As options sph and hom are not provided, nonhomogeneous and general (ellipsoidal) covariance matrices are simulated. From the output, we can see that mixture components with numbers 2 and 4 provide the largest overlap (see vector ex.1$rcMax). The corresponding probabilities of mislcassification can be found in the matrix ex.1$OmegaMap: they are 0.0642 and 0.0858. The map of misclassification probabilities as well as the numbers of components producing the largest pairwise overlap can be conveniently accessed by function summary(). Both desired values, ω andω, have been reached within the error bound as we can see from ex.1$BarOmega and ex.1$MaxOmega correspondingly.
Simulating a mixture with non-homogeneous spherical mixture components, unequal mixing proportions and pre-specified value of maximum pairwise overlap The following example ("sec3.1_ex2") simulates a bivariate mixture with three spherical components, mixing proportions no less than 0.1 and maximum pairwise overlap 0.1. Parameter hom is not provided, thus nonhomogeneous components are simulated.
Simulating a mixture with homogeneous spherical components, equal mixing proportions and pre-specified value of average pairwise overlap The last illustration ("sec3.1_ex3") of the function MixSim() deals with a 4-dimensional mixture with 2 components. The average overlap is specified at the level 0.05. To increase accuracy, we let eps be equal to 1e-10. Coordinates of mean vectors are simulated between 0 and 10. Since arguments sph and hom are specified as TRUE, clusters should be spherical and homogeneous.

Calculating exact overlap with the function overlap()
In this section we discuss the capability of MixSim to calculate the exact overlap when given the parameters of a Gaussian mixture model. This feature is useful if it is desired to estimate the level of clustering complexity for an existing classification dataset. This is another important application of MixSim. There exist numerous classification datasets widely used for testing clustering algorithms. However, the knowledge about these datasets is usually very limited: the investigator typically knows only which clusters are problematic having no information about the level of interaction among the clusters. Function overlap() provides the user with the map of misclassification probabilities. The command has the following syntax: overlap(Pi, Mu, S, eps = 1e-06, lim = 1e06) The parameters accepted by overlap as well as values returned by the function are provided in Table 4. All five arguments -Pi, Mu, S, eps, and lim -have the same meaning as in the function MixSim() discussed in Section 3.1. The returned values are also discussed in the same section.

Finding exact overlap for the Iris dataset
Here, we analyze the celebrated Iris dataset of Anderson (1935) and Fisher (1936). The data consist of 150 4-dimensional points measuring the width and length of petals and sepals; there are 50 observations from each of three different species of Iris: Setosa, Versicolor, and Virginica. It is well-known that Virginica and Versicolor are difficult to separate while Setosa creates a very distinct cluster. The following code ("sec3.2_ex1") estimates the parameters of a Gaussian mixture model with three components for Iris, provided the true classification vector. Then, the exact overlap is calculated using the function overlap().
R> data("iris", package = "datasets") R> p <-ncol(iris) -1 R> id <-as.integer(iris[, 5]) R> K <-max(id) R> Pi <-prop. As we can see from the output, the substantial maximum overlap of 0.0493 is produced by the second and third components: Virginica and Versicolor. At the same time, these two clusters almost do not overlap with Setosa. This explains why the majority of clustering methods prefer two-cluster solutions combining Virginica and Versicolor together.

Simulating datasets with the function simdataset()
In this section, we discuss some capabilities of MixSim for simulating datasets, along with outliers, from a given (perhaps simulated) mixture model. The function responsible for data generation is called simdataset() and has the following form: simdataset(n, Pi, Mu, S, n.noise = 0, n.out = 0, alpha = 0.001, max.out = 100000, int = NULL, lambda = NULL) The arguments and returned values are listed in Table 5. Parameters Pi, Mu, and S have the same meaning as before. The size of a generated dataset is defined as n + n.out, where n.out specifies the number of outliers needed. If the parameter n.out is not specified, no outliers are produced by simdataset. Parameter max.out specifies the maximum number of outlier resimulations with the default value of 1e05. alpha specifies ellipsoidal contours beyond which outliers have to be simulated. The number of dimensions for the dataset is defined as dim(Mu)[2] + n.noise. By default, n.noise = 0. The interval int defines a side of a hypercube for outlier simulation. It is also used for simulating noise variables if n.noise is greater than 0. Both outliers and noise variables are simulated from a uniform hypercube. When int is not provided, the interval bounds are chosen to be equal to the smallest and  largest coordinates in mean vectors correspondingly. The last parameter, lambda, specifies a vector of size dim(Mu)[2] + n.noise that performs an inverse Box-Cox transformation for every coordinate. By default, simdataset() generates datasets from Gaussian mixture models. If the user wants to produce data that are not normally distributed, the argument lambda can be helpful. The procedure uses the multivariate Box-Cox transformation given by x * = (x λ − 1)/λ, where x and x * represent the original and transformed observations, respectively. x * is approximately normally distributed for some value λ. We use an idea related to inverting the Box-Cox transformation to provide non-normally distributed data. The transformation we propose is given by x = (λx * + 1) 1 λ − 1. When λ = 1, the identity transformation is employed. Thus, the parameter lambda specifies desired transformations for each coordinate.
The following examples illustrate the capabilities of the function.

Simulating datasets with outliers
If it is desired to include outliers into a simulated dataset to increase clustering complexity or check the performance of clustering algorithms in the presence of scatter, the corresponding option n.out has to be specified. The following example demonstrates how simdataset() can be employed for simulating datasets with outlying observations. First, bivariate normal mixtures with 4 components and average overlap 0.01 are simulated similar to the previous examples. Then, the function simdataset() is employed to generate a dataset of 500 observations and introduce 10 outliers. Figure 2a,b demonstrates two datasets obtained this way. Red color points represent outlying observations. id is equal to 0 for such points. To obtain the plots, the user needs to run the following commands ("sec3.3_ex3a").

Simulating datasets with noise variables
To increase the complexity of clustering or test procedures reducing dimensionality, it can be needed to simulate datasets with noise variables. Here, we illustrate the use of simdataset() for simulating data from a one-dimensional mixture with 4 components and maximum overlap 0.1. One noise variable is added after that. As we can see from Figure 2c,d, it is obvious that the second variable introduces substantial clustering complexity. The Figure 2c can be obtained by running the following code ("sec3.3_ex4c").

Constructing a parallel distribution plot with the function pdplot()
The next function illustrated here is pdplot() -parallel distribution plot -described in detail by Maitra and Melnykov (2010). This plot is a convenient tool for visualizing multidimensional mixture models and components' interaction. The plot displays the principal components of a Gaussian mixture model. The first several components can serve as a good indicator of the major source of variability in a mixture. Table 6 lists the arguments acceptable in pdplot(). Pi, Mu, and S are the usual mixture parameters. file is the name of the pdf-file the plot has to be written to. Parameters Nx and Ny provide the numbers of horizontal and vertical smoothing intervals respectively. MaxInt sets the level of maximum color intensity while marg specifies plot margins. The function call has the following form: pdplot(Pi, Mu, S, file = NULL, Nx = 5, Ny = 5, MaxInt = 1, marg = c(2, 1, 1, 1)) We illustrate the use of the function with several examples.

Parallel distribution plot for the Iris dataset
For the Iris dataset, we first estimate mixture parameters using the true classification vector. Then, we employ pdplot() function to construct a parallel distribution plot and save it into the file "Iris.pdf". The obtained plot is presented in Figure 3a. It can be clearly seen that the blue and green components have substantial overlap being close to each other in every principal component while the red one is well separated. It agrees well with our findings in Section 3.2. To construct the plot, the user has to estimate the mixture parameters using the code from the example in Section 3.2 first, and then to run the following command: R> pdplot(Pi = Pi, Mu = Mu, S = S)   This example illustrates the difference between parallel distribution plots for well and poorly separated mixtures. We simulate two 4-dimensional mixtures with 6 components and average overlap 0.001 and 0.05 correspondingly. Figure 3b,c provides their parallel distribution plots. As can be seen, there is substantial difference between patterns in pictures. In the plot b, between-cluster variability dominates over within-cluster variability. This is an indication of well-separated clusters. The plot c suggests that in-cluster variability dominates over betweencluster variability implying larger overlap. The two plots can be easily reproduced with the following code ("sec3.4_ex1b" and "sec3.4_ex1c").  Table 7 provides their brief description. The first function returns values of four indices based on counting the pairs of points in clusterings, ClassProp() calculates the agreement proportion for two classifications, while the latter one, VarInf(), computes the variation in information for two clusterings. The functions have the following syntax:

Parallel distribution plot for simulated mixtures
The following examples are illustrations of the use of all the three functions.

Other auxiliary capabilities
The last function discussed in this section is perms(), which returns all possible permutations given the number of elements in a vector. Although this function is not directly related to other capabilities of MixSim, it is needed in the construction of ClassProp() since it is a faster implementation of the capability also provided by permn() available from package combinat (Chasalow 2012). The function has the following syntax:

perms(n)
A small example below illustrates the use of the function.

Illustrative use of the package
This section provides a brief illustration of the utility of MixSim. Here, we investigate and compare performances of four clustering methods realized in R. The first method is modelbased clustering by means of the EM algorithm, the second and third ones are k-means and partitioning around medoids algorithms respectively, while the last method is hierarchical clustering with Ward's linkage. The algorithms are tested on datasets simulated from 4-dimensional mixtures with 6 heterogeneous non-spherical components and equal mixing proportions. The value of average overlapω varies from extreme (0.4) to very low (0.001).
For each overlap, 100 mixtures are simulated with MixSim(), then a dataset of size 200 is generated from each mixture using function simdataset(). It is worth mentioning that we obtain exactly one dataset from each simulated mixture. Indeed, one can simulate more than one set from a mixture and even multiple datasets from the only mixture model. In this experiment, however, we are interested in examining as widely varying datasets as possible, subject to the only condition that simulated mixture models satisfy the pre-specified degree of average overlap. Variation in data patterns is substantially higher when we use different mixtures. Hence, we simulated just one set from each mixture, although in some other problems, a different strategy can be preferred.
The performance of methods is evaluated based on the adjusted Rand index, proportion of correct classifications, and variation in information for two partitionings. We use the original true and estimated classifications obtained from each clustering method to calculate the value of each index. Functions RandIndex(), ClassProp(), and VarInf() were employed. We calculate sample mean µ and standard deviation s for each characteristic over 100 simulated datasets for every level ofω. The results are provided in Table 8. Function Mclust() from the package mclust (Fraley and Raftery 2006) was employed for the EM algorithm. Functions kmeans() and hclust() were used for k-means and hierarchal clustering algorithms. The partitioning around medoids algorithm was employed by means of the function PAM() from the package cluster (Mächler, Rousseeuw, Struyf, Hubert, and Hornik 2012). The experiment can be reproduced by running the demo "sec4_ex1" or by using the code provided in the supplementary materials.
The obtained results are summarized in Table 8. As we can see, there is no uniform winner among the considered four clustering methods. The results suggest that k-means algorithm should be preferred over the rest of the field for substantial and moderate overlaps. The model-based clustering, however, is the clear winner for well-separated clusters (ω ≤ 0.01).
Hierarchical clustering with Ward's linkage slightly outperforms k-means for well-separated  Table 8: Performance of the model-based, k-means, partitioning around medoids, and hierarchical clustering with Ward's linkage algorithms in clustering 4-dimensional datasets of size 200 with 6 groups. µ and s represent the sample mean and standard deviation for the adjusted Rand index (AR), proportion of correct classifications (P ), and variation in information (VI ) obtained over 100 replications.
clusters but loses to it in the cases of substantial and moderate overlap. Overall, k-means performs slightly better than the partitioning around medoids algorithm. All three indices considered agree on the suggested conclusions. Of course, this is a small example demonstrating the potential of MixSim, but it illustrates the utility of the package well.

Summary
This paper describes the R package MixSim which provides a convenient and friendly tool for simulating Gaussian mixture models and datasets according to the pre-specified level of clustering complexity expressed in terms of the average and maximum pairwise overlap among mixture components. MixSim's functions are described and carefully illustrated on multiple examples. A small study illustrating the utility of the package is provided. The package is meant to be of interest to a broad audience working in the area of machine learning, clustering and classification.