High-Dimensional Bayesian Clustering with Variable Selection: The R Package bclust

The R package bclust is useful for clustering high-dimensional continuous data. The package uses a parametric spike-and-slab Bayesian model to downweight the eﬀect of noise variables and to quantify the importance of each variable in agglomerative clustering. We take advantage of the existence of closed-form marginal distributions to estimate the model hyper-parameters using empirical Bayes, thereby yielding a fully automatic method. We discuss computational problems arising in implementation of the procedure and illustrate the usefulness of the package through examples


Introduction
The purpose of cluster analysis is to partition observations into groups such that observations belonging to the same group are more similar than observations belonging to different groups.There are various ways of attributing observations to clusters, but one may classify them into two broad categories: distance-based (nonparametric) and model-based (parametric) techniques.Our approach lies between these: we use a model to define a distance and we implement hierarchical clustering as used in distance-based methods.
Hierarchical clustering using various distance measures is implemented in the R progamming language (R Development Core Team 2012) in packages such as cluster (Mächler et al. 2012) and has two variants, agglomerative clustering and divisive clustering, implemented in the agnes and diana functions of this package respectively.Agglomerative clustering begins with each observation as a separate cluster, successively merges the closest clusters using a dissimilarity measure, and stops when there is just one cluster.Divisive clustering starts with all the observations in one cluster and divides it until each observation forms a single cluster.However, hierarchical clustering is not the only way of grouping data.Another widely used technique is partitioning clustering, as embodied in the k-means algorithm, kmeans, of the package stats.A more robust variant, k-medoids, is coded in the pam function in the package cluster.Unfortunately, partitioning approaches can be hard to visualize, though some graphical tools are available in the packages flexclust (Leisch 2010) and cclust (Dimitriadou 2009).The dendrogram, a tree representation that provides a visual guide to the groupings as the number of clusters changes, is usually unavailable in partitioning algorithms.Many graphical tools are provided in the ape package (Paradis et al. 2004).One partitioning method is Bayesian mixture modeling, which often requires Markov chain Monte Carlo simulation, an example being the finite Gaussian mixtures of the bayesm package (Rossi 2011).Partial tree representation of Markov chain Monte Carlo groupings is feasible through labeltodendro package (Partovi Nia and Stephens 2012).
In many scientific domains modern technology provides data on many more variables than individuals.Cluster analysis is widely used in such cases, and a common difficulty is to provide reasonable statistical models for these low-sample-size-high-dimensional situations.Statistical analysis of such data is difficult partly because of overfitting, for which two main solutions have been proposed: the data are projected to a smaller dimension, or analysis is based only on relevant variables.Our approach is related to the second solution.
In the high-dimensional datasets now arising in biological applications, the key information on clustering may be hidden in a small subset of the variables (Cheeseman and Stutz 1996), and inclusion of other variables may mask the underlying structure.Many model-based clustering procedures depend on ratios of probability densities.When the data dimension greatly exceeds the number of individuals, the probability that two individuals will lie close enough to be considered part of the same cluster approaches zero, if substantial variation occur across all variables (Hall et al. 2005;Ahn et al. 2007).Thus variable selection or projection into a subspace seems necessary when clustering high-dimensional datasets, and this complicates matters further.
A variable may be considered useful for clustering if it defines a mixture, so variable selection in clustering requires the fitting of mixtures with unknown numbers of components on an unknown number of variables (Kim et al. 2006).Researchers have dealt with this in different ways.McLachlan et al. (2002) apply forward selection of variables using univariate significance tests of a single component against mixtures of two components.Wang and Zhu (2008) and Bondell and Reich (2008) implement variable selection using a penalized likelihood.Friedman and Meulman (2004) assign different weights to each variable as a measure of its importance, and have implemented this in the COSA software.Witten and Tibshirani (2010) similarly perform variable selection and provide importance measures by penalization of the dissimilarity matrix, implemented in the sparcl R package (Witten and Tibshirani 2011).Bergé et al. (2012) implemented clustering and discriminant analysis of high-dimensional data in the HDclassif R package using a new parametrization of the Gaussian mixture model which combines the idea of dimension reduction and model constraints on the covariance matrices.Hoff (2006) and Booth et al. (2008) suggest stochastic search to find the optimal clustering.Raftery and Dean (2006) fit a finite Gaussian mixture model and select variables using an approximate Bayes factor; see the R package clustvarsel (Dean and Raftery 2009).Tadesse et al. (2005) suggest use of a reversible jump algorithm for their Bayesian model.Another approach is dimension reduction by principal components analysis (Ghosh and Chinnaiyan 2002), but this may not show which variables are more effective for clustering or carry the best information about the cluster topology (Chang 1983).Liu et al. (2003) combine principal components analysis with variable selection and propose a Gibbs sampler to determine the number of components to be used.In the present paper we use Bayesian variable selection through spike-and-slab models (Mitchell and Beauchamp 1988;George and McCulloch 1997).Our suggested model imposes independence of variables, so selection of variables marginally or conditional on the previous selected variables coincide.
Most of the model-based clustering R packages are inappropriate for high-dimensional data except for HDclassif.The bclust package version 1.3, built for R 2.15.0, is intended to fill this gap.Unlike HDclassif the bclust package implements a Bayesian approach to clustering, with priors for model parameters and for the allocation of subjects to groups.The model and its priors are chosen so that the marginal posterior is analytically tractable, providing a fast algorithm.The marginal posterior is taken as the natural measure of the appropriateness of a grouping.The clustering that maximizes the marginal posterior is taken to be optimal.Since it is not easy to find the maximum a posteriori grouping over all possible partitions, we propose an approximation.The agglomerative path is used to approximate the maximum a posteriori clustering.This gives a visual guide to some of the other possible data allocations, through a dendrogram.The R package implementing the methodology described in this work is available from the Comprehensive R Archive Network at http://CRAN.R-project.org/package=bclust.

Bayesian model for clustered data
Suppose that T clustering individuals are grouped into C clusters.The univariate random variable y vct is the data of clustering individual t (t = 1, . . ., T c ) in cluster c (c = 1, . . ., C) measured on the continuous variable v (v = 1, . . ., V ).If there are T c observations in cluster c (T = C c=1 T c ), then the data distribution is the same if the observations in cluster c are arbitrarily reordered.Thus f (y 1c . . .y Tcc ) is an exchangeable distribution, and by the general representation theorem (Bernardo and Smith 1994, Chapter 4), there is a conditional distribution f (y c | ξ c ) and a prior distribution function F (ξ c ) such that suggesting use of a Bayesian model.We propose a linear model for the data: where η vc and θ vc are continuous random variables, γ vc and δ v are binary random variables, and µ is a constant.The random variable η vct is noise, supposed to be independent of θ vc and sampled from a Gaussian distribution with zero mean and variance σ 2 η ≥ 0. The random variable θ vc disappears when γ vc = 0 or δ v = 0 but is present when γ vc = δ v = 1.We assume that γ vc and δ v are independent of each other and follow Bernoulli distributions with probabilities p and q, respectively.
In Equation 2, µ represents an overall value for all the variables and individuals.Without loss of generality, our model presupposes that all the variable-wise centers equal zero; thus we suggest subtracting the median of each variable before using our software.The random If θ vc appears for at least one cluster, then the variable v is important, and the importance of variables for clustering is coded in the Bernoulli random variable δ v .One may interpret the probabilities q and p as the proportions of important variables and of the appearance of different cluster means for an important variable.
We propose two families for θ vc : a Gaussian distribution with mean zero and variance σ 2 θ > 0; and an asymmetric Laplace distribution (Bhowmick et al. 2006), used to model heavy tailed and asymmetric effects, centered at zero, and with left-tail and right-tail variances σ 2 θ L > 0 and σ 2 θ R > 0; see Figure 1.Both the Gaussian and the asymmetric Laplace families produce closed form marginal densities for the observations, yielding a fast algorithm.The slab density of y vct in (2) when the effects follow the asymmetric Laplace distribution is plotted in Figures 1  and 2, and has the form In Bayesian variable selection the term spike-and-slab distribution is typically used for the prior distribution.We use the terms spike for a distribution which is concentrated about zero and slab for the distribution with tails much more dispersed than the spike density, whether it is a prior or a marginal density.
If δ v = γ vc = 1, then the variable v and the variable-cluster combination v, c are active.In this case the marginal variance of data in variable-cluster combination v, c equals σ 2 θ + σ 2 η , defining a slab distribution, otherwise the marginal variance equals σ 2 η , giving a spike distribution, see  Sometimes clustering individuals include replicate data.One may omit the replication information and consider each replicate as a clustering individual, but after the data are grouped some of the replicates may then fall into different groups, and this is undesirable.We therefore propose to generalize the model ( 2) by assuming another level of variability between replicates of a clustering individual.If there are R ct replicates of clustering individual t grouped in cluster c, then we propose to generalize (2) to This reduces to (2) if R vc = 1 for all v = 1, . . ., V and c = 1, . . ., C, so below we focus on (3).Comparing (3) with (2), we see that if R vc = 1 for all c = 1, . . ., C, v = 1, . . ., V , then the model is identifiable only with respect to σ 2 η + σ 2 ε .This is important when marginal maximum likelihood is used to estimate the model hyper-parameters for unreplicated data using the replicated model ( 3).In such cases we set σ 2 η = 0 and estimate σ 2 θ .Suppose that the letter y with fewer indices corresponds to an appropriate vector of data.For instance, y v denotes the data vector for variable v and y vc corresponds to the data vector for variable v and cluster c.The marginal density of the Bayesian model ( 3) can be obtained by replacing in (1) and evaluating the integral.The marginal density equals (Partovi Nia 2009) in which the marginal density for each variable is a convex combination of the spike-and-slab densities, given by where is independent of the distribution assumed for the cluster effects θ vc , but f 1 depends on their distribution.If the effect has a Gaussian distribution, then f 1 corresponds to a multivariate Gaussian density with mean vector µ1 and covariance matrix Σ, where Σ is of dimension θ on the main diagonals, and the off-diagonal elements are equal to σ 2 η + σ 2 θ for replications of the same individual and to σ 2 θ for observations from different individuals.
When the effects are distributed according to the asymmetric Laplace distribution with variance σ 2 θ = σ 2 θ L + σ 2 θ R , the rates of the left-and right-tail exponential distributions forming the Laplace density being σ −1 θ L and σ −1 θ R , then where The square matrix A Tc×Tc is positive definite with determinant |A| consisting of main diagonals η .An immediate consequence of ( 6) is resistance of the clustering method to the noise variables, because the data density has two parts.The first is the density of data when the clustering parameter θ vc appears, the so-called slab density f 1 .This guides the clustering procedure when a variable is important for clustering.The second, the spike density f 0 , is the density of the data when the clustering parameter θ vc disappears.This part down-weights the effect of useless variables in clustering and provides a valid clustering procedure when the number of noise variables increases.In the extreme case if the data density consists only of f 0 s, that is with probability one δ v = 0 for all v = 1, . . ., V, or equivalently γ vc = 0 for all v and c = 1, . . ., C, the data play no role in grouping and the clustering posterior equals the prior.
We take the log Bayes factors log as a measure of importance for the variable-cluster combination v, c.The posterior odds is a Bayesian measure of uncertainty for testing two hypotheses and equals the prior odds times the Bayes factor.The posterior odds are only related to data through the Bayes factor and are understood as a data-based measure of the evidence when comparing two hypotheses (Kass and Raftery 1995).

Bayesian clustering paradigm
In Bayesian clustering, the allocation of observations to clusters is regarded as a statistical parameter.Therefore a Bayesian model such as (3) is assumed for data conditional on the grouping structure, and a prior distribution must be adopted for the clusters.Then a search algorithm, often using Markov chain Monte Carlo simulation, is applied to find the maximum a posteriori grouping.In the bclust package, similar to HBC (Savage et al. 2009), a Bioconductor (Gentleman et al. 2004) package, we use an agglomerative search method because it provides a visual guide to the other possible groupings through a dendrogram.
Suppose a data allocation C groups the observations into C clusters, of sizes T 1 , . . ., T C , with total T = C c=1 T c clustering individuals.We assume a multinomial-Dirichlet distribution (Heard et al. 2006) as the allocation prior The clustering posterior is in which f (y | C) is the marginal density of the data for the known allocation C derived in (5), and k > 0 is a fixed value for given data.The normalizing constant k plays no role in agglomerative clustering and may be omitted in numerical calculations.In order to apply Bayesian agglomerative clustering we start with each individual as a single cluster: the number of clusters equals the total number of individuals, C = T , and the number of individuals in cluster c is T c = 1, for all c = 1, . . ., C. In the first step, all pairwise merges are considered.For each pairwise merge, the clustering posterior ( 10) is calculated and the merge that maximizes (10) is applied.We keep g c = log f (C | y), the log posterior for the best merge having c clusters, to use as the dendrogram height.If the best merge according to ( 10) is to join cluster c 1 to c 2 to create the new cluster c, then of course The algorithm then considers all pairwise merges again, and continues until all clusters are merged and all individuals are in one cluster.
The best grouping found using the posterior as the objective function on the agglomerative path is the one that maximizes g c across c = 1, . . ., T .Clearly the groupings associated to g c are sorted in agglomerative order with increasing c, so a dendrogram representation is possible.In order to draw a dendrogram a monotone height function is required, but g c is not necessarily monotone and we use the following transformation.Write g max = max(g c ), and suppose that c max = argmax(g c ) is the number of clusters that maximises g c .For c < c max we define the height of the dendrogram to be h c = g c − g max , which is negative, and for c > c max , we take h c = g max − g c , which is positive.By definition, h c is monotone if g c is unimodal, which is usually the case, and cutting the dendrogram at zero height gives the grouping that maximizes g c .However plotting a dendrogram object in R requires non-negative heights, so we replace h c by h c − min(h c ).

Computational issues in code implementation
In order to accelerate the numerical computations, most of the bclust package is written in standard C and output is imported into R to benefit from its visualization facilities.However, some of the required routines were already available in Fortran, and are called using the F77_CALL function of the BLAS C library.Table 1 summarizes the main functions in the package bclust and Figure 3  O(T 3 ).This can be improved if a Lance-Williams type relationship (Lance and Williams 1967) holds for the posterior function.On the other hand, because the model (3) imposes independent variables, f (y | C) reduces to V v=1 f (y v ) and hence agglomerative clustering is of order O(V T 3 ), linear in the number of variables V .This is encouraging because in high dimensional settings T is small but V is large, so our algorithm is rather fast.However, evaluation of f (y | C) may be time-consuming for large V or T , and computational acceleration is then required.
In order to decide which clusters must be merged, we need to evaluate individual densities for each variable (6).The density evaluation becomes computationally expensive if C is large, as in the early stages of agglomerative clustering.A simple trick to rapidly evaluate f (y v | C) is to use the fact that in the agglomerative method only two clusters will be joined, so the evaluation of the density of two clusters with the past values of f (y vc | C) suffices for evaluation of the new f (y | C).Every time that we evaluate f (y | C), only the joint density of the merging clusters is calculated and f (y | C) is reconstructed by multiplying the lacking components.
The individual density f (y v ) for the Gaussian and the asymmetric Laplace model is composed of products and therefore it is best computed on the log scale.If we write bclust: High-Dimensional Bayesian Clustering with Variable Selection in R When l 0 and l 1 are both very small or very large, the computation of is troublesome, and computer memory may overflow or l may be evaluated as zero.To avoid this we evaluate l after factorizing exp(l 1 ) as This expression is appropriate when l 1 > l 0 , because the exponent function in (13) doesn't explode.There is an obvious variant when l 0 ≥ l 1 .A similar trick is applied for the evaluation of log f (y v ) using log In the Gaussian effects model, log f 1 (y vc ) corresponds to logarithm of a d-variate Gaussian density with mean µ1 and covariance matrix Σ, where d = Tc t=1 R ct , 1 is a unit vector of length d and Σ is a d × d positive definite matrix, that is, Evaluation of this density requires computation of the Mahalanobis distance and the log determinant of Σ.In order to efficiently compute them, let the upper-triangular matrix B d×d denote the Cholesky decomposition of Σ, that is B ⊤ B = Σ.The Cholesky decomposition of a positive definite matrix is efficiently implemented in Fortran and is available in the function dpbtrf of the LAPACK library (Anderson et al. 1999).Because B is upper-triangular, a solution to the system of linear equations is easily obtained by back-solving using the LAPACK function dtrtrs.Hence, x = Σ − 1 2 (y vc − µ1) might be used to evaluate the Mahalanobis distance as in which x i represents the ith element of the vector x.
Once the Cholesky decomposition of Σ is computed, the eigenvalues λ i are also available.
The log density can be obtained by replacing the Mahalanobis distance ( 16) and the log determinant ( 17) in ( 14), yielding We need to apply this procedure for all vectors of data y vc (v = 1, . . ., V, c = 1, . . ., C).
We can save computational time for data in the same cluster but another variable, say y v ′ c (v ′ = v), because for y v ′ c , the covariance matrix Σ and hence B are unchanged, so we do not need to re-calculate the Cholesky decomposition of Σ.However, the back-solving must be updated according to the new data in Bx = y v ′ c − µ1, and the Mahalanobis distance must be recomputed using the new x.
In the asymmetric Laplace model the density f 1 (y vc ) given in ( 8) has a more complicated form.However, the computational difficulty arises only in the calculation of and of Φ, the standard Gaussian cumulative distribution function.The cumulative Gaussian distribution function is available in the Rmath C library and evaluation of the quantities in ( 18) is similar to the Gaussian case.
Therefore, the required quantities are The value of the log density is readily obtained.For data in the same cluster but a different variable, say y v ′ c , the quantities A, d L , and d R are unchanged.Hence, we just need to update x b L and x b R , replace them in ( 19) and can then evaluate f 1 (y v ′ c ) with less computational effort.The positive definite matrix A is exchangeable and therefore |A| and A −1 are available analytically, but using the analytical forms doesn't accelerate the algorithm much.

Code analysis on simulated data
In order to analyze our computer code, a simple factorial experiment was performed with the number of variables V set to 50, 100, 200, 300, 500, 1000 and the number of individuals T set to 10,20,30,40,50,100,200,300.The experiment was run on a desktop PC with Intel Core Duo processor 1.8 MHz, 1 GB RAM and Linux Ubuntu operating system.Each design was fitted 5 times using the Gaussian and the asymmetric Laplace models and the time in seconds required for agglomerative clustering was saved.
Least squares estimates of the parameters (β 0 , β 1 , β 2 ) ⊤ for the linear model log 10 time = β 0 + β 1 log 10 V +β 2 log 10 T are (−6.large T ; see also Table 2.The fitted linear model can be used to predict the time required for agglomerative clustering for large T or V .However, β 0 is computer-dependent.On the equipment mentioned above, the time needed for clustering T = 100 individuals measured on V = 5000 variables is about 39 minutes for the Gaussian model and about 33 minutes for the asymmetric Laplace model.

Clustering toy examples
In this section only fits using the Gaussian model are presented.The asymmetric Laplace model leads to very similar results provided similar hyper-parameter values are used, but hyper-parameter estimation using the asymmetric Laplace model often more difficult.
In order to demonstrate the usefulness of the bclust package first we cluster a toy data set consisting of a cluster of 20 observations independently and identically sampled from a standard bivariate Gaussian distribution with correlation ρ = 0.9, and another cluster of 20 Gaussian variates with mean (4, 0) ⊤ , unit variances and negative correlation ρ = −0.9.This gives a data set in which one variable is more useful for clustering than another.A scatterplot of the data is shown in Figure 4.The generated data violate the variable independence assumption of model ( 3), but, provided the cluster centers are separated reasonably well, ignoring the dependence has little effect on the estimated grouping and the algorithm yields convincing results.
The result of the bclust command is a bclustvs object, similar to the existing R class hclust, but including extra information needed to produce appropriate graphs.Calculation of the importances is provided in the imp function.This imports a bclustvs object and gives the importance measures, the log Bayes factor of variables, log B δ , and the log Bayes factor of variable-cluster combinations, log B γ , for the maximum a posteriori clustering found by the agglomerative method.Negative values of the importances give negative evidence that the variable v or the variable-cluster v, c participate in the optimal clustering.The package provides the following command for plotting variable importances:

R> viplot(imp(cluster.obj)$var)
The bclust package is not designed to handle low dimensional datasets like the bivariate toy data.In order to demonstrate its usefulness in a high-dimensional situation we add 98 standard Gaussian noise variables to the bivariate toy example, yielding a data set with 40 observations and 100 dimensions; we use the same hyper-parameter values as in the bivariate case.The package includes a dendrogram-image-teeth plot, ditplot, of a bclustvs object and draws the dendrogram tree, the image plot of the unreplicated data, and the optimal grouping, in the same figure.
On the left of the figure is the dendrogram tree.To its immediate right is the image plot of the data, with clustering individuals in rows and variables in columns.The package uses the rainbow color scheme as the default coloring scheme for the image plot: the minimum value appears blue, the maximum value appears magenta, and intermediate values are shown with colors that depend on their closeness to the limiting values.The image plot is followed by a teeth plot showing the optimal grouping found by cutting the tree.On the extreme right of the figure is a vertical bar which can be used to represent another arbitrary grouping, here taken to be the correct data clustering available in the plotcol numeric vector.
Our next example includes data simulated from the replicated Gaussian model (3) with R ct = 4 and T = 10 grouped in four clusters with model hyper-parameters σ 2 ε = 1, σ 2 η = 3, σ 2 θ = 25, µ = 0, p = 0.5, q = 0.5.Figure 6 shows the profileplot of the data with red blobs for activated variable-cluster combinations attached to a teethplot and a horizontal bar declaring the activated variables, shown in red.The profileplot command of the bclust package is a handy tool suitable for presentation of replicated data.
Like ditplot, the function dptplot is intended to facilitate visualization of a bclustvs object, but when data are replicated.This function attaches a dendrogram plot to a profile plot and a teeth plot with an optional horizontal bar for variable importances and an optional vertical bar to show an arbitrary grouping.The pre-specified heat colors for the variable importances are determined by the scale proposed by Kass and Raftery (1995), blank for variables having negative Bayes factors and a heat color for positive ones.In the profile plot the variables are sorted automatically according to their importances.The output of the former code appears in Figure 8. Figure 8: The bar plot of the variable Bayes factors using viplot for the simulated data; see also Figure 6.

Clustering real data
When a small proportion of variables are active, i.e., q is small, it is often difficult to estimate the model hyper-parameters using maximum marginal likelihood.We therefore fix q = 1 and estimate the remaining parameters.The resulting procedure is equivalent to setting δ v = 1, for all v = 1, . . ., V in (3).One may use this modified model without variable selection to discriminate or cluster subjects.This model is still resistant to the noise variables, thanks to the variable-cluster selector γ vc , but it does not include variable selection and therefore no variable importance can be computed.Simulation shows that parameter estimation and clustering performs better when q = 1.When noise dominates, estimation of all parameters simultaneously is troublesome, and usually fixing q = 1 provides a better result also.
The clusterings for the gaelle data with and without variable selection are similar, so we report only the result of the variable selection model.The figure below confirms that our Bayesian clustering method clusters the data into five groups and finds 11 of the 43 variables to be important.
The output of the latter code appears in Figure 9.

Figure 1 :
Figure1: Distributions of underlying effects and of measurements.Left: the standard Gaussian density (black), the symmetric Laplace density (blue), and the asymmetric Laplace density with σ 2 θ R /σ 2 θ L = 10 (red), all having zero median and unit variance, giving examples of the distribution of θ vc .Right: the marginal density of a measurement y vct when the variable and the variable-class combination is active, obtained by convolving a standard Gaussian density with the densities on the left.

Figure 2 :
Figure 2: Marginal densities of measurements.Left: examples of a Gaussian spike (solid) and slab (dotted and dashed) densities; the dotted density is obtained by adding a Gaussian effect to a Gaussian noise variable; the dashed density is derived by adding a symmetric Laplace effect to Gaussian noise.Right: Gaussian spike and right-skewed (red dashed) and left-skewed (blue dashed) densities obtained by adding an asymmetric Laplace effect to Gaussian noise.

Figure 2 .
Figure 2. Hence model (2) always gives a Gaussian spike distribution, but depending on the distribution of θ vc , it provides a symmetric or an asymmetric slab distribution.

Figure 3 :
Figure 3: Diagram of function dependencies.Left panel: functions used for computations.Right panel: functions used for visualization.Ellipses denote functions developed in the bclust package and rectangles denote pre-existing R functions.
Denoting the diagonal elements of B, by b ii , we have b ii = λ First we calculate the upper-triangular Cholesky decomposition of A d×d , say B d×d , in which d = T c .Hence log |A| = 2 d i=1 log b ii , and we find the vectors x b L , x b R , x d L , x d R by back-solving the systems of linear equations

Figure 4 :
Figure4: Left panel: scatter plot of the bivariate toy data; two clusters are generated, the first cluster from a standard bivariate Gaussian distribution with correlation 0.9 the second with mean (4, 0) ⊤ , ρ = −0.9, and unit marginal variances.Right panel: The output of bclust fit on the toy example visualized using the generic plot command.

Figure 5 :
Figure 5: The output of bclust fit on the bivariate toy data (Figure 4 left panel), with 98 standard Gaussian noise variables added.The resulting bclustvs object is visualized using the ditplot command.

Figure 6 :
Figure 6: Profile plot of data with four clusters, simulated using the Gaussian model with ten clustering individuals, each having four replicates over 50 variables.The hyper-parameter values are σ2 ε = 1, σ 2 η = 1, σ 2 θ = 25, µ = 0, p = 0.5, q = 0.5.The 22 activated variables are shown using the red horizontal bar.Activated variable-cluster combinations are shown using the red blobs on the profiles for each group.

Figure 7 :
Figure 7: The output of bclust fit on simulated data visualized by dptplot command, see also Figure 6.The left side of the figure includes the posterior-based dendrogram cut at the maximum posteriori point, data are shown in the middle, and the maximum a posteriori grouping on the right side.

Table 1 :
Summary of the functions in the bclust package.

Table 2 :
Average clustering time (in seconds) for different number of variables V and the number of clustering individuals T .
⊤for the asymmetric Laplace model.As expected, β 1 ≈ 1 and β 2 ≈ 3 for both models.The value of β 2 for the asymmetric Laplace model is smaller than that for the Gaussian model, suggesting a more efficient algorithm is implemented for the asymmetric Laplace model especially for Messerli et al. (2007)ce produce high-dimensional data for which classification of a new observation to existing groups or clustering individuals is of interest.In most cases a list of potentially important variables is available and effective discriminating or clustering variables are demanded.Here we consider a subset of the replicated metabolomic data ofMesserli et al. (2007)available in the bclust package as the gaelle data set.The metabolite data consist of 14 mutant samples of the plant Arabidopsis thaliana.Values of 43 potentially important metabolites are measured for each sample using GC-MS technology.These metabolites are supposed to depend on the genetic changes.The data involve two mutants defective in starch bio-synthesis, pgm and isa2; four defective in starch degradation sex1, sex4, mex1, and dpe2; a mutant for comparison that accumulates starch as a pleitropic effect, tpt; four uncharacterized mutants, deg172, deg263, ke103, and sex3; and three wild type plants, WsWT, RLDWT, and ColWT.There are four replicates of all samples except the last, for which there are three.