PReMiuM: An R Package for Profile Regression Mixture Models using Dirichlet Processes

PReMiuM is a recently developed R package for Bayesian clustering using a Dirichlet process mixture model. This model is an alternative to regression models, non-parametrically linking a response vector to covariate data through cluster membership. The package allows Bernoulli, Binomial, Poisson, Normal and categorical response, as well as Normal and discrete covariates. Additionally, predictions may be made for the response, and missing values for the covariates are handled. Several samplers and label switching moves are implemented along with diagnostic tools to assess convergence. A number of R functions for post-processing of the output are also provided. In addition to fitting mixtures, it may additionally be of interest to determine which covariates actively drive the mixture components. This is implemented in the package as variable selection.


Introduction
Profile regression is an alternative to regression models for when one tries to make inference beyond main effects for datasets with potentially correlated covariates. In particular, profile regression non-parametrically links a response vector to covariate data through cluster membership (Molitor et al. 2010). We have implemented this method in the R (R Core Team 2012) package PReMiuM.
PReMiuM performs Bayesian clustering using a Dirichlet process mixture model and it allows Bernoulli, Binomial, Poisson, Normal and categorical response, as well as Normal and discrete covariates. Moreover, predictions may be made for the response, and missing values for the covariates are handled. Several samplers and label switching moves are implemented along with diagnostic tools to assess convergence. A number of R functions for post-processing of the output are also provided. In addition to fitting mixtures, it may additionally be of interest to determine which covariates actively drive the mixture components. This is implemented in the package as variable selection.
In order to demonstrate the PReMiuM package, it is helpful to present an overview of the Dirichlet Process. We begin this section by re-familiarising the reader with such a process, introducing notation that we shall call upon throughout the paper. Formally, let (Ω, F, P ) be a probability space comprising a state space Ω with associated σ-field F and a probability measure P 0 . We say that a probability measure P follows a Dirichlet process with concentration parameter α and base distribution P Θ 0 parametrised by Θ 0 , written P ∼ DP(α, P Θ 0 ) if (P (A 1 ), P (A 2 ), . . . , P (A r )) ∼ Dirichlet(αP Θ 0 (A 1 ), αP Θ 0 (A 2 ), . . . , αP Θ 0 (A r )) (1) for all A 1 , A 2 , . . . , A r ∈ F such that A i ∩ A j = ∅ for all i = j and r j=1 A j = Ω.

The stick-breaking construction
Although Definition (1) is perhaps rather abstract, proof of the existence of such a process has been determined in a variety of ways, using a number of different formulations (Ferguson, 1973 andBlackwell andMacQueen, 1973). In this paper we focus on Dirichlet process mixture models (DPMM), based upon the following simplified constructive definition of the Dirichlet process, due to Sethuraman (1994). If where δ x denotes the Dirac delta function concentrated at x, then P ∼ DP(α, P Θ 0 ). This formulation for V and ψ is known as a stick-breaking distribution. Importantly, the distribution P is discrete, because drawsΘ 1 ,Θ 2 , . . . from P can only take the values in the set {Θ c : c ∈ Z + }.
As noted by many authors (for example Ishwaran andJames, 2001 andKalli et al., 2011) it is possible to extend the above formulation to more general stick-breaking formulations, for example allowing V c ∼ Beta(a c , b c ) independently, resulting in a generalised Dirichlet process, such as the two parameter Poisson-Dirichlet process (Pitman and Yor 1997). Another example is provided by Chung and Dunson (2009) and Rodriguez and Dunson (2011) where the Beta distibutions are replaced by a Probit model, which may be allowed to depend on observed data, for the purpose of capturing dependence such as a spatial structure. The methods and results that we propose within this paper hold for such generalised processes, but at present the package is only coded to implement the Dirichlet process in equation (2).
Typically, because of the complexity of the models based on the stick-breaking construction, inference is made in a Bayesian framework using Markov chain Monte Carlo (MCMC) methods. Until recently, a perceived difficulty in making inference about this model was the infinite number of parameters within the stick breaking construction. Historically, this obstacle has resulted in the use of algorithms that either explore marginal spaces where some parameters are integrated out or use truncated approximations to the full Dirichlet process mixture model, see for example Neal (2000) and Ishwaran and James (2001).
More recently, two alternative innovative approaches to sampling the full DPMM have been proposed. The first, introduced by Walker (2007), uses a novel slice sampling approach, resulting in full conditionals that may be explored by the use of a Gibbs sampler. The difficulty of the proposed approach is the introduction of contraints that complicate the updates of the mixture component weights, leading to potential mixing issues. To overcome this, Kalli et al. (2011) generalise this sampler, adding further auxilliary variables, and report good convergence results, although the authors note that the algorithm is sensitive to these additional parameters. The second distinct MCMC sampling approach was proposed in parallel by Papaspiliopoulos and Roberts (2008). The proposed sampler again uses a Gibbs sampling approach, but is based upon an idea termed retrospective sampling, allowing a dynamic approach to the determination of the number of components (and their parameters) that adapts as the sampler progresses. The cost of this approach is an ingeneous but complex Metropolis-within-Gibbs step, to determine cluster membership.
Despite the apparent differences between the two strategies, Papaspiliopoulos (2008) noted that the two algorithms can be effectively combined to yield an algorithm that improves either of the originals. The resulting sampler was implemented and presented by Yau et al. (2011), and a similar version was presented by Dunson (2009) for DPMM. The current sampler presented in this paper is our interpretation of these ideas, implemented as an R package. This package, called PReMiuM, is based upon efficient underlying C++ code for general DPMM sampling and it is available on CRAN.
In Section 2 we describe formally the Dirichlet process mixture model implemented in PRe-MiuM and the blocked MCMC sampler used. In Section 3 we discuss profile regression and its link with the response and covariate models included in the package. In Section 4 we give an overview of the postprocessing tools available in PReMiuM to learn from the rich output produced by our Bayesian model and in Section 5 we discuss diagnostic tools that we propose to investigate the convergence of the MCMC. Finally, in Section 6 we give a brief overview of the structure of the code and show examples of its use as well as an indication of run times in Section 7.
2. Sampling the Dirichlet process mixture model

Definition and properties
Perhaps the most common application of the Dirichlet process is in clustering data, where it can be used as the prior distribution for the parameters of an infinite mixture model. Consider again the stick breaking construction in Equation 2. For the Dirichlet process mixture model (DPMM), the (possibly multivariate) observed data D = (D 1 , D 2 , . . . , D n ) follow an infinite mixture distribution, where component c of the mixture is a parametric density of the form f c (·) = f (·|Θ c , Λ) parametrised by some component specific parameter Θ c and some global parameter Λ. Defining (latent) parametersΘ 1 ,Θ 2 , . . . ,Θ n as draws from a probability distribution P following a Dirichlet process DP (α, P Θ 0 ) and again denoting the dirac delta function by δ, this system can be written, When making inference using mixture models (either finite or infinite) it is common practice to introduce a vector of latent allocation variables Z. Such variables enable us to explicity characterise the clustering and additionally facilitate the design of MCMC samplers. Adopting this approach and writing ψ = (ψ 1 , ψ 2 , . . .) and Θ = (Θ 1 , Θ 2 , . . .), we re-write Equation 3 as The likelihood of D i associated with the DPMM is simply the first line of Equation 4. Integrating out the latent variable Z i we obtain the more recognisable mixture likelihood The remainder of Equation 4 provides the prior specification for the DPMM, allowing us to write the joint posterior distribution as Of course, additional layers of hierarchy could easily be introduced, for example through hyper-priors for Θ 0 .

The MCMC sampler
A common approach to MCMC sampling from the DPMM is to integrate out V and use a Gibbs sampler on the resulting space. Such samplers are commonly referred to as Pólya urn samplers, since they are motivated by the Pólya urn representation of a Dirichlet process introduced by Blackwell and MacQueen (1973). Ishwaran and James (2001) provide a review of this approach and demonstrate that conditionals of the type p(Z i |Z −i ) can be derived, where Many other authors have focused on developing alternative samplers of this nature, including Neal (2000) and Green (2010). However, as noted by many authors, samplers of this nature, where the allocation of a single observation is conditional on the allocations of all other observations, can often suffer from poor mixing. This motivates the need for an alternative class of samplers that sample from the full model 4.
In the full model, the posterior conditionals for the allocation variables Z depend upon an infinite number of variables V and Θ. One way to bypass this complication is to truncate the definition in Equation 4 to mixtures with C components (of which potentially only a subset will be non-empty). Ishwaran and James (2001) demonstrate that under this model the conditional distributions are standard distributions which can be easily sampled from. This is one of the three samplers implemented in our R package, and we refer to it as truncated.
Although this approach resolves the challenges of sampling from the full model, if C is not chosen to be sufficiently large, then the posterior may be quite different on the truncated model space compared to the full model space. The authors provide some guidance for choice of C, but more recent work by Walker (2007) and Papaspiliopoulos and Roberts (2008) demonstrate techniques that alleviate the need for such a truncation, whilst retaining many of the sampling properties of the full conditionals.
The first step is to borrow the idea of Walker (2007) and introduce auxiliary variables U = (U 1 , U 2 , . . . , U n ) such that the joint posterior can be re-written as Here, 1 A , is the function that takes the value 1 over the set A and 0 elsewhere. By combining this auxiliary variable approach with the notion of retrospective sampling (i.e. adopting a just-in-time approach to sampling empty mixture component parameters, as introduced in Papaspiliopoulos and Roberts, 2008), it is possible to construct an efficient Gibbs sampler for sampling from the joint posterior 7. Integrating U out of Equation 7 with respect to the Lebesgue measure yields the DPMM posterior distribution given in Equation 5 meaning that marginalising samples of the joint distribution over U results in samples from the desired distribution. Kalli et al. (2011) extend the idea of Walker (2007) to a general class of slice samplers by writing where ξ 1 , ξ 2 , . . . is any positive sequence.
When ξ i = ψ i , this corresponds to the efficient Gibbs sampler proposed by Papaspiliopoulos (2008) in the context of DPMM using parameter blocking. This is the second of the three samplers implemented in our R package, and we refer to it as slice dependent, in accordance with Kalli et al. (2011). Most importantly, this design permits the introduction of label switching moves, without which it is very difficult to obtain sufficient mixing. We discuss this in detail in Section 5.
For the last of the three samplers implemented in our R package, that we refer to as slice independent in accordance with Kalli et al. (2011), we set ξ = (κ − 1)κ j−1 with κ = 0.8 as proposed by Kalli et al. (2011).
We continue by defining some new notation which is required to present our slice samplers. First, given the allocation variables Z, define Similarly, given the auxiliary variables U and the vector V , define It is important to emphasise that these values potentially change at each sweep of the sampler, as the underlying variables change, although for simplicity of exposition we have omitted explicitly labelling the parameters with the sweep. The purpose of the variable C is to provide an upper limit on which mixture components need updating at each sweep. Specifically, although there are infinitely many component parameters in the model, since P (Z i = c|U i > ψ c ) = 0, we need only concentrate our updating efforts on those components c for which ψ c > U i for some i = 1, 2, . . . , n. By defining C as in equation 9 it can be shown (see Appendix A.1) that ψ c < U i for all c > C and all i = 1, 2, . . . , n. Assuming that the Markov chain is initialised accordingly and is updated using the correct conditionals, it is possible to show Z ≤ C < ∞ almost surely (details provided in Appendix A.1).
With these definitions in place we make use of the following sets and vectors (which again will change at each sweep) Here the A, P and I are disjoint sets (updated at every sweep of the MCMC algorithm) that partition Z + , with names chosen to denote Active, Potential and Inactive components respectively. It is possible that P = ∅. By definition, all observations are allocated to a mixture component labelled by one of the indices in A. Components labelled with an index in P or I are necessarily empty (i.e. have no observations allocated to them), the difference being that at the next update of the allocation variables, components with labels in P may potentially become non-empty.
The blocked infinite DPMM algorithm can now be defined using the following blocked Gibbs updates to sample from the relevant conditionals.

DPMM algorithm
Suppose we are at sweep t of the sampler. Update as follows: Step A. Compute Z and the set A. Step While this algorithm is somewhat generic, the blocking strategy is clearly highlighted. Further details explaining each of the steps are provided in Appendix A.2. The key idea is that by doing joint updates, we can marginalise out an infinite number of variables when necessary, to ensure that we are always sampling from conditional distributions that depend only upon a finite number of parameters. In particular, after marginalisation, the parameters corresponding to the inactive set I do not contribute to the conditional distributions of the other parameters, so we do not actually need to sample their values. Since these parameters have no contribution to the likelihood, if values are subsequently required they can simply be sampled from the prior retrospectively as necessary. Although the sampler is written as a blocked Gibbs sampler, where it is not possible to sample directly from full conditionals (for example in the update of Θ, depending upon the choices of f and P Θ 0 ) Metropolis-within-Gibbs steps are applied. Depending on the application, the Gibbs updates specified above may comprise several different Gibbs or Metropolis-within-Gibbs steps (for example updating Θ and Λ. Typically, where Metropolis-Hastings updates are required we advocate adopting an adaptive Metropolis-Hastings approach: see Andrieu and Thoms (2008) for a review.

Example Models
The general sampler of the previous section is applicable for many specific models, depending on the choices of f and P Θ 0 . In this section we provide further details for some of the models that are implemented within our software. We detail the prior choices that are made within our implementation, but, of course, alternative priors could be chosen.

Gaussian mixtures
Perhaps the most common model to be implemented under the DPMM framework is the Gaussian mixture model, where D = X for some covariate data X, and X assumes a mixture of Gaussian distributions. Under this setting for each cluster c, the cluster specific parameters are given by Θ c = (µ c , Σ c ), where µ c is a mean vector and Σ c is a covariance matrix. There are no additional global parameters Λ. Under this setting By choosing µ c ∼ Normal(µ 0 , Σ 0 ) and Σ c ∼ InvWishart(R 0 , κ 0 ) (for each c) for our prior model P Θ 0 we have a conjugate model, permitting Gibbs updates for the parameters µ A and Σ A associated the active components, and also those (µ P and Σ P ) associated with the potential components. The choice of values for the hyperparameters Θ 0 = (µ 0 , Θ 0 , R 0 , κ 0 ) is discussed further in section 6.

Mixed mixtures
An alternative model is given by a mixture of some continuous and discrete random variables. Following the notation used above for Gaussian and discrete mixtures, for J 1 continuous random variables and J 2 discrete random variables, where D 1 i is the subset of the continuous random variables in D i and D 2 i is the subset of the categorical random variables in D i . Note that we are assuming independence between continuous and categorical data conditional on the cluster allocations.

Profile regression
Recently, interest has grown in using DPMM as an alternative to regression models, nonparametrically linking a response vector Y to covariate data X through cluster membership. This idea has been pioneered by several authors including Dunson et al. (2008), Bigelow and Dunson (2009) Our presentation is most similar to the latter three of these articles which refer to this idea as "profile regression".
For the case of either Gaussian or Discrete mixtures, as described above, our implementation permits the joint modelling of a response vector, for various response models which we present below. Formally, the data D = (Y , X, W ) is now extended to contain response data Y i and covariate data X i for each individual i, where the contribution of the covariate data to the response may be cluster dependent. There is also the possibility to include additional fixed effects W i for each individual, which are constrained to only have a global (i.e. non-cluster specific) effect on the response Y i .
The data D i is then jointly modelled as the product of a response model and a covariate model, to give the following likelihood: The covariate likelihood f X is of either of the forms presented in Sections 3.1 or 3.2. The likelihood f Y depends upon the choice of response model.

Binary response
Adopting a binary response model, each parameter vector Θ c is extended to include an additional parameter θ c . We also introduce the global parameter vector Λ = β, of the same length L as the fixed effects vector W i , to capture the contribution of these effects. Then, For each cluster c, we adopt a t location-scale distribution for θ c , with hyperparameters µ θ and σ θ with 7 degrees of freedom, as discussed by Molitor et al. (2010). Similarly, for each fixed effect l, we adopt the same prior for β l , but with hyperparameters µ β and σ β .
The components of Θ c corresponding to the covariate model for X retain the possibility of being updated according to Gibbs samples. However, since conjugacy is not achieved with our prior choice, updating θ c for each cluster (and β l for each fixed effect l) requires a Metropolis-within-Gibbs sample. In our implementation we propose the use of adaptive random-walk-Metropolis moves.

Categorical response
The categorical response model that we use is a simple extension of the binary response model of the previous section. In particular, each parameter vector Θ c additionally contains an extra parameter vector θ c = (θ c,1 , θ c,2 , . . . , θ c,R−1 ) of length R − 1, where R is the number of possible categories represented in the response data Y . Treatment of fixed effects is also extended, so that for each response category r = 1, 2, . . . , R − 1, there is a vector β r = (β r,1 , β r,2 , . . . , β r,L ), where β r,l is the coefficient for each fixed effect l (l = 1, 2, . . . , L). This . In our sampler we use the same priors for each θ c,r as for θ c and β r,l as for β l in the binary case, with the resulting observation about Metropolis-within-Gibbs updates remaining true.

Binomial response
By providing a number of trials T i associated with each individual i (in this model an "individual" might correspond to an area or "experiment") we can extend the binary response model to a Binomial response model. In particular, Priors used are identical to the binary case. Molitor et al. (2011) provide an example where this model is employed.

Poisson response
For count-type response data, an alternative to the Binomial model is the Poisson model. Under this model, each individual i is associated with an expected offset E i , and the response is then modelled as Prior models for θ c and β are as above.

Extra variation in the response
For some of the above models, it is possible that we may wish to allow for extra variation in the response. Our sampler is designed to achieve this by alternatively modelling λ i (as defined in the above response models) by Prior distributions for θ c and β are unchanged, but in this model Λ contains an additional parameter, σ 2 ε , for which prior specification is required. For simplicity we work in terms of the precision τ ε = 1/σ 2 ε , and adopt a gamma distribution with shape parameter s τε and rate parameter r τε . This approach permits a simple Gibbs update of this parameter. In order to make inference about this model, it is also necessary to update the latent variables λ i at every sweep of the MCMC sampler. These parameters are considered an extension of Λ (as they are not directly associated with a specific cluster) and are therefore updated in Step F of the DPMM algorithm. Updates to these parameters are done using adaptive Random-walk-Metropolis steps.

Gaussian response
Our sampler is able to handle Normally distributed continuous response data. As for many of the discrete response models, Θ c is extended to contain θ c for each c. As before Λ contains β, but also σ 2 Y . These parameters allow us to write the response model as: We impose the same prior settings as for the discrete response models, with the additional prior on τ Y = 1/σ 2 Y being Gamma(s τ Y , r τ Y ), where s τ Y and r τ Y are the shape and rate hyper parameters that extend Θ 0 . Adopting this conjugate prior, updates for τ Y are simple Gibbs updates.

Variable selection
In addition to fitting mixtures, potentially linking covariates and responses, it may additionally be of interest to determine which covariates actively drive the mixture components, and which share characteristics common to all components. This can be formulated as a question of variable selection. Below we present details of how this idea can be modelled, first in the case of discrete covariates, as considered by Papathomas et al. (2012), and then in the context of Gaussian covariates, a formulation which, as far as we are aware, has not been reported elsewhere. Variable selection for DPMM has also been considered by Chung and Dunson (2009).

Discrete covariates
Following the approach taken by Papathomas et al. (2012) our sampler implements two types of variable selection. We outline these approaches briefly in this section but for full details the reader is referred to this paper.
The first approach is a type of "soft" variable selection, where for each covariate j, there is an associated variable ζ j taking values in [0, 1], which informs whether the variable j is important in terms of supporting a mixture distribution. Letting φ 0,j,k be the observed proportion of covariate j values taking the value k throughout the whole covariate dataset X, we define the new composite parameters φ c,j,k := ζ j φ c,j,k + (1 − ζ j )φ 0,j,k .
The likelihood model is then as in Equation 11 but with φ Z i ,j,k replaced by φ Z i ,j,k . Under this model the vector Λ is extended by ζ = (ζ 1 , ζ 2 , . . . , ζ J ). We replicate the prior structure presented in the original paper, in particular, inducing sparsity with the introduction of additional binary variables ρ j for each j = 1, 2, . . . , J, which also extend Λ. For extended details of the conditional posteriors and updating strategy the interested reader is referred to Papathomas et al. (2012). Of particular note is the fact that conjugacy for φ c,j is no longer retained, meaning that Metropolis-within-Gibbs updates are necessary. We use adaptive Random-walk-Metropolis proposals.
An alternative approach to the variable selection presented above, is a more traditional "hard" (binary) cluster specific variable selection. Under this model, each mixture component c has an associated vector γ c = (γ c,1 , γ c,2 , . . . , γ c,J ), where γ c,j is a binary random variable that determines whether covariate j is important to mixture component c. This allows us to write φ c,j,k := γ c,j φ c,j,k + (1 − γ c,j )φ 0,j,k = φ γ c,j c,j,k (1 − φ c,j,k ) (1−γ c,j ) , which, as in the first variable selection model, is substituted into Equation 11 in place of φ c,j,k , to provide the likelihood for the covariate model. Under this model, each parameter vector Θ c is extended by γ c . We use the same prior specification (and hence posterior conditionals) as the original paper, in particular introducing additional parameters ρ j for each j = 1, 2, . . . , J into Λ. We also observe that using this alternative approach, the binary nature of γ c,j means that direct Gibbs updates for φ c,j,k can still be used (albeit with slightly altered Dirichlet distributions).

Gaussian covariates
The two variable selection methods described above can equally be applied to the Gaussian mixture case. Definingx = (x 1 ,x 2 , . . . ,x J ), wherex j = 1 n n i=1 x i,j is the average sample value of covariate j, we can define the vector µ c = (µ c,1 , µ c,2 , . . . , µ c,J ) where either depending on which variable selection approach is being adopted. We then replace µ c,j with µ c,j in Equation 10. Using identical priors for ζ j or γ c,j and ρ j , these parameters are updated as for the discrete covariate case. The posterior conditional distributions for updating µ c are shown in Appendix A.3, whereas those for updating other parameters are as before, with µ c j replaced with µ c,j as appropriate.

Postprocessing of the MCMC output
The rich posterior output produced can be used to learn about the partition space and its uncertainty. It is useful to show a representative partition, as an effective way to convey the output of the clustering algorithm. Moreover, it is also of interest to assess the uncertainty associated with subgroups of this best partition.
We discuss below the necessary steps. See also Molitor et al. (2010).
1. Computation of the dissimilarity matrix. At each iteration of the sample, we can record pairwise cluster membership and construct a score matrix, with entries equal to 1 for pairs belonging to the same cluster and 0 otherwise. Averaging these matrices over the whole MCMC run leads to a similarity matrix S, which can be then used to identify an optimal partition.
2. Identifying the optimal partition. We have implemented two deterministic clustering procedures to characterise the optimal partition. The first is to use find the best partition by choosing the one which minimises the least-square distance to the matrix S. This approach is fast, but susceptible to Monte Carlo error. The second procedure implemented in the package is partitioning around medoids (PAM) on the dissimilarity matrix 1 − S. This procedure is available in R in the package cluster and it robustly assigns individuals to clusters in a way consistent with matrix S. PAM is implemented for each possible number of clusters up to a specified maximum, and for each fixed number of clusters the best PAM partition is selected. A final representative cluster is then chosen by maximising the average silhouette width across these best PAM partitions.
3. Computation of the average risk and profile and the corresponding credible intervals.
Given an optimal partition P * obtained as above, we examine the MCMC output to assess whether or not the model consistently clusters individuals in a manner similar to P * . Consistent clustering leads to narrower credible intervals.
For example, for Bernoulli response, we obtain a distribution of the baseline risks for each cluster defined by P * . At each iteration of the sampler we compute the average of baseline risks p z i for all individuals within a particular cluster k of the optimal partition. This average baseline risk for cluster k is computed as follows: where n k denotes the number of individuals in cluster k. In a similar way we can compute the distribution of cluster parameters for other response and covariates types.

Mixing of the MCMC algorithm
The likelihood of the DPMM is invariant to the order of cluster labels but the prior specification of the stick breaking construction is not. Therefore, to ensure adequate mixing across orderings, it is important to include label-switching moves. In this package we have implemented the two label switching moves proposed by Papaspiliopoulos and Roberts (2008) as well as a third label switching move proposed by . This latter move updates the new cluster weights so they are something like their expected value conditional upon the new allocations. See ) for more details on the performance of this sampler.
Even with these label switching moves, convergence may be problematic and the user must address this issue using diagnostic tools. One difficulty in this respect is that there are no parameters in the model that can reliably demonstrate convergence. The parameters of the fixed effects tend to converge very quickly, regardless of the underlying clustering, as they are not cluster specific and therefore are not a good indication of the overall convergence. The cluster parameters, such as the θ c 's, cannot be tracked as their number (and their labels) can change from one iteration to the next. The concentration parameter α is not a reliable indicator of convergence either, as discussed in Hastie et al. (2013).
To overcome this challenge, we have implemented the computation of the marginal model posterior p(Z|D) as an additional diagnostic tool. This represents the posterior distribution of the allocations given the data, having marginalised out all the other parameters ). The marginal model posterior is computed for each run of the MCMC and it has proved very effective for our real examples to compare runs with different initialisations and identify runs that were significantly different from others. Our experience suggests that it is harder for the MCMC algorithm to split rather than merge clusters. This means that it is important to initialise the algorithm with a number of cluster which is greater than the number of clusters that the algorithm will convergence to. The marginal model posterior can help to assess what such number is for each specific example.
Finally, while optimal partitions allow visualisation of the result of a clustering algorithm, such an approach must be applied with care as we are not aware of any effective method to quantify nor visualise clustering uncertainty. For this reason, we advise using predictions as an additional tool to assess convergence and visualise the output of the algorithm, as their posterior distributions can be compared across runs using standard methods. More details of using the package for predictions can be found in Section 7. We have observed that these posterior predictive distributions tend to be more stable than optimal partitions.

Software
Our implementation of the DPMM algorithm is available as an R package from CRAN. The software is primarily written in C++ and R.
The sampler implements the algorithm exactly as detailed in the current paper, although continued work is in progress to extend the scope of the software to cover additional models.
The program is further customisable through the specification of a hyperparameters, providing name-value pairs for the various hyperparameters used within the model being run. If the value is not set for a specific hyperparameter, it takes its default value. Default values can be found within the full documentation that is available as part of the software.
Moreover, this package can produce predicted values based on random allocations, or a Rao-Blackwellised estimate of predictions, where the probabilities of allocations are used instead of actually performing a random allocation.

Simulated example
We simulated 1,000 subjects, partitioned into 5 groups in a balanced manner. Ten binary covariates were considered. To demonstrate one of the variable selection approaches within the sampler, only the first eight covariates support a clustering structure. The response is binary. This dataset can be simulated as follows.

R> library(PReMiuM) R> inputs <-generateSampleDataFile(clusSummaryVarSelectBernoulliDiscrete())
We use the default values for all hyperparameters: Dirichlet conjugate priors with a j = 1 for the covariates and We initialised all chains allocating subjects randomly to 20 groups. We run the chain for 10,000 after a burn-in sample of 20,000 iterations. While ensuring convergence is a complex problem, we have observed good stability in all our runs, with results from independent chains virtually identical. Figure 1 shows a box-plot of the posterior distribution for the probabilities of the response and the covariates for the 5 clusters that form the representative clustering.

Variable selection
Note that covariates 9 and 10 in Figure 1 have similar profile probabilities for all clusters, as they have been simulated not to affect the clustering. The variable selection approach will tease out the covariates that contain clustering support and exclude the others from affecting the clustering.
We initialised the chains as in the simulated example above, with additional prior specifications given by p(ρ p ) ≡ Beta(0.5, 0.5).
The algorithm consistently sampled values ρ p in accordance with the simulated data, as shown in Fig. 3. This figure can be reproduced as follows.

Conclusions
The structure of PReMiuM objects gives rise to a wider variety of uses than can be described in detail here. Our intention was to provide a tutorial for Dirichlet process clustering and to illustrate the basic features of the sampler and post-processing tools that we have implemented in PReMiuM to demonstrate its utility. Our long-term goal is to continue to develop this package for analysis on complex and high dimensional datasets as well to increase the flexibility with regards to the data types that can be analysed. We provide the following proposition to support our assertions regarding C .
Proposition 1. Suppose that we have a model with posterior as given in Equation 7. Suppose Z ,U and C are defined as in Section 2.2. Then: (i) ψ c < U i for all i = 1, 2, . . . , n and all c > C almost surely; (ii) C ≥ Z almost surely; and (iii) C < ∞ almost surely.
Proof. We rely on the fact that if V 1 , V c ∼ Beta(1, α) , ψ 1 = V 1 and ψ c = V c l<c (1 − V c ) for c = 2, 3, . . . then ∞ c=1 ψ c = 1. A proof of this result for the DPMM (in terms of more general conditions) is provided by Ishwaran and James (2001). Then: (i) By definition, for all i = 1, 2, . . . , n This implies 1 − ψ Z < 1 − U , meaning By definition of C , this implies C ≥ Z almost surely.
(iii) Since ∞ c=1 ψ c = 1, this is a convergent series. By definition of a convergent series and because U > 0 we have C < ∞ almost surely.

A.2.
Below are additional comments to explain the blocking strategy employed in the DPMM algorithm. We use '·' to denote "all other parameters and data".
Step A. This step is a straightforward calculation of Z , which (potentially) changes at each iteration (with the update of Z). The set A is defined immediately conditional on this value.
Step B. This is a joint update of U and the parameters corresponding to the active components in A, with the inclusion of label switching moves. The principle is to use the identity , as detailed in Walker (2007).
Step C. To compute U is straightforward given the updated value of U from step B.4. The value of Z (and with it the set A) can only change from that computed in Step A if the mixture component corresponding to the old Z was involved in a label switching move, and then only if the component it was switched with was empty. By design of the label switching moves (see Section 5) this means that Z and A can only get smaller, with the consequence that parameters corresponding to a small number of components may be updated twice per MCMC sweep (once in Step B as part of the active components A, and once in Steps D and E as part of the updated potential components P ). This has no ill-effects as long as the most recently updated parameter values are used at each subsequent step.
Step D. This is a joint update of α, V P and V I . The principle is to use the identity p(α, V P , V I |·) = p(α|·)p(V P , V I |α, ·). We proceed by first updating α ∼ p(α|·) = p(α, V P , V I |·)dV P dV I (step D.1 ) and then sampling p(V P , V I |α, ·) (step D.2 ). To update V P , we need to alternate Gibbs samples with checks to evaluate whether the component just updated is C . In this way the set P is determined on the fly. As mentioned in Section 2.1, no actual sampling is done for the inactive components in set I as these would just be samples from the prior and have no impact on the likelihood or any other conditionals in the MCMC sweep.
If the prior for α is a Gamma distribution then it is alternatively possible to sample directly from the conditional with V A also marginalised (see Walker, 2007 andEscobar andWest,1995 for details). We retain our version as any prior for α can be then used.
D.1 Since V P and V I both correspond to empty mixture components, the only contribution to the joint posterior conditional is through the prior. This allows us to easily integrate out V P and V I . Due to the conditional independence, the resulting posterior from which this step samples is p(α|V A , Z). Typically this cannot be sampled directly, so we employ a Metropolis-within-Gibbs move to update α, using an adaptive random-walk-Metropolis proposal on the log-scale.
D.2 We begin by setting C = Z . We then repeat the following two steps until the stop condition is reached. First, check if C c=1 ψ c > 1−U . Next, if the condition is met we set C = C and stop, otherwise we set C = C + 1 and sample V C ∼ Beta(1, α).
Step E. This step updates the parameters Θ P and Θ I from the distribution p(Θ P , Θ I |·).
The set P is fully determined from step D.2. The parameters correspond to empty mixture components, so updated values of Θ P are sampled directly from the prior. As with other inactive parameter Θ I play no part in this MCMC sweep and so are not updated.
E.1 Taking advantage of the conditional independence structure, in this step we update Θ P by doing a Gibbs sample from the prior, such that Θ c ∼ p(Θ c |Θ 0 ) for each c ∈ P . As Θ c may be a vector of parameters, this may involve a number of Gibbs updates per component c. The full details depend upon the choice of P Θ 0 . See Section 3 for examples.
Step F. Here, the global (non-cluster-specific) likelihood parameters Λ associated with f are updated. There are only a finite number of such parameters so no special updates are needed.
F.1 Sample Λ ∼ p(Λ|Θ A , Z, D). Due to the conditional independence structure of the model the update only depends on the current value of the active likelihood parameters Θ A , the allocations Z, and the data D. Λ may contain multiple parameters, so this stage may contain many Gibbs and / or Metropolis-within-Gibbs steps. Full details will depend upon the choice of f and p(Λ), see Section 3 for examples.
Step G. The final step of the algorithm is to update the parameter allocations Z, conditional on the newly updated values of the other parameters.