BNPmix: an R package for Bayesian nonparametric modelling via Pitman-Yor mixtures

This introduction to the R package BNPmix is currently in press in the Journal of Statistical Software. BNPmix is an R package for Bayesian nonparametric multivariate density estimation, clustering, and regression, using Pitman-Yor mixture models, a ﬂexible and robust generalization of the popular class of Dirichlet process mixture models. A variety of model speciﬁcations and state-of-the-art posterior samplers are implemented. In order to achieve computational eﬃciency, all sampling methods are written in C++ and seamless integrated into R by means of the Rcpp and RcppArmadillo packages. BNPmix exploits the ggplot2 capabilities and implements a series of generic functions to plot and print summaries of posterior densities and induced clustering of the data.


Introduction
Bayesian nonparametric (BNP) methods provide flexible solutions to complex problems and data which are not easily described by parametric models (Hjort, Holmes, Müller, and Walker 2010;Müller, Quintana, Jara, and Hanson 2015).A remarkable example is represented by BNP mixtures, flexible models for density estimation and clustering, nowadays a wellestablished modelling option in the toolbox of statisticians and applied researchers.This line of research was initiated by the Dirichlet process (DP) (Ferguson 1973) mixture of Gaussian kernels by Lo (1984), a contribution which paved the way to the definition of a rich variety of nonparametric mixture models.More recently, increasing interest has been dedicated to mixture models based on nonparametric mixing random probability measures, other than the DP, which might provide increased modelling flexibility (e.g., Nieto-Barajas, Prünster, and Walker 2004;Lijoi, Mena, andPrünster 2005b,a, 2007;Argiento, Bianchini, and Guglielmi 2016).
A BNP mixture model for the observations Y (n) where Θ is the parameter space, K(•; •) is a suitable kernel defined on R p × Θ, and p is a discrete random probability measure on Θ.Assuming that p is distributed as a Pitman-Yor process (PY) (Perman, Pitman, and Yor 1992;Pitman 1995;Pitman and Yor 1997) leads to a model which stands out for being a good compromise between modelling flexibility, and mathematical and computational tractability.Based on such assumption, model (1) can be alternatively written in hierarchical form as where P Y (α, ϑ; P 0 ) denotes a PY process with discount parameter α ∈ [0, 1), strength parameter ϑ > −α, and base probability measure P 0 defined on Θ.By exploiting the so-called stick-breaking representation (Sethuraman 1994;Pitman and Yor 1997) of the PY process, the random density f can be equivalently written as an infinite sum, specifically with θj iid ∼ P 0 and π j = V j l<j (1 − V l ) with V j ind ∼ Beta(1 − α, ϑ + jα).The PY process (with α > 0) and the DP, special case recovered when α = 0, are characterized by two radically different learning mechanisms (see De Blasi, Favaro, Lijoi, Mena, Prünster, and Ruggiero 2015).The discount parameter plays an important modelling role and has an impact on the induced prior distribution on the number of clusters in the data, the larger being α the flatter and less informative the prior.This, in the context of mixture models, makes the PY more robust in estimating the clustering structure underlying the data.
While open source software implementing Markov chain Monte Carlo (MCMC) methods for some classes of BNP mixture models is already available, the BNPmix package, described in this paper, aims at filling an existing gap by providing reliable routines which make posterior inference based on PY mixtures possible, straightforward and efficient.The BNPmix package implements different specifications of model (2) including both univariate and multivariate kernels.The inclusion of concomitant independent variables is also accounted for, by considering mixture models for partially exchangeable data with stratification induced by a categorical covariate (see Foti and Williamson 2015, and references therein), and mixture models for regression problems (in the spirit of De Iorio, Müller, Rosner, and MacEachern 2004).Moreover, while there is consensus that MCMC methods represent the gold standard for carrying out posterior inference for BNP mixtures, it is known that different MCMC simulation schemes feature different properties and thus might be convenient to serve different purposes.With this in mind, BNPmix implements three state-of-the-art MCMC methods for PY mixture models, henceforth referred to as marginal sampler, slice sampler and importance conditional sampler, providing the user with the option to choose one.
The list of R packages implementing MCMC techniques for BNP models is rich.In order to clarify which features are specific to BNPmix and which are shared by other packages, we review state-of-the-art R packages for BNP inference via MCMC.To this end, a list of the models implemented in BNPmix is presented in Table 3 in Appendix A.1, along with the availability of the same models in other packages.Similarly, Table 4 in Appendix A.1 compares the main technical features of BNPmix with those of other packages.While these tables focus on the features of BNPmix, it is worth stressing that the packages considered for the comparison also implement other models, which we review concisely.The DPpackage by Jara, Hanson, Quintana, Müller, and Rosner (2011) is probably the most comprehensive of the packages we considered.It is mainly written in Fortran and consists of a rich collection of functions implementing some of the most successful Bayesian nonparametric and semiparametric models, including DP and dependent Dirichlet process (DDP) mixtures, hierarchical DP, Pólya trees, and random Bernstein polynomials.Regrettably, despite being widely used, the DPpackage was recently archived from the Comprehensive R Archive Network.More recent and, in some case, more specific R packages for BNP inference via mixture models are PReMiuM (Liverani, Hastie, Azizi, Papathomas, and Richardson 2015), BNPdensity (Barrios, Lijoi, Nieto-Barajas, Prünster, and Kon Kam King 2017), dirichletprocess (Ross and Markwick 2020), BNPMIXcluster (Carmona, Nieto-Barajas, and Canale 2017) and msBP (Canale 2017).The main focus of PReMiuM is the definition of nonparametric regression models linking an outcome (continuous, binary or discrete) to a set of covariates via dependent nonparametric mixtures.The package implements a rich set of functions to explore the output of the analysis, including the induced clustering structure and the role of each covariate in driving the final model specification.Models are implemented for the DP case and extended to the PY process only by means of approximation.Another interesting feature of PReMiuM is the implementation of a label switching move (Liverani et al. 2015) to improve mixing of MCMC samplers.BNPdensity implements a Ferguson and Klass algorithm (Ferguson and Klass 1972;Barrios, Lijoi, Nieto-Barajas, and Prünster 2013) for a large class of univariate mixture models, with the mixing random probability assumed distributed as a normalized random measure (Regazzini, Lijoi, and Prünster 2003).The package offers the choice of five different kernels (Gaussian, double exponential, gamma, lognormal and beta) and accounts for the presence of censored data.The dirichletprocess package provides a set of functions allowing users to implement their own DP mixture model: a new object class is introduced, which acts as building block for a variety of specific statistical models.BNPMIXcluster focuses exclusively on the case of multivariate mixed-scale data (Canale and Dunson 2011;Carmona, Nieto-Barajas, and Canale 2019), framework for which a marginal MCMC sampler for PY mixture models is implemented.The package msBP only implements the multiscale Bernstein polynomial mixture model of Canale and Dunson (2014).The described set of R packages, overall, offers the possibility of carrying out statistical inference by means of a very broad collection of BNP mixture models.At the same time, when the focus is on the use of the PY process, BNPmix plays a leading role.Finally, it is worth mentioning the increasing attention recently dedicated by the BNP literature to variational methods approximating the posterior distribution (Blei and Jordan 2006;Hughes, Kim, and Sudderth 2015;Campbell, Straub, Fisher III, and How 2015;Tank, Foti, and Fox 2015): the availability of R packages implementing such approach for BNP models is rather limited though, a notable exception being the package MixDir (Ahlmann-Eltze and Yau 2018) which implements a hierarchical DP mixture of multinomial kernels.
The rest of the paper is organized as follows.Section 2 describes the model specifications accounted for in BNPmix.Section 3 introduces the implemented state-of-the-art MCMC methods for posterior simulation.An overview of the design philosophy of the package is presented in Section 4. In Section 5 the main features and usage of the package are illustrated by means of illustrative analyses of both synthetic and real data.A further comparison with other R packages for BNP inference and technical details on the parametrization of the implemented models are provided in the Appendix.

Model specifications
In this section we briefly introduce the different PY mixture models implemented in the BNPmix package.The implemented models can be classified into two main classes, based on whether they account for individual covariates or not.The latter ones can be obtained by suitably specifying parametric space, kernel and base measure in (2).The first ones can be seen as generalizations of (2) to the context of regression problems, and to the analysis of correlated samples by means of DDP.
The focus of the BNPmix package is on models obtained by mixing univariate, multivariate Gaussian kernels, or univariate normal regression kernels, specifications which guarantee that the resulting mixture models are dense in the space of regular densities (Lo 1984).In addition, the parametric space and the base measure can be chosen so to impose constraints on the variance and, in the multivariate case, the covariance structure of the model.The resulting mixture models are discussed next in more detail.For the sake of clarity, the parametric forms of the distributions introduced along the paper are reported in Appendix A.2.

Univariate location PY mixture model
We consider a univariate Gaussian kernel and define a mixture on its location parameter (West 1991).That is, we set θ = µ, and thus Θ = R, and consider the kernel K(y; θ) = φ(y; µ, σ 2 ), where φ(•; µ, σ 2 ) denotes the probability density function of a normal random variable with mean µ and variance σ 2 .Following this specification, the random density in (3) becomes In order to achieve conjugacy between kernel and base measure, we assume P 0 is normal, namely μj iid ∼ N (m 0 , σ 2 0 ), and we assign σ 2 an inverse gamma prior, that is σ 2 ∼ IGa(a 0 , b 0 ).The model can be completed by specifying a normal-inverse gamma hyperprior on (m 0 , σ 2 0 ), which is tantamount to assume σ 2 0 ∼ IGa(a 1 , b 1 ) and m 0 | σ 2 0 ∼ N(m 1 , σ 2 0 /k 1 ). Figure 1 displays a graphical representation of the model specification for the univariate location PY mixture model.

Univariate location-scale PY mixture model
As an extension of the previous model specification, we consider a univariate Gaussian kernel and define a mixture on the vector (µ, σ 2 ) composed by location and scale parameters of the kernel (Escobar and West 1995).That is, we set θ = (µ, σ 2 ), and thus Θ = R × R + , and consider the kernel K(y; θ) = φ(y; µ, σ 2 ).Following this specification, the random density in (3) becomes In order to achieve conjugacy, we consider a normal-inverse gamma base measure P 0 , namely σ2 ∼ N(m 0 , σ2 j /k 0 ).The model can be completed by assigning independent hyperpriors to the main parameters of the base measure.Specifically, m 0 ∼ N(m 1 , σ 2 1 ), k 0 ∼ Ga(τ 1 , ζ 1 ), and b 0 ∼ Ga(a 1 , b 1 ). Figure 2 displays a hierarchical representation of the univariate location-scale PY mixture model.
In order to achieve conjugacy, we consider a multivariate normal base measure P 0 , that is μj iid ∼ N p (m 0 , S 0 ), and we endow Σ with an inverse Wishart prior, namely Σ ∼ IW(ν 0 , Σ 0 ).The model specification can be completed by assigning a normal-inverse Wishart hyperprior to (m 0 , S 0 ), that is by assuming S 0 ∼ IW(λ 1 , Λ 1 ) and m 0 | S 0 ∼ N p (m 1 , S 0 /k 1 ). Figure 3 displays a hierarchical representation of the multivariate location PY mixture model.

Multivariate location-scale PY mixture model: full covariance matrix
As an extension of the previous specification, we consider a p-dimensional multivariate Gaussian kernel and define a mixture jointly on the mean vector µ and the covariance matrix Σ (Escobar and West 1995;Müller, Erkanli, and West 1996).That is, we set θ = (µ, Σ), and thus R p × S p + , where S p + denotes the space of p × p positive semi-definite matrices, and K(y; θ) = φ p (y; µ, Σ).Following this specification, the random density in (3) becomes In order to achieve conjugacy, we specify a normal-inverse Wishart base measure P 0 , namely Σj iid ∼ IW(ν 0 , Σ 0 ) and μj | Σj ind ∼ N p (m 0 , Σj /k 0 ).The model can be completed by assuming independent hyperpriors on m 0 , k 0 and Σ 0 , namely m 0 ∼ N p (m 1 , S 1 ), k 0 ∼ Ga(τ 1 , ζ 1 ) and  4 displays a hierarchical representation of the multivariate locationscale PY mixture model with full covariance matrix.

Multivariate location-scale PY mixture model: diagonal covariance matrix
A more parsimonious version of the multivariate location-scale PY mixture model presented in the previous section is obtained by constraining the random covariance matrices Σj to be diagonal (see, e.g., Bouveyron and Brunet-Saumard 2014).This model specification is particularly convenient when p is large.Based on this assumption, the random density in (3 where y r is the rth component of y, μj = (μ j1 , . . ., μjp ) and Σj is a diagonal matrix with diagonal σ2 j = (σ 2 j1 , . . ., σ2 jp ).Conditionally on p, the marginal model specification for each component is equivalent to the univariate location-scale case.Specifically, the base measure P 0 is the product of p independent normal-inverse gamma distributions, that is, σ2

Univariate regression PY mixture model
We consider an infinite mixture of regression model accounting for a d-dimensional vector of independent variables x (Dunson, Pillai, and Park 2007).This is obtained by considering a univariate Gaussian kernel with regressors acting linearly on the location parameter, and by defining a mixture on the regression coefficients β = (β 0 , β 1 , . . ., β d ), with the scale parameter of the kernel common across different groups.That is, we define x 0 = (1, x ) , we set θ = β, and thus Θ = R d+1 , and consider the kernel K(y; θ) = φ(y; x 0 β, σ 2 ).Following this specification, the random density in (3) becomes In order to achieve conjugacy, we assume that the base measure P 0 is a multivariate normal distribution, that is βj iid ∼ N d+1 (m 0 , S 0 ), with j ≥ 1.We set an inverse gamma distribution for the common scale parameter, σ 2 ∼ IGa(a 0 , b 0 ).The model can be completed by endowing (m 0 , S 0 ) with a normal-inverse Wishart hyperprior, that is S 0 ∼ IW(ν 1 , S 1 ) and m 0 | S 0 ∼ N p (m 1 , S 0 /k 1 ). Figure 6 displays a hierarchical representation of the univariate regressionscale PY mixture model.

Univariate regression-scale PY mixture model
As an extension of the previous specification, we consider an infinite mixture of regression model accounting for a d-dimensional vector of independent variables x, obtained by jointly defining a mixture on the regression coefficients and the scale parameter of the univariate Gaussian kernel (Dunson et al. 2007).That is, we set θ = (β, σ 2 ), and thus Θ = R d+1 × R + , and consider the kernel K(y; θ) = φ(y; x 0 β, σ 2 ).Following this specification, the random density in (3) becomes In order to achieve conjugacy, the base measure P 0 is set equal to a product of a multivariate normal distribution and an inverse gamma distribution, that is βj iid ∼ N d+1 (m 0 , S 0 ) and σ2 j iid ∼ IGa(a 0 , b 0 ).The model can be completed by endowing (m 0 , S 0 ) with a normal-inverse Wishart hyperprior, that is S 0 ∼ IW(ν 1 , S 1 ) and m 0 | S 0 ∼ N d (m 1 , S 0 /k 1 ), and by assuming b 0 ∼ Ga(τ 1 , ζ 1 ). Figure 7 displays a hierarchical representation of the univariate regressionscale PY mixture model.

Univariate location-scale DDP mixture model
We assume each observation y i is endowed with a categorical covariate x i taking values in {1, . . ., L}, which allows observations to be gathered into L distinct groups (y l1 , . . ., y ln l ), with l = 1, . . ., L. In order to account for heterogeneity across groups, while allowing borrowing of information, we consider the partially exchangeable mixture model proposed by Lijoi, Nipoti, and Prünster (2014).Namely, each group is modelled by means of a mixture with Gaussian kernel and group specific random probability measure pl .The vector p = (p 1 , . . ., pL ) is distributed as a Griffiths-Milne dependent Dirichlet process (GM-DDP) with parameters ϑ > 0 and z ∈ (0, 1), and base measure P 0 on Θ = R × R + , which implies that, marginally, groups are modelled with identically distributed location-scale DP mixtures, obtained by setting α = 0 in (3).
The parameter z controls the dependence across the components of p, with the two extremes of the range corresponding to full exchangeability (when z → 0), that is p1 = p2 = . . .= pL , and independence across groups (when z → 1), that is pl iid ∼ DP (ϑ; P 0 ).Such characterization of the extreme cases helps in setting a value for z, which by default is fixed equal to 0.5.A formal elicitation of z might be obtained by specifying the value of the correlation between the random variables pl 1 (A) and pl 2 (A), with l 1 = l 2 and for a measurable A (see Equations 12 and 13 in Lijoi et al. 2014), quantity which is invariant with respect to the choice of A. In order to achieve conjugacy, a normal-inverse gamma base measure is considered, that is σ2 8 displays a hierarchical representation of the univariate location-scale DDP mixture model.

Posterior simulation methods
The BNPmix package implements three types of MCMC algorithm for posterior simulation, namely the marginal sampler (Escobar and West 1995;Müller et al. 1996;Neal 2000), the slice sampler (Walker 2007;Kalli, Griffin, and Walker 2011), and the importance conditional sampler (ICS) (Canale, Corradin, and Nipoti 2020).The three sampling schemes are implemented for all the models described in Section 2, exception made for the univariate location-scale DDP mixture model, for which only the ICS is made available.Different simulation methods might be convenient in serving different purposes (see discussion in Canale et al. 2020): with this in mind, the method is offered as an option in the functions of the package, with the default one being the ICS, whose efficiency has been proved to be the most robust to model specifications (Canale et al. 2020).
Marginal sampler Escobar and West (1995) and Müller et al. (1996) propose a marginal sampling scheme for DP mixtures of univariate and multivariate Gaussian kernels, respectively.The approach relies on the analytical marginalization of the mixing random probability measure p, while the parameters θ = (θ 1 , . . ., θ n ) appearing in (2) are integrated out by means of a Gibbs sampler.The full conditional of each θ i is reminiscent of the urn scheme of Blackwell and MacQueen (1973), and, for the PY case with generic kernel K(•; •), is given by where θ (−i) = (θ 1 , . . ., θ i−1 , θ i+1 , . . ., θ n ) and k (−i) is the number of distinct values θ * j in θ (−i) , with n

Slice sampler
Introduced by Walker (2007) for DP mixtures, and improved on by Kalli et al. (2011), the slice sampler is a conditional method which works by introducing a suitable vector of augmenting random variables The components of U are independent uniform random variables and are such that the joint distribution of (Y i , U i ) is given by where {ξ j } ∞ j=1 is any positive sequence.The random distribution of Y i , given in (3), is recovered by marginalising (5) with respect to U i .As a by-product and conveniently, the problem of dealing with the infinite-dimensional p boils down to the problem of dealing with a finite sum, with a random number of terms.Approximate realizations of the posterior distributions of f are obtained by implementing a Gibbs sampler which serves the purpose of marginalising with respect to U .Different choices for the sequence {ξ j } ∞ j=1 are considered.Following Kalli et al. (2011), BNPmix implements the dependent slice-efficient version of the algorithm by setting ξ j = π j , which leads to a convenient simplification of (5), and different versions of the independent slice-efficient sampler obtained by choosing a deterministic sequence {ξ j } ∞ j=1 , the default choice being ξ j = E(π j ).The efficiency of the dependent slice-efficient algorithm, when used for PY mixtures, is compromised if large values of α and ϑ are considered (see discussion in Canale et al. 2020).

Importance conditional sampler
The ICS is a conditional sampling strategy for PY mixture models, recently introduced by Canale et al. (2020).It combines a convenient characterization of the posterior distribution of a PY process (Pitman 1996) with a sampling importance resampling (SIR) step.The problem of dealing with the infinite-dimensional p reduces to: 1) evaluating p on a finite dimensional partition of the parameter space Θ, and 2) sampling, via SIR, from a distribution proportional to K(y i , t)q(dt), where q is PY process used as proposal distribution.Approximate realizations of the posterior distribution of f are then obtained by marginalising θ out with a Gibbs sampler.The full conditional distribution of each θ i , in a similar fashion as Algorithm 8 in Neal (2000), boils down to a discrete distribution with a finite support, given, up to a proportionality constant, by where (t ) and (m 1 , . . ., m km ) are, respectively, the distinct values in the m-dimensional exchangeable sample generated from q and their frequencies.The value of m can be chosen arbitrarily, with large values favouring a good mixing of the chain at the price of a longer runtime.

Package implementation
The BNPmix package consists of three main R functions, wrappers of C++ routines which implement the BNP models described in Section 2 and the MCMC simulation methods introduced in Section 3, along with some user-friendly functions which facilitate the elicitation of prior distributions and the postprocessing of generated posterior samples.The three main functions are PYdensity, PYregression, and DDPdensity, and implement, respectively, Pitman-Yor mixture models for density estimation and clustering, a Pitman-Yor mixture model for density regression, and a DDP mixture model for density estimation of correlated samples.The spirit of the package is two-fold.On the one side, it provides a ready-to-use suite of functions to estimate a rich variety of BNP models: to this end, all functions provide a default specifications of their arguments, thus allowing for a non-informed estimation procedure.On the other side, each function offers a detailed list of arguments, concerning prior parameters, hyperpriors specification, kernel structure and sampling approach, which can be tuned by the more experienced user so to formalise and exploit available prior information.
The remainder of the section deals with both what lies under the hood (Section 4.1) and the design of the R wrappers (Section 4.2).

Low level implementation
All the main functions of the BNPmix package are written in C++ and exploit the seamless interface to R provided by Rcpp and RcppArmadillo (Eddelbuettel and François 2011;Eddelbuettel and Sanderson 2014).
All the C++ routines implementing the models listed in Section 2 share a common structure as they consist of a main function-where all the required objects are initialized-with a loop cycling over the number of MCMC iterations.Within this loop, all the method-specific functions are called to update the MCMC status.
Broadly speaking, any MCMC algorithm for posterior simulation under a mixture model setup includes two steps, which consist in i) allocating individual observations to clusters or mixture components, and ii) updating the parameters specific to each group or mixture component.In our C++ implementation, step i) is performed by specific functions, with names of the type clust_update_method_model, where method refers to one of the three MCMC methods discussed in Section 3 while model refers to one of the mixture specifications presented in Section 2. While step i) differs intrinsically from one MCMC method to another, step ii) stays fundamentally unchanged for different MCMC methods, as the update of component specific parameters essentially depends on the structure of the kernel only.In our C++ implementation, step ii) is performed by specific routines, whose names-accellerate_method_model-follow the same logic as those of the functions implementing step i).While method-specific, these functions share most of the C++ code needed to perform common tasks.Such common structure allows, as a by-product, for a fair comparison between the performance of different MCMC simulation methods.
In order to limit the memory usage, all the accellerate_method_model functions are of type void.This conveniently allows us to update, at each iteration of the MCMC, all the quantities needed to produce posterior summaries-e.g., realizations of posterior densities evaluated at a grid of points, or partitions of the data into clusters-while passing on to the next iteration only those required for the evaluation of full conditional distributions.The number of elements to update can vary across the iterations of the MCMC (see Canale et al. 2020, for details), aspect which is addressed by means of method-specific functions, named para_clean_method_model, which suitably resize and reorder the arrays where the values taken by model parameters are stored.In line with the two-fold spirit of the package, however, the option to save all the quantities generated by the MCMC algorithm is offered for the user to independently compute any posterior summary that might be of interest.
Finally, functions of the type hyper_accellerate_method_model allow hyperprior distributions to be added on the parameters defining the base measure, thus avoiding the often daunting task of eliciting the same parameters.

Wrappers to the main functions
A BNP mixture model can be fitted with the BNPmix package by calling an R function which, based on its arguments, interfaces with one of the specific C++ subroutines described in Section 4.1.All the R functions require a data set y and a set of arguments referring to MCMC method, prior specification and form of the produced output.The MCMC parameters are passed through the named list mcmc, which allows the following arguments to be specified: • niter, the number of MCMC iterations; • nburn, the number of burn-in iterations to discard; • nupd, argument controlling the number of iterations to be displayed on screen; • print_message, control option: if equal to TRUE the status is printed to standard output every nupd iterations.
The prior specification is passed to the C++ routines through the named list prior.The latter includes the arguments strength (1 by default) and discount (0 by default) for the strength parameter ϑ and the discount parameter α of the Pitman-Yor process, and a set of model-specific arguments, as described in Section 2. If hyper = TRUE, as by default, the parameters defining the base measure are endowed with hyperprior distributions.
Finally, the arguments of the named list output describe the output which is to be returned.Such list shares a structure which is common to all the functions, and contains: • out_type, summaries of the posterior distribution to be returned.If equal to "FULL", the function returns the estimated partition and all the MCMC realizations of the posterior densities.If equal to "MEAN", the function returns the estimated partition and the mean of the sampled densities.If equal "CLUST", the function only returns the estimated partition; • out_param, option to return the values taken by cluster or component-specific parameters.If equal to TRUE, the function returns all the MCMC draws of such parameters; • grid, a grid of points (or a data frame obtained via expand.grid)where to evaluate posterior densities.
While the general structure of these named lists is common across functions, there exist model-specific arguments which we describe, for each R function, in the remainder of the section.

PY mixture models
The function PYdensity performs univariate and multivariate density estimation and clustering, via Pitman-Yor mixtures with Gaussian kernels.Specific elements of the mcmc list are: • method, the MCMC algorithm chosen to perform the estimation.Three options are available, namely ICS (method = "ICS"), marginal sampler (method = "MAR") and slice sampler (method = "SLI"); • model, the specific mixture model, among those described in Section 2, to be fitted.The default choice is location-scale mixture (model = "LS") for both univariate and multivariate data.Other options are location mixture (model = "L") for both univariate and multivariate data, and location-scale mixture with diagonal covariance matrix (model = "DLS") for multivariate data only; • m_imp (available if method = "ICS"), size-m in the notation of Section 3-of the exchangeable sample generated from the proposal distribution q, in the SIR step of the ICS method.Default is m_imp = 10; • slice_type (available if method = "SLI"), the specification of the type of slice sampler.Options are "DEP" for dependent slice-efficient, and "INDEP" for independent sliceefficient.Default is slice_type = "DEP".
Finally, the named list prior for PYdensity admits a set of model-specific arguments, for which an exhaustive description is reported in Table 1.The default values of the arguments in the prior list are set so that the expectation of the location component equals the sample mean, and both the variance of the location component and the expectation of the scale component coincide with the sample variance.

PY density regression
The function PYregression performs univariate density regression and clustering, via Pitman-Yor mixture of normal linear regressions.Beside y, the argument x, that is the values taken by the d regressors, is also required.
The specific arguments of the named list mcmc for PYregression are: m1, mean of m0 (vector) s21, variance of m0 (vector) tau1, shape of k0 (vector) zeta1, rate of k0 (vector) a1, shape of b0 (vector) b1, rate of b0 (vector) • method, the MCMC algorithm chosen to perform the estimation.Three options are available, namely ICS (method = "ICS"), marginal sampler (method = "MAR") and slice sampler (method = "SLI"); • model, the specific mixture model, between those described in Section 2, to be fitted.The default choice is location-scale mixture (model = "LS"), a second option is the location mixture (model = "L"); • m_imp (available if method = "ICS"), size of the exchangeable sample generated from the proposal distribution q, in the SIR step of the ICS method.Default is m_imp = 10; • m_marginal (available if method = "MAR"), number of auxiliary variables introduced for the Monte Carlo evaluation of the probability that θ i takes a new value in (4) (see Algorithm 8 of Neal 2000); • slice_type (available if method = "SLI"), the specification of the type of slice sampler.
Finally, the model-specific arguments admitted by the named list prior for PYregression, are described in Table 2.The default specification of the hyperparameters is such that the intercept has prior expectation equal to the average of y and variance 100, the regression coefficients have prior expectation and variance equal to 0 and 100, respectively, and the prior expectation for the scale component coincides to the sample variance of y.

DDP density estimation
The function DDPdensity performs univariate density estimation for correlated samples, via GM-DDP mixture model with Gaussian kernel.Beside y, an argument named groups is also required.The latter can be thought of as a label assigned to each observation in y, which thus identifies distinct subsamples in y.Unlike the previous two functions, the named list mcmc does not account for the MCMC algorithm choice since only the ICS method is implemented for this class of models.Beside the common arguments, the mcmc named list includes, as model-specific, m_imp, which is the size of the exchangeable sample generated from the proposal distribution q, in the SIR step of the ICS method.
The base measure of the DDPdensity function is a normal-inverse gamma distribution with parameters m0, k0, a0 and b0, where the first two are mean and scale factor defining the normal base measure on the location parameter, and the latter two are shape and scale of the inverse gamma base measure on the scale parameter.These parameters can be specified as arguments the named list prior.Their default specification is such that the expectation of the location component of the base measure is equal to the overall sample mean, obtained by pooling the groups together, and the expectation of the scale component coincides to the overall sample variance.In addition, the prior list admits the argument wei (z in the notation of Section 2), which is a parameter taking values in (0, 1) (default is 0.5) and controlling the strength of the borrowing of information across groups.Finally, notice that, for the DDPdensity function, the argument discount is fixed equal to 0 while one can tune the argument strength (1 by default), corresponding to the parameter ϑ in the specification of the univariate location-scale DDP mixture model in Section 2.

Other functions
The PYdensity, PYregression, and DDPdensity functions return an object of class BNPdens.The plot method, extended to the BNPdens class by means of the ggplot2 package, produces a plot of the estimated posterior mean density function.Based on the type of model and the dimension of the data, plot can be called with additional arguments.Specifically: • if the BNPdens object is produced by PYdensity and data are univariate, plot admits the argument show_hist, a logical argument which returns the histogram of the raw data along with the posterior density.The size of the bins can be set with bin_size.
Observations can be displayed on the x-axis by specifying show_points = TRUE.Moreover, if also show_clust = TRUE, the displayed observations are coloured based on the estimated clustering.The col argument controls the colour of the estimated posterior density.Pointwise posterior credible bands of level conf_level around the posterior mean densities can be added by setting show_bands = TRUE; • if the BNPdens object is produced by PYdensity and data are multivariate, plot generates, for the pair of variables indexed by dimension, the bivariate contour plot of the corresponding bivariate marginal density function.Adding show_points = TRUE or show_clust = TRUE, allows to display the observations in the contour plot and to colour them based on the estimated clustering; • if the BNPdens object is produced by PYregression and the number of covariates does not exceed four, the plot function returns the scatterplot of the observations coloured according to the estimated clustering; • if the BNPdens object is produced by DDPdensity and thus data consist of two or more subgroups, the plot function returns a wrapped plot, with each subplot corresponding to a specific group.An additional argument specific to the output of DDPdensity is wrap_dim, which allows to choose the number of rows and columns in the plot.
Other methods have been extended to the BNPdens class.The print method returns details on the BNPdens object and the model which produced it.The summary method returns some basic information on the model fitting, such as number of iterations, number of burn-in iterations, computational time and average number of clusters.The partition method, when applied to BNPdens objects, returns a point estimate for the clustering of the data, based on the partitions visited during the MCMC algorithm.This can be achieved by adopting two distinct loss functions, namely the variation of information loss function (dist = "VI") and Binder's loss function (dist = "Binder").This method is an efficient C++ implementation of the method described by Wade and Ghahramani (2018).The BNPmix package also includes a method, named BNPdens2coda, to interface objects of class BNPdens with the coda package.
For univariate PY mixture models, the BNPdens2coda method exports a matrix, whose first row reports the number of blocks of the partitions visited by the chain, while each one of the remaining rows is composed by the values taken by the density at each point of the grid (grid), at different iterations of the MCMC algorithm.For multivariate and regression PY mixture models, the BNPdens2coda method exports only a vector with the number of blocks of the partitions visited by the chain.For DDP mixture models the BNPdens2coda method exports a matrix whose first row reports the number of clusters of the partition visited by the chain, while the other rows report the values taken by the weights of the group specific processes at each iteration.Finally, the PYcalibrate function allows to elicit the strength parameter of the Pitman-Yor process by specifying the sample size and by fixing the discount parameter and the expected number of clusters a priori.

Package scalability
Considering the variety of models implemented in the BNPmix package and the possibility to fit these models to data sets with different levels of complexity in terms of size (n), dimension (p), presence and number of covariates (d) and number of groups (L), we briefly comment on the scalability of the package with respect to these quantities.Based on our experience, when considering the default specifications and without taking into account the time needed to evaluate estimated densities on a given grid of values (i.e., output$out_type = "CLUST"), the average time per iteration for the functions PYdensity, PYregression, and DDPdensity, displays a linear grow as the sample size n becomes large.The runtime is also affected by the complexity of the data.Specifically, assuming n is kept fixed, a growth faster than linear is observed both when the dimension p in PYdensity and when the number of covariates d in PYregression increase.Finally, the function DDPdensity shows a roughly linear growth of its average time per iteration as the number of groups L becomes large.The described behaviour of PYdensity, PYregression and DDPdensity is not surprising as it is well known that MCMC methods are severely affected by the dimension of the space that has to be explored by the chain, that is the support of the posterior distribution for the functions implemented in BNPmix, which is ultimately connected to the quantities discussed above.
Evaluating the generated densities on a given grid of points at each iteration of an MCMC algorithm can severely affect computational efficiency, issue which is particularly relevant for large dimensions p in PYdensity or number of covariates d in PYregression.Note, however, that BNPmix conveniently allows users to specify the type of desired output (see Section 4.2), thus avoiding unnecessary computations if not needed.

Usage of the package
We next illustrate the use of the BNPmix package by walking the reader through the entire process of density estimation, clustering and regression, via BNP mixture models.We start by showing how to elicit prior distribtuions, how to run the sampler, and how to process the output of the functions, for inferential purposes.In order to do this, we analyze different subsets of the Collaborative Perinatal Project (CPP) data set (Klebanoff 2009), as well as synthetic data.The CPP data set is a rich collection of observations referring to a large prospective study conducted in the United States in the '60s.The goal of the study was to assess the cause of neurological disorders and other pathologies in children.The full data set counts more than 2 300 observations, each consisting of several measurements referring to pregnant women and their babies.The focus of our illustrative analysis is on three quantities: the gestational age (in weeks), the weight of the baby at birth (in g), and the concentration level in maternal serum of DDE (in µg/L), a persistent metabolite of the pesticide DDT, known to have adverse impact on health (Longnecker, Klebanoff, Zhou, and Brock 2001).This subset of the original data set is available in the BNPmix package via R> library("BNPmix") R> data("CPP")

Univariate density estimation
We illustrate here how to perform univariate density estimation and clustering, by focusing on the gestational age for a group of women belonging to one of the 12 university hospitals participating in the study.

R> y <-CPP[CPP$hosp == 11, ]
We consider a DP location-scale mixture model, thus setting the discount parameter α = 0. Previous studies show that the distribution of the gestational age is left skewed and shows an irregular shape due to the presence of three subpopulations corresponding to early preterm, preterm, and normal births.For this reason we fix the prior expected number of clusters to three and choose the value of strength parameter ϑ accordingly.This can be done by using the PYcalibrate function.
R> DPprior <-PYcalibrate(Ek = 3, n = nrow(y), discount = 0) We endow the parameters of the base measure with hyperprior distributions with default parameters.The complete prior specification for our BNP mixture model is specified as R> prior <-list(strength = DPprior$strength, discount = 0) Before running the MCMC algorithm (opting here and henceforth for the default ICS simulation method), we specify the number of MCMC iterations and the grid of points where to evaluate the density.This can be done with R> mcmc <-list(niter = 5000, nburn = 4000) R> output <-list(grid = seq(30, 46, length.out= 100), out_type = "FULL") Then MCMC algorithm can then be run by calling the PYdensity function and providing all the previously defined lists.

Multivariate density estimation
Next we illustrate how to perform density estimation and model-based clustering for multivariate data.The first illustration focuses on the three continuous variables composing the The clustering structure detected by the model fitted to the object fit.sim can be explored by using the function partition.As point estimate, we consider the partition which, among those visited during the MCMC, minimises the posterior expected loss, where we work under the variation of information framework introduced and studied by Wade and Ghahramani (2018)  The dendrogram obtained with the complete linkage through the hclust function (displayed in Figure 11) clearly shows the presence of two main clusters.The remaining small clusters, while distinct in our analysis, can be interpreted as one group of outlying observations, in agreement with the simulation scenario we devised.

Density regression
In order to illustrate the usage of the PYregression function, we study how the distribution of gestational age changes with DDE.Following the same argument as in Section 5.1, we fix the prior expected number of clusters to 3 and, given a moderate discount parameter α = 0.25, we elicit the strength parameter via PYcalibrate.We opt for the default specification of the other hyperparameters and summarise our prior assumptions with the code  As summary of the posterior distribution, we compute the estimated conditional density of gest, conditionally on 6 values of dde equal to the minimum observed value and the empirical quantiles of order 0.25, 0.5, 0.75, 0.95, and 0.99.

Density estimation for correlated samples
We conclude this section by illustrating the usage of the DDPdensity function.The CPP data set consists of observations coming from 12 hospitals: while assuming homogeneity within each hospital seems reasonable, we opt for a model which might account for heterogeneity across hospitals and thus consider the GM-DDP mixture model with Gaussian kernels, described in Section 2. We use an empirical approach and center the base measure parameters on sample summaries, as by default.
R> mcmc <-list(niter = 5000, nburn = 4000) R> grid <-seq(30, 46, length.out= 100) R> output <-list(grid = grid, out_type = "FULL") The lists defined above are supplied as arguments to the function DDPdensity, which can be run to estimate the posterior mean densities for the 12 hospitals.The latter can be visualized by running the method plot on the output of DDPdensity. Figure 13 displays the estimated densities along with pointwise 95% posterior credible bands.The model successfully accounts for heterogeneity across hospitals: this can be noticed, for example, by observing that the estimated densities for hospitals 3 and 5 are more skewed than those of most of the other hospitals in the study.At the same time, the model allows for borrowing information across hospitals displaying similar distributions: the result of this is apparent when the posterior density of gest for hospital 11 is compared with the one, for the same hospital, displayed in Figure 9 and obtained by ignoring information from other hospitals.

Technical details
The results presented in this section were obtained by using an Ubuntu 20.04 machine.All the routines were executed on R 4.0.0 with the BNPmix 0.2.7 package, dependent of Rcpp 1.0.4.6 and RcppArmadillo 0.9.900.1.0packages.R itself and all packages used are available from the Comprehensive R Archive Network (CRAN) at https://CRAN.R-project.org/.

A.2. Base measures and hyperdistributions
Table 5 summarizes the base measures and the hyperdistributions on the base measures parameters, for the specifications of univariate and multivariate PY mixture models with Gaussian kernels, described in Section 2. The focus of the table is on the models that can be fitted by using PYdensity.The top part of the table refers to the models without hyperpriors on the parameters of the base measure (hyper = FALSE), the bottom part (hyper = TRUE) instead describes the specification of the hyperpriors.For the sake of clarity and in order to avoid any ambiguity, Table 6 reports the parametrization of the relevant probability distributions, adopted throughout the paper.

Figure 1 :
Figure 1: Hierarchical representation of the univariate location PY mixture model with Gaussian kernel.

Figure 2 :
Figure 2: Hierarchical representation of the univariate PY process mixture model with location-scale base measure.

Figure 3 :
Figure 3: Hierarchical representation of the multivariate PY process mixture model with location base measure.

Figure 4 :
Figure 4: Hierarchical representation of the multivariate PY process mixture model with location-scale base measure and full covariance matrix.

Figure 5 :
Figure 5: Hierarchical representation of the multivariate PY process mixture model with location-scale base measure and diagonal covariance matrix.

Figure 6 :
Figure 6: Hierarchical representation of the univariate regression PY mixture model.

Figure 7 :
Figure 7: Hierarchical representation of the univariate regression-scale PY mixture model.

Figure 8 :
Figure 8: Hierarchical representation of the univariate location-scale DDP mixture model with Gaussian kernel.

Figure 9 :
Figure 9: Posterior mean density of gest in the 11th hospital and (left) 95% pointwise posterior credible bands along with the histogram of the raw data and (right) observations coloured according to the estimated clustering.

Figure 10 :
Figure 10: Posterior mean densities, univariate marginal and pairwise bivariate marginal, obtained by calling plot on the fitted model fit2.In addition to the posterior mean density value, each pairwise bivariate density shows the data points coloured according to the estimated clustering.

Figure 12 :
Figure12: Posterior mean conditional univariate densities (and 95% pointwise credible bands) of gest, conditionally on the values of dde reported in the plots' headers.

Figure 13 :
Figure 13: Posterior mean densities and 95% pointwise posterior credible bands of gestational age for the 12 hospitals.

Table 1 :
Parameters for the prior list of the PYdensity function.
in the context of clustering estimation problems.

Table 5 :
Base measures and optional hyperprior distributions on the parameters of the base measures, for location (L), location-scale (LS) and, only for the multivariate case, locationscale with diagonal covariance matrix (DLS) PY mixture models.As for the multivariate location-scale PY mixture model with diagonal covariance matrix, assume that r = 1, . . ., p. hyper = FALSE Univariate Multivariate