hdpGLM : An R Package to Estimate Heterogeneous Effects in Generalized Linear Models Using Hierarchical Dirichlet Process

The existence of latent clusters with different responses to a treatment is a major concern in scientific research, as latent effect heterogeneity often emerges due to latent or unobserved features – e.g., genetic characteristics, personality traits, or hidden motivations – of the subjects. Conventional random-and fixed-effects methods cannot be applied to that heterogeneity if the group markers associated with that heterogeneity are latent or unobserved. Alternative methods that combine regression models and clustering procedures using Dirichlet process are available, but these methods are complex to implement, especially for non-linear regression models with discrete or binary outcomes. This article discusses the R package hdpGLM as a means of implementing a novel hierarchical Dirichlet process approach to estimate mixtures of generalized linear models outlined in Ferrari (2020). The methods implemented make it easy for researchers to investigate heterogeneity in the effect of treatment or background variables and identify clusters of subjects with differential effects. This package provides several features for out-of-the-box estimation and to generate numerical summaries and visualizations of the results. A comparison with other similar R packages is provided.


Introduction
Latent heterogeneity in the effect of explanatory variables is a major concern in science.Uncovering that latent heterogeneity is crucial when the result of the analysis will guide practices and interventions.For instance, increasing the dosage of a new medication may improve symptoms of a disease, but it is a major concern of public health authorities to ensure that a dosage increase does not have the opposite effect (worsening symptoms) for some latent sub-populations whose members are unknown in advance due to, for example, unobserved genetic traits.The COVID pandemic provides a recent example on a global scale of such concerns about adversarial effect of vaccines for latent sub-populations.Here is another example from labor economics: The intensity of a job market training program can be beneficial on average, but can increase marginalization of some groups because of unobserved personality traits.In these examples, the identification of the latent groups with adversarial effects is crucial to prevent undesirable consequences, even though the factor causing heterogeneous effects remains latent and unknown by the researcher.
The presence of latent or unobserved features that create latent effect heterogeneity cannot be avoided in practice.When there is unobserved or unknown features of sub-populations causing latent effect heterogeneity and the membership in those sub-populations is unknown, conventional estimation methods, such as fixed-and random-effects models, are no longer useful to account for group-specific effects because group identifiers are not observed.
Recent developments, combining machine-learning clustering procedures with traditional regression estimation, provide an attractive alternative.An important class of such methods include the finite mixture of regression models (Ng, McLachlan, Wang, Ben-Tovim Jones, and Ng 2006;Villarroel, Marshall, and Barón 2009), and Dirichlet process regression (DPR) (Hannah, Blei, and Powell 2011;Heinzl and Tutz 2013).However, these methods are complex, require expertise in programming languages to implement, and are often difficult to estimate due to the non-convexity of the objective function and the computational cost involved.Moreover, while there are resourceful, reliable, and user-friendly packages available for R (R Core Team 2023) users to estimate conventional regression and fixed-and random-effects models, the same is not true for DPR models, which makes the latter approaches inaccessible for many practitioners, including those who possess intermediate-level programming skills.This article discusses the main methodological, implementation, and numerical features of the software package hdpGLM, available from the Comprehensive R Archive Network (CRAN) at https://CRAN.R-project.org/package=hdpGLM.In a nutshell, the software provides an interface for practitioners to investigate latent effect heterogeneity in their regression analysis.The hdpGLM package can be used to identify latent heterogeneity across sub-populations in the association between the explanatory variables and the outcome.The package provides an easy-to-use implementation of a hierarchical Dirichlet process prior (DPP) regression, as proposed in Ferrari (2020), for out-of-the-box estimation of linear and generalized linear models (GLMs), while simultaneously searching for latent clusters in the data that differ because of the value that the linear coefficients have for one or more of the observed covariates.When the estimation doesn not find a cluster, it returns linear coefficients whose posterior average are indistinguishable from the results of conventional regression methods in Re.g., those obtained from the function lm.
The estimation using the hdpGLM package is intuitive and easy to conduct.It adopts standard R symbolic formula notation, as used in the built-in stats package and well-known R functions for regression estimation, such as lm and glm, providing a seamless transition for practitioners from classical models to a more complex regression estimation that accounts for latent effect heterogeneity.The package provides the function hdpGLM to estimate hierarchical and non-hierarchical DPR.This function implements the hierarchical Dirichlet process generalized linear model (hdpGLM) proposed in Ferrari (2020) to estimate a Bayesian Dirichlet process mixture of regression mod-els, and to estimate latent heterogeneity in the effect of the linear coefficients.The function takes R standard regression formulas as an argument, e.g., y ∼ var_1 + var_2 + • • • + var_d and conducts a data-driven search for clusters.It returns K ∈ N sets of d + 1 linear coefficients (intercept included) that best fit, in a Bayesian sense, the correlation between the covariates (var_1, . . ., var_d ) and the outcome (y).The estimation uses MCMC blocked Gibbs sampler for Gaussian outcomes and Hamiltonian Monte Carlo (HMC) within Gibbs for non-Gaussian outputs -e.g., binary dependent variables.
In addition, the default methods coef, print, plot, summary, family, nobs, and predict are supported for objects returned by hdpGLM, making it easy for R users at all levels to manipulate the estimation results.The package also provides a number of post-processing functions for generating tidy summaries, visualizations, and classify data points into clusters based on how the treatment or observed covariates affect the response variable in different latent subgroups found during the estimation.In particular, the function plot builds on the ggplot2 package (Wickham 2016), and users can easily customize its aesthetic elements to fit their needs.The estimation algorithms are fully implemented in C++ (Stroustrup 2013) for efficient computation using the Rcpp API (Eddelbuettel and François 2011;Eddelbuettel and Balamuta 2018).The C++ code is seamlessly integrated in the package and the user can access everything, from estimation to visualization, within R using standard R syntax.
A search on CRAN reveals that the package PReMiuM (Liverani, Hastie, Azizi, Papathomas, and Richardson 2015) is the only other software available that implements DPP regression models in R, and whose goal is similar to the hdpGLM software package.Other packages that implement Dirichlet process models on CRAN are not directly comparable due to differences in the scope and goal of the application.Even the PReMiuM and hdpGLM packages differ in various ways, and are designed to accomplish different goals, making them complementary rather than alternative options to one another.The package PReMiuM provides a rich set of features, including clustering the data, variable selection methods, procedures to handle missing values in the explanatory variables, and built-in Markov Chain Monte Carlo (MCMC) diagnostic checks.However, the Dirichlet process is used to model only the intercept term of the regression models in that package.This means that the hdpGLM is unique in two ways.First, it is the only package that estimates and returns the linear coefficients of each cluster found in the analysis, and investigates heterogeneous effects for all covariates across clusters.In that sense, the goal is very different than in other packages, such as PReMiuM and DPpackage (Jara, Hanson, Quintana, Müller, and Rosner 2011), that use Dirichlet process models.Second, the hdpGLM is the only package that provides the user an option to estimate a hierarchical DPR.The hierarchical DPP regression in the hdpGLM package investigates if the latent effect heterogeneity across clusters is explained by upper-level covariates, as originally proposed in Ferrari (2020).Table 1 gives a summary of the main similarities and differences between the two main packages on CRAN, namely hdpGLM and PReMiuM, for DPR estimation in R. Section 6 compares the performance of these two packages.
In the remainder of this article, I focus on the R implementation of the software package hdpGLM.For completeness, the next section provides a brief self-contained introduction of the main ideas behind the underlying hdpGLM implemented in R with the software hdpGLM.Then, details of the MCMC samplers and some core implementation and design features are presented.Finally, a general analysis workflow with a real application is illustrated.

PReMiuM hdpGLM
Main function profRegr hdpGLM Similarities Types of outcome supported (a)  Discrete or continuous Discrete or continuous MCMC algorithm (b)  Gibbs Gibbs DPR clustering Built-in External

Background
In classical linear regression models, we estimate a single vector of linear coefficients β = (β 0 , . . ., β d ), that is, for β ∈ R Dx+1 the model is: The coefficient β l , for l ∈ {1, . . ., d} represents the marginal association between the outcome y and x l .Under some circumstances, it can be interpreted as the causal effect of x l on y (Angrist and Pischke 2008;Pearl 2009;Imbens and Rubin 2015).One of the assumptions of those models is that every observation comes from the same underlying model.That is, a feature x l affects the outcome in the same way for every observation.It means that, if E [ϵ | X] = 0, for every observation whose value of x l changes, for example, from -10 to 10, ceteris paribus, the average effect on the outcome will be, with probability 1 (certainty), 20β l .The same applies for GLMs, except that β l affects the log odds ratio or another transformation of the average parameter.The core point here is that, by modeling assumptions in linear regressions, the average effect (e.g., β l ) of the covariates (e.g., x l ) is constant; that is, there is no heterogeneity in that effect.
Dirichlet and Dirichlet process regressions, on the other hand, do not assume such a homogeneity on the average effect across subjects (Hannah et al. 2011).Instead, it assumes that some or all features X can have different effects, on average, on the outcome for different groups of subjects.When that is the case, if there are K groups, each one with different average effects, we can describe the model for each group k ∈ {1, . . ., K} as: The same formulation works for a Dirichlet mixture of GLMs, with a little modification in the last two lines of Model 1 above.The Dirichlet process prior (DPP) regression generalizes this model for an unknown, possible infinite number of groups K, so the total number of groups K don't need to be set in advance.In DPP, we substitute the Dirichlet distribution as prior for π in the Model 1 and use a Dirichlet process prior instead.A constructive way to describe the model with a DPP was presented in Sethuraman (1994).It is called the stick-breaking construction (SBC) and for more details, see Teh, Jordan, Beal, and Blei (2006) and Ferrari (2020).If we denote ∆ ∞ the infinity simplex and g −1 the inverse link function (e.g., identity function for Gaussian outcomes, exponential function for Poisson outcomes, inverse-logit for Bernoulli outcomes, and so on, following standard notation for generalized linear models; see McCullagh and Nelder (1989)), then the Model 1 can be written as follows in terms of the SBC and for GLMs: Researchers have applied Model 2 or some variation of it in many different problems, including to relax distributional assumptions on random effects models (Verbeke and Lesaffre 1997;Heinzl and Tutz 2013); estimate population latent heterogeneity when the group membership and number of heterogeneous groups are unknown (Mukhopadhyay and Gelfand 1997;Kleinman and Ibrahim 1998;Hannah et al. 2011;Heinzl and Tutz 2013); model latent instrumental variables to deal with endogeneity of covariates (Ebbes, Wedel, Böckenholt, and Steerneman 2005;Ebbes, Wedel, and Böckenholt 2009); study effect heterogeneity in job market training programs (Aakvik, Heckman, and Vytlacil 2005;Heckman and Vytlacil 2007;Matzkin 2007;Ichimura and Todd 2007;Chen 2007); investigate consumer demand in discrete choice models (Rossi, Allenby, and McCulloch 2012;Rossi 2014); study the length of time political appointees stay in their appointed position (Gill and Casella 2009); discern political priorities of senators (Grimmer 2009); track intraparty voting (Spirling and Quinn 2010); study immigrant turnout in elections (Traunmuller, Murr, and Gill 2015); and investigate dynamic aspects of public support for welfare policies (Stegmueller 2013).Ferrari (2020) proposes a model called hdpGLM which generalizes Model 2 for hierarchical data and estimates context-dependent DPP regressions.The generalization makes it possible to use the model on data sets with and without hierarchical structures.The reader can find detailed information about hierarchical Dirichlet process more generally in Teh et al. (2006).Hierarchical data is very common in science.Examples include students nested within schools, voters nested within cities or states, patients nested within hospitals, people nested within countries, legislators nested within state legislatures, and so on.
Essentially, the main idea of the hdpGLM model is the following: Let's call the upper-level information (e.g., school, cities, states, hospitals, or countries) the context of the lower-level data point (e.g., students, voters, patients, people).The hdpGLM models heterogeneity in the effect of lower-level features (e.g., family income) of subjects (e.g., patients) across latent clusters in each context (e.g., hospitals or cities), and estimates how the within-context latent heterogeneity depends on features of the context (e.g., hospitals' investments in posttreatment care).In that sense, it differs substantially from the goals of conventional hierarchical models -random and fixed effects-or their Dirichlet extension (see Kyung, Gill, andCasella 2009, 2010).
To model the dependence of the clusters on context-level features, the hdpGLM adds a parameter τ ∈ R (Dw+1)×(Dx+1) to the model (2), where D w is the number of context-level covariates (W ), and D x is the dimension of within-context, lower-level covariates (X).The parameter τ captures the dependence of the linear coefficients β on features W of the context.The hdpGLM leaves that dependence undefined to express its generality.The hdpGLM package lets the user express that dependence as an upper-level regression on W , as explained in Section 4. The result of the estimation is a set of vectors of linear coefficients, one for each latent group within each context, and a set of context-level coefficients indicating how context features affect within-context effects and latent heterogeneity.
If we use J to denote the number of contexts (e.g., schools, hospitals, states) in the data, W the context-level features (e.g., school budget, number of teachers per student, etc.), X the lower level features (e.g., student performance, gender, race, etc.), and C i = j, or C ij for short, a variable indicating that the observation i comes from context j, then the hdpGLM model can be formally described as follows: Basically, the Model 3 generalizes DPP regression models.More details can be found in Ferrari (2020).Note that in this model, the coefficients β are three-dimensional real-valued matrices, that is, Dw+1) .Henceforth, the indices lkj in β lkj indicate the coefficient of x l for group k in context j, and β kj represents the entire D x + 1 vector of coefficients of X for cluster k and context j.

MCMC sampler
The package hdpGLM implements Models 2 and 3 for R users using the MCMC algorithm proposed in Ferrari (2020).There are a few options to estimate DPP regression models using MCMC (Ishwaran and Zarepour 2000;Neal 2000;Blei and Jordan 2006;Walker 2007).Ferrari (2020) follows Ishwaran and James (2001) and proposes a blocked Gibbs sampler for the hdpGLM.One of the advantages of the MCMC implementation in the hdpGLM package is that the number of clusters grow "as needed" during the estimation.The MCMC starts with all points classified into a single cluster.Then, it follows the rules in Algorithm 1 and more clusters are generated based on the posterior density.The algorithm uses a normal distribution for f τ and f β , and makes β a linear function of context-level features for each latent cluster, that is When the outcome variable is continuous, these specifications lead to a conjugate prior, and Gibbs update is available for all parameters of the model (see proof in Ferrari 2020).In that case, we substitute the last line of Model 3 with a linear model with Gaussian outcomes, and model the prior parameters as follows: It leads to a Gibbs sampler described in Algorithm 1 (for proof, see Ferrari 2020, Appendix A).Following Ferrari (2020), in the algorithm, Z * represents the cluster labels at the current Algorithm 1 Gibbs Sampler for the hdpGLM with Gaussian mixtures Require: iteration of the MCMC, and Z * C the values between 1 and K that are not in Z * .The number K represents the maximum number of clusters selected by the user (see discussion below in Section 4.2).The subscript j indicates the context.The variable N k is the total number of data points in cluster k, and X jk (or y jk ) the covariates (outcome variable) of the units i in context j classified to k in the current iteration.
When the outcome is not continuous, Gibbs is not available for coefficients β.In that case, Ferrari (2020) proposes a Riemann Manifold Hamiltonian Monte Carlo (RMHMC) update (Girolami and Calderhead 2011) within Gibbs to sample from the posterior distribution of those parameters.HMC is more efficient than traditional Metropolis-Hastings updates to explore the posterior distribution, because it takes advantage of the gradient of the posterior distribution (Neal 2011).The HMC uses an ancillary variable v (called momentum) with the same dimension of β (called position variable).For β kj denoting the linear parameter for cluster k in context j, the Hamiltonian equation is defined as: The HMC (Neal 2011) defines is the posterior distribution of β kj , and f v is the distribution of v.Both U () and K() are convex.Based on the definition of U () and K(), the gradient is:

Algorithm 2 Riemann Manifold Hamiltonian Monte Carlo
Require: Z (t) , β (t) , π (t) , τ (t)  1: for j = 1, . . ., J and for k ∈ Z * j do 2: for l = L − 1 do 6: end for 9: end if 17: end for The leapfrog integration solves this system of simultaneous equations (Girolami and Calderhead 2011;Calin and Chang 2006) using L leapfrog steps of size e.Each step l = 1, . . ., L is defined by: This solution leads to the RMHMC update for the parameter β when the outcome variable is not continuous.Algorithm 2 describes the RMHMC update (see also Neal 2011).

Design philosophy, MCMC implementation, and general syntax
The design choices behind the hdpGLM software package, and the hdpGLM function in particular, seek to: (1) maximize usability for R users at all levels of experience; (2) maximize the integration of the hdpGLM package into commonly-used R package ecosystems; (3) provide out-of-the-box DPP and hierarchical DPP models to practitioners to investigate latent heterogeneity in all explanatory variables of regression models, and; (4) offer a computationallyefficient estimation of DPP and hierarchical DPP regression models.
On the latter goal, the MCMC algorithms that estimate the models are fully implemented in C++ for efficient computation, and all of the package functionalities are available through the R language.The C++ code is integrated in R via the C++ API available in the Rcpp package (Eddelbuettel and François 2011).The MCMC in the hdpGLM package exactly matches Algorithms 1 and 2 presented in Section 3.For the RMHMC algorithm, the implementation followed the convention and adopted a normal distribution for the momentum variable v ∼ N (0, I (Dx+1)×( Dx+1) ) (Neal 2011), which simplifies a term of the leapfrog integrator to: The package is fully implemented for continuous and binary outcome variables, and future versions will include discrete and ordinal outcomes as well.For binary outcome y i ∼ Ber(p kj ), the package uses a logit transformation for the average parameter: For that case, the elements of the RMHMC becomes, for each active cluster k: The DPP regression is a special case of hdpGLM when there is just one context (J=1) and there is no context-level variable W .The algorithm used for DPP regression is the same as algorithm 1, except that it sets E [β] = 0 for the prior distribution of β and, obviously, does not sample τ because that parameter is not used.Section 6 contrasts the performance of the MCMC algorithm implemented in the hdpGLM against the existing alternative in R.
The other goals -integration, usability, and out-of-the-box DPP regression analysis -are achieved in various ways in the package design.Section 5.1 demonstrates the usability and integration features in detail, with a workflow example for practical application.Second, R users don't need to learn new, intricate syntax or a series of different functions to estimate the model and create summaries with the results.The design mimics the syntax of standard, widely-used built-in regression estimation methods from the stats package, such as the lm and glm functions.The hdpGLM function receives the same basic arguments as those other functions, including R default symbolic formulas for regression.Moreover, the underlying method to create design matrices from input data.frames to run the regression, and the default methods to print, summarize, and visualize the results, follow standard R design.For instance, it becomes immediately obvious for R users who have used the builtin lm function the meaning of the summary output from hdpGLM, provided the differences between the underlying models are understood -i.e., clustering and Bayesian methods for hdpGLM regression.
Third, when non-hierarchical DPP regression models are estimated using the hdpGLM package, only one R-standard symbolic formula is required and the argument is mandatory (more details are provided in the next section).When hierarchical DPP regression are estimated, the main function hdpGLM requires an additional formula specifying the upper-level covariates.This design differs from other available options in R to estimate conventional hierarchical models -i.e., fixed and random effects -such as those provided in the lme4 package (Bates, Mächler, Bolker, and Walker 2015).
The option for two separate formulas when hierarchical DPP regression is required is due to several reasons.First, the first formula is mandatory and shares some goals with other widely-used R functions, such as the lm and glm.I opted to follow standard R syntax to facilitate usage and remind users that the model shares similar goals of those other regression estimation functions.In all cases (lm, glm, and hdpGLM), users depart from observed covariates that they want to include in the regression, based on domain knowledge.Users then run the model and get the estimated linear coefficients.So, the first formula reflects that common goal.The main difference is that the hdpGLM function relaxes the modeling assumption of no latent heterogeneity in the association between the covariates and the outcome, as assumed by the lm and glm functions.
The second formula is optional and used only if the user wants to investigate the dependency of the latent heterogeneity on higher-level covariates.Using two separate formula arguments makes the syntax and the documentation cleaner and more straightforward, and the manipulation of those formulas by internal ancillary functions easier to handle.So, it provides an end-to-end benefit, from maintenance and development to the usage of the package.Finally, different from the first formula, the second formula has a different goal that is not matched by any other R package currently on CRAN.When the second formula is used, the hierarchical model estimates the effect of higher-order observed covariates on the latent heterogeneity of the linear effects of lower-level covariates.Different from the first formula, the goal of the second formula is very different from other hierarchical models implemented in R, e.g., for fixed and random effects estimation, such as those provided by the lme4 package.Hence, the package separates formulas instead of adopting lme4-like syntax to remind the user of the important distinction between the estimation goals of these packages.Both the arguments formula1 and formula2 receive standard regression formulas, but only the former is mandatory.For instance, if we set formula2 = NULL (default) and formula1 = y ~x1 + x2, the function hdpGLM will run a DPP regression (Model 2), which will search for latent subgroups that differ on how the outcome y responds to the covariates x 1 and x 2 .When formula2 is also provided, the function estimates a hierarchical DPP (Model 3) instead.

Estimation
Following the same example, suppose the observations (e.g., persons) come from J contexts (e.g., states), and we measured a variable w with context-level information (e.g., state GDP).
If we provided formula2 = y ~w alongside formula1 as above, we can estimate the effect τ of w on the latent clusters across contexts (states).
The argument data receives an R data.frame or a more efficient data structure compatible with it, such as tibble (Müller and Wickham 2023) or data.table(Dowle and Srinivasan 2023).The data must contain the column names specified in the formulas.It is advisable to normalize all the variables in the data before the estimation for algorithm numerical stability.
The user can provide a string to the argument context.idwith the name of the column in the data frame containing the context labels.This not required, but can facilitate postestimation reports because the package provides general as well as context-specific summaries and plots of the clusters and linear coefficients.The argument can be a meaningful label of the contexts.For example, if the context is states, we could have a variable named state in the data with the name of the states.Otherwise, arbitrary labels are assigned to the contexts to identify them.
The argument mcmc receives a named list with the parameters for the MCMC.The list must contain two mandatory named elements, burn.in and n.iter.Additionally, it can receive three optional elements, epsilon, leapFrog, and hmc_iter.The element named burn.inmust be an integer, and it indicates the number of iterations to discard before recording the samples (MCMC burn-in period).The element n.iter is the number of iterations to keep after the burn-in.The other elements of the list are only needed if the user wants to tune in the HMC updates, which is used for non-Gaussian outcomes (see Section 3 and Algorithm 2).The default values follow recommendations from the literature (Neal 2011) and perform well for in a variety of cases (Ferrari 2020) The initial state of the Gibbs sampler comes from the prior distribution of the parameters, and the optional argument constants of the function hdpGLM receives a named list with the constants (prior distributions parameters) of the model specifying those priors.The user can take advantage of the argument constants to conduct sensitivity analyses.The list received by that argument can contain a vector named mu_theta, which is the average parameter of the multivariate normal prior for β, and whose size must match the number of covariates specified in the argument formula1 plus one for the intercept.The first element of the vector will be the prior average for the intercept, the second the prior for the coefficient of the first covariate that appears in formula1, and so on (see the complete example below).The list can also contain a square matrix named Sigma_beta, which is the covariance matrix of the multivariate normal prior of β, and the dimensions must match the size of mu_theta.Users can also set the prior parameter α 0 , which is the parameter of the Beta distribution for the SBC (Model 4), by including in the list an element named alpha with a positive number.Other elements of the list depend on the family of the GLMs used to capture the outcome variable, which is specified in the argument family.If the outcome is continuous and modeled with a Gaussian distribution (family = "gaussian"), then the list can contain an element named s2_sigma, which is the parameter of the scaled inverse-χ 2 distribution of σ 2 k as described in Equation 4. The user can also set df_sigma, which is the parameter v (degrees of freedom of the scaled inverse-χ 2 distribution) in Model 4. Both s2_sigma and df_sigma must be positive numbers.If constants = NULL (default), the package sets mu_beta = (0, ..., 0) and Sigma_beta = 10I, where I is an identity matrix.The dimensions are automatically adjusted to match the model formulas.The prior for the Beta distribution of the SBC is set to alpha = 1.For Gaussian outcome, it sets the priors for σ k as s2_sigma = 10 and df_sigma = 10.These values produce good estimation performance under various case scenarios (Ferrari 2020).
Here is an example of how to use the argument constants to select the prior parameters in a regression with two covariates (see Models 3 and 4 for reference): list(mu_beta = c(0, 0, 0), Sigma_beta = 10*diag(3), df_sigma = 10, s2_sigma = 10, alpha = 1) The user can select the argument K to set the upper bound of the truncated Gibbs sampler.As discussed in Section 3, one of the advantages of the MCMC implementation in the hdpGLM package for efficient computation is that the number of clusters grow "as needed" during the estimation.The MCMC starts with all points classified into a single cluster.Then, it follows the rules in Algorithm 1 and more clusters are generated based on the posterior density.However, although the actual active clusters (i.e., clusters with data points) in each iteration can be small, the prior π on the cluster indicator Z must be large enough to let the active clusters grow as needed.The argument K sets the upper bound for the number of clusters, which is required by the algorithm to set the size of π (see Ishwaran and James (2001) and Ferrari (2020)).For instance, if K = 100 (default), the number of active clusters can grow during the estimation up to 100.If the estimation achieves that upper bound, no additional cluster is created to allocate the data points.Too large of an upper bound K will require more memory.If it is too small, it can artificially truncate the number of clusters and classify the data in less clusters than it should.Users are advised to monitor the number of active clusters -that is, the number of clusters actually occupied with data points -either while the MCMC is running or after the estimation is completed.If the upper bound is achieved at any point, the user can increase it and rerun the analysis (this process will be automated in future versions).
The package provides two ways to monitor the number of occupied clusters to check if K needs to increase.After the estimation is completed, the maximum number of active clusters used at any point during the MCMC is saved in the output of the function hdpGLM (more details about the complete output are provided below).The user can check if that maximum is equal or close to K, in which case the user can increase K.For instance: R> library("hdpGLM") R> set.seed(777)R> sim <-hdpGLM_simulateData(n = 200, nCov = 2, K = 2, nCovj = 1, J = 5, + family = "gaussian") R> df <-sim$data R> mod <-hdpGLM(y ~X1 + X2, y ~W1, data = df, K = 50, + mcmc = list(burn.in= 0, n.iter = 100), n.display = 30) R> mod$max_active [1] 8 In the example above, the maximum number of clusters activated at any point during the MCMC was 8, which is much lower than the value of 50 used for K.It indicates that the algorithm needed much less than 50 clusters to sample from the regions of high density, so the truncation of π to 50 is not affecting any result.
The user doesn not need to wait until the estimation finishes to check the maximum number of active clusters.A partial report will be display in R during the MCMC iterations every time it reaches as many iterations as indicated in the argument n.display of the function hdpGLM.For instance, if n.display = 30, the function hdpGLM will display a report at every other 30 iterations.For example, the report after 60 iterations for the previous model will look like this: The report contains several important pieces of information.It shows the current iteration when the report was displayed (Iteration: 60); the acceptance rate (current and on average) for the parameter β, which is equal to 1 for Gaussian models because they use a Gibbs sampler, but it is usually smaller than 1 when HMC within Gibbs is used to sample β, which is the case for regression with non-Gaussian outcomes.It also shows the maximum number of clusters allowed, which is set by the user using the argument K of the function hdpGLM; the maximum number of activated clusters up to the iteration the report was generated (8 in the example); the number of active clusters in the current iteration (8 in the example); and the percentage of data points in each active cluster in the current iteration (clusters with less than 5% of the data are omitted in the report).
The output of the function hdpGLM is either an object of class 'dpGLM' or 'hdpGLM', depending on which model is estimated (DPP or hierarchical DPP).The object is a list with various elements, and here is an example of the main ones: R> class(mod) [1] "hdpGLM"

R> str(mod)
$ samples : 'mcmc' num [1:2344, 1:6] 1 2 1 2 3 4 5 6 1 2 .The object hdpGLM returned by the function hdpGLM contains an element named samples, which has the samples of the linear coefficients β for all clusters.That element belongs to the class 'mcmc', which is a class defined and used in various specialized R packages to produce diagnostics for MCMC samples, such as the package coda.The element named tau is the posterior distribution of the parameter τ .It is only returned when users estimate a hierarchical DPP regression.
The pik element of the object hdpGLM is a n × K matrix with the estimated probabilities that the observation i = 1, . . ., n belongs to the cluster k = 1, . . ., K, where K is the maximum number of clusters allowed (argument K of the function hdpGLM).The classification of the data points into clusters use those probabilities.

Estimation numerical summaries
The software package hdpGLM uses the default method summary to create a summary of the posterior samples.To illustrate the summary function provided by the hdpGLM package, consider the model fit previously, mod.Suppose we estimate the following model, which examines the effect of a new assignment policy (x 1 ) and teacher personality (x 2 ) on student performance (y).We collected data on many different schools, and measured school investment in extracurricular activities (W 1 ).
The dataset df contains columns named y, X1, X2, and W1 measuring the relevant variables.
We can estimate the coefficients β and search for latent subgroups (k) of students who respond differently to x 1 and x 2 by using formula1 = y ~X1 + X2 in the function hdpGLM, and we can estimate how school investment in extracurricular activities affects the clustering of student response to covariates by using formula2 = y ~W1.After the estimation, if we use the function summary to summarize the object returned by the function hdpGLM it will print the following (suppose df is the name of the data.framewith the school-student data):  -------------------------------Summary statistics of clusters with data points in each context ------------------------------- It gives a comprehensive assessment of the main results.The first block of the summary (named hdpGLM Object) gives the number of iterations used in the MCMC and the number of clusters activated during the estimation.For a hierarchical DPP regression, it also provides the number of contexts, and a summary of the clustering results (average, median, minimum, maximum, and standard deviation of the number of clusters found across contexts).Then, it prints a full list of the posterior summaries for the linear coefficients of each cluster and each context.Lastly, for hierarchical DPP regression, the function summary also prints the posterior summary for the parameter τ , grouped under "Context-level coefficients." The applied examples below illustrate the interpretation of these coefficients.
The package hdpGLM also provides a function summary_tidy to create tidy summaries of the output (Wickham 2014).The tidy summary returns the summary output in a tibble format.Because that function returns a tidy summary, the user can easily subset and slice the summary output to retrieve estimates for particular contexts (e.g., schools) or clusters.

Visualization
The package hdpGLM provides a variety of options to visualize the output of the estimation.It provides a generic method plot that takes the output of the hdpGLM function with the posterior samples and generates plots that include cluster posterior averages.The features of the plot depend on (1) the model estimated (DPP or hierarchical DPP), and (2) the user's choice to visualize aggregate or cluster-specific estimates of the coefficients.
Figure 1 shows an example of the output of the plot function for a non-hierarchical DPP regression estimated using the hdpGLM function with two covariates named x 1 and x 2 in the data.The plot shows the posterior distribution of all linear coefficients β for all clusters.The vertical lines represent the cluster posterior averages.
Users have many built-in options to control plot aesthetics.They can set the argument separate = TRUE in the plot function to plot the clusters separately.In that case, the plots also include the 95% highest posterior density intervals for each cluster and coefficient.The user can also select which coefficients to plot by including an argument term, which receives a vector of strings with the names of the covariates.For instance, if we use plot(mod, separate = TRUE, term = c("X1")), the plot created will display the posterior distribution of all clusters separately with their respective 95% HPD intervals, and it will plot only the distribution of the coefficients of X1.
If one estimated a hierarchical DPP, then the set of linear coefficients are distributed by cluster and contexts.Moreover, it estimates an additional set of coefficients τ representing the context-level effect.The package provides three built-in functions to visualize the results of the model estimation in that case.
Figure 2 is created by estimating a hierarchical DPP regression estimated using the hdpGLM function with two covariates named x 1 and x 2 in the data, and five contexts (e.g., schools) with a context-level feature named W 1 : R> mod <-hdpGLM(y ~X1 + X2, y ~W1, data = sim$data, context.id= "school", + mcmc = list(burn.in= 0, n.iter = 100)) R> plot(mod, context.id= "school", term = c("X1", "X2")) In this example, a variable school in the data contains the labels of the contexts: from School A up to School E. The panels in Figure 2 show the posterior distribution of all linear coefficients β for all contexts.
The package provides two additional plot functions, plot_tau and plot_pexp_beta, to display the effect of context-level features on the posterior expectation of the cluster coefficients β.The function plot_tau displays the posterior distribution of the parameter τ (see Model 3). Figure 3 shows the plot produced by the function plot_pexp_beta.The bottom row shows the posterior average of β kj for each cluster k and context j.The top row shows the association between the context-level feature (W 1 ) and the posterior averages of the β kj .
All the plot functions in the hdpGLM package return a ggplot2 object.Therefore, users of ggplot2 (Wickham 2016) can easily extend the plots and adjust its elements to fit their preferences.

Classification/clustering
After the estimation, we can classify the data points into clusters.It is achieved using the function classify.The classification uses the posterior probability of the cluster membership for each data point.Each point is classified into the cluster it has the highest probability to belong.The function receives two arguments, data and samples.The former receives the data.frameused in the estimation.The latter uses the output of the estimation.It will return the same data frame with an additional column called Cluster indicating the cluster membership of each observation.

Dirichlet Process Regression
This section shows a typical analysis workflow for applied research using the hdpGLM package.It assumes the goal of the analysis is to find latent heterogeneity in the association between the regressors and the outcome, and then cluster the sample based on that heterogeneity.From that standpoint, the typical analysis using hdpGLM follows the same pattern of other regression analyses in R, e.g., using the built-in lm R function.The only additional step for the hdpGLM is the MCMC convergence checks.The analysis starts with a set of observed covariates and an outcome variable, selected by the researchers based on domain knowledge.Then, the researchers estimate the model and check the linear coefficients and their summary statistics.If one is using the lm function, that means looking at the p-values and 95% confidence intervals of the linear coefficients.With the hdpGLM, one checks the posterior mean, 95% highest posterior density (HPD) intervals for each cluster, and the number of clusters found.The example below illustrates this approach.
A classic problem in political science is understanding the effect of people's income, the level of inequality in their neighborhood, and their ideology on their attitudes toward welfare policies (Alt and Iversen 2017;Armingeon and Weisstanner 2021).Consider the toy dataset welfare provided with the hdpGLM package.The dataset contains four variables.The outcome is people's support for welfare policies (support).It is a function of levels of inequality in people's neighborhoods (inequality), their personal income (income), and political ideology (ideology).
As in any Bayesian analysis using MCMC, an important step is to check the MCMC convergence.The user can run MCMC diagnostics directly on the output of the hdpGLM function on the element samples of the list it returns.For instance, using the coda package (Plummer et al. 2006), users can run geweke.diag(mod$samples)and get the Geweke convergence check (Geweke 1992), or check the MCMC standard error (Flegal, Haran, and Jones 2008) with the package mcmcse and the function mcse(mod$samples) (Flegal, Hughes, Vats, Dai, Gupta, and Maji 2021).It is not recommended to use the Gelman-Rubin convergence diagnostic (Brooks and Gelman 1998) due to the multimodal nature of the posterior distribution of the linear coefficients.A good summary of various MCMC convergence diagnostics can be found in (Cowles and Carlin 1996), and the package coda implements the various convergence checks discussed in that article.
Discussions about which model to choose, e.g., between lm and hdpGLM, when the results differ is beyond the scope of this article.Readers are referred to Ferrari (2020), which demonstrates that in a variety of circumstances, if there is no latent subgroups in the sample, the posterior average of the linear coefficients estimated using the hdpGLM function are indistinguishable from those produced by the lm function.The example above shows this feature for the effect of income, as the clusters differ only on the effect of inequality (see also the simulated example on Section 5.2).When clusters are found, one option is to compare root mean squared error (RMSE) of the predictive values, which can be achieved using the function predict, available for the objects returned by both functions.To illustrate the interpretation, in this toy example the effect of inequality is negative and homogeneous in country 3, as indicated by a unimodal distribution of β 1 on Figure 5, but there are latent subpopulations in countries 0, 1, 2, and 4. For some subpopulations in those countries, the effect of inequality is negative, but it is positive for other subgroups (as indicated by a multimodal distribution of β 1 in those contexts).

Hierarchical Dirichlet process regression
The parameter τ captures how the latent heterogeneity in each context is associated with the context-level features, gap in the example.We can plot the posterior distribution of τ using the function plot_tau or visualize its effect directly using the function plot_exp_beta.
Figure 6 uses the function plot_pexp_beta to display the posterior averages of β in each context and cluster as function of gap.
To illustrate the interpretation, when the country gender gap in the provision of public goods (gap) increases, the posterior expectation of the effect of income decreases for all clusters (upper row, third column of Figure 6), but gap does not affect how inequality is associated with the outcome (support for welfare policies) across various clusters in different countries.
As discussed in the previous section, MCMC diagnostics can be conducted using external packages to check convergence of the chain.For instance, users can run geweke.diag(mod$tau)and geweke.diag(mod$samples) to get the Geweke convergence check (Geweke 1992), or use other diagnostic checks for MCMC objects.
To classify the data points into clusters, we run the function classify.In this example, it is classify(data = df, samples = mod).It returns the dataset with a new column named Cluster that indicates the cluster membership of each observation.Ferrari (2020) shows other examples.Note that this is the only package currently on CRAN capable of estimating the latent heterogeneity in the linear coefficients, as illustrated in this example.

Performance comparison between hdpGLM and PReMiuM
Table 3 compares the two packages currently available in R -hdpGLM and PReMiuM -that have similar purposes of estimating DPP regression models.The tests were run serially on an AMD EPYC 7702 CPU at 2.18 GHz on a 64-bit Debian Linux system with 514GB RAM.
The comparison was based on the time the MCMC algorithm takes to run 500 iterations using the packages' default MCMC parameters.A similar performance test was conducted by PReMiuM developers (Liverani et al. 2015), and I mimic their test here.The number of predictors and the size of the data was set in advance, but the number of clusters in the data was randomly selected in each test.
Table 3 shows how the packages scale with the number of subjects.The hdpGLM outperforms the PReMiuM software package in all runs when the outcome variable was Gaussian.The gap in performance increases substantially with the size of the data set.The hdpGLM is around five times faster when the data set had 1,500 sample points and around 17 times faster when there were 5,000 observations.The opposite happens when the outcome variable is binary.The PReMiuM software is faster than the hdpGLM in this case, but the gap does not increase as much with the size of the data.
It is important to note that the two packages have a few overlapping functionalities, but differ in many aspects (see Table 1).One important difference is that the hdpGLM investigates latent heterogeneity on all predictors, not only the intercept, as the PReMiuM does.This extra feature of the hdpGLM package requires a HMC update within Gibbs for non-Gaussian outcomes, as discussed, which accounts for differences in performance for binary outcomes.Nevertheless, the performance of the two packages for binary outcomes are not drastically different as they are for Gaussian outcomes.
There are two limitations of the hdpGLM package, which are more generally related to computational aspects of Gibbs sampling or MCMC.One is scalability; the other is that large moves on the optimization curve are limited due to the restricted changes in each sampling step.In terms of scalability, the estimation may require large memory and processing time for large data sets.Table 3 shows that the package performs well with moderate-size data, and it is much faster than other existing alternatives to estimate similar models, but performance may be an issue depending on the size of the data.There are alternative approaches for DP and hierarchical DP mixtures using sub-cluster splits (Chang andFisher III 2013, 2014) and variational inference (Hughes and Sudderth 2013;Hughes, Kim, and  but these models need to be adapted to be used with hierarchical DP mixtures of GLMs.In future versions, the package hdpGLM should explore these alternative sampling methods and adapt them to estimate the hdpGLM.

Discussion
DPP regression is an alternative to classical regression models.It can be used in a variety of situations when we suspect that there is heterogeneity in the effect of one or more covariates on the response variable.DPP is one of the core approaches in semi-parametric Bayesian regression models and unsupervised learning regression estimation (Abbring and Heckman 2007;Hastie, Tibshirani, and Friedman 2009;Hastie et al. 2009;Hannah et al. 2011;Traunmuller et al. 2015).This paper describes the R package hdpGLM, which implements a generalization of DPP regression models, developed in Ferrari (2020), for easy-to-use estimation, summary, and visualization of DPP regression by R users.Users can find further documentation details and examples at http://www.diogoferrari.com/hdpGLM/.The package is designed to facil-

First
, the hdpGLM package returns the posterior samples in a 'mcmc' object of the R package coda(Plummer et al. 2006).Various other R packages use coda as the main engine for analysis of MCMC outputs, such as the MCMCpack (Martin, Quinn, and Park 2011), R2WinBUGS(Sturtz, Ligges, and Gelman 2005), MCMCglmm(Hadfield 2010), boa(Smith 2007), and many others.The user can use packages specialized in MCMC diagnostics directly on the output object of the estimation function hdpGLM, and take advantage of all the coda facilities to evaluate convergence of the MCMC and visualize sample outputs of DPP regression.

A
single function called hdpGLM handles the estimation of DPPs (Model 2) and hierarchical DPP (Model 3) regression models.It can receive one or two different regression formulas that follow standard R symbolic formula notation.Which model is estimated (DPP or hierarchical DPP) depends on the numbers of formulas provided, and the function automatically selects the appropriate sampler to use.A discussion of the design decisions is presented on Section 4.1.The signature of the function with its default arguments is: hdpGLM(formula1, formula2 = NULL, data, mcmc, family = "gaussian", K = 100, context.id= NULL, constants = NULL, n.display = 1000) . If users want to modify those parameters, epsilon must be a positive number.It is used in the Stormer-Verlet Integrator (a.k.a.leapfrog integrator) to solve the Hamiltonian Monte Carlo in the estimation of the model.The default value is 0.01.leapFrog must be an integer.It indicates the number of steps taken at each iteration of the HMC for the Stormer-Verlet Integrator.The default is 40.Finally, hmc_iter must receive an integer, and it indicates the number of HMC iteration(s) for each Gibbs iteration.The default is 1.An example of the argument mcmc is: list(n.iter= 10000, burn.in= 2000, epsilon = 0.01, leapFrog = 40, hmc_iter = 1)

Figure 1 :
Figure 1: Posterior distribution of β produced by function plot of the hdpGLM package.

Figure 2 :
Figure 2: Posterior distribution of β in each context (e.g., schools) produced by function plot of the hdpGLM package.

Figure 3 :
Figure 3: Posterior average of β for each context and cluster and the effect τ of cluster level features (W 1 ).

Figure 4 :
Figure 4: Association between income and support for welfare (left panels) and inequality and support for welfare (right panels) in each cluster

Figure 5 Figure 5 :
Figure5shows the posterior distribution of the linear coefficients β for each context.The following command generated the plot:

Figure 7 :
Figure 7: Posterior distribution of the β coefficients generated by the function plot of the package hdpGLM after the estimation using function hdpGLM and the simulated data whose true value of the coefficients are shown on Table2.

Table 1 :
(Plummer et al. 2006he hdpGLM package and other alternatives.The symbol ✓ indicates the feature is available, and × indicates it is not available.Note:(a)Current version of PReMiuM supports binary, categorical, count and continuous outcome variables.hdpGLM'salgorithm is built to support all GLM models; current version includes binary and continuous outcomes only; (b) For the hdpGLM, Gibbs is used for Gaussian outcomes, and HMC within Gibbs is required for non-Gaussian outcomes to cluster the data based on all linear coefficients; (c) Current version of hdpGLM includes coef, family, nobs, plot, predict, print, and summary; (d) Diagnostics for the hdpGLM can be computed using specialized packages designed for that purpose, such as coda(Plummer et al. 2006).
If we knew the number of subgroups (K) and the group membership for each observation, we could estimate K separate regressions, one for each group, or include a group indicator as an interactive term.The big advantage of the DPR is that the groups with differential effects don't need to be known or observed in advance.In other words, we can use a DPR model to investigate if there is effect heterogeneity caused by unobserved or omitted variables.The model estimates the group membership, the average effects of each covariate in each group, and the proportion of subjects in each cluster.
)×K , d = D x Denote Z i = k, or Z ik for short, the group membership of subject i on group k.Let π k denote the proportion of subjects on group k.The regression model becomes: of the distribution of the outcome variable of the mixture components: gaussian (Müller and Wickham 2023) Sriutaisuk, and Kim 2020)ham 2016;Gómez-Rubio 2017)available to manipulate the summary output to plot, using packages such as the ggplot2(Valero-Mora 2010;Ginestet 2011;Wilkinson 2012;Wickham 2016;Gómez-Rubio 2017)), or to create tables and reports using package ecosystems such as the tidyverse tools(Wickham et al. 2019;Lee, Sriutaisuk, and Kim 2020)or the R package kableExtra(Zhu 2021)to export the output table to L A T E X.For a non-hierarchical DPP regression, the function summary_tidy returns a tibble(Müller and Wickham 2023)with the posterior summary for β.For a hierarchical DPP regression, it returns a list of two tibble objects, one for β and one for τ : We can use this data to estimate the effect of inequality, income, and ideology on people's support for welfare.Using a classical linear model and the R built-in function lm, one would have the following results: Following the same example, consider the data set welfare2, also included in the hdpGLM package.This new dataset includes the same variables than the welfare dataset, but from five different countries.It has index to indicate the country (country), and the country-level gender gap in country's provision of public good (gap).Here are the first rows of that dataset:We can estimate the effect of the country-level gender gap on the latent heterogeneity we found in the previous example, but now using the hierarchical DPP regression.It is done by running the following code: 1 Either summary or summary_tidy will generate a numerical summary of the posterior samples.Using the summary_tidy function, we get a list of tidy summaries for β and τ .As explained in Section 4.3, the summary_tidy gives the summaries of the posterior distribution for the parameter β, but now for each context (country) as well.Additionally, the same summaries (posterior Mean and 95% HPD intervals) are provided for the parameter τ (see Model 3).

Table 3 :
Sudderth 2015), Comparing the time required to run 500 iterations in the two packages hdpGLM and PReMiuM R currently on CRAN to estimate DPR models.Note: The hdpGLM clusters data after investigating latent-heterogeneity on all linear coefficients, while the PReMiuM package models heterogeneity in the intercept term only.