A Bayesian Approach for Model-Based Clustering of Several Binary Dissimilarity Matrices: The dmbc Package in R

We introduce the new package dmbc that implements a Bayesian algorithm for clustering a set of binary dissimilarity matrices within a model-based framework. Speciﬁcally, we consider the case when S matrices are available, each describing the dissimilarities among the same n objects, possibly expressed by S subjects (judges), or measured under diﬀerent experimental conditions, or with reference to diﬀerent characteristics of the objects themselves. In particular, we focus on binary dissimilarities, taking values 0 or 1 depending on whether or not two objects are deemed as dissimilar. We are interested in analyzing such data using multidimensional scaling (MDS). Diﬀerently from standard MDS algorithms, our goal is to cluster the dissimilarity matrices and, simultaneously, to extract an MDS conﬁguration speciﬁc for each cluster. To this end, we develop a fully Bayesian three-way MDS approach, where the elements of each dissimilarity matrix are modeled as a mixture of Bernoulli random vectors. The parameter estimates and the MDS conﬁgurations are derived using a hybrid Metropolis-Gibbs Markov Chain Monte Carlo algorithm. We also propose a BIC-like criterion for jointly selecting the optimal number of clusters and latent space dimensions. We illustrate our approach referring both to synthetic data and to a publicly available data set taken from the literature. For the sake of eﬃciency, the core computations in the package are implemented in C / C++ . The package also allows the simulation of multiple chains through the support of the parallel package.


Introduction
Data consisting of proximity measures, that is measurements of the pairwise similarity or dissimilarity between n objects, abound in many fields, ranging from psychology, sociology and market research, to ecology, demography, economics, as well as genomics and linguistics. A dissimilarity matrix, D -whose (i, h)-th entry, d ih , represents the dissimilarity between objects i and h -is typically analyzed using multidimensional scaling (MDS; for a comprehensive overview see Borg and Groenen 2005). MDS is a factorial technique that aims at finding a set of latent factors (the so-called MDS configuration) in a low-dimensional Euclidean space, such that the distances between objects in this space resemble as much as possible the original d ih 's. In its basic formulation, MDS involves a single dissimilarity matrix. Different MDS variants have been introduced in the literature. The classical or metric MDS (CMDS; Torgerson 1952Torgerson , 1958 assumes that the d ih 's are Euclidean distances plus a random error. Nonmetric MDS (Shepard 1962;Kruskal 1964) handles ordinal data, assuming that only the ordering of the dissimilarities is relevant in the determination of the MDS configuration. The ALSCAL algorithm (Takane, Young, and De Leeuw 1977) combines metric and nonmetric solutions and derives the MDS configuration by fitting a function (whose features depend on the type of data) of the d ih 's with errors added (see Krantz, Luce, Suppes, and Tversky 1971). Also, maximum likelihood (metric) MDS approaches have been proposed (Ramsay 1977(Ramsay , 1978(Ramsay , 1982Takane 1978bTakane ,a, 1981Takane and Carroll 1981;Takane and Sergent 1983;Shibayama 1986, 1992), which allow for the selection of the number of MDS factors using likelihood ratio tests or Akaike's criterion (Takane 2007). Adopting a Bayesian approach, Oh and Raftery (2001) introduced a model where the d ih 's follow a truncated normal distribution whose expected value is the Euclidean distance based upon the MDS factors. The authors rely on Markov Chain Monte Carlo (MCMC) methods for estimation and also introduce a criterion to select the number of dimensions.
A further major development in the literature on MDS is the so-called three-way MDS, where many dissimilarity matrices, say D 1 , . . . , D S , are taken into account simultaneously (Young 1987;Borg and Groenen 2005;Giguère 2006). This situation typically arises when the pairwise dissimilarities among n objects are judgments expressed by different subjects (usually playing the role of judges or raters), or when they are measured under different experimental conditions, or with reference to different characteristics of the objects themselves. For the sake of exposition, and without loss of generality, throughout the paper we will refer to the first situation, thus assuming that the dissimilarity matrices describe the judgments expressed by S subjects.
principle be applied to any type of dissimilarities, here we focus on binary dissimilarities, taking only values 0 or 1 depending on whether two objects are deemed as dissimilar or not. We concentrate on binary dissimilarities because such type of data are often analyzed using simplistic approaches. Indeed, MDS is frequently applied to a consensus dissimilarity matrix obtained by summing or averaging the individual ones. In other cases, groups of dissimilarity matrices are formed a priori (based on experimental conditions or on the subjects' characteristics, such as gender or age), and three-way MDS (typically, either RMDS or WMDS) is performed on the consensus matrices formed for each group. Clearly, in the latter case, the groups are identified a priori, without evaluating whether they are actually different and/or internally homogeneous. While this can be reasonable when the analysis is confirmatory, an exploratory approach would instead surely benefit from a data-driven procedure aiming at identifying the possible clusters and their specific MDS configurations. As already mentioned, this allows to ignore the possibly negligible differences across subjects, to detect groups of subjects sharing similar opinions, and to better shape and identify their most distinctive traits. Additionally, this simplifies subsequent analyses, which can be based on a reduced number of groups rather than the original set of subjects. The model implemented in the dmbc package provides a possible solution to achieve these goals. Embedding the model in a Bayesian framework is particularly convenient because it permits a full inferential assessment of the obtained results. Furthermore, we build on the work of Oh and Raftery (2007), and introduce a BIC-like criterion for simultaneously selecting the optimal number of latent dimensions, p, and number of clusters, G.
A distinctive feature of our approach compared to classic RMDS or WMDS is that in the latter a single MDS configuration is extracted, while we obtain a different configuration for each cluster of subjects. Under this perspective, our proposal can be regarded as an extension of RMDS, because we assume that there are groups of subjects -not known a priori -whose dissimilarities derive from the same cluster-specific MDS configuration. Instead, our method differs from WMDS, because we assume that all the subjects placed in the same cluster share the same configuration. Note that the more complex Bayesian model introduced by Okada and Lee (2016) -assuming that subjects in the same cluster assign different weights to the common cluster-specific MDS factors -is a full integration of RMDS and of WMDS. The introduction of individual weights allows capturing differences across subjects, even when placed in the same cluster. Our model instead leads to a compromise solution cluster-wise, which can nonetheless be more representative and easier to interpret. In addition, our simpler model has the advantage of allowing the definition of a criterion for model selection.
Before proceeding, it is worth mentioning that our procedure shares some features with the approach introduced by Hoff, Raftery, and Handcock (2002) and extended by Handcock, Raftery, and Tantrum (2007) to model social networks based on the assumption of the existence of a latent (unique) "social space". Specifically, Hoff et al. (2002) refer to a class of random graph models where the nodes represent n social actors, the edge between two nodes signals a specified relation between them, and the probability of observing a relation depends on the distance between the nodes in a latent space. Handcock et al. (2007) extend the model to account for the possible clustering of n actors based on their position in the unique latent space. 4 Instead, our goal is to cluster the S subjects according to their judgments about the similarity of the n objects, thus partitioning configurations rather than positions.
More specifically, we model the joint distribution of each subject's dissimilarities as a finite mixture of Bernoulli random vectors, with parameters depending on the cluster-specific MDS factors and estimate all the unknown quantities (i.e., the parameters and the latent positions) using a hybrid Metropolis-Gibbs algorithm. Our package also allows to run simultaneously multiple MCMC chains by exploiting the capabilities of the parallel package (R Core Team 2021). To make the computations more efficient, most of the routines included in the package are written in C/C++ through the support of the Rcpp package (Eddelbuettel and François 2011).
The dmbc package is available from the Comprehensive R Archive Network (CRAN) at http: //CRAN.R-project.org/package=dmbc and it can be installed directly using, for example, install.packages("dmbc"). The development version of the package can also be retrieved from https://github.com/sergioventurini/dmbc. The paper first briefly reviews the model and then moves to fully present the package features. More specifically, Section 2 introduces our model-based approach for clustering a set of binary dissimilarity matrices, together with the details of the MCMC algorithm for its estimation. Section 3 illustrates the proposed criterion to simultaneously select the number of latent dimensions and the number of clusters. Section 4 provides a description of the features available in the dmbc package and shows how to apply them on simulated data. In Section 5 we apply the package's functionalities to a data set that is available in the literature. Finally, Section 6 summarizes and provides some closing remarks.

Model specification and estimation
We consider S binary and symmetric dissimilarity matrices, D 1 , D 2 , . . . , D S , so that the (i, h)th element of D j , d ih,j , can only take values 0 or 1, with d ih,j = d hi,j if i = h, and 0 otherwise. We assume that the S subjects are partitioned into G clusters, with G a priori unknown. To represent the subjects' group memberships we refer to a set of latent variables, x j , such that P(x j = g) = λ g , with λ g ≥ 0 and G g=1 λ g = 1. Matrices assigned to the same cluster are assumed to share a common p-dimensional latent configuration space for the n objects, with p also unknown. Therefore, for the i-th object, a latent vector exists, z g i :=(z i | x j = g) = (z g i1 , . . . , z g ip ), reflecting the factors scores possibly underlying the dissimilarity judgments expressed by the subjects in the g-th cluster. Thus, the true dissimilarity between objects i and h for the subjects in the g-th cluster, δ g ih , can be expressed as the Euclidean distance between z g i and z g h , Conditionally on the cluster, we assume that the observed dissimilarity is a Bernoulli random variable, or equivalently, Further, the j-th subject's dissimilarities are assumed to be conditionally independent given the latent positions in the configuration space, so that dmbc: Model-Based Clustering of Binary Dissimilarity Matrices in R where d j denotes the vector whose elements are the m = n(n − 1)/2 elements in the lower triangular part of D j , and π g denotes the corresponding vector of π's. Hence, the unconditional distribution of d j turns out to be the mixture We aim at finding the G clusters and at estimating the latent p-dimensional configuration for each cluster. Note that p is assumed to be the same for all the clusters. This may appear as a strong assumption, possibly reducing the flexibility of the model. Nonetheless, if the unknown dimensions of the clusters' configurations, (p 1 , . . . , p G ), are actually different and vary between, say, p min and p max , we can expect that a dimension p = p max will be selected, and that for a cluster with p g < p, (p − p g ) factors will turn out to be irrelevant. Therefore, the assumption of a constant number of latent dimensions in the different clusters should not have a large impact on the results.

Model estimation
The adoption of a fully Bayesian approach requires the specification of prior distributions for all the unknown parameters. For the latent positions we assume a p-variate normal distribution, z g i ∼ N p (0, η g I p ). The assumption of spherical covariance matrix allows avoiding a too large number of parameters, but in principle it could be relaxed. For η g we use an inverse gamma distribution with mode b g /(a g + 1), η g ∼ IG(a g , b g ). For the logit cluster-specific intercept in Equation 1, we assume that α g ∼ N (0, σ 2 g ), with σ 2 g ∼ IG(a 0 , b 0 ). Finally, we use a G-dimensional Dirichlet distribution for the vector of mixing probabilities, λ ∼ D G (ν 1 , . . . , ν G ). All the parameters are assumed to be a priori independent. Based on such specifications, the full conditional posterior distributions of the model parameters are: where x = (x 1 , . . . , x S ), Z g = (z g 1 , . . . , z g n ), φ p (· | µ, Σ) indicates the p-dimensional multivariate normal density with mean vector µ and covariance matrix Σ, and˜ (Z g , α g , x) represents the contribution of the g-th cluster to the overall likelihood: with d g ih,+ = S j=1 d g ih,j , and S g = S j=1 I(x j = g). The notation (· | else) in Equations 5-10 indicates conditioning on all the other unknowns in the model.
To simulate from the posterior distributions, a MCMC algorithm is used, which iterates over the model parameters, latent positions and cluster memberships by updating them in turn. We use a Gibbs sampler for the parameters whose full conditional posterior distribution is available, while we adopt a Metropolis-Hastings step otherwise.
By cycling over the clusters, that is for g = 1, . . . , G, we use a Metropolis-Hastings step to update the (Z g ) (t) 's, by sampling the n objects in a random order. For the i-th sampled object, a normal proposal density is used, and a candidate ( . Note that we abused a little the notation since the z g i are updated one at a time. Indeed, in the expression above it is (Z g ) * = z g 1 , . . . , (z g i ) * , . . . , z g n , with only one object's vector of latent positions, the i-th, updated. A further Metropolis-Hastings step is used to update α The variances of the proposal distributions, γ Z and γ α , are chosen to achieve a good performance of the algorithm in terms of acceptance rate. Preliminary testing suggested that setting γ Z = 1.5 and γ α = 0.75 provides satisfactory results in terms of acceptance rates (i.e., between 20% and 30%) in most cases.
Every MCMC iteration for the cluster-specific parameters is completed by updating η g , σ 2 g , x, and λ, using their full conditionals, shown in Equation 6, Equation 8, Equation 9 and Equation 10, respectively.
We compute the parameter estimates as the posterior means from the corresponding MCMC chain after deleting the burn-in iterations.

Initialization
To initialize the MCMC simulation, starting values of all the model's parameters must be provided. As for the initial partition x 0 , in the package three different approaches are available (see Section 4 for the corresponding syntax specification): 1. Select the initial partition randomly.
2. Partition the S subjects into G groups by applying a Ward hierarchical clustering algorithm to the vectors of observed dissimilarities d j , j = 1, . . . , S using a specified distance measure (Everitt, Landau, Leese, and Stahl 2011).
In our experience, the choice of the initial partition does not seem to systematically affect the final results of the analysis. In any case, in all the examples reported in this paper, we set the initial partition randomly, to let the algorithm learn from the data without being influenced by any specific initial choice.
No matter how the initial partition is chosen, the corresponding vector of clusters' proportions, λ 0 = (n 1 /S, . . . , n G /S), is used to initialize λ. Given the initial partition, the starting pdimensional MDS configuration for the g-th cluster, (Z g ) (0) , is derived through a classical metric MDS from a consensus matrix D g obtained by summing the dissimilarity matrices of the subjects assigned to cluster g.
To initialize η g we use the average sample variance computed over the p coordinates of (Z g ) (0) , g . For α g we rely on the means of the elements of the d j 's in the g-th cluster,d g , and to the overall mean of the dissimilarities in the g-th cluster,d g . For each pair of objects, we define the artificial binary dissimilarity d g ih taking value 1 or 0 depending on whether or not d g ih >d g . We fit the logistic regression model (Equation 1) relating the d g ih 's to the distances computed using the initial configuration (Z g ) (0) , and take α (0) g as the intercept estimate of this model and the squared standard error of the estimate itself as a starting value for σ 2 g , the variance of α g .
The hyper-parameters (a 0 , b 0 , a 1 , b 1 , . . . , a G , b G , ν 1 , . . . , ν G ) need to be specified by the user. As for η g , we set a g = 3/2 and b g = 1 2s (0) g , so that the prior mean for η g is fixed ats (0) g . We use a noninformative prior for σ 2 g , by choosing a 0 = b 0 = 10 −1 . Finally, for the prior distribution of λ we follow the standard approach in the literature on Bayesian finite mixtures (Robert 1996) and set ν g = 1 for g = 1, . . . , G. Tests on the effect of changing the values of the hyper-parameters showed a little impact on the final numerical results, and almost no effect on the qualitative conclusions for all the data sets we analyzed.

Identifiability and post-processing
Since the Euclidean distance is invariant under translation, rotation, and reflection, the likelihood will also be invariant. To solve the consequent identifiability issues, we follow a standard approach (see for example Oh and Raftery 2001;Hoff et al. 2002), and post-process the sampled latent positions by applying a Procrustes transformation (Gower and Dijksterhuis 2004;Borg and Groenen 2005). Thus, the resulting configuration Z g , g = 1, . . . , G, is rotated toward maximum similarity with a reference configuration, that we set equal to the configuration at the last iteration of the MCMC algorithm.
The likelihood is also invariant under a permutation of the clusters' labels (Stephens 2000). To fix this label switching issue, we choose the relabeling approach proposed by Celeux, Hurn, and Robert (2000) 5 .

Model choice
In the previous section we described the inferential procedure for given G, the number of clusters, and p, the dimension of the latent spaces. Actually, the proper choice of p and G is a relevant issue in MDS and in cluster analysis respectively. Embedding the problem in a Bayesian framework allows to develop selection criteria. Specifically, we extend to our case the approach proposed in Oh and Raftery (2007) to simultaneously choose p and G, where they take into account the situation of a unique numerical dissimilarity matrix and the aim is to cluster the objects.
Following Oh and Raftery (2007), we propose to select (p, G) based on p(Z pG , p, G | D), the posterior density function of (Z, p, G) given the data, D, evaluated at Z = Z pG , the estimated posterior mean of the latent configuration given (p, G). Note that where c is a constant, and f (D | Z pG , p, G) and p(Z pG , p, G) are respectively the marginal likelihood and the marginal prior: Assuming equal prior probabilities for all p ∈ (p min , p max ) and for all G ∈ (G min , G max ), Equation 12 can be written as The marginal prior of Z for given p and G is The marginal likelihood is not available in closed form, but it can be approximated using the Laplace-Metropolis method (Lewis and Raftery 1997). More specifically, by defining the 3G-dimensional vector θ = (α, σ 2 , λ), we obtain and H * is minus the inverse Hessian of h(·) evaluated at θ * . Note that when G = 1, it is θ = (α, σ 2 ).
As suggested by Lewis and Raftery (1997), we compute θ * as the multivariate median of the posterior sample {θ ( ) : = 1, . . . , L}, defined as θ * = arg min q L =1 |θ ( ) − q|. As for H * , note that asymptotically it equals the (marginal) posterior covariance matrix. To limit the impact of possible outliers, we use a robust estimator of the posterior covariance matrix, and precisely the minimum covariance determinant estimator (see Rousseeuw 1984;Rousseeuw and Van Driessen 1999).
As pointed out in Oh andRaftery (2001, 2007), a criterion based on p(Z pG , p, G | D) would favor larger p. Indeed, p(Z pG | p, G) is positively related to the scale of Z, which tends to decrease as p increases, even without improvement of the fit. To avoid this shrinking effect, two configurations should be compared using the same number of dimensions. Therefore, to compare dimensions p and (p − 1) we refer to Z p and to Z * p = (Z p−1 : 0), a p-dimensional configuration obtained by adding to Z p−1 a factor with all the coordinates equal to zero. Note that Z * p returns the same likelihood as Z p−1 , and that if (p − 1) is the correct dimension, then the optimal solution in the p dimensional space should be (close to) Z * p .
dαdσ 2 , and the comparison between configurations with dimensions p and (p − 1) can be based on where A p can be regarded as a correction factor to the posterior density ratio of Z p and Z (p−1) for the shrinking effect. Using Equation 14 one can show that is the estimate of the latent space coordinate (for the i-th object and the k-th factor) in the (p − 1)-dimensional configuration corresponding to G = 1. Since the shrinking effect regards only p and not G, we use A p for all values of G given the same value of p, as suggested in Oh and Raftery (2007). Hence, to choose simultaneously p and G, we propose the following information criterion: where Equation 16 has been used to derive the final expression. The values of p and G corresponding to the minimum DCIC pG should be selected as the best solution. The encouraging results obtained in a small simulation study (not reported here) suggest the reliability of the proposed criterion for the selection of the correct model.

The dmbc package
The dmbc package allows to fit the model described above and to compute the DCIC pG index for the sequence of models with latent space dimension up to p and number of clusters up to G. Additionally, it includes functionalities for summarizing and plotting all the results. In this section, we provide a detailed description of the package features.

Object classes
The package uses the S4 system and it is built around six classes whose features and slots are described below.
'dmbc_data': Defines instances to represent the input data of a dissimilarity model-based clustering (DMBC) analysis. The classes' slots are: diss, a list containing the subjects' dissimilarity matrices, that must be objects with class dist; n, the number of objects assessed by each subject; S, the number of subjects. A further class member, family, specifies the distribution of the dissimilarities, but it is not used in the current version of the package and, since we limit attention to binary dissimilarities, at the moment it is internally constrained to "binomial".
'dmbc_model': Defines objects providing the main elements of a DMBC model, namely the dimension of the latent space, p, the number of clusters, G, and family having the same content as described above.
'dmbc_fit': Defines objects containing the results of a DMBC model fitted using a single MCMC chain. It includes the simulated chains for the model's parameters 6 , the Metropolis-Hastings acceptance rates, the list of the observed dissimilarity matrices (diss), a list (dens) with the log-likelihood, log-prior and log-posterior values at each iteration, the control parameters, the prior hyper-parameters choices, the proposal distribution parameters for the latent positions, and the cluster-specific random-effects α g 's, a list (dim) containing the model's dimensions (i.e., n, p, G and S) and the model characteristics (i.e., those described above for the 'dmbc_model' class).
'dmbc_fit_list': Contains a unique member, results, that consists of a list of 'dmbc_fit' objects corresponding to the independent MCMC simulated chains.
'dmbc_config': Provides the final estimates: Z.est and Z.sd respectively contain the estimates of the latent configurations for the n objects in each cluster and the corresponding posterior standard deviation, cluster provides the estimate of the subjects' cluster membership, and est indicates the estimate type (either posterior means, medians, or modes, or estimates corresponding to the maximum likelihood value attained during the simulations); for convenience, also general information about the fitted model (n, p, etc.) is included. Finally, chain is the ID number of the chain used to compute the estimates, and labels is an optional character vector of labels to tag the objects in graphical visualizations.
'dmbc_ic': Provides information about the comparison of DMBC models based on different values of p and/or G, and specifically the results of the computations discussed in Section 3. In particular, the class slots correspond to the different components of the DCIC index (i.e., logprior, logmlik, logcorrfact and DCIC defined in Equation 14, Equation (15), Equation (17) and Equation (18) respectively), the posterior estimates of the compared models (post.est), the estimate type (est), and a list (res_last_p) of 'dmbc_fit_list' objects with the results of the fitted DMBC models corresponding to the maximum value of p. The last slot should be used as a starting point in case one wants to extend the comparison to models with higher dimensions. Indeed, as discussed above, the DCIC index is computed incrementally with respect to the value of p.
For all these classes, the package provides specific methods for printing, summarizing and plotting their contents, which are illustrated in more details in the following.

Model fitting
The package's main function is dmbc(), whose primary aim is to set up the problem, verifying the coherency and consistency of the inputs, and possibly filling in with default values all the inputs not specified in the function call. In addition, when more MCMC chains are simulated, the function dispatches the call to perform the requested type of parallel operation.
More operationally, fitting a DMBC model requires the three steps described below.
Step 1. Specification, at least partial, of a set of tuning parameters controlling the model's fitting procedure: • nsim: The number of MCMC iterations (after burn-in).
• burnin: The number of burn-in iterations.
• thin: The number of iterations between consecutive draws.
• nchains: The number of MCMC chains to run.
• threads: The number of cores to use in the computations.
• seed: An optional random seed.
• parallel: A string indicating the type of parallel operation to use for running multiple chains simultaneously. Admissible values are "no" (i.e., serial computation, one chain at a time), "multicore" and "snow". Specifically, multicore and snow are the two original packages that have been merged into the parallel package that we use internally (R Core Team 2021). We offer the possibility to choose between "multicore" and "snow" because the first option, which exploits the mclapply() function from the parallel package, is not available on Windows systems.
• z.prop: The variance of the normal proposal distribution for the latent positions (denoted as γ 2 Z in Section 2.1).
• alpha.prop: The variance of the normal proposal distribution for the cluster-specific random effects α g (denoted as γ 2 α in Section 2.1).
• random.start: A Boolean value indicating whether the initial cluster membership should be chosen randomly or not (see the next two arguments for alternatives).
• partition: A numeric vector containing the user-defined initial partition.
• method: If the initial partition is not chosen randomly (i.e., random.start = FALSE), a Ward hierarchical clustering algorithm is applied to the observed dissimilarities using the distance measure specified through this argument. The allowed distances are those available in the dist() function, that is either euclidean, maximum, manhattan, canberra, binary or minkowski, with the default set to the Manhattan distance.
• procrustes: A Boolean value indicating whether to post-process the simulated MCMC chains by applying a Procrustes transformation.
• relabel: A Boolean value indicating whether to post-process the simulated MCMC chains by running the relabeling algorithm by Celeux et al. (2000). Note that turning off the relabeling allows to dramatically reduce the computational burden of the whole algorithm.
• store.burnin: A Boolean value indicating whether the burn-in iterations should be stored in the final output 7 .
• verbose: A Boolean value indicating whether information on the progress of the MCMC algorithm should be printed in the R console 8 .
To support the user in setting these parameters, the package includes the dmbc_control() auxiliary function which can be used to specify all or only a subset of the above parameters. The parameters that are not explicitly provided will be set to their default values automatically. Section 4.3 provides more details about the meaning of the most relevant tuning parameters and the corresponding default values.
Step 2. Choice of the prior hyper-parameters (see Section 2.1) using a list with elements: • eta: A list of two vectors a and b, both of size G, containing the hyper-parameters of the Inverse Gamma distributions of the z g i 's prior variances. • sigma2: A list with two numeric scalars, a and b, containing the hyper-parameters of the Inverse Gamma distribution of the prior variance of α g .
• lambda: A vector of size G containing the parameters of the Dirichlet prior distribution of λ.
The package includes an auxiliary function, dmbc_prior(), to assist the user in the proper setting of the prior hyper-parameters, which works similarly to the dmbc_control() seen above.
Step 3. Call of the dmbc() function. After a preliminary check of the validity of the control and prior arguments passed, the function internally generates the model's parameter starting values using the dmbc_init() function, and it dispatches the computation using the chosen parallel framework. dmbc() internally calls dmbc_fit(), the workhorse function that performs all the computations. In particular, for each chain dmbc_fit() performs the following substeps.
• Sub-step 3a: It runs the MCMC simulation by calling the internal dmbc_mcmc() C routine. • Sub-step 3b: It post-processes the latent position chains by applying a Procrustes transformation as described in Section 2.3. This step is performed using the procrustes() function from the MCMCpack package (Martin, Quinn, and Park 2011). • Sub-step 3c: It relabels all the parameter chains to take care of the label-switching issue. This is implemented internally in the dmbc_relabel() C routine.
Note that sub-steps 3b an 3c are performed only if the procrustes and relabel options in dmbc_control() are both set to TRUE (default). After these computations, the control returns to the dmbc() function, which may perform an optional final post-processing step (i.e., Procrustes transformation and relabeling) to try making the results of the different chains more consistent in terms of interpretation. This optional final step can be achieved by setting to TRUE the post_all argument of dmbc().
The output returned by dmbc() is an object of class 'dmbc_fit_list'.
To describe the features of the package, we refer to the simulated data stored in the simdiss object 9 , which contains S = 10 binary dissimilarity matrices. For a preliminary exploration of the data, it is possible to use the plot() function, that displays the dissimilarities matrices using a grid of rectangles with colors corresponding to their values (0 or 1): R> data("simdiss", package = "dmbc") R> plot(simdiss, colors = c("white", "darkgray"), cex.font = 0.75) The picture, reported in Figure 1, highlights that some of the subjects share a common pattern of 0s and 1s. This is particularly evident for subjects from 4 to 8, whose matrices present a top right block of 0s.

Diagnostics
The standard approach in the estimation of Bayesian models through MCMC is to let the algorithm run for a prespecified number of iterations to allow the probability distribution of the sampled values converge to the "target" distribution, that is the posterior distribution of the model's parameters. Therefore, a common step in every Bayesian data analysis problem is to inspect the results provided by the MCMC simulation to check whether the algorithm did actually convergence. Even if an in-depth discussion of these procedures goes beyond the scope of the paper, we provide here a brief summary of the most frequently adopted strategies, referring to the classical reference by Gelman and Rubin (1992) and to the recent review by Roy (2020) for a more formal and detailed description.
A very common practice to attain convergence consists in discarding (burn-in) the first part of the simulated chains to avoid including in the estimates calculation values of the parameters that belong to regions of low (posterior) density. In addition, to reduce the autocorrelation in the sampled values, one can thin the sample, using only one in every k-th iteration.
Based on the theoretical considerations discussed in the references above and on the many experimentation we performed during the development of the package, we chose some default values to support the user in setting these critical tuning parameters. More specifically: • The default for the length of the burn-in period (burnin in our package) was set to 5,000, that is half of the total number of iterations. This is a pretty commonly adopted strategy. Even so, when convergence is slower it is advisable to increase this value.
• The default for the number of iterations after burn-in (nsim in the package) was set to 10,000, which has shown to provide good results in most of the cases we considered.
• For thinning (thin in the package), the default is set to 1, so that no thinning is applied. Indeed, we think that this choice should be left to the user, because it depends on the complexity of the problem and on possible preliminary results.
A further set of parameters that typically have a strong impact on the "goodness" of the sampled values is represented by the variance of the proposal distributions from which the algorithm advances new potential values. On one hand, too large values of these variances may produce too many rejections, and the algorithm will probably get stuck. On the other hand, too small values will produce too many acceptances, preventing the algorithm to explore regions of the parameter space with potentially higher density. In our algorithm, there are two such variances: the variance of the proposal distribution for the latent coordinates z g ij , and the variance for the cluster-specific intercepts α g (see Section 2.1 for details). The default values for these parameters were set respectively to 1.5 and 0.75, which worked pretty smoothly in most examples. However, it is very difficult to suggest proposal variance values that work well for every situation and some trials are always recommendable (at least to confirm that the default values work fine).
Besides and beyond the setting of the tuning parameters, it is always crucially important to monitor the algorithm and to assess whether it converged or not. The simplest, but probably most powerful, graphical tool in this respect is the trace plot, reporting the parameter values sampled at each iteration. When the trace plot shows a "stationary" series, that is a sequence of values with no trend and constant variability, one can tentatively conclude that the algorithm achieved convergence.
For the sake of convenience, we linked our package to others developed for MCMC diagnostics. More specifically, the function dmbc_fit_to_mcmc.list(), converts the output returned by dmbc() into an object of class 'mcmc.list', which can be inspected using the functions from the coda package (Plummer, Best, Cowles, and Vines 2006), the most popular for the analysis of MCMC results. Furthermore, our package allows to produce a variety of graphical visualizations to monitor the simulated chains. These graphs are generated using the functions from the bayesplot package (Gabry, Simpson, Vehtari, Betancourt, and Gelman 2019). In the next sections we provide some examples of the visualizations available with dmbc().

Exploration and visualization of results
For a summary of the obtained results, one can apply the summary() function to the object of class 'dmbc_fit_list' returned by dmbc(). The function first converts the output into an mcmc.list object and then applies the corresponding summary() method. The summary method for 'dmbc_fit_list' allows specifying two arguments. The first, include.burnin, controls whether the burn-in iterations (assuming they have been stored in the original call) should be used or not when computing the summaries. The second, summary.Z, controls the printout of the summary for the latent position estimates. For the simulated data example, one gets the following results: The model's estimates of the objects' latent positions and of the subjects' cluster membership can be obtained by calling the dmbc_get_configuration() function, which returns an object of class 'dmbc_config' (see Section 4.1). By default, the dmbc_get_configuration() function uses the results of the first simulated chain to compute the summaries. It is possible to use another chain by setting the chain argument accordingly 10 . For the simulated data example we obtain: To access the cluster memberships more explicitly, one can use the clusters() generic method, which is imported from the modeltools package (Hothorn, Leisch, and Zeileis 2020): R> clusters(est) [1] 1 1 1 2 2 2 2 2 3 3 Note that subjects from 4 to 8, who exhibited common dissimilarities patterns (see Figure 1), are correctly clustered together.
The package includes a method for plotting the objects in the estimated latent space, possibly identifying the points in the diagram with the labels provided in the call to function dmbc_get_configuration(). If not explicitly given, points in the diagram are labeled from 1 to n: R> set.seed(101) # needed to replicate the label positioning R> library("bayesplot") R> color_scheme_set(rep("black", 6)) R> graph <-plot(est, size = 2, size_lbl = 3, label_objects = TRUE) R> graph + panel_bg(fill = "white", color = NA) The plot of the latent configuration estimates shows that the subjects clustered in different groups have different perceptions about the similarities/dissimilarities among the objects they evaluated.
The results of a fitted DMBC model can be represented graphically in many different ways. In particular, as we already mentioned in Section 4.3, the dmbc package exploits the capabilities of the bayesplot, a wrapper of ggplot2 (Wickham 2016), that is focused on the visualization of MCMC simulations. To visualize the results of a fitted DMBC model, one needs to use the plot() function specifying the type of the desired plot in the what argument. All the possibilities offered by bayesplot have been mirrored in dmbc 11 . In Figures 3-5 we provide some examples of the available plot types, which are commonly used to graphically assess the algorithm convergence: R> color_scheme_set("brightblue") R> plot(res, what = "areas_ridges", regex_pars = "lambda", + include.burnin = TRUE) R> color_scheme_set("mix-blue-red") R> plot(res, what = "trace", regex_pars = "eta", include.burnin = TRUE, + transformation = "log", + facet_args = list(labeller = ggplot2::label_parsed)) R> color_scheme_set("viridisA") R> plot(res, what = "pairs", regex_pars = "alpha", diag_fun = "hist", + off_diag_fun = "hex", include.burnin = TRUE) In particular, Figure 3 shows the density plots of the λ g parameters. A unimodal density for the values from a MCMC simulation, like those represented in the figure, typically provides evidence in favor of convergence. Figure 4 reports the trace plots for the log-transformed η g parameters; the logarithmic transformation is particularly useful for variances, that typically exhibit right-skewed distributions. The plots show no trend or change in the variability, suggesting that the algorithm reached convergence also for these parameters. Finally, Figure 5 displays the histograms for the simulated α g parameter values, together with the graphical representation of their pairwise joint distributions. Again, no unexpected patterns are visible in these diagrams, and this further corroborates the conclusion that the algorithm converged.

Model selection
The dmbc package provides the dmbc_IC() function that implements the approach for model  selection described in Section 3. Together with the data, control and prior choices as for dmbc(), this function also requires to specify pmax, the maximum value of the latent space dimension one wants to consider, and Gmax, the maximum number of cluster solutions to compare. In addition, one may choose the estimate type to use in the computations. By default, the posterior means are used, but other available possibilities are posterior medians or modes, or the estimates corresponding to the maximum likelihood value attained during the simulation. dmbc_IC() returns an object of class 'dmbc_ic' containing all the results, and in particular the values of the DCIC index corresponding to the requested models.
As an illustration, we complete the analysis of the simulated data by comparing the DMBC models with p from 1 to 3 and with G from 1 to 4, with priors set to the default values.

dmbc: Model-Based Clustering of Binary Dissimilarity Matrices in R
The package also includes a convenient procedure for updating the evaluation of p and G by increasing their maximum values. This can be done using the update method for 'dmbc_ic' objects imported from the stats4 package (R Core Team 2021). For example, the following code allows to add the DCIC indices for models with p = 4 to those previously obtained, while maintaining the same range of values for G (output not reported here): R> new.pmax <-pmax + 1 R> new.ic <-update(ic, pmax = new.pmax, Gmax = Gmax) R> plot(ic, size = c(4, 1.5))

Empirical application
We refer to data presented in Takane, Jung, and Takane (2009) relative to judgments on the similarity between n = 18 animals expressed by S = 20 subjects. Each subject was asked to divide the animals into as many groups as needed, based on their similarity. In their work, Takane et al. (2009) build a summary dissimilarity matrix whose (i, j)-th entry contains the number of subjects who did not place the i-th and the j-th animal into the same group. A standard (nonmetric) MDS with the Euclidean distance was used to analyze the data, and a three-dimensional solution was chosen "primarily for ease of presentation" (Takane et al. 2009, p. 233).
This approach is based on the implicit assumption that the subjects' dissimilarity matrices are homogeneous, so that a consensus matrix effectively summarizes the data. However, one could be interested in investigating whether such assumption is supported by data, and to study more in depth the possible differences in the subjects' perceived dissimilarities, by identifying clusters of subjects with similar perceptions. Therefore, for each subject we build a dissimilarity matrix whose entries signal for each pair of animals whether they were placed in the same group (0) or not (1). The resulting binary matrices, available in the animals object within the package, can be visualized as in Figure 7: R> data("animals", package = "dmbc") R> plot(animals, colors = c("white", "darkgray"), cex.font = 0.75) We first apply the procedure to simultaneously select p and G based on the DCIC index. Thus, for each combination of (p, G), with 1 ≤ p ≤ 5 and 1 ≤ G ≤ 5, we apply our algorithm using 20,000 MCMC iterations, with the first 10,000 used for warming-up (the ranges for p and G were chosen based on a preliminary analysis using a reduced number of iterations). The prior and proposal variance parameters are set as illustrated in Section 2.1.

Conclusion
In this work, we introduced dmbc, an R package that extends to three-way binary data the Bayesian multidimensional scaling approach by Oh and Raftery (2007). Our procedure provides a partition of the set of dissimilarity matrices into G groups, and simultaneously returns a p-dimensional MDS configuration for each cluster. Embedding the problem in a Bayesian framework has the advantage of allowing the development of a Bayesian criterion to jointly select G and p. Furthermore, it is possible to obtain credibility intervals for all the estimated parameters as well as for the objects' coordinates in the different clusters' configurations.
Our procedure is flexible and includes RMDS and WMDS as particular cases. If the configuration underlying the identified clusters is unique and the individual differences are only due to a different system of weights, then the cluster-specific configurations will be similar except for cluster-specific scaling factors. Clearly, a model based on a unique configuration with a possibly different system of weights is easier to interpret compared to the case when many configurations are obtained. Even so, the comparison of the configurations across the clusters' can be pursued using a Procrustean analysis (see for example Gower and Dijksterhuis 2004) and rotating the estimated configurations to maximum similarity with a reference configuration.
Even if our focus here is very specific, a nice feature of our model-based three-way MDS is its versatility. Indeed, the model and the package can be extended along different directions. First, our approach can be adapted to deal with multinomial and numerical (i.e., non-binary) dissimilarity data by suitably modifying the assumption (Equation 3) on the dissimilarity distribution. Second, it can be enriched by introducing subject-level covariates, such as sociodemographic characteristics, assuming that they influence either the cluster membership, or the linear predictor (Equation 1), or both. Third, it is possible to allow for asymmetric dissimilarity matrices. All these ideas are currently under development.
One limitation of the model regards the assumption that the cluster-specific latent spaces have the same dimension, p. This might lead to overestimate the latent space dimension for some clusters. A simple way to overcome this issue is to apply principal components analysis to the estimated cluster-specific configurations to assess if one or more clusters present redundant dimensions. Although one could envision an extension of the model that allows for a different number of dimensions of the configuration spaces, this would come at the cost of a more complicated procedure for selecting the optimal value for p in the different clusters.
In the MCMC algorithm presented in the paper we use normal proposal densities. This choice is motivated by the difficulty in identifying suitable distributions in a complex multivariate setting like the one analyzed here. An alternative strategy, which has been largely discussed in the literature, is the surrogate proposal approach, used for example in Gormley and Murphy (2010), that approximates a target distribution by simplifying its terms through a quadratic multivariate Taylor expansion. This allows constructing proposal distributions which better approximate problematic full conditional distributions from which sampling is required, and allows the Metropolis-Hastings algorithm to explore more efficiently the parameters' space.
A further limitation of our proposal concerns the DCIC criterion. For given values of p and G, the computation of the DCIC requires the estimation of all models with dimensions from 1 to (p − 1), and this does not permit to focus on a restricted set of (hypothesized) promising dimensions. The estimation of the model over a wide range of dimensionality can therefore be very time consuming. One possibility to overcome this issue is to adapt to our context the approach introduced by Oh (2012) for choosing the MDS dimensionality in the case of a single dissimilarity matrix. In this procedure, for each factor in an MDS configuration, a prior is assumed which is a mixture of a point mass at 0 and a continuous distribution for the rest of the parameter space. This prior specification leads to a marginal posterior distribution which is a mixture of the same form of the prior, where the mixing weight of the continuous distribution can be regarded as a measure of significance for the factor itself. Hence, in such framework the optimal dimension of the configuration coincides with the number of factors having high posterior probabilities of being non-zero.
Our methodology does not account for uncertainty in the number of clusters G and the latent space dimension p, which are both assumed to be unknown but fixed quantities. This clearly represents a further direction of research, which can be carried on using a trans-dimensional MCMC approach, such as the reversible jump sampling (Green 1995;Richardson and Green 1997;Brooks, Giudici, and Roberts 2003), in which the dimension of the parameter space can change from one iteration to the next. Alternatively, one can achieve a higher degree of flexibility by adopting a nonparametric perspective, for example using Dirichlet process mixtures (see for example Ferguson 1973;Escobar and West 1995;MacEachern and Müller 1998;Kyung, Gill, and Casella 2010;Müller, Quintana, Jara, and Hanson 2015 provides a comprehensive survey of these models).
Still with reference to the selection of the best model, we note that, up to a certain extent, the DCIC index tends to favor an increase in the number of clusters over an increase in the number of dimensions of the MDS configuration. As a consequence, when the sample contains a large number of subjects with very peculiar perceptions, one may get a DCIC profile showing a very slow monotonic decay with respect to the number of clusters. In these cases, model selection becomes an issue, since one may find a "global" optimum only selecting a very high value for G. When this occurs, we suggest to adopt a more heuristic procedure that mimics the strategy used to identify the dimension of factorial configurations through the well-known scree plot. Thus, in these situations we suggest to select as a reasonable value of G the point where the slope of the DCIC curve levels off showing the classical "elbow".
As a final remark, for what regards the computational efficiency of the algorithm implementation, we note that the algorithm scales well with respect to the number of subjects (S) and to the number of clusters (G), while it tends to slow down when the number of objects (n) increases.