kamila : Clustering Mixed-Type Data in R and Hadoop

In this paper we discuss the challenge of equitably combining continuous (quantita-tive) and categorical (qualitative) variables for the purpose of cluster analysis. Existing techniques require strong parametric assumptions, or diﬃcult-to-specify tuning parameters. We describe the kamila package, which includes a weighted k -means approach to clustering mixed-type data, a method for estimating weights for mixed-type data (Modha-Spangler weighting), and an additional semiparametric method recently proposed in the literature (KAMILA). We include a discussion of strategies for estimating the number of clusters in the data, and describe the implementation of one such method in the current R package. Background and usage of these clustering methods are presented. We then show how the KAMILA algorithm can be adapted to a map-reduce framework, and implement the resulting algorithm using Hadoop for clustering very large mixed-type data sets.


Introduction
Cluster analysis, also referred to as unsupervised learning, comprises a set of techniques that seek to identify structure in a data set without reference to a known set of labeled "training" data vectors (Hastie, Tibshirani, and Friedman 2009). In this paper we introduce software for clustering data consisting of mixed continuous and categorical variables. Although there are various existing approaches for clustering mixed-type data, they suffer from significant drawbacks including loss of information due to discretization, arbitrary weighting of continuous versus categorical variables (e.g., as in dummy coding or the similarity metric of Gower 1971), or strong parametric assumptions (e.g., as in parametric mixture models). We review these existing approaches in Section 2, and in Section 3 discuss an often overlooked technique introduced by Modha and Spangler (2003) for estimating appropriate weights for combining continuous and categorical variables. In Section 4 we describe the KAMILA (k-means for mixed large data) algorithm and underlying data model, and in Section 5 we introduce the R (R Core Team 2017) package kamila (Foss and Markatou 2018) implementing a weighted k-means algorithm, Modha-Spangler weighting, and the KAMILA algorithm for clustering mixed-type data. Package kamila is available from the Comprehensive R Archive Network (CRAN) at https://CRAN.R-project.org/package=kamila. In Section 6 we introduce software for implementing KAMILA on a distributed file system using Hadoop (Apache Software Foundation 2016a). Section 7 offers conclusions, while Appendix A discusses the Hadoop implementation of the KAMILA algorithm.

Related work
Clustering techniques can generally be divided into hierarchical methods and partitioning methods. Hierarchical methods involve constructing a series of nested partitions of the data set; these can either involve iteratively splitting a cluster from the previous step (divisive techniques) or iteratively merging two clusters from the previous step (agglomerative techniques). The diana function in R package cluster (Maechler, Rousseeuw, Struyf, Hubert, and Hornik 2015) implements divisive clustering. The function agnes in package cluster (Maechler et al. 2015) and the function hclust implemented in each of the packages stats (R Core Team 2017), fastcluster (Müllner 2013), and flashClust (Langfelder and Horvath 2012) are commonly used functions for agglomerative clustering. A compromise between agglomerative and divisive clustering is developed by Chipman and Tibshirani (2006), and implemented in R package hybridHclust of Chipman, Tibshirani, and Hastie (2015). Partitioning methods generally involve a single split of the data set into mutually exclusive and exhaustive clusters. A well-known example are the k-means algorithms (Hartigan and Wong 1979;MacQueen 1967) or alternatives such as the k-medoids algorithms, e.g., the PAM (partitioning around medoids) algorithm of Kaufman and Rousseeuw (1990). The Hartigan-Wong, Lloyd, and MacQueen k-means algorithms are implemented by the kmeans function in the stats package (R Core Team 2017). Kernel k-means uses kernels to project the data into a non-linear feature space before applying k-means, and is implemented in the kkmeans function in the kernlab package (Karatzoglou, Smola, Hornik, and Zeileis 2004). The function kcca in package flexclust (Leisch 2006) implements generalizations of k-means using arbitrary centroid statistics and distance measures. Trimmed k-means, which trims a specified proportion of extreme values in each cluster, is implemented in packages trimcluster (Hennig 2012) and tclust (Fritz, García-Escudero, and Mayo-Iscar 2012). The pam function in package cluster (Maechler et al. 2015) implements k-medoids. The clara function in the same package implements a computationally efficient version of k-medoids in which the partitioning steps are executed on sub-sampled data and then applied to the entire data set. This efficient algorithmic structure makes clara well suited for large data sets that can be stored in RAM. At the end of this section we discuss existing software for data sets too large to store on a single compute node. The finite mixture model (e.g., Lindsay 1995;McLachlan and Basford 1988;McLachlan and Peel 2000;Titterington, Smith, and Makov 1985) is another partitioning method in which each cluster is assumed to follow some parametric distribution, the parameters of which are then typically estimated using the EM (expectation-maximization) algorithm (Dempster, Laird, and Rubin 1977). Various finite mixture models are implemented in R. Finite Gaussian mixture models can be fit using the packages EMCluster (Chen and Maitra 2015), mclust (Fraley, Raftery, Murphy, and Scrucca 2012), and mixture (Browne, ElSherbiny, and McNicholas 2015); multinomial or gamma mixtures can be fit using package mixtools (Benaglia, Chauveau, Hunter, and Young 2009), and mixtures of skew-normals can be fit using package mixsmsn (Prates, Cabral, and Lachos 2013). For an overview on packages available on CRAN for cluster analysis and mixture models see the corresponding CRAN Task View (Leisch and Grün 2017).
Although there is an abundance of clustering techniques designed for a single data type, fewer exist for handling mixed-type data. In fact, many of the most common approaches to clustering mixed-type data involve imperfect usages of techniques designed for a single data type. One common strategy is to first dummy code the categorical variables, and then apply some technique designed for continuous data, such as k-means. For a categorical variable X with distinct categorical levels, the dummy coding step generally involves creating distinct 0 − c indicator variables, one for each of the levels, where 0 indicates absence and c ∈ R denotes presence of that particular level in X. A problem arises in the selection of c; an excessively large c will overemphasize the categorical variables over the continuous variables in the final clusters, whereas an excessively small c will under-emphasize the categorical variables. As demonstrated in Foss, Markatou, Ray, and Heching (2016), the selection of c is a difficult problem without a clear solution; even in very simple mixed-type data sets the ideal choice of c can fluctuate based on the number of variables used in the analysis and the degree of separation between the underlying clusters in the data. The common approach of arbitrarily setting c = 1 is by no means an acceptable solution to the mixed-type data clustering problem. Another common approach of standardizing all variables (in particular, each dummy-coded variable) to unit variance is also inadequate; it amounts to an equally arbitrary choice of c j ≈ 1/ p j (1 − p j ), where p j is the frequency of observations with that categorical level present in the j-th dummy-coded variable. This has the effect of up-weighting categories the further they deviate from p j = 0.5. While it is trivial to imagine an application where this could be appropriate, it is equally trivial to imagine scenarios where the opposite effect is desired (i.e., two near-majority categories of interest, with a small proportion of rare categories of minimal interest). See Section 5.1 for an illustration of the difficulty in selecting appropriate weights.
A second common strategy for clustering mixed-type data is to use a distance metric compatible with mixed data, such as Gower's distance (Gower 1971), and then use a clustering method that depends only on the distances between the data points (e.g., agnes, pam, or various other functions in the cluster package). However, each variable in Gower's distance must be assigned a user-specified weight determining its relative contribution to the distance, which presents essentially the same dilemma as the choice of c in the dummy coding approach above. A poor choice of weights will result in certain variables being over-or under-emphasized, and in particular, it is unclear how to properly weigh the continuous variables relative to the categorical variables for a given data set. In Table 2 of Foss et al. (2016), a range of weighting strategies for Gower's distance and dummy coding were compared in a simulation study using three variables (one continuous, two categorical). The continuous weight was fixed at 1.0 while the categorical weights spanned a broad range of values. Two models were used to generate data; one in which the continuous variable contained more useful information regarding cluster structure, and one in which the categorical variables contained more useful information. Although the top performing method in this simulation (KAMILA; no choice of weights required as described in Section 4) achieved adjusted Rand index (ARI; Hubert and Arabie 1985) scores of 0.99 in both conditions, weighting strategies for Gower's distance and dummy coding could only perform well in one condition at a time. For example, the best score for Gower's distance in condition 1 (ARI = 0.984) used categorical weights of 0.1, but could only achieve an ARI of 0.688 in condition 2. The best score for Gower's distance in condition 2 (ARI = 0.849) used categorical weights of 0.4, but could only achieve an ARI of 0.872 in condition 1. Similar trade-offs were apparent for dummy coding. We note that the optimal weight could fluctuate unpredictably based on features such as the number of variables and the number of categorical levels; the best weights mentioned above do not generalize to other data conditions. Numerous other existing strategies for clustering mixed-type data involve this same problem of requiring a user-specified weight determining the relative contribution of continuous versus categorical variables. The k-prototypes method of Huang (1998) uses a distance metric equal to the squared Euclidean distance between the continuous components plus the user-specified constant γ times the matching distance between the categorical components (Huang 1998, p. 291, Equation 9). The package clustMixType (Szepannek 2016) implements k-prototypes clustering in R. In Azzalini and Menardi (2014) (with associated package pdfCluster), a novel density-based clustering scheme is proposed for continuous data, and a possible extension to mixed-type data using a distance such as Gower's distance is mentioned briefly. The method of Friedman and Meulman (2004) uses a Gower-like distance metric, and introduces an interesting strategy for selecting variable weights, but explicitly relies on the assumption that setting each weight to 1/p is equivalent to giving the same influence to all attributes regardless of data type (here p denotes the number of variables of any type). In Hennig and Liao (2013), a weighting scheme is proposed for mixed-type data that takes into account the number of levels in each categorical variable (and optionally the number of observations in each categorical level). The method does not take into account other crucial aspects of the data that influence the adequate choice of weights, such as number and quality of the variables. This leads to this method performing worse than other methods for mixed-type data (see Foss et al. (2016) for more extensive discussion and examples). As described above and discussed in Foss et al. (2016), the proper choice of weights fluctuates in a data-dependent manner, rendering any fixed choice of weights unable to balance continuous and categorical components in any general sense.
The clustering technique of Modha and Spangler (2003) offers a potential solution to this dilemma, and is described in Section 3. The algorithm is structured similarly to k-prototypes, using a weighted combination of squared Euclidean distance and cosine distance for the continuous and categorical variables, respectively. The weighting between continuous and categorical variables is identified through a brute-force search. In various conditions Modha-Spangler weighting performs well. Figure 4 in Foss et al. (2016) depicts simulation results suggesting that Modha-Spangler weighting performs poorly relative to a competing method (KAMILA clustering, described in Section 4) when clusters are poorly separated in the continuous variables but well-separated in the categorical. Furthermore, unlike finite mixture models, this method is unable to adjust the relative contribution of individual variables of the same type. To our knowledge, the method of Modha & Spangler has not previously been implemented in R, nor in any other statistical software package.
Finite mixture modeling is another technique that does not require user-specified weights for the continuous versus categorical variable contribution. For mixed-type data, a popular model is the joint normal-multinomial mixture model (Hunt and Jorgensen 2011). Finite mixture models are often able to achieve a favorable balance between continuous and categorical variables when their parametric assumptions are reasonably accurate; however, they often perform poorly when their parametric assumptions are strongly violated. For examples of this behavior, see Foss et al. (2016), in which the performance of mixture models is studied in a series of Monte Carlo studies that systematically vary the severity of parametric violations. Specifically, Foss et al. (2016) show that the normality assumption is particularly vulnerable to cluster distributions that are skewed as well as heavy-tailed distributions. Normal-multinomial mixture models are implemented in the R packages clustMD (McParland 2015), fpc (Hennig 2015a), and Rmixmod (Langrognet, Lebret, Poli, and Iovleff 2016).
In a manner similar to finite mixture models, the KAMILA method of Foss et al. (2016) is able to achieve a favorable balance between continuous and categorical variables without requiring user-specified weights. In order to decrease the susceptibility to violations of parametric assumptions, the continuous components are modeled using a general class of elliptical distributions. Categorical variables are modeled as mixtures of multinomial random variables, thus not requiring dummy coding. Unlike the method of Modha and Spangler (2003), KAMILA does not require a brute-force search to identify an appropriate balance between continuous and categorical variables. We describe KAMILA in detail in Section 4.
Existing clustering software for very large data sets relies heavily on methods designed for continuous data only, and on k-means clustering in particular; they are thus vulnerable to the drawbacks enumerated above when used with mixed-type data. See the results of the simulation in Section 5.1 for an illustration of the limitations of using k-means for mixed-type data, and see Foss et al. (2016) for additional results and discussion of the limitations. The Mahout project (Apache Software Foundation 2016c) implements k-means, fuzzy k-means, and streaming k-means, as well as a spectral clustering algorithm that involves running k-means on eigenvectors of the graph Laplacian of the original data. The In-Memory Statistics for Hadoop Suite (SAS Corporation 2016) implements both k-means and DBSCAN, a clustering technique which also requires the choice of a suitable distance-based metric. The Microsoft Azure Machine Learning Studio (Microsoft Corporation 2016) and the ScaleR package (Revolution Analytics 2016b) both rely on the standard k-means algorithm for clustering data sets. The R package pmclust , built on the pbdMPI framework (Ostrouchov, Chen, Schmidt, and Patel 2012), implements Gaussian mixture modeling for continuous variables on large computing clusters using the single program/multiple data (SPMD) programming style. In Section 6 we introduce an alternative clustering strategy for large mixed-type data sets that implements the KAMILA clustering algorithm in Hadoop streaming mode. In addition to achieving a superior balance between continuous and categorical variables compared to standard k-means approaches, our method can drastically reduce storage and computational requirements by not requiring dummy coding of variables. For example, the airline data set analyzed in Section 6 would require over 25 times the number of variables if its categorical variables were dummy-coded.

Modha-Spangler clustering
The work of Modha and Spangler (2003) presents a general framework for clustering data sets with multiple data types, although it focuses primarily on the special case of mixed continuous and categorical variables. Let x i = (F (i,1) , F (i,2) , . . . , F (i,m) ) denote the i-th data vector with m different data types and i = 1, 2, . . . , N . Modha and Spangler (2003) defines a weighted distortion measure for mixed-type data as D α (x 1 , x 2 ) = m =1 α D (F (1, ) , F (2, ) ), where each of the m data types is assigned its own distance metric D and a corresponding weight in α = (α 1 , α 2 , . . . , α m ), α ≥ 0, and m =1 α = 1. For a fixed weighting α, the proposed algorithm in Modha and Spangler (2003) seeks to partition the data vectors into the k partitions {π * u } k u=1 such that where c u denotes the centroid corresponding to the u-th cluster, whose definition depends on the particular choices of distance metrics D for each data type. Modha and Spangler (2003) use a method analogous to the k-means algorithm to approximate the optimal partition {π * u } k u=1 . In order to select the optimal weights α * , Modha and Spangler (2003) define the average within-cluster distortion for the -th data type as W (α) = k u=1 x∈π * u (α) D (F , c * (u, ) (α)), and the average between-cluster distortion as B , wherec denotes the centroid of the -th data type taken across all N data vectors.
Finally, assuming no missing data, the method of Modha and Spangler (2003) aims to identify the weighting vector α * such that The weighting is identified through a brute-force search: the algorithm is run repeatedly for a grid of the possible weightings α with selection as described above. We focus primarily on the case where m = 2, with the first data type consisting of continuous variables and the second of 0-1 dummy-coded categorical variables. In this case, Modha and Spangler (2003) use squared Euclidean distance for D 1 and cosine distance for D 2 .
In the current R package we provide a flexible implementation of the Modha-Spangler method in which the underlying clustering technique can be specified by the user. We devolve the issues of initialization and stopping rules to the underlying clustering technique used. We leave for future investigation the interesting question regarding how optimal settings for initialization and stopping rules are altered when moving from a simple application of a clustering method to the Modha-Spangler approach. The granularity of the brute-force search over α 1 can be specified by the user, with the default of ten equal subdivisions of [0, 1].

Model
As described in Foss et al. (2016), we assume that our data set consists of N independent and identically distributed observations of a (P + Q)-dimensional vector of random variables (V , W ) that follow a finite mixture distribution with G components, where V is a Pdimensional vector of continuous random variables and W is a vector of Q categorical random variables, and where the q-th element of W has L q categorical levels denoted 1, 2, . . . , L q , with q = 1, 2, . . . , Q. The vectors V and W may be dependent, but under the local independence assumption (e.g., see Hennig and Liao 2013;Hunt and Jorgensen 2011), V and W are independent within any particular cluster. Given membership in the g-th cluster, we model V as a vector following a finite mixture of elliptical distributions with individual component density functions f V,g (v; µ g , Σ g ), where g indexes cluster membership, µ g denotes the g-th centroid, and Σ g the g-th scaling matrix. The specific form of the density function f V,g is described below. Given membership in the g-th cluster, we model W as a vector following a finite mixture of multinomials with individual component probability mass functions f W,g (w) = Q q=1 m(w q ; θ gq ), where m(·; ·) is the multinomial probability mass function, and θ gq is the multinomial parameter vector for the g-th component of the q-th categorical variable. Given cluster membership in the g-th cluster, under the local independence assumption, the joint density of (V , W ) is with the overall density unconditional on cluster membership given by where π g denotes the prior probability of observing the g-th cluster.

Radial kernel density estimation
The Proposition 2 in Foss et al. (2016) states that if X ∈ R p follows a spherically symmetric distribution with center µ, then where r = (x − µ) (x − µ), R = (X − µ) (X − µ), and f R is the probability density of R. We proceed to constructf R using a (univariate) kernel density estimation scheme, which is then substituted into (2) in place of f R . Note that X corresponds to the vector V above within a particular cluster, and that using a scaling matrix Σ g this result can be extended to elliptical distributions. The kamila function currently uses Σ g equal to the identity matrix; future work will investigate how best to extend KAMILA clustering to allow for more flexible specifications, such as Σ g = Σ g for all g, g , i.e., all scaling matrices being equal across clusters, and Σ g = Σ g for all g = g .
This univariate kernel density estimation scheme avoids the drawbacks of a multivariate kernel density estimator: namely that multivariate kernel density estimation is computationally expensive, and tends to over-fit data points (Scott 1992, Chapter 7).

Algorithm description
KAMILA proceeds by estimating the unknown parameters of (1) via an iterative process similar to the EM algorithm, as shown in Algorithm 1 of Foss et al. (2016). At the t-th iteration of the algorithm, letμ (t) g denote the estimator of the centroid of population g, and letθ (t) gq denote the estimator of the parameters of the multinomial distribution corresponding to the q-th discrete random variable drawn from population g.
These parameter estimates can be initialized through random uniform draws, draws from the observed data points, or with initial centroids selected through some other mechanism (e.g., a preliminary clustering round). Using centroids identified through previous rounds of clustering may be appropriate in certain circumstances, but in general, replacing a small subproblem (i.e., initialization) with the entire original problem (clustering) seems needlessly complex. Any additional computational resources may be better spent on extending the number of initializations and iterations of KAMILA rather than spending it on an entirely separate clustering algorithm. We have found that initializingμ (0) g for each g = 1, 2, . . . , G with random draws from the observed continuous data vectors appears to offer a modest advantage over random draws from a uniform distribution with marginal ranges equal to the sample ranges of the data. Initializingθ (0) gq for each g and each q = 1, 2, . . . , L q is done with a draw from a Dirichlet distribution (Kotz, Balakrishnan, and Johnson 2000) with shape parameters all equal to one, i.e., a uniform draw from the simplex in R Lq .
The estimation procedure proceeds iteratively, with each iteration consisting of two broad steps: the partition and the estimation step. The partition step assigns each observation to a cluster, and the estimation step re-estimates the parameters of interest using the new cluster memberships.
The minimum distance is then calculated for the i-th observation as r ig ). The kernel density estimate of the minimum distances is constructed asf where k(·) is a kernel function and h (t) is the corresponding bandwidth at iteration t. We currently use the Gaussian kernel, with bandwidth h = 0.9An −1/5 , where A = min(σ,q/1.34), σ is the sample standard deviation, andq is the sample interquartile range (Silverman 1986, p. 48, Equation 3.31). The functionf V as shown in Section 4.2. We assume that the Q categorical variables are independent within a given population g (i.e., local independence), and calculate the probability of observing the i-th vector of categorical variables given population membership as c gq ), where m(·; ·) is the multinomial probability mass function.
We assign the i-th object to the population g that maximizes the function In each iteration t, the latest partition of the N observations is used to calculateμ for all g, p, and q. If Ω (t) g denotes the set of indices of observations assigned to population g at iteration t, we can then calculate the parameter estimates bŷ where I{·} denotes the indicator function and |A| denotes the cardinality of the set A.
The kamila R package uses the straightforward rule of stopping a run when group membership remains unchanged from one iteration to the next. When running KAMILA with particularly large data sets, e.g., in a map-reduce framework (see Section 6), this stopping rule requires the storage and comparison of cluster memberships for two consecutive iterations at a time which can be computationally expensive; a stopping rule which avoids this computational cost uses the quantities and stops when both are less than some chosen threshold(s). If membership remains unchanged from one iteration to the next, then con and cat will both be zero. Thus, selecting very low thresholds will be similar to the initially proposed stopping rule. We have found that using k = 1 and thresholds of around 0.01 or lower yield satisfactory results with data sets with variables numbering in the dozens.
After all runs have ended, we select the best partitioning based on an objective function.

Equation 4 above suggests selecting the partition that maximizes
all runs. An alternative objective function, similar in structure to the Modha-Spangler objective described in Section 2, is to minimize the product of the categorical negative log-likelihood and the within-to between-cluster distance ratio of the continuous variables, that is, selecting the partition that minimizes Q con × − log c (t) ig . The quantity Q con is defined as . . , G} is a function that maps the observed data point index to its assigned cluster index, and B con = T con − W con , where T con is the total sum of squares N i=1 x i −μ 2 withμ denoting the sample mean taken across all continuous variables. The latter objective function is currently the default used in the kamila package, although future work will address in greater detail the best possible objective functions that can be used for KAMILA clustering.

The R package kamila
The kamila package provides the three clustering functions wkmeans, a simple wrapper for the stats::kmeans function modified to handle mixed-type data; kamila, described in detail in Section 4, and gmsClust, a flexible version of Modha-Spangler clustering described in Section 3 that can be modified to use any base clustering method and objective function. The gmsClust function defaults to using wkmeans with dummy coding of categorical variables and squared Euclidean distance. Only the kamila and gmsClust functions are intended to be used routinely for clustering; wkmeans is included primarily to be used as the default clustering technique of gmsClust, while also serving to illustrate fundamental challenges associated with using manual weights for clustering mixed-type data.
One heuristic approach to clustering mixed data is to use k-means with dummy coded categorical variables, implemented here with the wkmeans function. While the default of 0−1 dummy coding is most common, this choice of weights is not optimal in any general sense; unlike the use of dummy coding in a regression context, the choice of weights can yield completely different clustering results. Generally speaking, the higher the scalar c used in 0-c coding, the greater influence the categorical variables will have on the resulting clustering. In the wkmeans function, the conWeight parameter specifies the relative weighting of both the continuous and categorical variables. The conWeight parameter must be an element of [0, 1]; the continuous variables are each multiplied by conWeight, while the 0−1 dummy coded categorical variables are each multiplied by (1−conWeight). The "best" choice of weight in this weighted k-means approach can change dramatically based on the characteristics of the data set, and is generally difficult to specify (Foss et al. 2016). In the next section, we illustrate this difficulty by showing the performance of two choices of weights on two different data sets.
Selecting the optimal clustering technique remains a challenge, in part due to the fact that the optimal clustering technique is highly dependent upon the data set and a user's analytic objectives (Hennig 2015b). Expected characteristics of the clusters can guide the selection of clustering technique. If the clusters are well-approximated by the normal distribution, then the normal-multinomial model is recommended. If the clusters depart from normality, however (e.g., due to skewness or heavy tails), then a model relying on normality should be avoided (Foss et al. 2016) in favor of KAMILA or the Modha-Spangler technique. KAMILA is particularly well-suited to handle large data sets (hence the acronym), both in terms of algorithmic efficiency and its ability to identify useful latent structure. Depending on the scenario, Modha-Spangler may be more appropriate if the sample size is small, particularly if the underlying clusters are not well separated (well-separated clusters tend to be identified easily by any method). In particular, KAMILA (and other techniques relying on multinomial mixture models) can suffer if the sample size is small compared to the number of levels in the categorical variables (e.g., sample sizes of 100-250 with variables containing 10+ categorical levels and overlapping clusters). Modha-Spangler often is more robust in this type of scenario with categorical sparsity (small sample size relative to the number of categorical levels). The kamila function incorporates a categorical smoother which may be used to reduce the illeffects of categorical sparsity; if a user suspects that categorical sparsity is adversely affecting performance the parameter catBw may be increased to 0.1 or 0.2; a trade-off is that an increased bandwidth when there is no categorical sparsity will reduce the impact of categorical variables.
The selection of the number of clusters remains a fundamental challenge in clustering (see, e.g., Cooper and Milligan 1988;Hennig and Lin 2015;Milligan and Cooper 1985;Tibshirani and Walther 2005). Recognizing that no method is superior in all circumstances, we include an implementation of the prediction strength method of Tibshirani and Walther (2005) in the kamila function. Strengths and weaknesses of the prediction strength method have been discussed in the general case (Hennig and Lin 2015;Tibshirani and Walther 2005) as well as in the context of KAMILA clustering by Foss et al. (2016, Section 3.3). The prediction strength method tends to favor a small number of clusters, although this depends on a userspecified threshold. While a threshold of 0.8−0.9 is recommended in Tibshirani and Walther (2005) for well-separated data (default in the kamila function is 0.8), it is less clear what threshold should be used if greater cluster overlap is expected. A lower threshold will, ceteris paribus, yield an equal or larger number of clusters. In Foss et al. (2016) a threshold of 0.6 yields useful results in a particular application, and Hennig and Lin (2015) suggests using a simulation-based approach in the spirit of Tibshirani, Walther, and Hastie (2001) to calibrate the selection process. In any of the clustering functions in the kamila package, the number of clusters can also be manually selected by the user.
Theoretical analysis of the run-time of k-means analysis has proven surprisingly difficult (e.g., see Arthur, Manthey, and Röglin 2011), to say nothing of more complex algorithms such as Modha-Spangler and finite mixture models. However, in typical applied conditions these algorithms seem to consistently display linear run-times with respect to sample size (e.g., Figure 7 in Foss et al. 2016). The brute-force optimization of the Modha-Spangler algorithm yields a linear run-time with a larger slope than k-means and KAMILA. This discrepancy is increased in our R implementations of the Modha-Spangler (gmsClust) and KAMILA (kamila) algorithms, due to the fact that all of the computationally intensive portions of kamila are written in C++.

Basic usage and simulation
We begin by illustrating the basic usage and performance of the methods in the kamila package. In order to demonstrate the properties of the methods, we first use the genMixedData function to create a simple data generation function. This function allows the creation of mixed-type data sets in which we carefully control the degree of cluster separation/overlap separately in the continuous and categorical variables. We parameterize the overlap between two clusters as the overlapping area under their densities, with zero corresponding to complete separation and one corresponding to complete overlap; genMixedData specifies univariate overlap for each variable separately. The parameter nConVar controls the number of continuous variables, nConWithErr controls how many of these variables have an overlap of level conErrLev, while the remaining continuous variables have overlap of 0.01. The categorical variables are controlled analogously with the parameters nCatVar, nCatWithErr, and catErrLev. The parameter popProportions is a vector of probabilities controlling the relative frequency of observations drawn from each cluster. Currently only two-cluster data sets are supported, as controlling overlap between clusters in k-cluster simulated data sets is less straightforward (e.g., see Maitra and Melnykov 2010). We note that simulated mixed-type data with more intricate dependency structure between variables might be generated using the PoisBinOrdNor package (Amatya and Demirtas 2015;Demirtas, Hu, and Allozi 2016) to simulate data separately for each cluster.

Byar data analysis
We illustrate our R package kamila on a real data set consisting of fifteen variables measured on 475 patients with prostate cancer. We cluster the data set with gmsClust with a k-medoids algorithm specified as the base clustering algorithm, and with kamila.
We begin by dropping the height, weight, patient ID, and outcome variables, and converting categorical variables from integers to factors.
R> data("Byar", package = "clustMD") R> Byar$Serum.prostatic.acid.phosphatase <-log( + Byar$Serum.prostatic.acid.phosphatase) Adjusted Rand Index Figure 2: Performance of k-means, Modha-Spangler, and KAMILA clustering algorithms on two different data structures. White boxplots denote results from a balanced data set in which both the continuous and categorical variables contain useful information regarding cluster structure, and gray boxplots denote results from an unbalanced data set in which the categorical variables are more useful than the continuous. "W5/5" denotes the k-means algorithm with a weight of 0.5 applied to both continuous and categorical variables during clustering, and "W2/8" denotes weights of 0. We use the kamila package to cluster the data using Modha-Spangler clustering. In the Modha-Spangler optimization framework, the optimal choices of distance metric and clustering method depend on the data set at hand and the user's clustering goals; we do not wish to endorse one particular method over others. Thus, we allow the user to input any custom distance metric and clustering algorithm into our Modha-Spangler implementation, and we illustrate this functionality here. Instead of the default k-means algorithm, we use the Modha-Spangler framework to optimize the k-medoids algorithm PAM (partitioning around medoids; Kaufman and Rousseeuw 1990) using Gower's distance (Gower 1971). This requires using continuous variables that have been normalized to the range [0, 1] by subtracting the minimum value and dividing by the range.
R> rangeStandardize <-function(x) { + (x -min(x)) / diff(range(x)) + } R> conVars <-as.data.frame(lapply(conVars, rangeStandardize)) We then write functions implementing the L1 distance for continuous variables and matching distance for categorical (the two distances required by Gower's distance). These functions must accept as input two rows of a data frame and return a scalar distance measure. Note that the default behavior of a data frame of factors differs substantially if it contains one versus multiple variables; using as.integer to convert factors before comparison avoids comparing incompatible types.
We now run a KAMILA clustering procedure on the same data.
R> set.seed(6) R> numberOfClusters <-2:10 R> kmResPs <-kamila(conVars, catVars, numClust = numberOfClusters, + numInit = 10, maxIter = 50, calcNumClust = "ps") Prediction strength values for each of number of clusters tested are shown in Figure 5. The selected number of clusters is two. In the table below, we see that the two-cluster KAMILA solution is identical to the three-cluster KAMILA solution except that the two clusters with the lowest proportion of deaths due to prostate cancer are merged.

Hadoop implementation
We illustrate here how to implement KAMILA on very large data sets stored on distributed file systems. We use the map-reduce computing model (Dean and Ghemawat 2008) as implemented in Hadoop (Apache Software Foundation 2016a; White 2009). We also make use of the Hadoop streaming utility (Apache Software Foundation 2016b), available in the standard Hadoop distribution. Hadoop streaming allows mappers and reducers to be written in any programming language to be integrated into an otherwise standard Hadoop map-reduce run; this allows for rapid development of custom applications. In our case, it offers a way to implement map-reduce runs that have access to the rich set of objects and methods available in R.
There are multiple existing R packages that facilitate computing with data sets stored on distributed file systems. The pbdR project (Ostrouchov et al. 2012) is a collection of packages for high-performance computing on distributed systems, including interfaces to MPI (The MPI Forum 2015) and ScaLAPACK (Blackford et al. 1997). Interfacing with Hadoop can be accomplished with the packages RHIPE Guha 2012), rmr2 (Revolution Analytics 2016a), or hive (Feinerer and Theussl 2015), while interfacing with Spark (Apache Software Foundation 2016d) can be accomplished using the R package sparklyr (Luraschi, Ushey, Allaire, and The Apache Software Foundation 2016) or the SparkR package (Venkataraman 2013) included in Spark (Apache Software Foundation 2016d) versions ≥ 1.4. Although we opted to use a bare-bones implementation of Hadoop streaming in the current work to minimize software dependencies, future implementations may draw upon the packages mentioned above in order to allow distributed computations to be executed from within the R environment.
The algorithm used in the Hadoop implementation of KAMILA is shown in Algorithm 1. Unless otherwise noted, it is structured in fundamentally the same manner as detailed in Section 4 and in Algorithm 1 in Foss et al. (2016), although certain modifications were necessary to accommodate map-reduce operations. Initializations of the continuous variables are not drawn from the entire data set, but a subset of the continuous data points (around 2,000 to 5,000 points are adequate) to reduce computation time. In the REDUCE 1 step of Algorithm 1, values of r (t) i are binned to avoid the computation required to store a vector of length N between the first and second map-reduce runs. For the calculation of the radial kernel density estimate between map-reduce runs 1 and 2, calculations are performed assuming the points are drawn from the mid-point of the bins; the variance of the binned values is computed using Sheppard's correction (Stuart and Ord 1998), in which a correction factor of h 2 /12 is subtracted from the standard variance estimator, where h is the bin width. The stopping rule given in Equation 5 is used in order to avoid the costly storage and comparison of two vectors of length N every iteration.
Compared to KAMILA, k-means and Modha-Spangler algorithms pose problems in a highperformance computing environment with large mixed-type data sets. In addition to substantial problems balancing continuous and categorical contribution described in Section 2, k-means may cause difficulties with large data sets due to the computational demands of dummy coding categorical variables. It is not uncommon to encounter large data sets with categorical variables containing a large number of levels; since each dummy-coded level requires its own variable the corresponding storage and run-time demands can quickly become challenging to handle. Each iteration of the KAMILA algorithm requires two passes through the continuous variables and one through the categorical, and thus has run-time approximately proportional to 2p + q, where p is the number of continuous variables and q the number of categorical variables. One k-means iteration has run-time p + q * , where q * is the number of dummy-coded categorical levels. If p = q and there are levels per categorical variable, the ratio of these run-times is (p + q * )/(2p + q) = ( + 1)/3. Thus, if every level is given its own dummy variable, then computation times will be comparable if all variables are binary. Since variables often have many levels, KAMILA will generally outperform k-means in this context. (We note that simplex coding (McCane and Albert 2008) can reduce this ratio slightly to /3.) Since the standard formulation of Modha-Spangler is based on k-means, it suffers from these same computational challenges, but with run-times multiplied by the number of evaluations in its brute-force optimization scheme.
until Stopping rule is satisfied. end for Output partition that maximizes the chosen objective function.
Hadoop can be found in Appendix A of Theußl, Feinerer, and Hornik (2012). The environment modules package (http://www.modules.sourceforge.net/) is used to manage packages and environment variables. We do not expect all interested parties to have the same setup, and thus some adaptation may be required to implement our programs; however, we have separated out the components such that the mappers and reducers can be conveniently integrated into alternative setups using Hadoop streaming.

Running the analysis in Hadoop
Here we give a brief overview of the code implementing KAMILA in Hadoop. First we navigate to the relevant directory where the supplementary material is contained and list the content: $ ls -1F --gr kamilaStreamingHadoop/ BASH/ csv/ py/ R/ Rnw/ kamila.slurm kmeans.slurm LICENSE preprocessing.slurm README.md Note that the sources may also be downloaded from GitHub (https://github.com/ahfoss/ kamilaStreamingHadoop.git).
As stated above, our methods may require adaptation to the particular platform of interest. The file preprocessing.slurm is a SLURM batch submission script showing an example preprocessing pipeline using SQLite (Hipp 2013, version 3.7.17). Three data files are required to run the clustering program. In addition to the main data file (which currently must be in CSV format), the program requires a TSV file with one row per categorical variable describing basic metadata including column indices, number of levels, and the names and counts of the categorical levels. Required formatting is described in README.md, and the script py/preprocKamila.py gives example code for generating required inputs from a raw data set. Categorical variables should not be dummy coded. The levels of each categorical variable should be replaced by the integers 1-M , where M is the number of categorical levels in that variable. Finally, as described above, a small random subset (without replacement; about 2,000 to 5,000 should suffice) of the continuous data points should be drawn and placed in a separate data file. The file kamila.slurm is a SLURM batch submission script that implements KAMILA clustering in streaming Hadoop. The SLURM options and BASH variables defined in the initial lines of kamila.slurm must be set to match the specifications of the job at hand. Detailed instructions for these settings can be found in the document README.md and in Appendix A.
Once kamila.slurm and the data set are set up, the job can be submitted to the SLURM job manager using the sbatch command. In order to illustrate the data format required, a sample analysis can be executed from the package as follows from a linux terminal. First, a toy data set csv/sample.csv can be generated using an included R script: $ cd kamilaStreamingHadoop/ /kamilaStreamingHadoop$ Rscript R/genData.R /kamilaStreamingHadoop$ ls -1sh csv/ total 604K 604K sample.csv A preprocessing script formats the data and generates the required metadata as described above. A SQLite data base is created containing other useful summary information. Note that the preprocessing.slurm script can be instead run using the SLURM job manager: /kamilaStreamingHadoop$ slurm preprocessing.slurm Next, cluster the data: Results of the analysis will be stored in the directory output-kamila-####, where #### denotes the SLURM job ID. See Appendix A for a more detailed description of the raw output file structure.
Finally, a sample script can be used to generate useful summary information about the clustering procedure and results using the script Rnw/kamilaSummary.Rnw. The JOBID variable in the section "User-Supplied Values" of kamilaSummary.Rnw must be changed to the job ID used by SLURM to run kamila.slurm.

Airline data set
We illustrate our Hadoop software on the airline data set used in the ASA Data Expo 2009 (American Statistical Association 2016, http://stat-computing.org/dataexpo/2009/, accessed April 2016). All files necessary for downloading the data and conducting the analysis are part of the supplementary material. In addition they can be found on GitHub      Figure 6: Results of clustering the airline data using KAMILA. A subset of the data is plotted using the first two principal components of the sub-sample. Cluster number is depicted by color, with the number within the plotted points depicting the iteration number of the best k-means run. Loadings of the variables on the PCs are shown in Table 6.
(https://github.com/ahfoss/airline-hadoop-analysis); the analysis can be reproduced by running the commands found in the README file.
Summary statistics for the variables used are given in Tables 1, 2 Table 8: Cluster-specific minimum, mean, and maximum for the continuous arrival delay in minutes (ArrDelay). A negative delay corresponds to an event that occurred earlier than the originally scheduled time.
variables were z-normalized, and categorical variables were re-coded using unique integers for each categorical level. There were a total of 335 and 331 terminals in the origin and destination variables, respectively; we restricted the analysis to the 50 most prevalent terminals in both cases and grouped other terminals in an "other" category. We note that if we had been forced to rely on a method requiring dummy coding, we would require 145 variables to represent the categorical variables instead of five, a 2800% increase in the number of variables required. Based on the approximate calculations described in the opening of Section 6, this corresponds roughly to an 8× increase in computation time for k-means over KAMILA.  Figure 7: The most prominent airlines in cluster 8 of the airline data analysis. Pearson residuals were calculated for the cross-tabulations of cluster membership and airport, separately for origin and destination airports. Residuals for only cluster 8 were then plotted. Higher numbers represent increased counts compared to the expected counts if the cluster were unrelated to the variable in question.

Results of the airline analysis
We ran the KAMILA analysis using 16 initializations with a maximum of 10 iterations per initialization (as shown in Algorithm 1 each iteration includes two map-reduce steps). Four runs of four initializations were run in parallel, with the data split into 80 chunks for the map-reduce steps. Our computing setup used 320 cores and ran for a wall time of 72 hours, a little more than two and a half years of CPU time.
A plot of the clustering results for the airline data are shown in Figure 6. This plot shows a subset of the data plotted on the first two principal components of the sub-sampled continuous variables. Loadings of the continuous variables on these principal components are given in Table 6, and account for 70.5% of the variance in the subset.
Cluster-specific minimum, maximum, and mean values are shown for the air time variable in Table 7, and for arrival delay in Table 8. Figure 6 shows that the most prominent clusters with regard to continuous variables are cluster 1, which appears to capture longer flights, and cluster 8, which appears to capture flights with larger arrival and departure delays. This is     all airports in the data set.
Counts of flights by day of the week for each cluster are shown in Table 10, with associated Pearson residuals in Table 11. An interesting observation is that counts for Saturday and Sunday are consistently more often over-or under-represented than this is the case for any other day of the week, despite the fact that Saturday and Sunday are the least frequent days for flying overall (see Table 2). Inspecting counts of flights for each carrier by day of the week (Table 12) and associated Pearson residuals (Table 13), it appears that flights on Sunday depart the most dramatically from an independence model. This is especially apparent in Figure 8, a boxplot depicting Pearson residuals of carrier by day counts (i.e., Table 13) aggregated across carriers. This suggests that airlines are markedly different in terms of how many flights they offer on weekends, and on Sunday in particular. While more work would be needed to identify the precise cause of this, one possible reason could be that some airlines do not choose to offer many Sunday flights due to lower demand on weekends. The difference could also be related to contrasting company policies regarding flight attendants' work on Sunday.

Conclusion
Although mixed-type data is quite common across many fields, effective strategies for clustering such data sets are surprisingly difficult to find. As discussed above, this is true even in the context of R, but it is particularly true when the data at hand is too large to be clustered in a standard R environment. Existing methods use arbitrary strategies for controlling the continuous and categorical contribution to the overall clustering, often resulting in undesirable solutions dominated by one type or the other. In this paper we introduced the R package kamila that offers multiple solutions to this problem. Additionally, we provide examples using Hadoop illustrating the implementation of KAMILA using a map-reduce model on a large data set stored on a distributed file system. Clustering mixed-type data continues to be a challenging problem; as these methods are refined, and as other effective strategies are discovered, we intend for the kamila package to serve as a centralized platform for implementing the most effective techniques in this area.

A. KAMILA Hadoop submission file
The main SLURM batch submission script kamila.slurm should be modified to fit your computing setup and data. The first set of choices involve options for the SLURM workload manager; consult the sbatch documentation (SLURM Team 2016) for further information.
The remainder of the user-specified options are BASH variables. The first three variables are the file names of the three data files described in Section 6.1: • DATANAME: the file name of the primary CSV data file.
• SUBSAMP_DATANAME: the file name of the sub-sampled data file.
• CATINFO_DATANAME: the file name of the TSV file.
The fourth variable, DATADIR, is the directory of the data files, i.e, the full path of the data file is ./$DATADIR/$DATANAME. The remainder of the environment variables control the behavior of the KAMILA algorithm: • INITIAL_SEED: integer. A random seed for the random number generator for reproducibility.
• NUM_CLUST: integer. KAMILA partitions the data into this many clusters.
• NUM_MAP: integer. The default number of map tasks, passed to Hadoop streaming option mapred.map.tasks.
• NUM_REDUCE: integer. The default number of reduce tasks, passed to Hadoop streaming option mapred.reduce.tasks.
• NUM_CHUNK: integer. Each cluster is split into this many chunks, and each chunk is assigned its own key (i.e., about NUM_CLUST × NUM_CHUNK keys total). This is useful if the number of available reducers greatly exceeds NUM_CLUST.
• NUM_INIT: integer. The number of distinct initializations should be used.
• MAX_NITER: integer. The maximum number of initializations computed for each distinct initialization.
• EPSILON_CON and EPSILON_CAT: positive reals. Parameters controlling the stopping rule. The closer to zero the more stringent the rule and thus the more likely each initialization will simply run for MAX_NITER iterations. The run is stopped if the summed absolute deviation of the centroid parameters in both the continuous and categorical variables are less than EPSILON_CON and EPSILON_CAT, respectively, from one iteration to the next. A reasonable value is the total deviation you would accept in the estimated centroid parameters relative to the true parameters, which depends on the data and the user's analysis needs. See the software paper cited above for more information.
The file output-kamila-####/best_run/allClustQual.txt contains a list of the objective criterion values used to select the best run. The directory output-kamila-####/run_i/ contains information on the i-th run; within each run's directory iter_j/ stores information on the j-th run. The primary results files, output-kamila-####/run_i/stats/finalRunStats.RData, contain two R lists: runSummary and clustSummary.
The output-kamila-####/run_i/iter_j/ directories contain the centroids of the current run/iteration stored as an RData file, along with other output from the reducers for that iteration.
The runSummary object contains overall summary statistics for the clustering: • clusterQualityMetric: the quantity used to select the best solution over all the runs.
• totalEucDistToCentroid: the total Euclidean distance from each continuous point to its assigned centroid.
• totalCatLogLik: the log-likelihood of the categorical centroids with respect to the categorical variables.
The clustSummary list contains one element for each cluster. Each cluster's element contains a list of length five with the elements: • count: the number of observations assigned to this cluster.
• totalDistToCentroid: the total Euclidean distance from each member of the current cluster to the centroid (along the continuous variables only).
• totalCatLogLik: the log-likelihood of the categorical centroids of the current cluster with regard to the categorical variables.
• con: means, minima, and maxima along each continuous variable for members of the current cluster.
• catfreq: frequencies of observations of each level of each categorical variable within the current cluster. Frequencies are calculated by counting the number of observations in the current cluster at a particular level of a particular variable and dividing by the number of observations in the current cluster.

A.2. Reading Hadoop results files into R
The document Rnw/kamilaSummary.Rnw (written as a knitr document; Xie 2015) included in the kamilaStreamingHadoop repository provides code for generating plots and tables of summary statistics from the output objects described in Section A.1. The context and execution of this script is described in Section 6.1. In the interest of brevity we do not discuss every line of Rnw/kamilaSummary.Rnw comprehensively, but highlight certain sections of the code that illustrate strategies for querying SQLite data bases and iterating through the Hadoop results directory.
We begin by obtaining summary statistics from the SQLite data base with path and filename stored in the variable dataBaseName.
We construct Figure 6 by projecting a subset of the data onto its first two principal components, with data points colored by final cluster membership. We assume that this subset has already been created and stored as a separate CSV file with path and file name subsampleDataFile. Upon this plot we superimpose the cluster centroids at each iteration of the clustering procedure, labeled by iteration number. This is accomplished with the following code.