blockcluster : An R Package for Model-Based Co-Clustering

Simultaneous clustering of rows and columns, usually designated by bi-clustering, co-clustering or block clustering, is an important technique in two way data analysis. A new standard and eﬃcient approach has been recently proposed based on the latent block model (Govaert and Nadif 2003) which takes into account the block clustering problem on both the individual and variable sets. This article presents our R package blockcluster for co-clustering of binary, contingency and continuous data based on these very models. In this document, we will give a brief review of the model-based block clustering methods, and we will show how the R package blockcluster can be used for co-clustering.


Introduction
Cluster analysis is an important tool in a variety of scientific areas such as pattern recognition, information retrieval, micro-array, data mining, and so forth. Although many clustering procedures such as hierarchical clustering, k-means or self-organizing maps, aim to construct an optimal partition of objects or, sometimes, of variables, there are other methods, called block clustering methods, which consider simultaneously the two sets and organize the data into homogeneous blocks. Let x denotes a n × d data matrix defined by x = {(x ij ); i ∈ I and j ∈ J}, where I is a set of n objects (rows, observations, cases, etc.) and J is a set of d variables (columns, attributes, etc.). The basic idea of these methods consists in making permutations of objects and variables in order to draw a correspondence structure on I × J. For illustration, consider Figure 1 where a binary data set defined on set of n = 10 individuals I = {A, B, C, D, E, F, G, H, I, J} and set of d = 7 binary variables J = {1, 2, 3, 4, 5, 6, 7} is re-organized into a set of 3 × 3 clusters by permuting the rows and columns.
In recent years, co-clustering has found numerous applications in the fields ranging from data mining, information retrieval, biology, computer vision and so forth. Dhillon (2001) published an article on text data mining by simultaneously clustering the documents and content (words) using bipartite spectral graph partitioning. This is quite a useful technique for instance to manage a huge corpus of unlabeled documents. Xu, Zong, Dolog, and Zhang (2010) presents another co-clustering application (again using a bipartite spectral graph) to understand subset aggregates of web users by simultaneously clustering the users (sessions) and the page view information. Giannakidou, Koutsonikola, Vakali, and Kompatsiaris (2008) employed a similarity metric based co-clustering technique for a social tagging system. In the field of bioinformatics, co-clustering is mainly used to find structures in gene expression data. This is useful for instance to find a set of genes corresponding to a particular kind of disease. Some of the pioneer material in this context can be found in Kluger, Basri, Chang, and Gerstein (2003). Recently many co-clustering based algorithms have also been developed to target computer vision applications. For instance, Qiu (2004) demonstrated the utility of co-clustering in image grouping by simultaneously clustering images with their low-level visual features. Guan, Qiu, and Xue (2005) extended this work and presented opportunity to develop a novel content based image retrieval system. Similarly, Rasiwasia and Vasconcelos (2009) used co-clustering to model scenes. We have presented here only a very brief survey demonstrating the diversity of co-clustering applications. Unfortunately, many of the algorithms developed above for co-clustering are suitable in that particular context only and cannot be considered as a generic framework. Here, we take the opportunity to emphasize on the fact that the algorithms used in our software can be more or less adopted in most or all the above contexts.
In practice, several software packages exist for block clustering such as Bicat (Barkow, Bleuler, Prelic, Zimmermann, and Zitzler 2006), biclust (Kaiser and Leisch 2008), and BiGGEsTS (Goncalves, Madeira, and Oliveira 2009), or refer to Madeira and Oliveira (2004) for a review. In this paper, we introduce our R (R Core Team 2016) package blockcluster (Iovleff and Singh Bhatia 2017), which provides a bridge between a newly developed C++ library and the R statistical computing environment. The R package blockcluster allows to estimate the parameters of the co-clustering models (Govaert and Nadif 2003) for binary, contingency and continuous data. This package is unique from the point of view of generative models it implements (latent blocks), the used algorithms (BEM, BCEM) and, apart from that, special attention has been given to design the library for handling very huge data sets in reasonable time. The R package is available from the Comprehensive R Archive Network (CRAN) at https://CRAN.R-project.org/package=blockcluster. The rest of the article is organized as follows. In Section 2, we give the mathematical foundation (without going into detailed proofs) of latent block models for binary, contingency and continuous datasets. In Section 3, we elaborate various details of the blockcluster package and illustrate its usage with the help of several examples. This part can also act as a firsthand comprehensive tutorial for the audience who are interested in running the blockcluster package directly. Finally we illustrate our package on real datasets in Section 4 and terminate with some concluding remarks in Section 5.

Mixture models: Classic formulation
In model-based clustering, it is assumed that the data arises from a mixture of underlying probability distributions, where each component k of the mixture represents a cluster. Hence the matrix data x=(x 1 , . . . , x n ) is assumed to be i.i.d. and generated from a probability distribution with the density where f k is the density function for the kth component. Generally, these densities belong to the same parametric family (say for example Gaussian). The parameters (π k ) g k=1 give the probabilities that an observation belongs to the kth component and the number of components in the mixture is g which is assumed to be known. The parameter θ of this model is the vector (π 1 , . . . , π g , α) containing the parameter α and the mixing proportions (π 1 , . . . , π g ). Using this, the density of the observed data x can be expressed as An important property of the pdf f (x; θ) shown by Govaert and Nadif (2003) is that the density (1) can be rewritten as where Z denotes the set of all possible partitions of I in g clusters, f (x|z; α) = i f z i (x i ; α) and p(z) = i π z i . With this formulation, the data matrix is assumed to be a sample of size 1 from a random (n, d) matrix.

Latent block model: General formulation
Let the dataset x be defined on a set I with n elements (individuals) and a set J with d elements (variables). We represent a partition of I into g clusters by z = (z 11 , . . . , z ng ) with z ik = 1 if i belongs to cluster k and z ik = 0 otherwise, z i = k if z ik = 1 and we denote by z .k = i z ik the cardinality of row cluster k. Similarly, we represent a partition of J into m clusters by w = (w 11 , . . . , w dm ) with w j = 1 if j belongs to cluster and w j = 0 otherwise, w j = if w j = 1 and we denote w . = j w j the cardinality of column cluster . Furthermore, note that for any variable V ij , we denote V .k to represent i V ik and similarly To study the block clustering problem, the previous formulation (2) is extended to propose the block mixture model defined by the following probability density function where U denotes the set of all possible labelings of I × J and θ contains all the unknown parameters of this model. By restricting this model to a set of labelings of I × J defined by a product of labelings of I and J, and further assuming that the labelings of I and J are independent of each other, we obtain the decomposition where Z and W denote the sets of all possible labelings z of I and w of J.
By extending the latent class principle of local independence to our block model, each data point x ij will be independent once z i and w j are fixed. Hence we have where f z i w j (x, α) is either a probability density function (pdf) defined on the real set R, or a discrete probability function. Denoting θ = (π, ρ, α), where π = (π 1 , . . . , π g ) and ρ = (ρ 1 , . . . , ρ m ) are the vectors of probabilities π k and ρ that a row and a column belong to the kth row component and to the th column component respectively, we obtain the latent block mixture model with pdf Using the above formulation, the randomized data generation process can be described by the three steps row labelings (R), column labelings (C), and data generation (X) as follows: (R) Generate the labelings z = (z 1 , . . . , z n ) according to the distribution π = (π 1 , . . . , π g ).

Model parameter estimation
EM based algorithms are the standard approach to estimate model parameters by maximizing the observed data log-likelihood. In our case, the complete data is represented as a vector (x, z, w) where the unobservable vectors z and w are the labels. The complete data loglikelihood can then be written The EM algorithm maximizes the log-likelihood L(θ) iteratively by maximizing the conditional expectation Q(θ, θ (c) ) of the complete data log-likelihood given a previous current estimate θ (c) and x: Unfortunately, difficulties arise owing to the dependence structure in the model, and more precisely, to determinate e (c) ikj . To solve this problem a new approximate solution is proposed (Govaert and Nadif 2003) using the Hathaway (1986) and Neal and Hinton (1998) interpretation of the EM algorithm. By constraining the probability p to verify the relation it can be shown that the fuzzy clustering criterion proposed by Neal and Hinton (1998) in case of the latent block model, where and L c is the fuzzy complete-data log-likelihood associated to the block latent model given by This is a variational approach and its objective is to maximize the functionF C . Doing that, the maximization of the likelihood is replaced by the maximization of an approximation of this likelihood defined byL (θ) = argmax t,rF To simplify some notations, we will also use the restricted complete data log-likelihood and the restricted fuzzy complete-data log-likelihood

Algorithms
Here we give a brief outline of two algorithms used in our package.

Block expectation maximization (BEM) algorithm
The fuzzy clustering criterion given in (6) can be maximized using the EM algorithm. We here only outline the various expressions evaluated during the E and M steps.

E-Step:
Here we compute conditional row and column class probabilities which are respectively given by: and

M-Step:
Here we calculate row proportions π and column proportions ρ: The maximization ofF C w.r.t. π, and w.r.t. ρ, is obtained by maximizing k t .k log π k , and r . log ρ respectively, which leads to π k = t .k n and ρ = r . d . Also the estimate of model parameters α will be obtained by maximizing which will depend on the pdf's f k and shall be treated later separately in subsections discussing block mixture models for the binary, contingency and continuous case.

Block classification expectation maximization (BCEM) algorithm
To apply the complete maximum likelihood (CML) approach for parameter estimation of the latent block model (4), the partitions z and w are added to the parameters to be estimated, and the objective is to maximize the complete data log-likelihood associated with the latent block model (4). Unlike the BEM situation, this maximization does not require an approximation and can be done, for instance, by maximizing with fixed z and ρ and then with fixed w and π. This algorithm can be seen as a variant of the BEM algorithm described above and to perform this algorithm, it is sufficient to add a C-step in each of the EM algorithms, which converts the t ik 's and r jl 's to a discrete classification before performing the M-step by assigning each individual and variable to the cluster having the highest posterior probability.

Binary data
The parameters α of the underlying distribution of a binary data set is given by the matrix . . , g and = 1, . . . , m and the probability distribution We re-parameterize the model density as follows: Hence the parameters p k of the Bernoulli mixture model are replaced by the following parameters: • The binary value a k , which acts as the center of the block k, and which gives, for each block, the most frequent binary value, • The value ε k belonging to the set ]0, 1/2] that characterizes the dispersion of the block k, and which, for each block, represents the probability of having a different value than the center.
For this model, using Equations 10 and 11 the row and column class conditional probabilities are respectively given by: and Furthermore, we can write the restrained data log-likelihood for this model in terms of previously defined explicit parameters as where y k = i,j t ik r j x ij . It can be easily shown that the values of model parameters maximizing Equation 13 is given by For a detailed discussion on Bernoulli latent block models we would kindly refer our readers to Govaert and Nadif (2008).

Contingency data
To apply the block latent mixture model on contingency data, it is assumed that for each block k, , the values x ij are distributed according to a Poisson distribution P(µ i ν j γ k ) where the Poisson parameter is split into µ i and ν j , the effects of the row i and the column j respectively, and γ k , the effect of the block k (Govaert and Nadif 2010). Then, we have where α = (µ, ν, γ) with µ = (µ 1 , . . . , µ n ), ν = (ν 1 , . . . , ν d ) and γ = (γ 11 , . . . , γ gm ).
Unfortunately, this parameterization is not identifiable. It is therefore not possible to estimate simultaneously µ i , ν j and γ k . The following two solutions are considered to tackle this problem: • µ i and ν j are known: In this case, it only remains to estimate the matrix γ k .
• Constraints are imposed on the parameter θ = (π, ρ, µ, ν, γ): In this situation, the following constraints appear to make the model identifiable, but this remains yet to be proved.
The transformation of the parameter θ on a parameter θ defined as follows π k = π k and ρ = ρ shows that it is possible to parameterize the model with the constraints k π k γ kl = ρ γ kl = 1/N, where N is any positive real.

Case-1:
The row and column effects are known.
In this situation, α = γ = (γ 11 , . . . , γ gm ) and f k (x ij ; γ) can be written as Then, we can replace the function f k by the function g k in the log-likelihoods and in the expressions t ik and r j . Hence, using Equations 10 and 11 the row and column class conditional probabilities are respectively given by: Also the restrained complete data log-likelihood is given by: where y k = ij t ik r j x ij . The maximization of Equation 16 gives: Case-2: The row and column effects are unknown.
Using the constraints given by Equation 15, we obtain E(x i. ) = µ i and E(x .j ) = ν j . We can propose as an estimator of µ i and ν j margins x i. and x .j and take the results of Case-1 to calculate row and column class conditional probabilities by replacing µ i and ν j by x i. and x .j respectively. Hence the row and column class conditional probabilities are respectively given by: Furthermore the derivation of γ k is exactly the same as in the previous case. Using this particular form of γ k , the calculation of t ik and r j is simplified and becomes

Continuous data
In this case, the continuous data is modeled using the unidimensional normal distribution. Hence the density for each block is given by: The parameters of the model are α = (α 11 , . . . , α gm ), where α k = (µ k , σ 2 k ). For this model, using Equations 10 and 11 the row and column class conditional probabilities are respectively given by: The computation of parameters α is given by maximization of restrained log-likelihood which in this case is given by: It can be deduced easily that the value of µ k and σ 2 k is given by minimization respectively. Hence we obtain:

Block clustering with package blockcluster
blockcluster is a newly developed R package for co-clustering of binary, contingency and continuous data. The core library is written in C++ and the blockcluster API acts as a bridge between the C++ core library and the R statistical computing environment. A brief introduction of the C++ library is given in Appendix A. In the subsequent sections, we shall provide detailed explanation of the various functionalities and utilities of the package with help of examples wherever possible.

Strategy for parameters estimation
Correct parameters estimation in EM based algorithms has always been a challenging problem due to various local optima. In case of co-clustering algorithms, this problem could be even more apparent and hence the final results shall be directly affected by the choice of a strategy. As being explained in previous sections, we have two block clustering algorithms namely BEM and BCEM and the way we run these algorithms in our library is named respectively as "XEMStrategy" and "XCEMStrategy". Both the strategies are equivalent except that they run different algorithms. The various steps involved in "XEMStrategy" (equivalently in "XCEMStrategy") are: 1. Set the pseudo log-likelihood L XEM to minimum possible value.
2. Run the BEM algorithm for some pre-defined number of times (we denote this number by nbxem in our package) as follows: (a) Set the pseudo log-likelihood L xem to minimum possible value.
(b) Initialize the model parameters.
(c) Run the algorithm with a high tolerance (epsilon) and few iterations using the initialized parameters, and calculate the current pseudo log-likelihood L current .
(d) If L current is greater than L xem , copy the currently estimated parameters to α start and set L xem to L current .
3. Starting with the parameters α start , run the algorithm with a low value of epsilon (low tolerance) and a high number of iterations and calculate the current pseudo loglikelihood L current .
4. If L current is greater than L XEM , copy the currently estimated parameters to α final and set L XEM to L current .

Repeat
Steps 2 through 3 for some pre-set times (we denote this number by nbtry in our package) in order to obtain a good estimation of the parameters.
Although the above strategy does not explore all the possible solutions and hence cannot ensure to obtain the global optimum, it is still a very promising approach in practice. The tuning of the values of nbxem and nbtry needs to be done intuitively, and could have a substantial effect on the final results. A good way to set these values is to run co-clustering a few number of times and check if the final log-likelihood is stable. If not, one may need to increase these values. In practice, it is better to increment nbxem as it could lead to better (stable) results without compromising too much the running time.
In the package blockcluster, we have a function called cocluststrategy 1 which allows to set the values of various input parameters including nbxem and nbtry. In the following example, we get an object defaultstrategy of S4 class 'strategy' by calling the function cocluststrategy without any arguments and then we called the overloaded function summary to see various default values.

Model initialization
Initializing model parameters is arguably the most important step in getting a good final estimate of model parameters. The three most important initialization strategies for EM algorithms are: • Random: In this method, the model parameters are initialized randomly using the input data. One may go with several random initializations and choose the one which gives the best log-likelihood.
• CEM: The model parameters are first initialized using random values by making use of the input data, and then the CEM algorithm (Celeux and Govaert 1992) is run for a small number of iterations with high tolerance. The CEM algorithm optimizes directly the completed log-likelihood by replacing the M step by a classification step based on the MAP. In general, the CEM algorithm converges very quickly (not necessarily to a good optimum) which is an advantage as we spend few time in initialization compared to the overall run time.
• FuzzyCEM (Stochastic EM): This method is equivalent to CEM except for the fact that instead of classifying samples into various clusters based on MAP, we stochastically select the class for each sample using the corresponding class probability distribution.   Unfortunately, in blockcluster, we do not have the freedom to initialize any model with all of the methods above. Instead, for every model we only have one possible initialization method at present (mainly CEM ) as shown in Table 1. Hence the initialization strategy will be selected automatically by the package based on the input model. This is the reason why we have no default initialization strategy. This is indeed on the priority list of our future works to provide as many initialization strategies as possible for various kinds of models.

Binary data
We have simulated binary data using the data generation process explained in Section 2 and with parameters given in Table 2. The data consist of 1000 rows (samples) and 100 columns (variables) with two clusters on rows and three clusters on columns.
The following R commands shows how to load the package, process the data and visualize/summarize results using blockcluster.

R> plot(out, asp = 0)
To plot the various block distributions (Figure 2(b)), the following R command is used with type set to "distribution" (type is "cocluster" by default).

Contingency data
In this case, we simulated contingency data as explained earlier in Section 2 and using the parameters given in Table 3. Again the data dimensions are 1000 × 100 with two clusters on rows and three clusters on columns.
The following R commands load the simulated contingency data into the R environment, co-clusters the contingency data, summarizes the results and shows the graphics respectively.

Continuous data
In this final example, we generated continuous data using parameters given in Table 4. The dataset again contains 1000 rows and 100 columns with two clusters on rows and three clusters on columns.
Co-clustering has not been exploited in this context so far to the best of our knowledge. Hence it seems this offers a new opportunity for co-clustering.

Document clustering
Document clustering is yet another data mining technique where co-clustering seems to be very useful as demonstrated in Dhillon (2001). Here we run our package on one of the datasets being used in Dhillon (2001) which is publicly available at ftp://ftp.cs.cornell. edu/pub/smart. We mix the datasets Medline (1033 medical abstracts) and Cranfield (1398 aeronautical abstracts) leading to a total of 2431 documents. Furthermore, we use all the words (excluding stop words) as features making a total of 9275 unique words. The data matrix consists of words on the rows and documents on the columns with each entry giving the term frequency, that is the number of occurrences of the corresponding word in the corresponding document. To run our package, we assume that the term frequency follows a Poisson distribution. Hence we can apply the model "pik_rhol_unknown" available in our package for contingency (Poisson family) datasets with unknown row and column effects. Table 5 shows the confusion matrix and compares our results with the classical bipartite spectral graph partitioning algorithm of Dhillon (2001) where we have obtained 100 percent correct classification. Figure 6 depicts the 2 × 2 checkerboard pattern in the data matrix, hence confirming the more frequent occurrence of particular set of words in one document and vice-versa. Please note that the data matrix images are extremely sparse (data points almost invisible) and have been processed using simple image processing tools for the visualization purpose only.

Conclusion and future directions
This article presents our newly developed R package for co-clustering algorithms based on latent block models. In the first part of the article, we have presented a brief mathematical foundation of co-clustering algorithms based on latent block models. In the second part,   Figure 6: Original data matrix with words on rows and documents on columns (a), and checkerboard pattern in words by documents matrix obtained after performing co-clustering (b).
we have given details on the strategy adopted to run these algorithms. Furthermore, we have completed the presentation of the blockcluster package by providing the examples on simulated datasets for each data type. Finally we have shown applicability of our software on real datasets. As the package facilitates co-clustering for all types of data, it provides a generic framework and a one stop solution for a wide variety of co-clustering applications.
We are currently working on new initialization strategies as well as model selection criteria that should be part of upcoming releases of the package. The next release of the package will provide support for categorical models and a Gibbs EM algorithm (Keribin, Brault, Celeux, and Govaert 2015) for parameter estimation as well as implement semi-supervised co-clustering in which some row and/or column classes are provided by the user.

A. The core library: Structure and usage
In this paper we have given a detailed analysis on the usage of the R package blockcluster to estimate parameters of latent block models, but behind the scenes the kernel (computational) part lies in the C++ core library 2 . Our purpose to provide this section is only to briefly introduce the kernel part and partly for developers who might be interested in integrating the library with their own codes, but we highly recommend to use only the R package blockcluster for research purposes as lot of utilities are absent in the C++ codes, for instance the graphics part.

A.1. Library structure
The C++ library is based on the famous strategy design pattern (Freeman, Robson, Bates, and Sierra 2004) and the (not so complete) architecture is revealed in Figure 7. The choice of the architecture is motivated from the fact that it allows easy extension to the library (accommodating any future research) with as little changes as possible in the existing codes.
Hence it follows the famous design principle "open for extension but closed for modification" (Freeman et al. 2004). Furthermore, we have used OpenMP (Dagum and Enon 1998) to provide parallel implementations of the software by exploiting various data and task parallelism in latent block model estimation. By default, the library shall make use of multiple cores on the machines that support OpenMP.

A.2. Usage
In our library, we have provided a simple main function (to illustrate its usage) that can be run from the console as follows.
$ coclustmain -f path/to/optionfile.txt coclustmain is a binary executable file which takes the path to a text file as command line argument. A sample input text file is shown below.
[InputOptions] DataType = Binary nbrowclust = 2 nbcolclust = 3 DataFileName = Data/BinaryData/testbinarypikrholepskl.dat nbinititerations = 10 initepsilon = .01 nbtry = 1 nbxem = 1 epsilon_xemstart = .00001 epsilon_xem = .00000001 nbiterations_xemstart = 50 nbiterations_xem = 50 epsilon_int = 0.01 This file contains all the input options needed to run the library. The only mandatory inputs are the data type, number of row/column clusters and the path to the data file. As the reader may have already guessed, this file is more or less similar to the blockcluster package function cocluststrategy() where the optional parameters in the sample text file are equivalent to arguments of the function. By looking at the coclustmain source file, one can easily understand how to integrate the core library with their own C++ codes.