CoClust : A Python Package for Co-Clustering

Co-clustering (also known as biclustering), is an important extension of cluster analysis since it allows to simultaneously group objects and features in a matrix, resulting in row and column clusters that are both more accurate and easier to interpret. This paper presents the theory underlying several eﬀective diagonal and non-diagonal co-clustering algorithms, and describes CoClust , a package which provides implementations for these algorithms. The quality of the results produced by the implemented algorithms is demonstrated through extensive tests performed on datasets of various size and balance. CoClust has been designed to complete and easily interface with popular Python machine learning libraries such as scikit-learn .


Introduction
In the era of data science, clustering various kinds of objects (documents, genes, customers) has become a key activity and many high quality packaged implementations are provided for this purpose by many popular packages such as the base package stats for R (R Core Team 2019), skmeans (Hornik, Feinerer, Kober, and Buchta 2012), kernlab (Karatzoglou, Smola, Hornik, and Zeileis 2004), NbClust (Charrad, Ghazzali, Boiteau, and Niknafs 2014), CLUTO (Karypis 2003), scikit-learn (Pedregosa et al. 2011), SciPy (Jones, Oliphant, Peterson et al. 2001, including the scipy.cluster module), nltk (Bird, Klein, and Loper 2009, with the nltk.cluster module), Weka (Hall, Frank, Holmes, Pfahringer, Reutemann, and Witten 2009), etc. A natural extension of standard cluster analysis is co-clustering where objects and features are simultaneously grouped into meaningful blocks called co-clusters or biclusters, thus making large datasets easier to handle and interpret. In fact, since the seminal work of Hartigan (1972), co-clustering has found applications in many areas such as bio-informatics (Cheng and Church 2000;Madeira and Oliveira 2004;Van Mechelen, Bock, and De Boeck 2004;Tanay, Sharan, and Shamir 2005;Cho and Dhillon 2008;Gupta and Aggarwal 2010;Hanczar and Nadif 2010, web mining (Xu, Zong, Dolog, and Zhang 2010; or contingency matrices by maximizing an adapted version of the modularity measure traditionally used for networks. Another dimension for characterizing co-clustering algorithms is to distinguish between general partitional co-clustering algorithms and those that seek to discover a diagonal structure (which is displayed to the user in the form of a block-diagonal matrix). In the former case, the requested number of row clusters can be different from the requested number of column clusters while in the latter the two numbers obviously have to be the same in order to produce a diagonal structure. Diagonal algorithms assign each row and each column to exactly one co-cluster. For text mining applications, the attractiveness of this approach lies in its simplicity: Each set of documents is automatically labeled by a set of terms. However, there may be occasions where one may want to associate a set of documents with several sets of terms. In this case, general, non-diagonal co-clustering methods may be more suitable.
Spectral methods and methods based on matrix factorization are fast and lend themselves to simple implementations. However, as we noted in a recent comparative study whose preliminary results can be found in Ailem et al. (2015), they are clearly outperformed by the other cited methods when co-clustering document-term matrices in terms of accuracy. Also, as indicated above, co-clustering spectral methods are already available in scikit-learn. Probabilistic co-clustering methods deliver better accuracy, and several open-source implementations already exist in the form of the above-mentioned R packages blockcluster and blockmodels. The CoClust package presented in this paper therefore provides scikit-learn-compatible implementations of two modularity-based methods and one information-theoretic based methods. It is freely available under a BSD-3 license at https://pypi.python.org/pypi/coclust.
The first two algorithms, CoclustMod (Ailem et al. 2015(Ailem et al. , 2016 and CoclustSpecMod (Labiod and Nadif 2011), are recent representatives of the family of block-diagonal co-clustering algorithms. These algorithms have several advantages. First, being block-diagonal algorithms, they directly produce interpretable descriptions of the resulting document clusters since each document cluster is directly associated with one term cluster. Second, the experimental results described in Ailem et al. (2015) have shown that these algorithms can adapt to various kinds of matrices (whether they are binary, contingency, weighted or unweighted matrices). In addition to this flexibility, they outperform popular, older block-diagonal algorithms such as the above-mentioned well-known "spectral co-clustering" algorithm.
The third algorithm (CoclustInfo) is based on an information-theoretic approach. Informationtheoretic co-clustering has become very popular in the text mining community since the above-mentioned work by Dhillon et al. (2003). Its main benefits are speed of convergence and scalability. CoclustInfo is an implementation of the CROINFO algorithm described in Nadif (2013, 2018).
Last but not least, the three algorithms have the advantage of simplicity since they can be implemented using simple alternated iterative procedures where one of the partitions is fixed (say, the column partition) and a better partition of the other set (say, the row partition) is searched.
The outline of the paper is as follows. We first review the theory underlying these algorithms (Section 2) before presenting the interface and API of the module that implements them (Section 3) and conclude with a benchmark that demonstrates the effectiveness of the software compared to other algorithms (Section 4).

Theory
As said in the introduction, co-clustering algorithms simultaneously cluster rows and columns into partitions and a pair consisting of a row cluster and a column cluster determines a cocluster, which is a submatrix of the original data matrix. Diagonal algorithms assign each row and each column to exactly one co-cluster and so the rows and columns can be rearranged so that the co-clusters form a kind of diagonal (see for example Figure 1). Non-diagonal algorithms, on the other hand, do not have this restriction and rearranging the rows and columns according to the found partition may result into structures such as the one shown in Figure 2.
The CoClust package provides three co-clustering algorithms (two diagonal and one nondiagonal). The first two perform co-clustering by maximizing the modularity of bipartite graphs while the third one uses the information-theoretic notion of mutual information to define its criterion. Before describing these notions, we first give some general notations we will use throughout the paper.

General notations
We will consider the partition of the sets I of n objects and the set J of d attributes into g non overlapping clusters, where g may be greater or equal to 2. Let us define a n × g indicator matrix z = (z ik ) and a d × g indicator matrix w = (w jk ). The kth row cluster is defined by the set of rows i such that z ik = 1. In the same manner, the kth column cluster is defined by the set of rows j such that w jk = 1. Finally, X is the matrix used as input to all the methods described in this paper; X can be of any kind provided it is a matrix with non-negative entries (e.g., a graph adjacency matrix, or a document-term matrix, depending on the application domain).

Modularity-based, block-diagonal co-clustering
The first family of algorithms implemented in the CoClust package consists of two algorithms (CoClustMod and CoclustSpecMod) that seek an optimal block-diagonal clustering, meaning that objects and features have the same number of clusters and that, after proper permutation of the rows and columns, the algorithm produces as result a block-diagonal matrix (see Figure 1). In the context of document-term matrices, this co-clustering model has the advantage of directly producing interpretable descriptions of the resulting document clusters.
A notable block-diagonal co-clustering algorithm is the bipartite spectral graph partitioning algorithm described in Dhillon (2001). Inspired by previous work on spectral graph clustering, this algorithm finds the optimal minimum cut partitions in a bipartite document-term graph by computing the second left and right singular vector of the normalized documentterm matrix, thus using a real relaxation of the discrete optimization problem. The block diagonal algorithms implemented in CoClust follow a completely different approach: They try to maximize a measure of the concentration of edges within co-clusters compared with the random distribution of edges between all nodes regardless of the co-clusters. This criterion is an adaptation to the bipartite case of the standard "graph modularity". Before describing the two algorithms, it is therefore useful to review this notion of "bipartite graph modularity".

Bipartite graph modularity (BGM)
In this section we first review the standard graph modularity measure, and show how to adapt it so that it can be used in the co-clustering context.
Modularity is a quality criterion often used for detecting communities in graphs, which has received considerable attention in several disciplines since the seminal work by Newman and Girvan (2004). Intuitively, modularity compares the number of edges inside a cluster of nodes with the expected number if the edges in the graph were placed at random. 1 Given the graph G = (V, E), let X be a binary, symmetric adjacency matrix with (i, i ) as entry; and x ii = 1 if there is an edge between the nodes i and i . If there is no edge between nodes i and i , x ii is equal to zero. Finding a partition of the set of nodes V into homogeneous subsets leads to the resolution of the following integer linear program: max c Q(X, c) where Q(X, c) is the modularity measure: In this formula, c is a binary matrix defined by c ii = g k=1 z ik z i k , meaning that c ii is 1 when nodes i and i are in the same group and 0 otherwise. In addition, |E| is the number of edges and x i. = i x ii is the degree of i.
In summary, we seek a binary matrix c which is defined as zz and models a partition in a relational space, thus having the properties of an equivalence relation: In a bipartite context, the basic idea is to model the simultaneous row and column partitions using a relation c defined on I × J. Noting that c = zw and the general term can be expressed as follows: c ij = 1 if object i is in the same block as attribute j and c ij = 0 otherwise. Then c ij = g k=1 z ik w jk . Now, given a rectangular matrix X defined on I × J, modularity can be reformulated as follows in the co-clustering context: where x .. = i,j x ij = |E| is the total weight of edges and x i. = j x ij (the degree of i in the binary case and the sum of the weights in the contingency and continuous cases) and x .j = i x ij (the degree of j in the binary case and the sum of the weights in the contingency and continuous cases). This modularity measure can also take the following form: As the objective function 3 is linear with respect to c (c = zw ) and as the constraints that c must respect are linear equations, the problem can theoretically be solved using an integer linear programming solver. However, this problem is NP hard, and as a result, in practice, we use heuristics for dealing with large datasets.

CoclustMod: Co-clustering by alternated maximization of BGM
In this section we describe the theory underlying CoClustMod, one of the two block-diagonal algorithms provided by the CoClust package (Ailem et al. 2015(Ailem et al. , 2016. Proposition 1. Let X be a (n×d) matrix with non-negative entries and c be a (n×d) matrix defining a block seriation, the modularity measure Q(X, c) can be rewritten as: 1.
A proof for Proposition 1 can be found in Ailem et al. (2015). This proposition is at the heart of the CoclustMod algorithm since it allows to maximize the modularity by alternatively maximizing Q(X w , z) and Q(X z , w). The optimal classification binary matrices z * and w * are respectively defined by z * = arg max z Trace(X w − δ w ) z and w * = arg max w Trace(X z − δ z ) w. In fact, Q(X w , z) and hence modularity can be maximized by assigning row i to the Algorithm 1 CoclustMod. Input: binary or contingency data X, number of clusters g. Output: partition matrices z and w.
The same kind of argument applies for Q(X z , w), which leads to the different steps presented in Algorithm 1.

CoClustSpecMod: Co-clustering by spectral maximization of BGM
In this section we describe the theory underlying CoClustSpecMod, another block-diagonal algorithm provided by the CoClust package. In the same way as CoClustMod, CoClust-SpecMod sees modularity-based co-clustering as a trace maximization problem, but with two important differences as described in Labiod and Nadif (2011). First, it uses normalized versions of the z and w matrices, and second, it maximizes modularity using a spectral approach, which contrasts with the direct maximization performed by CoClustMod.
The use of a normalized modularity matrix is motivated by the desire to balance the row and column cluster sizes. The z matrix is therefore replaced by a z = zh − 1 2 where h is a diagonal matrix where each diagonal element contains the number of elements in the kth row cluster. In the same way, the w matrix is replaced by a w = wf − 1 2 where f is a diagonal matrix where each diagonal element contains the number of elements in the kth column cluster. The modularity problem then amounts to the following trace maximization problem: This maximization is performed using a spectral approach by performing the following steps: 1. Scale the modularity matrix.

Algorithm 2 CoclustSpecMod.
Input: data X, number of clusters g. Output: partition matrices R and C. 1. Form the affinity matrix X.

2.
Define D r and D c to be the diagonal matrices D r = diag(X1) and D c = diag(X 1).
Cluster the rows of Q into g clusters by using k-means. 6. Assign object i to cluster R k if and only if the corresponding row of the matrix Q was assigned to cluster R k , and assign attribute j to cluster C k if and only if the corresponding row of the matrix Q was assigned to cluster C k .
2. Approximate the scaled matrix using SVD.
3. Use the matrices produced by the SVD decomposition to form a new matrix, then apply a clustering algorithm (e.g., k-means) to cluster the new matrix.
Step 1 is performed as follows. Let B be a bipartite modularity matrix. A scaled version B of this matrix is computed as: In Step 2, B is approximated as g−1 k=1Ũ k λ kṼ k . whereŨ andṼ are derived from the singular vectors as follows: Finally,Ũ andṼ are used to form a matrix Q = Ũ ,Ṽ which is given as input to a clustering algorithm such as k-means. The different steps of CoClustSpecMod are presented in Algorithm 2.

Information-theoretic co-clustering
In this section we describe the notions underlying the third algorithm, CoclustInfo, provided by the CoClust package. In contrast to the previously described algorithms, CoclustInfo takes an information-theoretic approach and uses mutual information to define its criterion (Govaert and Nadif 2013, Chapter 4). Another important difference is that this algorithm does not seek to discover a block-diagonal structure like the previously described algorithms. The requested number of row clusters can be different from the requested number of column clusters. A representative example of the kind of matrix obtained when using CoclustInfo is shown in Figure 2.

Initial contingency table and associated joint distribution
Let X be a n × d contingency table such as the example shown in Table 1 (left). This  table can    Let now P IJ = (p ij ) denote the sample joint probability distribution associated with the two variables. It can be represented by a n × d matrix defined by . . An example of the probability matrix corresponding to our sample contingency matrix is shown in Table 1 (right). As for other random variables, the association between the two categorical variables I and J can be measured using mutual information. Intuitively, mutual information between two variables compares the observed frequencies in the data with the expected frequencies under the null hypothesis of no association. The mutual information between two variables I and J is expressed as

Aggregated contingency table and associated joint distribution
In this section, we describe the new contingency table and associated joint distributions that can be derived when simultaneously aggregating the rows and the columns of a contingency table X according to a couple of partitions of the sets I and J. In fact, if z and w are partitions in g clusters and m clusters of the set I of the rows and the set J of columns of X, then a new two-way contingency table X zw = (x zw k ) can be associated with two categorical random variables taking values in sets K = {1, . . . , g} and L = {1, . . . , m} by merging the rows and columns according to the partitions z and w:  The distribution that can be associated to z and w is the distribution P zw shows that P zw KL is a distribution. Moreover it can be noticed that, as the row margins of this new distribution do not depend on the partition w and will be denoted p z k. . Similarly, the column margins k p zw k are equal to j w j p .j and will be denoted p w . . For instance, the aggregation of the rows and columns of the data according the partitions z = (1, 1, 2, 2, 3, 3) and w = (1, 1, 1, 2, 2) leads to the contingency table X zw and the distribution P zw KL reported in Table 2. Table 2 gives the original distribution P IJ and the distribution P zw KL obtained after aggregating the rows and columns. As can be seen in this example, the two distributions are similar. Using the mutual information applied on the P zw KL distribution, we obtain the following measure: .
One can then express the loss in mutual information incurred when passing from the original probability matrix to the aggregated matrix as: is the Kullback-Leibler divergence between the two distributions P IJ and P zw KL . This difference in mutual information is exactly the criterion minimized by the CoclustInfo algorithm. It has been shown (Govaert and Nadif 2013) that this loss in mutual information can equivalently be expressed as: where R zwγ IJ = (r zwγ ij ) is a distribution depending on the partitions z and w and a parameter γ. R zwγ IJ can be defined by r zwγ ij = p i. p .j k, z ik w j γ k . The parameter γ = (γ k ) corresponds to a matrix of size (g, m) where each γ k plays the role of the centroid of the co-cluster k and such that γ k > 0 ∀k, and k, p z k. p w . γ k = 1. It can be shown that R zwγ IJ is a distribution, which in addition has the same column and row margins than the P IJ distribution.
. repeat repeat Step 1. z ik = 1 if k = arg max 1≤k ≤g p w i log γ k and z ik = 0 otherwise ∀i.
. until convergence repeat Step 3. w j = 1 if = arg max 1≤ ≤m k p z kj log γ k and w j = 0 otherwise ∀j.
. until convergence until convergence return z and w.
Before seeing how the W I (z, w, γ) = I(P IJ )−I(R zwγ IJ ) criterion can be optimized in practice, it is worth noting here that it is a generalization of the criterion proposed for the well-known ITCC algorithm (Dhillon et al. 2003). 2 The minimization of the criterion W I (z, w, γ) can be obtained by alternating the three computations: z = arg min z W I (z, w, γ), w = arg min w W I (z, w, γ) and γ = arg min γ W I (z, w, γ).
More precisely, it has been shown in Govaert and Nadif (2013) that the minimization of W I (z, w, γ) for fixed w and γ is obtained by assigning each row i to the cluster k maximizing p w i log γ k . Similarly, in the computation of w, the minimization of W I (z, w, γ) for fixed z and γ is obtained by assigning each column j to the cluster maximizing k p z kj log γ k . Finally, for the computations of γ, the problem can be formulated as arg max γ k, p zw for all k, . These different steps are summarized in the pseudo-code shown in Algorithm 3.

Software
The CoClust package provides a set of convenience command-line tools enabling to launch a co-clustering task on a dataset by only providing the suitable parameters. In addition, for Python (Van Rossum and others 2011) developers, it also exposes an API designed to provide a seamless integration with the scikit-learn library.

Command-line scripts
The two scripts included in the CoClust package are: • coclust which runs a co-clustering algorithm; • coclust-nb which provides recommendations to select the best number of co-clusters.

Running a co-clustering algorithm: The coclust script
The coclust script can be invoked from the command line to run an algorithm on a data matrix. It also provides parameters for the evaluation of the results.
The user has to select an algorithm which is given as a first argument to coclust. The choices are: • modularity; • specmodularity; • info.
These choices correspond to the CoclustMod, CoclustSpecMod, and CoclustInfo algorithms respectively.
The other options that have to be given depend on the algorithm. Some of them are however common to all of them: • those describing the input matrix; • those used for the evaluation; • some of the output and algorithm parameters.
The input matrix can be given as a MATLAB (The MathWorks Inc. 2017) file or a text file. For the MATLAB file, the key corresponding to the matrix must be given. For the text file, each line should describe an entry of a matrix with three columns: the row index, the column index and the value. The separator is given by a script parameter.
The names of the parameters are given in the sections corresponding to the algorithms.
CoclustMod algorithm. All the options available for the CoclustMod algorithm are summarized below:

Detecting the best number of co-clusters: The coclust-nb script
The coclust-nb script is a command-line script, which takes almost the same arguments as coclust modularity. A summary is given below:

coclust-nb [-h] [-k MATLAB_MATRIX_KEY | -sep CSV_SEP] [--output_row_labels OUTPUT_ROW_LABELS] [--output_column_labels OUTPUT_COLUMN_LABELS] [--reorganized_matrix REORGANIZED_MATRIX] [--from FROM] [--to TO] [-m MAX_ITER] [-e EPSILON] [--n_runs N_RUNS] [--visu] INPUT_MATRIX
The number of co-clusters being unknown, the to and from parameters serve to define the range within which the number has to be searched: --from=2 minimum number of co-clusters --to=10 maximum number of co-clusters For example, for the CSTR dataset, the best number of co-clusters found by the following command (run from the command line) is 4, the same as the real number of clusters: $ coclust-nb datasets/cstr.csv --seed=1 --n_runs=30 --max_iter=60 --visu In this example, the best modularity is 0.478638 and is attained for 4 co-clusters. The evolution of the modularity against the number of co-clusters is plotted in Figure 3.

Python API
Each algorithm is implemented in a specific class, all sharing the same methods as those used by scikit-learn, thus making it very easy to integrate it with this library. The following sections contain usage examples for each of the three algorithms. The examples also demonstrate how to load matrices stored in different formats.

CoClustMod usage
The following example shows how to load the CLASSIC3 dataset from a MATLAB file. The MATLAB file is loaded using a function provided by the SciPy library. A matrix is then extracted from the MATLAB dictionary and stored in variable X. A co-clustering model with 3 co-clusters is then created, and receives as input the X matrix. Then, after displaying the maximum modularity value as well as its evolution over the iterations (Figure 4) Top 10 terms Figure 5: CoclustMod -displaying the top terms of each cluster using the plot_cluster_top_terms function.
representations of the obtained term clusters are produced via the plot_cluster_top_terms and get_term_graph functions (Figures 5 and 6). The get_term_graph function extracts the n most frequent terms in a given term cluster along with the k most similar (in terms of cosine similarity) neighbors of each of these most frequent terms.

CoClustSpecMod usage
In this example, the CSTR dataset is imported as a CSV file. The first line of the file is the number of rows followed by the number of columns and the number of clusters the model is fitted with. The other lines are tuples of the form (row number, column number, value). The spectral modularity based model is fitted, and the plot_cluster_sizes function (available in the visualization module) is then used to display the sizes of the document and term clusters (see Figure 7).

CoClustInfo usage
In this example, the CLASSIC3 dataset is imported from a MATLAB file. A model is created and fitted. A graph showing the evolution of the criterion is displayed along with the last γ kl matrix obtained at the end of the execution. This matrix allows to visually spot the most cohesive co-clusters produced by the algorithm (see Figure 8).

Combined usage
The following example shows how easy it is to run several algorithms on the same dataset and then plot the resulting reorganized matrices in order to have a first visual grasp of what can be expected from the different algorithms. A plot of three different reorganized matrices for the CSTR dataset is shown in Figure 9.
>>> import matplotlib.pyplot as plt Figure 9: Using the plot_reorganized_matrix utility function to plot three reorganized matrices for the CSTR dataset.

Example of integration with scikit-learn
In the following example, the scikit-learn library is used to import the corpus of documents NG20 (see Section 4.1), select only five categories, and create a document-term matrix. This example shows how easy it is to include an algorithm of the CoClust package in a scikit-learn 'Pipeline'.

Description of datasets
To assess the performance of the three implemented algorithms, we tested them on 8 datasets of different size, sparsity and balance 3 . The characteristics of each dataset are reported in Table 3.
• The CSTR dataset was previously used in Li (2005) and includes the abstracts of technical reports published in the Department of Computer Science of Rochester University. These abstracts were divided into 4 research fields: natural language processing (NLP), robotics/vision, systems, and theory.
• SPORTS is a dataset from the CLUTO toolkit (Karypis 2003), and is the same as that used in Zhong and Ghosh (2005  • REVIEWS is also a standard dataset used by the CLUTO toolkit. • WEBACE (Ding and Li 2007) contains news articles partitioned across 20 different topics obtained from the WEBACE project (Han et al. 1998).
• RCV1 (Cai and He 2012) is a subset of a newswire stories corpus made available by Reuters containing 4 categories: C15, ECAT, GCAT, and MCAT.
• Finally, NG20 is the 20 Newsgroups dataset. 5 In addition to the three algorithms included in the package (denoted as CoclustMod, Co-clustSpecMod and CoclustInfo in the experiments), we also included in the comparison the implementations of the two co-clustering algorithms available in scikit-learn, denoted as SpectralBi and SpectralCo in our experiments. 6

Setup
The experiments were performed on a standard workstation (CPU: Intel(R) Core(TM) i5-3210M CPU @ 2.50GHz; Memory: 8192 MB DDR3 @ 1600 MHz). 7 The reported results were obtained by running each algorithm 100 times with random initialization and averaging over the best 50 executions. 8 For SpectralBi and SpectralCo the default parameter values were used, except of course for the numbers of clusters which were set to the same values as for the other algorithms. The document-term contingency matrices were used in their original form without any pre-processing or weighting.
To evaluate the performance of the algorithms, we compared the results they generated with the true classes, by computing clustering accuracy, adjusted Rand index (ARI) and normalized mutual information (NMI), the number of requested clusters corresponding to the values in the fourth column of Table 3.  Clustering accuracy, denoted Acc, measures the extent to which each cluster contains data points from the corresponding class and is defined by Acc = 1 n max 1≤k≤g 1≤ ≤g where C k is the kth cluster in the final results, L is the true th class and T (C k , L ) is the number of entities that belong to class and are assigned to cluster k. The greater the clustering accuracy, the better the clustering performance. NMI (Strehl and Ghosh 2003) is estimated by: where N k denotes the number of objects contained in the cluster C k (1 ≤ k ≤ g),N is the number of objects belonging to the class L (1 ≤ ≤ g), and N k, denotes the number of objects that are in the intersection between cluster C k and class L . The larger the NMI, the better the quality of clustering. 9 For ARI, we used the implementation provided by scikit-learn.

Results and discussion
In particular, the results presented in Tables 4, 5 and 6 clearly show that the three CoClust implementations outperform the spectral implementations available in scikit-learn in terms of NMI and clustering accuracy. 10 More precisely, CoclustInfo and CoclustMod perform better than their spectral competitors (including CoclustSpecMod). Also, CoclustMod provides an easy way of estimating the appropriate number of clusters (a feature implemented by the coclust-nb script described in Section 3.1). There, is however a marked difference in execution time between CoclustInfo and CoclustMod (Table 7). A drawback of CoclustMod is that it has to handle a non sparse, modularity matrix, which is both time and memory consuming. As a consequence, in terms of execution time, CoclustMod is the slowest of the five compared algorithms. In contrast, the implementations available in scikit-learn are very fast but, as already said, have significantly lower accuracy, ARI and NMI scores.
The displayed results are those obtained using the technical environment available as of this 9 The datasets and benchmark code can be found in the benchmark directory of the package. 10 In these tables the number of row and column clusters requested is specified within parentheses after the dataset name.

Dataset
CoclustInfo CoclustMod Coclust-Spectral-Spectral-SpecMod Biclustering    writing. They may slightly differ in the future with the evolution of the external resources used in the implementation (e.g., NumPy, SciPy, etc.).

Conclusion
Co-clustering is an important technique in the era of so-called "big data" since it allows to compress large, high dimensional matrices. However, few tools were available so far for the Python community, and the CoClust package, therefore, aims at filling this gap. By presenting and contrasting the theory and implementation of two distinct families of coclustering algorithms (block-diagonal and non diagonal algorithms). The paper also provides the reader with a representative survey of methods available in the co-clustering field.
Experimental results show that the three implemented algorithms adapt well to datasets of various balance and sparsity and can be used with good co-clustering performance in many settings. In particular, they clearly outperform the available Python implementations of coclustering algorithm in terms of result quality.
In the future we plan to include model-based co-clustering algorithms. We will more specifically focus on algorithms based on the Poisson latent-block model (Govaert and Nadif 2018;Ailem, Role, and Nadif 2017b,a;Salah and Nadif 2019), but with extensions of these models to better take into account data sparsity. Adding post-processing tools for facilitating the interpretation of the produced co-clusters is another path for future work.