Unsupervised Clustering and Meta-Analysis Using Gaussian Mixture Copula Models

Methods for clustering in unsupervised learning are an important part of the statistical toolbox in numerous scientiﬁc disciplines. Tewari, Giering, and Raghunathan (2011) proposed to use so-called Gaussian mixture copula models (GMCM) for general unsupervised learning based on clustering. Li, Brown, Huang, and Bickel (2011) independently discussed a special case of these GMCMs as a novel approach to meta-analysis in high-dimensional settings. GMCMs have attractive properties which make them highly ﬂexible and therefore interesting alternatives to other well-established methods. However, parameter estimation is hard because of intrinsic identiﬁability issues and intractable likelihood functions. Both aforementioned papers discuss similar expectation-maximization-like algorithms as their pseudo maximum likelihood estimation procedure. We present and discuss an improved implementation in R of both classes of GMCMs along with various alternative optimization routines to the EM algorithm. The software is freely available in the R package GMCM . The implementation is fast, general, and optimized for very large numbers of observations. We demonstrate the use of package GMCM through diﬀerent applications.


Introduction
Unsupervised learning based on cluster analysis is an important discipline in many fields of science and engineering to detect groups of data with similar properties.Gaussian mixture models (GMMs) are perhaps the most widely used method for model based clustering of continuous data.However, the assumption of jointly normally distributed clusters in GMMs is often violated.Tewari et al. (2011) presented the semi-parametric class of Gaussian mixture copula models (GMCMs) for general clustering in unsupervised learning and highlighted them as a flexible alternative to GMMs when obviously non-normally distributed clusters are present.The attractiveness of the GMCMs is predominantly due to an invariance under all monotone increasing marginal transformations of the variables.This scale invariance of the variables stems from the rank-based nature of copula models and makes the GMCMs highly versatile.
The GMCMs have found some success in applications after Li et al. (2011) independently from Tewari et al. (2011) proposed using a special-case for a non-standard meta-analysis methodology named reproducibility analysis.Their method has been adopted by the EN-CODE project (Bernstein, Birney, Dunham, Green, Gunter, and Snyder 2012, p. 58;The ENCODE Consortium 2011, p. 15) and applied on ChIP (chromatin immunoprecipitation) sequencing data.The meta-analysis approach with GMCMs consists of clustering genes or features that agree on statistical evidence and those that do not.In other words, the features are clustered into a reproducible and an irreproducible group.The flexibility of the GMCMs makes them suitable for meta-analysis of multiple similar experiments.
The work of Li et al. (2011) is especially important in genomics as both data and results are subject to substantial variability due to limited samples sizes, high-dimensional feature spaces, dependence between genes, and confounding technological factors.This high variability has brought into question the reliability and reproducibility of many genomic results (Ioannidis, Ntzani, Trikalinos, and Contopoulos-Ioannidis 2001;Ein-Dor, Zuk, and Domany 2006;Tan, Downey, and Spitznagel 2003).Others, however, argue that the lack of reproducibility is only superficial (Zhang et al. 2008).Together with a rapid evolution of many different highthroughput technologies and vast online repositories of publicly available data, this motivates the need for a robust and flexible meta-analysis toolbox, which can evaluate or aggregate results of multiple experiments even across confounding factors such as differing technologies.
The high flexibility of the GMCMs comes at a cost, however.The likelihood is difficult to evaluate and maximize, partly because of intrinsic identifiability problems as we describe in detail later.We have solved some of the issues and implemented them in the R (R Core Team 2016) package GMCM (Bilgrau, Boegsted, and Eriksen 2016).
Although copula theory is an elegant way of approaching rank-based methods, we present the GMCMs in a more traditional fashion.We refer to the general model of Tewari et al. (2011) simply as the general model or general GMCM and the special case model of Li et al. (2011) is referred to as the special model or special GMCM.
In the following, we present the general GMCM followed by the special GMCM and the derivation of the likelihood function.Subsequently, the key features of package GMCM are presented and compared to the idr package (Li 2014).The technical details of the problematic maximization of the likelihood are then discussed.Finally, our package is evaluated based on different applications before concluding with a discussion of GMCMs.This document was prepared and generated using knitr (Xie 2013), a dynamic report generation tool inspired by Sweave (Leisch 2002), and the R packages Hmisc (Harrell, Jr 2016) and RColorBrewer (Neuwirth 2014).The simulation study was carried out using parallel computing with doMC and foreach (Kane, Emerson, and Weston 2013;Revolution Analytics and Weston 2015;Revolution Analytics 2015).

The general GMCM for clustering
We consider a large p × d matrix [x gk ] of observed values where the rows are to be clustered into m groups.The general GMCM assumes an m-component Gaussian mixture model (GMM) as a latent process, Z = (Z 1 , . . ., Z d ) , with the following distribution GMM: where H ∈ {1, 2, . . ., m} corresponds to the class and α 1 , . . ., α m are the mixture proportions satisfying α h > 0 for h = 1, . . ., m and m h=1 α h = 1.Thus, the latent GMM is parameterized by We denote the joint and kth marginal cumulative distribution functions (CDF) of the GMM by respectively, where Φ and Φ k are the joint and kth marginal CDFs of the multivariate normal distributions, respectively.Analogous equations hold for the joint and marginal probability density functions (PDF) which we denote by lower-case γ and γ k .
Let X = (X 1 , . . ., X d ) be an observation with known marginal CDFs F 1 , . . ., F d and assume the relationship between the observed and the latent variables.By Equation 2 and the probability integral transform the vector U = (U 1 , . . ., U d ) where When F 1 , . . ., F d are known we can derive an expression for the likelihood of this model.For later use we simplify the notation by introducing the vector functions Γ where Θ is the parameter space.The vector function Γ • applies the kth marginal transformation Γ k on the kth entry of the observation and similarly does F • .Again by the probability integral transform, Z is transformed by Γ • into the marginally uniformly distributed random vector U with CDF The PDF c of U is computed by the change of variables theorem or by differentiation of C using the multivariable chain rule.If we abbreviate the notation by not explicitly stating the dependence on the parameters θ, the PDF is given by since the Jacobian matrix J Γ −1 • (u) is diagonal.The CDF C and PDF c are the so-called copula and copula density of the GMM model, respectively (Nelsen 2006).Hence U is distributed according to the Gaussian mixture copula density c, and the observation X is some marginal transformation of U .The model is thus completely specified by GMCM: From this, we see that the GMCM operates on three levels: a latent level Z, a copula level U , and an observed level X. Figure 1 (A-C) illustrates the three levels of a 2-dimensional 3-component GMCM.Here, F • and F −1 • map panel A to B and B to A, respectively.Likewise, Γ • defines the mapping between panels C and B.
To assess the class of an observation, Tewari et al. (2011) proposed using which is the a posteriori probability that the observation was generated from component h.
To decide the class for the observation, the maximum a posteriori (MAP) estimate can be used.That is, the h corresponding to max h (κ h ).

The special-case GMCM for meta-analysis
In the Li et al. (2011) reproducibility analysis, the p × d matrix [x gk ] consists of test statistics or p values testing the same null hypothesis for a large number p of, e.g., genes for each of d ≥ 2 studies.Rows correspond to genes, indexed by g, and columns to experiments, indexed by k.Without loss of generality, large values are considered to be indicative of the alternative hypothesis.A prototypical example in genomics is a matrix of transformed p values for the hypothesis of no differential expression of genes between treatment and control groups for two or more experiments.The task is here to determine which genes g are jointly significant in all experiments.Ordinary meta-analysis methodologies involve combining confidence intervals of effect sizes, test statistics, or p values in a row-wise manner and assessing the significance whilst controlling for the number of false positives (Owen 2009).Li et al. (2011) proposed a special case of Equation 4 with m = 2 components corresponding to whether the null or alternative hypothesis is true, where h = 1 corresponds to spurious signals and h = 2 to genuine ones.Hence α 1 and α 2 = 1 − α 1 are the fraction of spurious and genuine signals, respectively.Li et al. (2011) further imposed the following constraints on the parameters: µ 1 = 0 d×1 = (0, 0, . . ., 0) , and where ρ ∈ [−(d − 1)  Li et al. (2011) defined the local irreproducibility discovery rate of an observation as analogously to the local false discovery rate (lFDR) of Efron (2004Efron ( , 2005Efron ( , 2007)).Notice, that Equation 5 coincides with Equation 8 for the special model.As the multiple testing problem is present when more observations are obtained, an adjusted irreproducibility discovery rate was also defined by Li et al. (2011): where I α = {u | idr(u) < α}, i.e., the probability of a gene being non-reproducible while in the rejection region.The adjusted IDR(α) relates to idr in the same manner as the marginal false discovery rate (mFDR) relates to the lFDR.

The GMCM likelihood function
Suppose we have observed p i.i.d.samples x 1 = (x 11 , . . ., x 1d ), . . ., x p = (x p1 , . . ., u pd ) from Equation 4which can be arranged into the observation matrix introduced in Section 2.1.From these, the marginal uniform variables u 1 = F • (x 1 ) = (u 11 , . . ., u 1d ), . . ., u p = F • (x p ) = (u p1 , . . ., u pd ) are computed and are independent and identically distributed according to the copula density of Equation 3. The log-likelihood is thus given by since the Jacobian arising from transformation F • is not dependent on θ (and thus constant when optimizing with respect to θ).
In practice, F 1 , . . ., F d are unknown and estimated by the empirical CDF Hence the pseudo-observations of u gk are plugged into the log-likelihood and the maximizing parameters are determined.However, since p is large, The GMCM is rank-based since plugging a variable into its empirical CDF corresponds to a particular ranking scheme in which the lowest value is awarded rank 1 and ties are given their largest available rank.To avoid infinities in the computations ûgk is rescaled by the factor p/(p + 1).
The usage of ûgk violates the assumption of independent observations as the ranking introduces dependency between the observations.The introduced dependency is arguably negligible when p is large.We ignore this problem and refer to Chen, Fan, and Tsyrennikov (2006) and the references therein for a more detailed discussion of this problem which is common to all copula model estimation procedures.

Package overview
The GMCM package currently has 14 user visible functions of which the majority are for convenience.The functions are presented in Table 1 and the GMCM reference manual.Two different parameter formats are used depending on if the special or general model are used.In the general model a specially formatted list of parameters is used, which has the name theta in the function arguments.The function rtheta generates such a prototypical list with random parameters and is.theta conveniently tests if the argument is properly formatted.If the special model is to be used, the required parameters are simply given in a numeric vector (α 1 , µ, σ, ρ) of length 4, with name par in the arguments.The functions meta2full and full2meta provide the easy conversion between the general theta and the special par format.
The most important functions are fit.full.GMCM and fit.meta.GMCM.They allow to fit the general and special GMCMs, respectively.The method argument of these functions specifies the optimization routine to be used.If the general model is used get.probreturns a matrix of posterior probabilities κ gk as defined in Equation 5.In the special model, the function get.IDR is used to compute local idr (i.e., the posterior probability of belonging to the irreproducible component) and adjusted IDR values.
The functions SimulateGMMData and SimulateGMCMData allow to simulate observations from the models specified in Equations 1 and 4, respectively.
Beside the following tutorial, a small usage example of the special model can also be run using help("GMCM").All simulations and computations were carried out on a regular laptop (1.7 GHz Intel Core i5, 4GB DDR3 RAM).

Using the package
We proceed with a small tutorial to present the use of the package.As an illustration, we load the package and simulate 10,000 observations from a 2-dimensional 3-component GMCM with randomly chosen parameters: The sim object is a list containing the matrix of the realized latent process (sim$z), the matrix of true realizations from the GMCM density (sim$u), the formatted parameters (sim$theta), and the component from which each observation is realized (sim$K).Figure 2 shows the realized data.
Subsequently, we select a starting estimate from the data, fit the ranked observed data using Nelder-Mead ("NM"), and compute the posterior probabilities of each observation belonging to each component: Fast evaluation of the multivariate Gaussian PDF.
Table 1: Overview of the user visible functions and their purpose in approximate order of importance.Please consult the documentation (e.g., help("Uhat")) for function arguments and return types.The relevant equations are indicated right-justified.variances can be computed.The correlations in all components are taken to be zero.This usually provides reasonable initial parameters.Objections could be raised against using such a procedure on the rank and not the latent level.However, as we are only interested in the relative position of the components, this often provides as reasonable starting parameters.The function fit.full.GMCM does the actual optimization of the likelihood to arrive at the maximum likelihood estimate (MLE).The default Nelder-Mead (NM) procedure converged in 499 iterations in about 6 seconds.
In serious applications the starting values should be chosen carefully and the algorithm ought to be started from different positions of the parameter space to investigate the stability and uniqueness of the MLE.The estimate with the largest likelihood should then be returned.
The confusion matrix obtained when comparing the GMCM clustering results to the true class labels, see Table 2, indicates an accuracy of 99.4%.In (unfair) comparison, the kmeans algorithm has an accuracy of 90.3%. Figure 3 shows the clustering results and simple model checks by simulating from the fitted parameters.Though a high clustering accuracy is achieved, we see from the model check in Figure 3B compared to Figure 2A that the underlying parameters are not really identifiable.However, we see from panel C, that the fitted parameters model the observed ranks closely and thus provide a high predictive accuracy.

Runtime and technical comparison
For the special model, the GMCM package allows for an arbitrary number of dimensions (or experiments) d to be included whereas the idr package only supports d = 2. (PEM) algorithm compared to the idr package.The optimization procedures such as Nelder-Mead (NM), simulated annealing (SANN), and others which only rely on evaluations of the likelihood further reduce the runtime compared to the PEM.
Run and iteration times for an increasing number of observations are see Table 3 for a simulated dataset with parameters (α 1 , µ, σ, ρ) = (0.7, 2, 1, 0.9).The algorithms were all run with the starting values (0.5, 2.5, 0.5, 0.8).The parameters were chosen such that the idr package does not converge prematurely.
To assess the optimization routines in the idr and GMCM packages, 1, 000 datasets with 10,000 observations were simulated from the special model with parameters θ = (0.9, 3, 2, 0.5).
The special model was fitted to each of the datasets using each of the available routines with random initial parameter values.Figure 4 shows the results from the fitting procedures.
The maximum number of iterations was set to 2,000.The SANN procedure was given 3,000 iterations.
The clusters of parameter estimates away from the true values seen in Figure 4 presumably correspond to local maxima of the likelihood.Hence many of the procedures are fairly often caught in such local maxima.Interestingly, while the estimates of the standard deviation σ and correlation ρ for the PEM algorithm seem to be biased, the algorithm achieved a high clustering accuracy.We also see that the PEM algorithms in GMCM and idr behave quite differently.The maximal number of iterations, 2000, was hit only by the PEM algorithm 274 and 18 times for the idr and GMCM packages, respectively.Also notable is the factor 555 reduction in total runtime between the fastest and slowest fitting procedures.
All warnings produced by the PEM algorithm in idr were equal to "NaNs produced".PEM in GMCM only warned that the maximum number of iterations was reached.The errors produced by SANN and L-BFGS-B seem to arise when the estimates of the covariance matrix became singular.The vast majority of the errors for L-BFGS-B were divergence to nonfinite likelihood values.The unique error thrown by PEM (idr), "missing value where TRUE/FALSE needed", seems to stem from a simple bug.
Considering computational efficiency and robustness, accuracy, and precision of parameter estimates, we chose the Nelder-Mead as the default optimization procedure.

Availability of the package
The GMCM package is open-source and available both from the Comprehensive R Archive Network (CRAN) at https://CRAN.R-project.org/package=GMCM and from the GitHub repository at https://github.com/AEBilgrau/GMCM.git for bug reports as well as easy forking and editing.

Maximizing the likelihood
The optimization of the likelihood function in Equation 10 is non-trivial.There exists no closed form expression for Γ −1 k .Furthermore there are intrinsic problems of identifiability of the GMCM parameters.These problems will greatly affect any estimation procedure.
Both Li et al. (2011) and Tewari et al. (2011) make use of a pseudo EM (PEM) algorithm to find the maximizing parameters.Tewari et al. (2011) use the PEM as a "burn-in" and then switch to a gradient descent algorithm.In both papers the likelihood function of the GMM, GMM , specified by Equation 1, and the estimators for the corresponding EM algorithm are derived.The PEM algorithm then iteratively alternates between estimating pseudoobservations ẑgk = Γ −1 k (û gk ; θ) and subsequently updating θ by an E and M step.While this intuitively is a viable approach, it effectively ignores the Jacobian of Equation 3 as the transformation Γ −1 depends on the parameters θ.In short, the wrong likelihood is thus optimized and a pseudo (or quasi) maximum likelihood estimate is found.This may yield an inefficient optimization routine and biased parameter estimates.This problem of the PEM is appreciated by Tewari et al. (2011).
A fundamental problem with the PEM algorithm is the alternating use of pseudo-observations and parameter updates.The pseudo data is not constant in the GMM which implies no guarantee of convergence nor convergence to the correct parameters.
The PEM algorithm alternates between updating parameter estimates and pseudo data which results in the following log-likelihood values, given in the order of computation.Conventionally, convergence is established when the difference of successive likelihoods is smaller than some > 0. The implementation of Li et al. (2011) through the R package idr determines convergence if where > 0 is pre-specified.However, an increase in successive likelihoods is only guaranteed by the EM algorithm when the (pseudo) data are constant.Since both, the pseudo data and parameter estimates, have changed the above difference can be, and often is to our experience, negative.In the idr package this sometimes happens in the first iteration without warning.Such cases arguably stop the procedure prematurely since any negative difference obviously is smaller than some positive .The EM algorithm only guarantees that the difference is non-negative and thus might be more suitable for determining convergence.
The PEM convergence criterion used by Tewari et al. (2011) corresponds to determining if the difference in successive parameter estimates is sufficiently small while recording the highest observed likelihood estimate.This partly remedies the problem.However, the PEM still inherits the conventional problems of the EM algorithm.It often exhibits slow convergence and offers no guarantee of finding the global optimum.
Our software package GMCM offers fast optimization of both the general and special model.Our implementation of the PEM algorithm supports various convergence conditions.By default, it determines on convergence if and returns the parameters which yield the largest likelihood value.These are not necessarily the parameters obtained in the last iteration.The internal function PseudoEMAlgorithm is called when fit.full.GMCM or fit.meta.GMCM are run with method = "PEM".
Instead of the EM approach, however, we propose optimizing the GMCM likelihood function in Equation 10 using procedures which rely only on likelihood evaluations.To make this a feasible approach considerable effort has been put into evaluating the log-likelihood function of Equation 10 in a fast manner by implementing core functions in C++ using Rcpp and RcppArmadillo (Eddelbuettel 2013;Eddelbuettel and François 2011;François, Eddelbuettel, and Bates 2016).With fast likelihood evaluations the standard optimization procedure optim in R is used with various optimization procedures, such as Nelder-Mead (also known as the amoeba method), simulated annealing, and BFGS quasi-Newton methods.
When the parameters are passed to optim we use various transformations to reformulate the optimization problem as an unconstrained one.We logit-transform the mixture proportions.
In the general model, a Cholesky decomposition combined with a log-transformation is used to ensure positive definiteness of the covariance matrices.In the special model, the variance σ 2 is ensured to be positive due to a log-transform.The restriction on the correlation ρ to be in the interval [−(d − 1) −1 , 1] is guaranteed by an affine and logit function composition.
Additional speed has also been gained by implementing a faster inversion of the marginals Γ k .
Similarly to Li et al. (2011), we linearly interpolate between function evaluations.However, we distribute the default 1,000 function evaluations to the components according to the current estimate of the mixture proportions.The determined number of function evaluations for component h within the kth dimension is then sampled equidistantly in the interval µ hk ± a √ Σ hkk where a = 5 by default.Lastly, the monotonicity of Γ k is used to quickly invert the function by reflection around the identity line.Furthermore, we approximate the mixture CDF Γ k by using the approximation of the error function erf(x) ≈ 1 − (a 1 t + a 2 t 2 + a 3 t 3 ) exp(−x 2 ) where t = 1/(1 + bx) and a 1 , a 2 , a 3 , and b are constants (Abramowitz and Stegun 1970, p. 299;Hastings, Hayward, and Wong 1955).

Identifiability of parameters
The model suffers from unidentifiable parameter configurations.As a consequence of the GMCM invariance to translations only relative distances between the location parameters µ 1 , . . ., µ m can be inferred.We arbitrarily anchor the first component at µ 1 = 0 as a partial solution.To account for scaling invariance, the first component is required to have unit variance in each dimension, that is Σ 1kk = 1 for all k.However, problems of identifiability persist in a number of scenarios.In cases where two or more components in the latent GMM are well-separated from each other the relative distances and component variances are not identifiable for all practical purposes.For example in the special GMCM, the parameter configuration θ = (0.5, 10, 1, 0) will be indistinguishable from (0.5, 100, 1, 0).The ranking destroys all information about the relative variances and distances between the well-separated components.
The clustering might also easily fail when the location and variance parameters for two or more components are similar along the same dimension.Suppose for example that µ 1 = (0, 0), µ 2 = (4, 0), and Σ 1 = Σ 2 = I 2×2 , i.e., the location and variance parameters are equal along the ordinate axis.In such cases, the ranking will create a homogeneous cluster which cannot be easily separated.
Even though the parameters may not be fully uniquely estimable in all cases, the general model can still be an effective clustering algorithm as measured by clustering accuracy.thus the following clustering should be carefully interpreted.If the parameter estimates approach any of the given numbers then the remaining parameters, represented by dots, are effectively non-identifiable.For example in Situation 1, if the mixture proportion α 1 approaches 1 then the remaining parameters can easily diverge as they no longer contribute to the likelihood.In Situation 2 where θ = (0, •, •, ρ) extra caution should be displayed if ρ becomes substantially different from zero as all observations will be deemed reproducible.
While the above corrections somewhat remedy these issues, the three situations can still be observed, especially when data consisting of nearly pure noise is supplied.

Reproducibility of microarray results
In molecular biology, microarrays are often used to screen large numbers of candidate markers for significant differences between case and control groups.Microarrays simultaneously probe the DNA composition or transcribed RNA activity of multiple genes in a biological sample.
The number of probes ranges in the orders of 10,000 to 6,000,000, depending on the specific microarray.
In the study of haematological malignancies it is of biological interest to know how normal B-lymphocytes develop (Lenz and Staudt 2010;Rui, Schmitz, Ceribelli, and Staudt 2011;Küppers 2005).Hence, B-cells from removed tonsil tissue of six healthy donors were sorted and isolated using fluorescence-activated cell sorting (FACS) into five subtypes of B-cells: Naïve (N) B-cells, Centrocytes (CC), Centroblasts (CB), Memory (M) B-cells, and Plasmablasts (PB).As part of the immune response to an infection, the CBs proliferate rapidly and become CCs within the so-called germinal centers (GC).The 6 × 5 samples were profiled with Affymetrix GeneChip HG-U133 plus 2.0 (U133) microarrays (see Bergkvist et al. 2014, for further details).
It is, e.g., of interest to identify which gene expressions have been altered within the GCs from which the CCs and CBs come.We therefore tested the hypothesis of no difference in genetic expression between CC and CB samples against N, M, and PB samples for all the gene expressions present on the U133 array.Since gene profiling technologies are rapidly evolving the experiment was later repeated with new donors and on the newer GeneChip Human Exon 1.0 ST (Exon) microarray.
The 30 samples on the U133 arrays and the 30 samples on Exon arrays were pre-processed and summarized to gene level separately and independently using the RMA algorithm with the Bioconductor (Gentleman et al. 2004) package affy (Gautier, Cope, Bolstad, and Irizarry 2004) using custom CDF files (Dai et al. 2005).This pre-processing resulted in the genetic expression levels of 37,923 probe-sets for the U133 arrays and 19,750 probe-sets for the Exon arrays both annotated with Ensembl gene identifiers (ENSG identifiers).
Each experiment was analyzed separately using a mixed linear model and the empirical Bayes approach using the limma package (Smyth 2004) to test the hypothesis of no differential expression for each gene between the CC + CB and the N + M + PB groups.The tests yield two lists of p values for the U133 and Exon arrays.
The p value lists were reduced to the 19,577 common genes present on both array types and combined into a matrix [x gk ] 19577×2 where x gk is one minus the p value for varying gene expression for gene g in experiment k ∈ {U133, Exon}.
To determine the genes which are differentially expressed in a reproducible way, the special GMCM was fitted with the Nelder-Mead optimization procedure using fit.meta.GMCM.The procedure was started in 3 different starting values and the estimate with the largest loglikelihood was chosen.The best estimate converged in 311 iterations.Subsequently, the local and adjusted IDR values were computed with get.IDR.A total of 3,546 genes (18.1%) were found to have an adjusted IDR value below 0.05 and deemed reproducible.The results are illustrated in Figure 5 along with the parameter estimates.The algorithm successfully picks p values which are high-ranking in both experiments.
If the MAP estimate is used -which corresponds to a local idr value less thană0.5 -then 4, 510 (23%) genes are deemed reproducible.ăThis percentage isăconsistentăwith the estimate of the mixture proportion α 1 = 0.71 of the null component.
Note, since no biological ground truth is available, the accuracy cannot be determined.However, since genes which are not differentially expressed are expected to be irreproducible the accuracy may be high.
For comparison, the number of genes marginally significant at 5% significance level after Benjamini-Hochberg (BH) correction (Benjamini and Hochberg 1995) is 3,968 and 6,713 for the U133 and Exon experiments, respectively.The number of commonly significant genes (i.e., simultaneously significant in both experiments) is 3,140 or 16%.This corresponds to the common approach of using Venn diagrams.
The list of reproducible genes, which can be ranked by their idr values, provides a more accessible list of genes for further biological down-stream analyses than the unordered list of genes obtained by the Venn diagram approach.
The p values from the experiments are available in GMCM via data("u133VsExon", package = "GMCM").

Effects of cryopreservation on reproducibility
Cryopreservation is a procedure for preserving and storing tissue samples by cooling them to sub-zero temperatures.It is convenient for researchers and a crucial component of biobanking.
Cryopreservation is usually assumed by default to alter the biological sample since many cryopreserving substances are toxic, the freezing procedure may damage the sample due to ice crystallization, and it may induce cellular stress response.Fresh is therefore considered favorable to cryopreserved tissue.Few studies have analyzed the effect of the cryopreservation on phenotyping and gene expression.Recently, we studied cryopreservation to gauge the actual impact of the cryopreservation on global gene expression in a controlled comparison of cryopreserved and fresh B-lymphocytes.Similarly to the above, the B-cells were prepared from peripheral blood of 3 individual healthy donors and FACS sorted into 2 × 4 B-cell subtypes, Immature (Im), Naïve (N), Memory (M), and Plasmasblasts (PB).Half of the samples were cryopreserved and thawed prior to the gene expression profiling using the Exon array while the other half was profiled fresh.The resulting data was pre-processed using RMA (see Rasmussen et al. 2014, for further details).As a supplement to ?, we performed a reproducibility analysis using the special model.The analysis nowăpresented hereăwas originally omittedăfrom ?due to our concerns about complexity and the length added to the manuscript.
If cryopreservation has relatively negligible effects on global screenings, then a high reproducibility should be expected for differential expression analyses within the fresh and frozen samples -however only for the true differentially expressed genes.For each probe set, the samples were analyzed using linear mixed models as described in Rasmussen et al. (2014) and the hypothesis of no differential expression between pre (Im + N) and post germinal center (M + PB) cells was tested for both fresh and cryopreserved samples separately to mimic the situations where only fresh or frozen samples are available.The special GMCM was fitted using the resulting absolute value of the test statistics to determine the level of reproducibility of each probe set.Local and adjusted irreproducible discovery rates were computed for all probe sets and this level of reproducibility was discretized into three groups: highly reproducible (IDR g < 0.05, cf.Equation 9), reproducible (idr g < 0.5, cf.Equation 8), and irreproducible (idr g ≥ 0.5).
The best parameter estimate of 40 fits was θ = (α 1 , µ, σ, ρ) = (0.73, 1.08, 1.32, 0.86).The reproducibility analysis deemed 1,667 (8.9%), 1,402 (7.5%), and 15,639 (83.6%) genes highly reproducible, reproducible, and irreproducible, respectively.Figure 6 shows these classifications of the p values for differential expression between pre and post germinal cells for the fresh and frozen samples.The total of 3,069 (16.4%) reproducible probe sets seems quite high and agrees with the estimated mixture proportion of 0.73.Again, the model correctly captures the genes with simultaneously low p values.Recall also that non-differentially expressed genes are expected to be irreproducible and the actual accuracy is thus much higher although it (again) cannot easily be estimated when no biological ground truth is available.
Naturally, one might wonder whether genes changed due to cryopreservation to a large extent are deemed irreproducible.The paired design allowed us to investigate this hypothesis.The hypothesis of no difference in expression between fresh and frozen samples for each gene was therefore tested and the significant BH-adjusted p values at the 5% level are highlighted in Figure 6.The expectation above was then tested using a test for non-zero Spearman correlation between the p values and idr values which yielded a non-significant correlation (ρ = 0.009, p = 0.21).In other words, high evidence for a change between fresh and frozen is not associated with greater irreproducibility (idr).Alternatively, a Fisher's exact test also did not yield a difference in odds (odds ratio = 0.67, 95% CI = (0.36, 1.32), p value = 0.23) of having a BH-adjusted significant change due to cryopreservation in the reproducible group (odds = 48/(15591−48)) compared to the irreproducible (odds = 14/(3055−14)).Thus there is no evidence for an over-representation of the irreproducible genes among the significant ones.We might thus conclude that though some genes change due to cryopreservation, the differential analysis between subgroups to a great extent still yields the same results whether the samples are fresh or frozen.
Lastly, notice that some genes in the lower left corner of Figure 6 (A-C) near the origin are also being deemed reproducible.This is an artifact of the model due to the high correlation of ρ = 0.86 in the reproducible component.
The p values and test scores are available in GMCM via data("freshVsFrozen", data = "GMCM").

Image segmentation using the general GMCM
In computer vision and graphics, image segmentation is useful to simplify and extract features of pictures.To illustrate the flexibility of the model and the computational capability of the The JPEG image can be represented as a 1,447,500×3 matrix where each column corresponds to a color channel in the RGB color space and each row corresponds to a pixel and observation in the GMCM.The values are in this case in the interval [0, 1].
A 3-dimensional, 10-component GMCM was fitted using the PEM algorithm which resulted in the middle image of Figure 7.The segmented colors were chosen using the location estimates μ1 , . . ., μ10 .That is, the three dimensional vector F −1 • (Γ • ( μh ; θ)) ∈ [0, 1] 3 in the RGB space was used as the color of cluster h.Alternatively, the average RGB value of each cluster could be used.
For comparison the 1.4 Mpx image was also segmented with the k-means algorithm.The results are seen at the bottom of Figure 7.The final color given to each cluster corresponded to the mean estimated by the algorithm.
As can be seen, the k-means algorithm and the GMCM yield quite different segmentations and different details of the image are captured.For example, the GMCM seems to capture more details of the bottom of the orange external tank.However perhaps erroneously, the GMCM also clusters the black left edge of the photo together with a light cluster.Which method is superior method depends on the application at hand.We acknowledge that disregarding spatial correlations between pixels is quite naïve.However, this example should illustrate the computational capability of the package to handle large datasets with a high number of clusters.
The package jpeg was used to read, manipulate, and write the JPEG image from R (Urbanek 2014).

Concluding remarks
The software for the gradient descent algorithm used by Tewari et al. (2011) to arrive at a maximum likelihood estimate is written in the proprietary language MATLAB (The Math-Works Inc. 2014) and also not provided as open source software.Hybrid procedures, similar to the one proposed by Tewari et al. (2011), can easily be constructed with the GMCM package.The GMCM package solves some of the previously described issues regarding the maximum likelihood estimation and provides a considerable speed-up in computation times.However, there seems to be no complete remedy for all of the challenges of the GMCMs.As stated, the transformation into uniform marginal distributions by ranking will result in a loss of information about the distance between components that are well separated.
The intrinsic identifiability problems of GMCMs may in practice often not be a big issue.Even though the parameters of the assumed underlying GMM can be difficult to estimate due to the flat likelihood function, the clustering accuracy can still be very high.Furthermore, the actual parameters, except perhaps the mixture proportions, do often not seem to be of particular interest in applications.Hence, the merit of the GMCMs should be measured by predictive accuracy which still remains to be explored.In this respect, we believe that the theoretical and practical properties of the special GMCM and IDR approach should be studied further and compared to common p value combining meta-analyses, such as the methods of Fisher, Stouffer, Wilkinson, Pearson, and others, see, e.g., Owen (2009).Interestingly, and perhaps also of slight concern, it can be seen that the IDR approach would be deemed unreasonable by Condition 1 in Birnbaum (1954) whenever ρ = 0.It is unclear whether the method fulfills properties such as admissibility (Birnbaum 1954) and relative optimality in Bahadur's sense (Littell and Folks 1971).
The simulation study in Section 3.3 revealed relatively many errors thrown by the GMCM package.We are committed to pinpoint the exact sources of the errors and provide fixes in future versions.We suspect the errors encountered are due to divergence of the parameters and should therefore be treated as such.With this in mind we believe that software should fail loudly with an error or a warning when it indeed fails.
In conclusion, the GMCM package provides a fast implementation of a flexible and widely applicable tool proposed for reproducibility analysis and unsupervised clustering.The flexibility and applicability are however gained at the cost of a complicated likelihood function.

Figure 1 :
Figure 1: From left to right the observed, copula (or rank), and latent process are shown.The first and second row of panels illustrate 10,000 realizations from the general and special model, respectively.The component from which the realizations come is indicated by color and point-type.Each dimension in the special model corresponds to an experiment where simultaneously high values in both experiments are indicative of good reproducibility.

Figure 2 :
Figure 2: Panel A shows realizations from the latent process and panel B the corresponding marginally uniformly distributed process.Note, that while B shows the true realizations from the GMCM u g , the ranked observed values ûg are almost visually identical because of the relative large number of observations.

Figure 3 :
Figure 3: In panel A the estimated class labels of the observations are indicated by color and point-type.To allow to check the model panels B and C show 10,000 realizations from the GMM and GMCM using the fitted parameters.

Figure 4 :
Figure 4: Parameter fitting results for the different optimization procedures.From left to right, the first four panels show plots of the fitted parameter estimates.The true parameter values are plotted as vertical lines.Next, the mean clustering accuracy (and standard deviation), total runtime in minutes for all 1000 fits, and the number of warnings and errors are shown.The last panel shows the number of iterations for each fit.The black vertical lines indicate the median.

Figure 5 :
Figure 5: Panel A shows a plot of the scaled ranks of p values for the Exon experiment against the scaled ranks of the p values for the U133 experiment.Presumably, genes located in the upper left or lower right corner of the plots are false positive results in either experiment.Panel B shows the estimated latent GMM process.The fitted parameters shown are used to marginally transform panel A into the B.

Figure 6 :
Figure 6: Results from the reproducibility analysis of cryopreserved samples.Panel A shows the p values p g for the test of no differential expression between the pre-and post germinal center groups for fresh and frozen samples.Panel B shows the corresponding ranked p values ûg , and panel C shows the estimated latent process ẑg .The estimated level of reproducibility for each probe set is color coded according to the legend in panel C. Genes significantly different across fresh and frozen samples are plotted as red squares regardless of the reproducibility level.

Figure 7 :
Figure 7: Top: the original 1.4 Mpx JPEG image of the space shuttle Atlantis' climb to orbit during mission STS-27 in December 1988.Middle: the image segmented into 10 colors by the GMCM.Bottom: the image segmented into 10 colors by the k-means algorithm.Image credit: NASA.

Table 3 :
Runtime comparisons of the idr and GMCM packages with increasing number of observations p.The benchmarked optimization procedures are the pseudo EM algorithm (PEM) and the Nelder-Mead (NM) method.The runtime is given in seconds.The last column shows the relative speed per iteration compared to the fastest procedure.
The GMCM package considerably decreases the per iteration runtime of the pseudo expectation-maximization

Table 4
describes three situations in the special model where the parameter estimates and Situation α 1

Table 4 :
Equivalent optima for data corresponding to pure noise.A dot (•) denotes an arbitrary value.The given values need only to be approximate.