lmdme: Linear Model on Designed Multivariate Experiments in R

The lmdme package decomposes analysis of variance (ANOVA) through linear models on designed multivariate experiments, allowing ANOVA-principal component analysis (APCA) and ANOVA-simultaneous component analysis (ASCA) in R. It also extends both methods with the application of partial least squares (PLS) through the specification of a desired output matrix. The package is freely available from Bioconductor and licensed under the GNU General Public License.ANOVA decomposition methods for designed multivariate experiments are becoming popular in “omics” experiments (transcriptomics, metabolomics, etc.), where measurements are performed according to a predefined experimental design, with several experimental factors or including subject-specific clinical covariates, such as those present in current clinical genomic studies. ANOVA-PCA and ASCA are well-suited methods for studying interaction patterns on multidimensional datasets. However, currently an R implementation of APCA is only available for Spectra data in the ChemoSpec package, whereas ASCA is based on average calculations on the indices of up to three design matrices. Thus, no statistical inference on estimated effects is provided. Moreover, ASCA is not available in an R package.Here, we present an R implementation for ANOVA decomposition with PCA/PLS analysis that allows the user to specify (through a flexible formula interface), almost any linear model with the associated inference on the estimated effects, as well as to display functions to explore results both of PCA and PLS. We describe the model, its implementation and two high-throughput microarray examples: one applied to interaction pattern analysis and the other to quality assessment.


Introduction
Current "omics" experiments (proteomics, transcriptomics, metabolomics or genomics) are multivariate in nature. Modern technology allows us to explore the whole genome or a big subset of the proteome, where each gene/protein is in essence a variable explored to elucidate its relationship with an outcome. In addition, these experiments are including an increasing number of experimental factors (time, dose, etc.) from design or subject-specific information, such as age, gender and linage, and are available for analysis. Hence, to decipher experimental design or subject-specific patterns, some multivariate approaches should be applied, with principal component analysis (PCA) and partial least squares regression (PLS) being the most common. However, it is known that working with raw data might mask information of interest. Therefore, analysis of variance (ANOVA)-based decomposition is becoming popular to split variability sources before applying such multivariate approaches. Seminal works on genomics were that of De Haan et al. (2007) on ANOVA-PCA (APCA) and of Smilde et al. (2005) on ANOVA-SCA (ASCA) models. However, to the best of our knowledge R implementation of APCA is only available for Spectra data, ChemoSpec R package by Hanson (2012). Regarding ASCA, as there is no R package for this model, it can only be used by uploading script-function files resulting from a MATLAB code translation (Nueda et al. 2007). In addition, ASCA only accepts up to three design matrices, which limits its use and makes it difficult. Moreover, coefficient estimations are based on average calculations using binary design matrices, without any statistical inference over them. Here, we provide a flexible linear model-based decomposition framework. Almost any model can be specified, according to the experimental design, by means of a flexible formula interface. Because coefficient estimation is carried out by means of maximum likelihood, statistical significance is naturally given. The framework also provides both capacity to perform PCA and PLS analysis on appropriate ANOVA decomposition results, as well as graphical representations. The implementation is well-suited for direct analysis of gene expression matrices (variables on rows) from high-throughput data such as microarray or RNA-seq experiments. Below we provide an examples to introduce the user to the package applications, through the exploration of interaction patterns on a microarray experiment.

The model
A detailed explanation of ANOVA decomposition and multivariate analysis can be found in Smilde et al. (2005) and Zwanenburg et al. (2011). Briefly and without the loss of generality, let us assume a microarray experiment where the expression of (G 1 , G 2 , . . . , G g ) genes are arrayed in a chip. In this context, let us consider an experimental design with two main factors: A, with a levels (A 1 , A 2 , . . . , A i , . . . , A a ) and B, with b levels (B 1 , B 2 , . . . , B j , . . . , B b ), with replicates R 1 , R 2 , . . . , R k , . . . , R r for each A × B combination levels. After preprocessing steps are performed as described in (Smyth 2004), each chip is represented by a column vector of gene expression measurements of g × 1. Then, the whole experimental data is arranged into a g × n expression matrix (X), where n = a × b × r. In this data scheme, single gene measurements across the different treatment combinations (A i × B j ) are presented in a row on the X matrix, as depicted in Figure 1. An equivalent X matrix structure needs to be obtained for 2D-DIGE or RNA-seq experiments and so forth.  Regardless of data generation, the ANOVA model for each gene (row) in X can be expressed as (1): Where x ijk is the measured expression for "some" gene, at combination "ij" of factors A and B for the k replicate; µ is the overall mean; α, β and α × β are the main and interaction effects respectively; and the error term ε ijk ∼ N (0, σ 2 ). In addition, (1) can also be expressed in matrix form for all genes into (2): Where X l , E matrices are of dimension g ×n and contain the level means of the corresponding l − th term and the random error respectively. However, in the context of linear models X l can also be written as a linear combination of two matrix multiplications in the form of (3): Where B l and Z l are referenced in the literature as coefficient and model matrices of dimensions g × m (l) and n × m (l) , respectively, and m (l) is the number of levels of factor l. The first term is usually called intercept, with B µ = µ and Z µ = 1 being of dimension g × 1 and n × 1, respectively. In this example, all Z l are binary matrices, identifying whether a measurement belongs ("1") or not ("0") to the corresponding factor. In the implementations provided by Smilde et al. (2005) and Nueda et al. (2007), the estimation of the coefficient matrices is based on calculations of averages using the design matrix (up to three design matrices Z α,β,αβ ), to identify the average samples. In theory, these authors fully decompose the original matrix as shown in (1). On the contrary, in this package the model coefficients are estimated, iteratively, by the maximum likelihood approach, using the lmFit function provided by limma package (Smyth et al. 2011). Consequently, three desirable features are also incorporated: 1. Flexible formula interface to specify any potential model. The user only needs to provide: i) The gene expression matrix (X), ii) The experimental data.frame (design) with treatment structure, and iii) The model in a formula style, just as in an ordinary lm R function. Internal model.matrix call, will automatically build the appropriate Z matrices, overcoming the constraint on factorial design size, and tedious model matrix definitions.
2. Hypothesis tests on coefficient B l matrices. A T test is automatically carried out for the s − th gene model, to test whether or not the o − th coefficient is equal to zero, i.e., In addition, an F test is performed to simultaneously determine whether or not all b so are equal to zero.
3. Empirical Bayes correction can also be achieved through the eBayes limma function. It uses an empirical Bayes method to shrink the row/gene-wise sample variances towards a common value and to augment the degrees of freedom for the individual variances (Smyth 2004).
By contrast, De Haan et al. (2007) estimate the main and interaction effects by overall mean subtraction. Hence, genes need to be treated as an additional factor. Meanwhile, in Smilde et al. (2005) and Nueda et al. (2007) implementations, the estimations are obtained on a geneby-gene basis, as in (1). Therefore, in a two-way factor experiment, such as time × oxygen, De Haan's model includes two additional double interactions and a triple interaction, because genes are treated as a factor, unlike the models of Smilde and Nueda.

The decomposition algorithm
The ANOVA model (2) is decomposed iteratively using (3), where in each step the l − th coefficientsB l ,Ê l matrices andσ 2 l are estimated. Then, the particular term contribution matrixX l =B l Z l is subtracted from the preceding residuals to feed the next model, as depicted in (4): Where the hat (" ∧ ") denotes estimated coefficients. In this implementation, the first step always estimates the intercept term, i.e., formula=∼1 in R style, withB µ =μ and Z µ = 1. The following models will only include the l − th factor without the intercept, i.e., formula=∼lth_term-1, where lth term stands for α, β or αβ in this example. This procedure is quite similar to the one proposed by Harrington et al. (2005).

PCA and PLS analyses
These methods explain the variance/covariance structure of a set of observations (e.g., genes) through a few linear combinations of variables (e.g., experimental conditions). Both methods can be applied to the l − th ANOVA decomposed step of (4) to deal with different aspects: PCA concerns with the variance of a single matrix, usually with the main objectives of reducing and interpreting data. Accordingly, depending on the matrix to which it is applied, there are two possible methods: ASCA, when PCA is applied to coefficient matrix,B l , (Smilde et al. 2005); and APCA when PCA is calculated on the residual, E l−1 . The latter is conceptually an ASCA and is usually applied to, X l + E, i.e., the mean factor matrix X l , plus the error of the fully decomposed model E of (1) PLS not only generalizes but also combines features from PCA and regression to explore the covariance structure between input and some output matrices, as described by Abdi and Williams (2010) and Shawe-Taylor and Cristianini (2004). It is particularly useful when one or several dependent variables (outputs -O) must be predicted from a large and potentially highly correlated set of independent variables (inputs). In our implementation, the input can be either the coefficient matrixB l or the residualÊ l−1 . According to the choice, the respective output matrix will be a diagonal O=diag(nrow(B l )) or design matrix O = Z l . In addition, users can specify their own output matrix, O, to verify a particular hypothesis. For instance, in functional genomics it could be the Gene Ontology class matrix as used in gene set enrichment analysis (GSEA) by Subramanian et al. (2005).
When working with the coefficient matrix, the user will not have to worry about the expected number of components in X (rank of the matrix, given the number of replicates per treatment level), as suggested by Smilde et al. (2005), because the components are directly summarized in the coefficientB l matrix. In addition, for both PCA/PLS, the lmdme package (Fresno and Fernández 2013a) also offers different methods to visualize results, e.g., biplot, loadingplot and screeplot or leverage calculation, in order to filter out rows/genes as in Tarazona

Example
In this section we provide an overview of lmdme package by Fresno and Fernández (2013a). The example consists of an application of the analysis of gene expression interaction pattern, where we address: How to define the model, undertake ANOVA decomposition, perform PCA/PLS analysis and visualize the results. From here onwards, some outputs were removed for reasons of clarity and the examples were performed with options(digits=4).

Package overview
The original data files for the first example are available at Gene Expression Omnibus (Edgar et al. 2002), with accession GSE37761 and stemHypoxia package (Fresno and Fernández 2013b) on the Bioconductor website. In this dataset, Prado-Lopez et al. (2010) studied differentiation of human embryonic stem cells under hypoxia conditions. They measured gene expression at different time points under controlled oxygen levels. This experiment has a typical two-way ANOVA structure, where factor A stands for "time" with a = 3 levels {0.5, 1, 5 days}, factor B stands for "oxygen" with b = 3 levels {1, 5, 21%} and r = 2 replicates, yielding a total of 18 samples. The remainder of the dataset was excluded in order to have a balanced design, as suggested by Smilde et al. (2005) to fulfil orthogonality assumptions in ANOVA decomposition. First, we need to load stemHypoxia package to access R objects calling data("stemHypoxia"), which will then load the experimental design and gene expression intensities M.
R> library("stemHypoxia") R> data("stemHypoxia") Once the preprocessing of the experiment data is completed, library("lmdme") should be loaded. This instruction will automatically load the required packages: limma (Smyth et al. 2011) and pls (Mevik et al. 2011). Once the data are loaded, the ANOVA decomposition of Section 2.1 can be carried out using (4) calling lmdme function with the model formula, actual data and experimental design. Model:~time * oxygen Model decomposition: Step Names Formula CoefCols 1 1 (Intercept)~1 1 2 2 time~-1 + time 3 3 3 oxygen~-1 + oxygen 3 4 4 time:oxygen~-1 + time:oxygen 9 The results of lmdme will be stored inside the fit object, which is an S4 R class. By invoking the fit object, a brief description of the data and design used are shown as well as the Model applied and a summary of the decomposition. This data.frame describes the applied Formula and Names for each Step, as well as the amount of estimated coefficients for each gene (CoefCols). At this point, we can choose those subjects/genes in which at least one interaction coefficient is statistically different from zero (F test on the coefficients) with a threshold p value of 0.001 and perform ASCA on the interaction coefficient term, and PLS against the identity matrix (default option).
For visual clarity, xlabs are changed with the "o" symbol, instead of using the rownames(M) with manufacturer ids, and second axis with the expand=0.7 option to avoid cutting off loading labels. In addition, PLS biplot is modified from the default pls behavior to obtain a graph similar to ASCA output (which="loadings"). Accordingly, ylabs is changed to match the corresponding coefficients of the interaction term and var.axes is set to TRUE. The ASCA biplot of the first two components (see Figure 2(a)), explain over 70% of the coefficient variance. The genes are arranged in an elliptical shape. Thus, it can be observed that some genes tend to interact with different combinations of time and oxygen. A similar behavior is observed in PLS biplot of Figure 2(b).
The interaction effect on the fit object can also be displayed using the loadingplot function (see Figure 3). For every combination of two consecutive levels of factors (time and oxygen), the figure shows an interaction effect on the first component, which explains 50.61% of the total variance of the "time:oxygen" term.
R> loadingplot(fit, term.x="time", term.y="oxygen") In the case of an ANOVA-PCA/PLS analysis, the user only needs to change the type="residuals" parameter in the decomposition function and perform a similar exploration.