Supervised Multiblock Analysis in R with the ade4 Package

This paper presents two novel statistical analyses of multiblock data using the R language. It is designed for data organized in ( K +1) blocks (i.e., tables) consisting of a block of response variables to be explained by a large number of explanatory variables which are divided into K meaningful blocks. All the variables – explanatory and dependent – are measured on the same individuals. Two multiblock methods both useful in practice are included, namely multiblock partial least squares regression and multiblock principal component analysis with instrumental variables. The proposed new methods are included within the ade4 package widely used thanks to its great variety of multivariate methods. These methods are available on the one hand for statisticians and on the other hand for users from various ﬁelds in the sense that all the values derived from the multiblock processing are available. Some relevant interpretation tools are also developed. Finally the main results are summarized using overall graphical displays. This paper is organized following the diﬀerent steps of a standard multiblock process, each corresponding to speciﬁc R functions. All these steps are illustrated by the analysis of real epidemiological datasets.


Introduction
Multivariate methods are widely used to summarize data tables where the same experimental units (individuals) are described by several variables. In some circumstances, variables are divided in different sets leading to multiblock (or multitable) data stored in K different tables (X 1 , . . . , X K ). In this case, each data table provides a typology of individuals based on different descriptors and several statistical techniques have been developed and implemented in Figure 1: Examples of (K + 1) multiblock data.
R (R Core Team 2018) to identify the similarities and discrepancies between these typologies (e.g., Dray, Dufour, and Chessel 2007;Lê, Josse, and Husson 2008). A more complex situation is encountered when Y is a block of several variables to be explained by a large number of explanatory variables organized in K blocks (X 1 , . . . , X K ). This (K + 1) multiblock data are found in various fields including process monitoring (e.g., Kourti 2003), chemometrics (Kohonen, Reinikainen, Aaljoki, Perkio, Vaananen, and Hoskuldsson 2008), sensometrics (Måge, Menichelli, and Naes 2012), social sciences, market studies, ecology (Hanafi and Lafosse 2001) or epidemiology (Bougeard, Lupo, le Bouquin, Chauvin, and Qannari 2012). Some examples of multiblock data are given in Figure 1. In community ecology, the environmental filtering hypothesis suggests that species abundances would be filtered hierarchically, first by large-scale environmental factors (e.g., climate), and subsequently by landscape structure and fine-scale environmental factors. In veterinary epidemiology, the expression of an animal disease could be related to different factors including feeding practices, hygiene, husbandry practices or treatments.
Multiblock methods preserve the original structure of the data and thus allows one to analyze the (K + 1) tables simultaneously. They can be used to select explanatory variables in the datasets (X 1 , . . . , X K ), generally numerous and quasi-collinear, that are strongly related with the dependent variables in Y. After this selection step, multiblock methods can also be used to identify the complex links between explanatory and dependent tables both at the variable and block levels. Although methods designed for the analysis of (K +1) tables have been available for a few years, with a straightforward and single eigensolution, there are few published applications. The main reason for this lack of interest is probably the poor availability of free statistical software implementing these methods. Multiblock partial least squares regression is only available in the free Multi-Block Toolbox (Van den Berg 2004) of the commercial software package MATLAB (The MathWorks, Inc. 2015). However, this method is designed for more complex multiblock data (K explanatory datasets to explain K dependent ones) and there is no proof of the convergence of the associated iterative algorithm. In R (R Core Team 2018), no methods are implemented to deal with (K + 1) tables and two unsatisfactory solutions can be envisaged. A first alternative is to ignore the multiblock structure of the explanatory variables so that a two-table technique can be applied. For instance, partial least squares regression (pls package; Mevik and Wehrens 2007) or redundancy analysis (pcaiv function in ade4;  can be used to study the link between the merged explanatory dataset X and the dependent table Y. On the other hand, the user may also use methods developed for more complex data structures, such as partial least squares path modeling (plspm package; Sanchez, Trinchera, and Russolillo 2017). However, this method is not specifically designed for a (K + 1) structure and the iterative algorithm convergence is only practically encountered but no formal proofs have been provided (Henseler 2010).
We implemented new statistical and graphical functionalities to analyze multiblock (K + 1) data in the ade4 package for R. This package provided classes, methods and functions to handle and analyze multivariate datasets organized in one , two or K tables . We propose additional tools for the statistical analysis of (K + 1) datasets with explanatory and modeling methods in ade4. We implemented two multiblock methods that are based on the optimization of a criterion with a direct eigensolution. The first method is multiblock partial least squares regression (MBPLS) applied to the particular case of a single response dataset Y (Wold 1984). The second one is multiblock principal component analysis with instrumental variables (MBPCAIV) also called multiblock redundancy analysis (Bougeard, Qannari, and Rose 2011). We detail preliminary data manipulation, present the two selected multiblock methods and give advice to select the most relevant. We illustrate the main advantages provided by the methods and the overall descriptive graphical displays. Multiblock methods are also devoted to modeling purpose and we thus propose a crossvalidation procedure to select the optimal model dimension and diagnostic plots to describe its quality. Lastly, we detail the optimal model interpretation at the variable and at the block levels. The whole procedure is illustrated by the analysis of a real epidemiological dataset.

Data manipulation
We consider a response dataset Y with M variables and K explanatory datasets X k with P k variables (k = 1, . . . , K). The merged dataset X is defined as X = [X 1 | . . . |X K ] and contains P = K k=1 P k explanatory variables. All these variables are measured on the same N individuals and are centered.
Note that we added some restrictions concerning the number of individuals (N ≥ 6; defined from our experience in multiblock analyses), the explanatory variables (P k ≥ 2) for k = (1, . . . , K). The dependent block Y may contain a single variable (M ≥ 1). No missing values are allowed.
The ade4 package provides the class 'ktab' that should be used to store the multiblock explanatory datasets (X k ). Variables from the same block must be contiguous. Different procedures can be used to create a 'ktab' object. In the following example, we illustrate the use of the ktab.data.frame function. The data [Y|X 1 |X 2 ] are stored in the douds object available in ade4. The dataset Y contains 27 variables, the first explanatory table called "River" contains 4 variables and the second one called "Chemicals" has 7 variables. R> library("ade4") R> data("doubs", package = "ade4") R> Y <-doubs$fish R> X <-doubs$env R> blo <-c(4, 7) R> tab.names <-c("River", "Chemicals") R> ktabX <-ktab.data.frame(df = X, blocks = blo, tabnames = tab.names) We implemented new functions to manipulate easily multiblock data. For instance, ktabX[1, 1:5, 1:3] can be used to select data for the first five individuals and the first three variables of the first table. The dependent dataset Y should be analyzed by a one-table method providing an object of class 'dudi'. For instance, dudi.pca can be used to apply principal component analysis. Note that the transformation selected for the dependent variables in the call of the dudi.pca function (centering and scaling in this example) will also be applied in the subsequent multiblock analysis.
In the rest of the paper, we consider a real dataset (chickenk in ade4) to illustrate the use of multiblock methods. This example concerns the overall risk factors for losses in broiler chickens (Y) described by (M = 4) variables (the first-week mortality rate, the mortality rate during the rest of the rearing, the mortality rate during the transport to the slaughterhouse and the condemnation rate at slaughterhouse). The (P = 20) explanatory variables are organized in (K = 4) thematic blocks related to the successive production stages of broiler chickens: the farm structure (X 1 , 5 variables, "FarmStructure"), the flock characteristics at placement (X 2 , 4 variables, "OnFarmHistory"), the flock characteristics during the rearing period (X 3 , 6 variables, "FlockCharacteristics") and the transport-lairage conditions, slaughterhouse and inspection features (X 4 , 5 variables, "CatchingTranspSlaught"). All these variables are measured on (N = 351) broiler chicken flocks. See ?chickenk for further details and Lupo et al. (2009) for a complete description. R> data("chickenk", package = "ade4") Several questions can be associated to this epidemiological dataset: 1. Are there some relationships between losses in broiler chickens Y = (y 1 , . . . , y Q ) and variables measured at the different production stages X = (x 1 , . . . , x P )?
2. Do the chicken flocks (N ) have the same features in terms of their production stage description (X) in relation with losses (Y)?
3. Are there significant links between all the variables describing the production stages X = (x 1 , . . . , x P ) and each type of losses Y = (y 1 , . . . , y Q )?
4. Is it possible to sort the effects of all the variables describing the production stages X = (x 1 , . . . , x P ) in relation with the overall losses (Y)?
5. Is it possible to sort the effects of the various production stages (X 1 , . . . , X K ) in relation with the overall losses (Y)?
Multiblock methods help to answer these different questions. The first two questions relate to a descriptive aim and will be handled in Section 4. The three last questions pertain to the modeling framework and will be treated in Section 6.

Multiblock methods
We implemented multiblock partial least squares regression (MBPLS) applied to the particular case of a single response dataset Y (Wold 1984) and multiblock principal component analysis with instrumental variables (MBPCAIV; Bougeard et al. 2011). In comparison with MBP-CAIV, the method MBPLS is less sensitive to multicollinearity within explanatory blocks.
For the case where only one block of variables is explained, Westerhuis, Kourti, and Mac-Gregor (1998) among others, showed that the solution obtained from the iterative algorithm, originally devoted to K datasets to be explained with K other ones, is equivalent to the solution obtained from a standard PLS regression of Y and the merged dataset X. The method MBPCAIV considers the multiblock structure of data and leads to a model with a better fitting ability than MBPLS but it is unstable when explanatory blocks contain quasi-collinear variables. The user has to select the most relevant method by making a trade-off between stable and predictive model according to the data structure and the aims of the study.
Both MBPCAIV and MBPLS methods can be considered as the analysis of (K + 1) triplets, a dependent one (Y, Q Y , D) and K explanatory ones (X k , Q X k , D) for k = (1, . . . , K). One can notice that the following algebraic presentation of multiblock methods as the analysis of (K + 1) triplets is consistent with the one of all the methods developed in ade4 for one, two and K tables . The simultaneous analysis of these triplets is provided by the crossing products of X k and Y, i.e., Y DX k , with the metric D = 1 N I N where I N is the N × N identity matrix. Multiblock methods seek a smaller dimension space to represent the main relationships between variables and individuals. They are based on the analysis of the K triplets (Y DX k , Q Y , Q X k ) where Q Y is usually equal to I M and Q X k = I P k for MBPLS or Q X k = (X k DX k ) − for MBPCAIV (where − stands for the generalized inverse). Since the generalized inverse leads to the non-uniqueness of the eigenvalue decomposition, results for MBPCAIV may slightly differ when replicated in case of high multicollinearity. The solution is given by For MBPCAIV, this quantity can be rewritten as k VAR(P X k DYv (1) ) where P X k is the projector onto X k . The constraint t (1) k D = 1 is added and the latent variables, derived from the eigensolution, are given by t For MBPLS, the quantity maximized is equal to k VAR(X k DYv (1) ) = VAR(X DYv (1) ). In this case, the constraints w (1) k I P k = 1 and w (1) I P = 1 are added so that the first order solution is also given by To obtain the second order solution, for MBPCAIV and MBPLS, the variables in the datasets (X 1 , . . . , X K ) are deflated by a regression onto the first global component t (1) . The same maximization is then performed but the original datasets are replaced by the residuals obtained in the deflation step. Subsequent components are obtained by reiterating this process. The reader can consult Bougeard et al. (2011) for more details.
With ade4, MBPCAIV is obtained by: R> res <-mbpcaiv(dudiY, ktabX, scale = TRUE, option = "uniform", The MBPLS is performed by replacing the call to mbpcaiv by a call to the mbpls function. The first two arguments refer to the dependent and the explanatory datasets. Variable scaling of the explanatory dataset can be set by the scale argument (default is TRUE). The scaling of the dependent dataset has been previously defined in the first call to dudi.pca. The argument option defines the block weighting. The "none" option corresponds to no block weighting, "uniform" corresponds to the case where the merged explanatory dataset X (resp. Y) has a total variance equal to one and each of the K explanatory blocks to 1/K (Westerhuis and Coenegracht 1997). The argument scannf allows to display the scree plot of eigenvalues to help with the choice of the number of the latent variables to be interpreted (default is TRUE).
The optional nf argument indicates the number of selected dimensions (defaults is 2). The object res contains the different outputs of the analysis:

Descriptive interpretation tools
Multiblock analyses are descriptive methods that provide an overview of the relationships between variables, blocks and individuals. They can be used to answer questions 1 and 2 (Section 2). The global latent variables t, linear combinations of the explanatory variables, orthogonal by construction, provide the overall graphical displays. The explanatory variables are then depicted by their loadings w * where the component is given by for a given dimension h, as a consequence of the deflation procedure (Wold, Martens, and Wold 1983). The dependent variables are represented by their regression coefficients onto these latent variables ( The decomposition of inertia into successive dimensions indicates the quantity of information extracted by the global latent variables t (element lX in res). The rank of the analysis is given by the element rank. A comprehensive summary of the first dimensions is provided by the summary function. For each dimension, the eigenvalues, inertia percentage and cumulated inertia percentage are given. The percentage and the cumulated percentage of the inertia of each dataset, X, Y and (X 1 , . . . , X K ), explained by the global latent variables are also provided (VarY and VarYcum) for the dependent dataset:

Multiblock principal component analysis with instrumental variables
Class: multiblock mbpcaiv Call: mbpcaiv(dudiY = dudiY, ktabX = ktabX, scale = TRUE, option = "uniform", scannf = FALSE, nf = 5) Graphical tools to represent the outputs of the analysis are provided in the new adegraphics package (Dray and Siberchicot 2018). The main results are provided by the plot method of the 'multiblock' class. By default, the first two global latent variables t (element lX) are used but higher order representations can be selected using the arguments xax (defaults to 1) and yax (defaults to 2).
R> library("adegraphics") R> plotmbpcaiv <-plot(res, xax = 1, yax = 2) Results are illustrated in Figure 2. The first plot (top right corner) depicts the similarities between individuals (i.e., the 351 chicken flocks). The scree plot of eigenvalues is represented in the top left corner. Relationships between blocks are depicted in the third plot (bottom left corner) by the squared covariance between the partial latent variables t k (Tl1) and u (lY). The fourth plot (bottom middle) depicts the dependent variables by their projection on the latent variables c (Yco). The fifth plot (bottom right corner) depicts simultaneously the 20 explanatory variable loadings w * (faX). These two last plots are usually interpreted together to identify the relationships between explanatory and dependent variables. Graphical functionalities of adegraphics are based on the package lattice (Sarkar 2008(Sarkar , 2017. Classes are provided to store simple and multiple graphics as objects. It is thus possible to extract and modify easily a single plot from these multiple graphical outputs. For instance, we updated some aesthetic properties of the plot of the individuals and added colors corresponding to different values of stress during the rearing period. R> class(plotmbpcaiv) [1] "ADEgS" attr(,"package") [1] "adegraphics" Row scores (X) Figure 3: Updated graphical representation of the 351 chicken flocks colored according to stress during the rearing period. Red symbols correspond to stress occurrences (feeding system defection, electrical defect, etc.), blue symbols to the absence of stress.

Selection of the optimal number of latent variables
Multiblock methods can also be used for predictive purposes and the first step is to select the optimal model dimension. The global latent variables can be expressed as linear combinations of X, i.e., t (h) = Xw * (h) and the dependent dataset Y can be split up into the same latent variables such as  = (1, . . . , H). The optimal model is obtained by selecting the number of latent variables with a two-fold cross-validation (Stone 1974). The dataset is split in a calibration and a validation sets, this procedure being repeated several times. Among all the models corresponding to the various values of h, the optimal model is retained, as a compromise between a good fitting ability (minimization of the root mean square error of calibration RMSE q q q q q q q q q q q q q q q q q q q q

RMSEc RMSEv
Root Mean Square Error Root Mean Square Error q q q q q q q q q q q q q q q q q q q q Root Mean Square Error validation is provided by the testdim function which has three arguments: the multiblock object, the number of repetitions (nrepet) and the lower and upper quantiles to compute (defaults respectively to 0.25 and 0.75) to get confidence intervals. To get reliable results, a minimum number of 100 repetitions is imposed.
R> set.seed(123456) R> testdim.chik <-testdim(res, nrepet = 100) R> class(testdim.chik) [1] "krandxval" R> plot(testdim.chik) Results are summarized in Figure 4 by means and associated confidence intervals for RMSE V . This plot helps to select the optimal number of latent variables to be introduced in the model by making a trade-off between a stable and a predictive model. In this example, a model with four latent variables both optimizes (i.e., minimizes) fitting and prediction abilities.

Interpretation of the optimal model
Lastly, we provide tools to identify, in the optimal model, significant relationships between explanatory and dependent variables at the variable and at the block levels. Bootstrapping simulations are applied to the three main predictive parameters (regression coefficients, cumulated variable importance index and cumulated block importance index) to provide confidence intervals, computed by the non-Studentized pivotal method (Carpenter and Bithell 2000). The regression coefficients of the optimal model measure the links between each explanatory and dependent variable. A coefficient is considered significant if the bootstrapped 95% confidence interval does not contain the threshold value 0. If the number of dependent variables in Y is large, the interpretation of these coefficients is difficult. In this case, it is more suitable to measure the contribution of each explanatory variable to the explanation of the whole dependent block Y. The variable importance index (vip) is proposed, derived from the squared explanatory loadings w * (h) 2 , weighted according to the associated block importance a (h) 2 k and expressed as a percentage for each dimension h. The cumulated variable importance index (vipc) sums these quantities over all the optimal components under study and weights them according to the amount of the relative importance of each dimension λ (h) , also expressed as a percentage. Each explanatory variable is considered to be significantly associated with Y when the 95% confidence interval does not contain the threshold value 1/P , P being the number of explanatory variables. Lastly, the block importance index (bip) is proposed to assess the contributions of the blocks (X 1 , . . . , X K ) in the modeling process. It is computed from the coefficients (a (h) 2 1 , . . . , a (h) 2 K ) which measure the link between Y and (X 1 , . . . , X K ). If the optimal model contains several components, the cumulated block importance index (bipc) is based on the weighted average of the bip indexes, taking as weights the relative importance of each dimension λ (h) . Both these quantities are expressed as percentages. A block is considered to be significantly associated with the dependent dataset if the 95% confidence interval does not contain the threshold value 1/K, K being the number of blocks. All the details are given in Bougeard et al. (2011).
We implemented new classes 'randboot ' and 'krandboot' and methods (print, plot) to manage the outputs of bootstrap simulations. For multiblock methods, the randboot method for 'multiblock' objects can be used. It takes three arguments: the multiblock object, the number of repetitions (nrepet) and the optimal number of dimension of the model (optdim). By default, the number of repetitions is equal to 199 but a higher number is required to get more stable results. The randboot method for 'multiblock' objects returns a list with 3 elements XYcoef, bipc and vipc. The first element is a list of 'randboot' objects, the two others are object of the class 'randboot'.
From these results, it follows that each variable in Y is significantly related to a specific set of explanatory variables. Firstly, the first-week mortality rate ("Mort7") is related to four variables, two of which pertain to the farm structure. The mortality rate during the rest of the rearing ("Mort") is significantly linked with seven variables, four of which pertain to the flock characteristics during the rearing period. The mortality rate during the transport to the slaughterhouse ("Doa") is associated with eight variables, among which four pertain to the catching, transport-lairage conditions, slaughterhouse and inspection features. Finally, the condemnation rate at slaughterhouse ("Condemn") is related to fourteen variables, six of these variables refer to the flock characteristics during the rearing period. Some explanatory variables are specifically related to one variable in Y, e.g., the chick homogeneity (from X 2 ), whereas others are linked with up to three (out of four) variables in Y, e.g., the genetic strain (from X 3 ). Therefore, to sort these explanatory variables by a global order of priority thus highlighting their overall contribution to the explanation of the Y block, the vipc values can be used. It turns out that four explanatory variables have a significant impact on the overall losses: the stress occurrence during rearing ("Stress"), the type of loading system ("LoadType"), the stocking density in transport crates ("StockingD") and the average duration of waiting time on lairage ("Dlairage"). Finally, the relative importance of the four production stages in the overall losses explanation highlights the significant importance of the flock characteristics during the rearing period (X 3 block). The interested reader may refer to Bougeard et al. (2012) for a detailed interpretation of the results.

Conclusion and perspectives
We provide new tools to improve the handling and the statistical analysis of multiblock (K + 1) datasets in the ade4 package for R. These methods preserve the original structure of the data and provide an adapted framework that combines tools from factorial analysis and regression methods. Traditional graphical outputs are completed by cross-validation and bootstrap procedures to select and interpret the optimal model.
The framework of multiblock methods provide several tools and could be enriched by considering hierarchical-structured design of individuals frequently met in biological surveys (e.g., individuals partitioned in groups corresponding to different treatments). Multiblock methods can also be directly adapted to the explanation of several dependent datasets (Y 1 , . . . , Y K ) to handle more complex data structures. We hope that our implementation of multiblock methods and future developments will facilitate the use of these techniques and thus improve the statistical analysis of (K + 1) datasets.