Simultaneous Analysis in S-PLUS : The SimultAn Package

In this paper we describe the SimultAn package dedicated to simultaneous analysis. Simultaneous analysis is a new factorial methodology developed for the joint treatment of a set of several data tables. Since the ﬁrst stage of simultaneous analysis requires a correspondence analysis of each table the package comprises two parts, one for correspondence analysis and one for simultaneous analysis. The package can be used to perform classical correspondence analysis of frequency/contingency tables as well as to perform simultaneous analysis of a set of frequency/contingency tables. In this package, functions for computation, summaries and graphical visualization in two dimensions are provided, including options to display partial rows and supplementary points.


Introduction
In this paper we present the SimultAn package, a package for simultaneous analysis (SA) with S-PLUS (Insightful Corp. 2003). The main reason for developing this package is to provide users with a tool for applying this new methodology which is not available elsewhere.
SA is a factorial method developed for the joint treatment of a set of several data tables, especially frequency tables whose row margins are different, for example when the tables are from different samples or different time points, without modifying the internal structure of each table. In the data tables rows must refer to the same entities, but columns may be different.
SA combines the basic idea of intra-analysis (Escofier and Drouet 1983) to maintain the internal structures of each contingency table in overall analysis and certain characteristics of multiple factor analysis (MFA) Pagès 1984, 1998) to balance the influence of the tables in the overall analysis and provides a joint description of the different structures contained within each table as well as a comparison of them.
More details about this method and its several applications can be found in Goitisolo (2002) and in Zárraga and Goitisolo (2002, 2008 The SimultAn package contains the following functions: CorrAn and SimAn, which allow the user to perform correspondence analysis (CA) and SA, respectively. Despite there being several S-PLUS functions to perform data analysis (Everitt 2005), (Heiberger and Holland 2004), (Crawley 2002) as well as to perform CA (Beh 2003), (Venables and Ripley 1999), (Everitt 1994) the CorrAn function is implemented in the package because the first stage of SA entails a CA of each frequency or contingency table. Nevertheless, users can perform the CA of any table or even of the concatenation of the tables. If the user wants to perform a direct SA of all the tables, there is no need first to perform the CA of each table since CorrAn is also implemented in SimAn. The summary.CorrAn and plot.CorrAn and summary.SimAn and plot.SimAn functions give summaries and graphical representations of CA and SA, respectively.
This paper consists of three further sections. Section 2 briefly describes simultaneous analysis. Section 3 demonstrates the application of the parameters of CorrAn and SimAn, what values they can take and what the parameters do. Section 4 provides some concluding remarks.
The package has been written using S-PLUS version 6.2 (Insightful Corp. 2003) for Windows (Professional Edition). The basics of S-PLUS can be found in Krause and Olson (2002).
Each of them classifies the answers of n ..g individuals with respect to two categorical variables. All the tables have one of the variables in common, in this case the row variable with categories I = {1, . . . , i, . . . , I}. The other variable of each contingency table can be different or the same variable observed at different time points or in different subsamples. On concatenating all these contingency tables, a joint set of columns J = {1, . . . , j, . . . , J} is obtained. The element n ijg corresponds to the total number of individuals who simultaneously choose the categories i ∈ I of the first variable and j ∈ J g of the second variable, for  denoted in the usual way, for example, n i.g = j∈Jg n ijg , and n denotes the grand total of all G tables.
In order to maintain the internal structure of each table G, SA begins by obtaining the relative frequencies of each table as usually done in CA: so that i∈I j∈Jg f g ij = 1 for each table g. SA is carried out in two stages.

Stage one: CA of each contingency table
Because in SA it is important for each table to maintain its own structure, the first stage carries out a classical CA of each of the G contingency tables. The separate analyses of the G tables also allow us to check for the existence of structures common to the different tables. From these analyses it is possible to obtain the weighting used in the next stage.
CA is a well-known exploratory multivariate technique for the graphical and numerical analysis of contingency tables. The bibliography of publications on CA is so wide that we will not focus on its methodology. We direct interested readers to various texts where CA is described, notably Benzécri (1973Benzécri ( , 1992, Greenacre (1984Greenacre ( , 1993, Lebart, Morineau, and Warwick (1984) and Lebart, Morineau, and Piron (2006). CA on the gth contingency table can be carried out by calculating the singular value decomposition (SVD) of the matrix X g , whose general term is: Let D g r and D g c be the diagonal matrices whose diagonal entries are respectively the marginal row frequencies f g i. and column frequencies f g .j . From the SVD of each table X g we retain the first squared singular value (or eigenvalue, or principal inertia), denoted by λ g 1 .

Stage two: Joint analysis of the tables
In the second stage, in order to balance the influence of each table in the joint analysis, as measured by the inertia, and to prevent the joint analysis from being dominated by a particular table, SA includes a weighting on each table, α g . The choice of the weighting α g depends on the aims of the analysis and on the initial structure of the information, and different values may be used (Zárraga and Goitisolo 2006). The most frequently used is α g = 1/λ g 1 , where λ g 1 denotes the first eigenvalue (square of first singular value) of table g.
As a result, SA proceeds by performing a principal component analysis (PCA) of the matrix: The PCA results are also obtained using the SVD of (3), giving singular values √ λ s on the sth dimension and corresponding left and right singular vectors u s and v s . The squared singular values or eigenvalues are called principal inertias in the context of CA and their sum s λ s is equal to the total inertia or total variance in the data table.
The projections on the sth axis of the columns are calculated as principal coordinates: where D c (J × J), is a diagonal matrix of all the column masses, that is all the D g c . One of the aims of the joint analysis of several data tables is to compare them through the points corresponding to the same row in the different tables. These points are referred to here as partial rows and denoted by i g .
The projection on the sth axis of each partial row is denoted by F s (i g ) and the vector of projections of all the partial rows for table g is denoted by F (g) s : Comparison of partial rows is complicated, especially when the number of tables is large. Therefore each partial row is compared with the (overall) row, projected as: where D w is the diagonal matrix whose general term is g∈G f g i. . The choice of this matrix D w allows us to expand the projections of the (overall) rows to keep them inside the corresponding set of projections of partial rows, and is appropriate when the partial rows have different weights in the tables. With this weighting, the projections of the overall and partial rows are related as follows: Thus, the projection of a row is a weighted average of the projections of partial rows. It is closer to those partial rows that are more similar to the overall row in terms of the relation expressed by the axis and have a greater weight than the rest of the partial rows. The dispersal of the projections of the partial rows with regard to the projection of their (overall) row indicates discrepancies between the same row in the different tables.
Notice that if f g i. is equal in all the tables, then F s = g∈G s , that is, the overall row is projected as the average of the projections of the partial rows.
The projection of a partial row on axis s is, except for the factor α g /λ s , the centroid of the projections of the columns of table g: The projection of an overall row on axis s is, except for the coefficients α g /λ s , the weighted average of the centroids of the projections of the columns for each table: To compare the different tables, SA provides measurements of the relation between the factors of the different analyses. For seeking the relation between factors of the individual CA of the tables, the correlation coefficient can be used to measure the degree of similarity between the factors of the separate CA of different tables. This is possible when the marginals f g i. are equal.
The relation between the factors s and s of the tables g and g respectively would be calculated with the correlation coefficient: where F g s (i) and F g s (i) are the projections on the axes s and s of the separate CA of the tables g and g respectively and where λ g s and λ g s are the inertias associated with these axes. When f g i. are not equal, the generalized correlation (Zárraga and Goitisolo 2003) is used to calculate the relation between the factors s and s of the tables g and g respectively: This measurement allows us to verify whether the factors of the separate analyses are similar and check the possible rotations that occur.
Likewise, it is possible to calculate for each factor s of the SA the relation with each of the factors s of the CA separate analyses of the different tables: If all the analyzed frequency tables have the same row weights this measurement is reduced to: that is the classical correlation coefficient between the factors of the separate CA and the factors of SA.
The relation between the overall rows and the partial rows can be calculated in order to verify whether a table is well represented in each of the factors of the simultaneous analysis. The generalized correlation between the projections of the partial rows, F (g) s , and the projections of the overall rows, F s , is calculated as: By considering the tables as groups of columns, in the sense of MFA (Escofier and Pagès 1998), it may be also interesting to project the tables on the axes of SA. The projection of table g onto the sth axis, H s (g), is obtained as: where Inertia s (g) represents the sum of the projected inertias of columns of table g on axis s.
Because of the weighting α g = 1/λ g 1 , the projections H s (g) take values between 0 and 1 and the contributions of each table to the formation of the axis s, H s (g)/λ s , show that the influence of the tables in the joint analysis is balanced.
A value of H s (g) close to 1 would indicate that the axis of the joint analysis is a direction of major inertia for table g, whereas a weak value would indicate that the axis is a direction of very weak inertia for table g.
Consequently, adding the projections of all the tables gives the projected inertia on the axis: As the projection of each group on the first axis is less than or equal to one, the projected inertia onto it is between zero and G, the number of tables.

Application
To illustrate the outputs and graphs of SimultAn we show two examples -one for CA and one for SA -using the traffic.dat dataset, which is available in the package. The labels of the age categories are preceded by M and F to distinguish which table they belong to (i.e., M18to20 for males and F18to20 for females).
In order to study the interaction between the kind of vehicle driven and the driver's age group for each gender we consider the analysis of two tables, one for each gender, where the 11 rows correspond to the kinds of vehicle driven and the 9 columns to the drivers' age groups.We therefore have one table for the 57420 male drivers and another for the 12573 female drivers involved in road accidents (Zárraga and Goitisolo 2009).
The SimultAn package comprises two parts. The first is for the classical CA of each of the two tables or the classical CA of the concatenation of the tables including (or not) the supplementary elements. The CorrAn function computes the CA. The second part allows the simultaneous analysis of the two tables to be performed using the SimAn function.

Correspondence analysis
The first step in the analysis is to read the dataset and select the rows and columns to be used as active or supplementary elements in the analysis.
The instruction for reading the data is: The instruction for selecting rows and columns from the dataset depends on the aim of the user. Let us assume that the user is interested in analysing the table corresponding to male drivers as active, i.e., all 11 rows corresponding to vehicles and the first 9 columns corresponding to the ages of male drivers.
The selection of the data would be: S> dataCA <-data[1:11, 1:9] The CorrAn function computes the CA of the selected data. The instruction to perform the CA is: S> CorrAn.out <-CorrAn(data = dataCA) The following output of CA is given by corresponding to the total inertia, as a measure of the total variance of the data table, to the eigenvalues or principal inertias as well as to the percentages of explained inertia and cumulated percentages of explained inertia for all possible dimensions.
The output also contains, for rows and columns, the masses in % (100fi, 100fj), the chisquared distances of points to their average (d2) and, by default restricted to the first two dimensions s = 1, 2, the projections of points on each dimension or principal coordinates (Fs, for rows, Gs for columns), contributions of the points to the dimensions and squared correlations (ctrs and cors, respectively). All the arguments of the CorrAn function can be consulted by the function names():

S> names(CorrAn)
which gives the parameters: "data" "sr" "sc" "nd" "dp" The CorrAn function offers the inclusion of supplementary elements, the parameters sr and sc indicate the supplementary rows and supplementary columns, respectively. The parameter nd indicates the dimensionality of the solution, which by default is nd = 2, and the parameter dp indicates the number of decimal places for numerical results, which by default is dp = 2, except for eigenvalues and percentages of inertia.
Assume that the CA is performed on the traffic data where the category of vehicle Other (i.e., the eleventh row) and the category of unknown age for males drivers MUA (i.e., the ninth column) are treated as supplementary points. The instruction to perform this CA is: S> CorrAn.out <-CorrAn(data = dataCA, sr = 11, sc = 9) In the corresponding section of the output of The graphical representation of the results of CA is created with the following instruction: S> plot(CorrAn.out, s1 = 1, s2 = 2) where the parameters s1 = 1 and s2 = 2 indicate that the graph is given for the first two dimensions. A plot of e.g., the second and the third dimensions is obtained by setting s1 = 2 and s2 = 3. Notice that in this case the argument nd of the CorrAn function must be at least 3. If there are supplementary elements in the analysis, the plot.CorrAn function provides two graphical outputs, one for the active elements and the second one for both active and supplementary elements. Figure 2 shows the display of active and supplementary points in the first two dimensions.
A list of all available results given by CorrAn() can be obtained with the instruction: The output is structured as a list object: totalin Total inertia.
resin Results of inertia.
resi Results of active rows.
resj Results of active columns.
resisr Results of supplementary rows.
resjsc Results of supplementary columns.
X Matrix X to be diagonalized.
totalk Total of data table.
I Number of active rows.
namei Names of active rows.
fi Marginal of active rows.
Fs Projections of active rows.
d2i Chi-square distance of active rows to their average.
J Number of active columns.
namej Names of active columns.
fj Marginal of active columns.
Gs Projections of active columns.
d2j Chi-square distance of active columns to their average.

Isr Number of supplementary rows.
nameisr Names of supplementary rows.

Simultaneous analysis
In order to perform the SA of a set of tables, after reading the data with the instruction: S> dataSA <-read.table(file = "traffic.dat") it is necessary to select which tables are to be jointly analyzed and which are the active and supplementary elements (if any).
Remember that the traffic.dat dataset contains two tables, one for each gender, where the 11 rows correspond to the kinds of vehicle driven and the 18 columns to the drivers' age groups. Columns 1 to 9 correspond to male drivers' age groups and columns 10 to 18 to female drivers' age groups.
The SimAn function computes the SA of the tables: S> SimAn.out <-SimAn(data = dataSA, G = 2, acg = list(1:9, 10:18), + weight = 2, nameg = c("M", "F")) where data = dataSA indicates which data set is to be analyzed, the parameter G indicates the number of tables to be jointly analyzed, G = 2 in this example, and acg indicates the active columns of each table. The parameter weight refers to the weighting α g (Section 2.2). Three values are possible, weight = 1 means α g = 1, weight = 2 means α g = 1/λ g 1 , where λ g 1 denotes the first eigenvalue (square of first singular value) of table g and is given by default, and weight = 3 means α g = 1/Total Inertia (g). Since in SA the rows are the same for all the tables, the parameter nameg allows the user to distinguish in the interpretation of the results as well as in the graphical representations which partial rows belong to each table.
In this example we have chosen M for the first table, the table of male drivers, and F for the second one, the table of female drivers. By default, if this parameter is not indicated, partial rows of the first table will be identified as G1 followed by the name of the row, partial rows of the second table as G2 followed by the name of the row and so on. The nameg argument also allows the different tables in the analysis to be identified.
More optional arguments for the SimAn() function include an option for setting the dimensionality of the solution (nd), an option for setting the number of decimal places for numerical results (dp) and options for selecting rows and /or columns as supplementary elements (sr and sc, respectively). By default, the parameters nd and dp take the value 2.
All the above parameters are given by:

S> names(SimAn)
The following instruction gives the summary output of SimAn:

S> summary(SimAn.out)
In its first stage SA performs a CA of each table (Section 2.1), so the output contains the separate CA of each table, as provided by the CorrAn function in Section 3.1, i.e., total inertia, explained inertias by the dimensions and, for rows and columns, the masses, distances, principal coordinates, contributions of the points to the dimensions and squared correlations. The following output is given for the table of male drivers (table M) and for the table of  female drivers (table F) The joint analysis of all the tables is performed in the second stage of the SA (Section 2.2). The output of SimAn contains the total inertia, the eigenvalues, percentages of explained inertia and cumulated percentages of explained inertia for all dimensions. In SA, with the weight used by default α g = 1/λ g 1 , the total inertia is a weighted average of the inertias of the tables and the first eigenvalue takes values between 1 and the number of tables, in this case 2. The first eigenvalue is 1.77, close to its maximum value of 2. This value indicates that the first axis of the SA is an axis of major inertia in the two data tables.
As with simple CA (Section 3.1), the output also contains, for the overall rows and for the columns of the two tables, the masses (pi, 100fjg), chi-squared distances (d2) and, by default restricted to the first two dimensions s = 1, 2, projections of points on each dimension or principal coordinates (Fs, for rows, Gs, for columns), contributions of the points to the dimensions and squared correlations (ctrs and cors, respectively). Partial rows only play a role in the analysis in building the overall rows and are projected (Fs) as supplementary elements. Therefore they do not contribute, on their own, to the axes but squared correlations (cors) are still meaningful measures of how well they are represented by the axes and are included in the output. Notice how the parameter nameg = c("M", "F") allows the partial rows of each table to be identified by putting an M and an F before the name of the rows. The output of summary.SimAn also contains the relations between overall rows and partial rows, the relations between the factors of the CA of the different tables, the relations between the factors of the SA and the factors of the separate CA of the different tables, the projections of the tables and the contributions of each table to the principal axes (Section 2.2).
The SimAn function also allows supplementary elements (rows and/or columns) to be included in the analysis. Assume that row 11 (category of vehicle Other) takes part in the analysis as a supplementary row.
The instruction to perform this SA is: S> SimAn.out <-SimAn(data = dataSA, G = 2, acg = list(1:9, 10:18), + weight = 2, nameg = c("M", "F"), sr = 11) In the corresponding section of the output of The graphical representation of the results from SA is created with the following instruction: S> plot(SimAn.out, s1 = 1, s2 = 2) where the parameters s1 = 1 and s2 = 2 (by default) indicate that the graph is given for the first two dimensions. A plot of e.g., the second and the third dimensions is obtained by setting s1 = 2 and s2 = 3. Notice that in the latter case nd, the dimensionality of the solution, must be at least 3. A large number of elements can take part in the SA of a set of tables depending on the number of tables and on the dimensions of the tables. In order to facilitate interpretation, the plot.SimAn function gives several graphical outputs for active  and supplementary points (if any) including the ones corresponding to the separate CA of each table. Figures 3 and 4 show, for the analysis in which all the points are active elements, the displays of overall rows and columns and overall rows and partial rows, respectively.
Each table is identified with a different color and the columns and partial rows of each table are in the same color and have the same symbol.
Two additional graphs, one for the projections of the tables ( Figure 5) and one for the relations between factors of CA and SA ( Figure 6) are also provided by plot.SimAn.
A list of all available results given by SimAn() can be obtained with the instruction: S> names(SimAn.out) The output is structured as a list object: totalin Total inertia.
resin Results of inertia.
resi Results of active rows.
resig Results of partial rows.
resj Results of active columns.
Fsg Projections of each table.
ctrg Contribution of each table to the axes.
riig Relation between the overall rows and the partial rows.
rCACA Relation between separate CA axes.
rCASA Relation between CA axes and SA axes.
Fsi Projections of active rows.
Fsig Projections of partial rows.
Gs Projections of active columns.
allFs Projections of rows and partial rows in an array format.
allGs Projections of columns in an array format.
I Number of active rows.
maxJg Maximum number of columns for a table.
G Number of tables.
namei Names of active rows.
nameg Prefix for identifying partial rows, tables, etc.
resisr Results of supplementary rows.
resigsr Results of partial supplementary rows.
Fsisr Projections of supplementary rows.
Fsigsr Projections of partial supplementary rows.
allFssr Projections of rows and partial supplementary rows in an array format.
Isr Number of supplementary rows.
nameisr Names of supplementary rows.
resjsc Results of supplementary columns.
Gssc Projections of supplementary columns.
allGssc Projections of supplementary columns in an array format.
Jsc Number of supplementary columns.
namejsc Names of supplementary columns.
CAret Results of CA of each table to be used in summary and plot functions.

Conclusion
We have presented the SimultAn package for simultaneous analysis of a set of contingency tables not available elsewhere. This package also allows the user to perform correspondence analysis of any table, of any combination of tables and even of the concatenation of all tables. All the features of the package are explained and illustrated in this paper using the dataset traffic.dat, which is available in the package.