MultipleCar : A Graphical User Interface MATLAB Toolbox to Compute Multiple Correspondence Analysis

In this paper we present the toolbox MultipleCar , which is a general program for computing multiple correspondence analysis and which was designed using a graphical user interface. The procedures implemented in MultipleCar are the usual ones that are already available in other applications, plus some additional procedures. MultipleCar makes it possible to compute (1) joint correspondence analysis, and (2) orthogonal and oblique rotation of coordinates. Although MultipleCar was developed in MATLAB , we compiled it as a standalone application for Windows operative systems based on graphical user interfaces. The users can decide whether to use the advanced MATLAB version of MultipleCar , or the standalone version (which does not require any programming skills).


Introduction
In this paper we present the program MultipleCar, which was designed using a graphical user interface. It is a general program for computing multiple correspondence analysis. Multiple-Car uses traditional procedures and indices as well as more recent developments not included in commercial or free shared packages.
Correspondence analysis (CA) is a popular method that displays the associations between two categorical variables. The method was developed in France by Benzécri in 1973 but only gained popularity outside France after textbooks were published in English 10 years later (Greenacre 1984;and Lebart, Morineau, and Warwick 1990). Multiple correspondence analysis (MCA) can be seen as an extension of correspondence analysis that simultaneously analyzes more than two categorical variables. As well as displaying the associations between two categorical variables, MCA makes it possible to study (bivariate) relationships between several categorical variables and display the relationships between observations. Mathematically, MCA and its variations can be defined in several ways. Although the differences between formulations are small and relatively straightforward from a mathematical point of view, practitioners may get confused when interpreting different MCA outcomes, or deciding on how to analyze data themselves. In fact, there appears to be some ambiguity surrounding the formulation of MCA. We briefly introduce CA and MCA and some of its variants.

Correspondence analysis (CA)
Correspondence analysis is a well-known, established method. Excellent descriptions can be found in Greenacre (1984), and Lebart et al. (1990). Here we give a concise overview of some of the most important mathematical relationships that are useful for understanding the software presented and its options. Using Greenacre's notation, we can summarize CA as follows. Let P denote an n r × n c data matrix with nonnegative elements that sum to 1. That is, 1 nr P1 nc = 1, where, generically, 1 q denotes a q dimensional vector of ones. Correspondence analysis amounts to the following least-squares approximation problem: subject to C D c C = I k , whereP = D −1/2 r P − rc D −1/2 c , r = P1 qc , c = P 1 qr , D r and D c are corresponding diagonal matrices (i.e., D r 1 nr = r and D c 1 nc = c). The so-called row and column coordinate matrices R and C are of rank k, where k is the dimensionality of the approximation. This rank must be chosen by the user. Often k = 2 is chosen as this allows for immediate graphical displays of the data.
This least-squares problem can be solved by using singular value decompositioñ where U and V are orthonormal and Λ is a diagonal matrix with the singular values on its diagonal in descending order. By selecting only the first k columns of U and V and the corresponding singular values, a k-dimensional least-squares approximation ofP is obtained. The resulting coordinate matrices are Depending on the choice of α, the row (column) coordinates are referred to as principal (standard) coordinates if α = 1, standard (principal) coordinates if α = 0 or symmetrical coordinates if α = 1/2. Furthermore, the quality of the k-dimensional approximation can be assessed by considering the so-called inertias. That is, where κ denotes the rank of P. The row and column coordinates are closely related through so-called transition formulae. This implies that rather than using (3) to separately construct the coordinate matrices, one set of coordinates (or singular vectors) can be used to obtain the other set. For example, The set of coordinates as defined in (3) constitutes a so-called biplot as the inner-product D 1/2 r RC D 1/2 c approximates the data.
Note that the choice of α in (3) influences the interpretation of the sets of points in a biplot. In particular, distances between the principal coordinate points are (approximated) chi-squared distances. On the other hand, the standard coordinate points, which are scaled to have weighted squared length equal to one, merely indicate directions. The biplot relationship ensures that the principal coordinate points can be projected onto the directions indicated by the standard coordinates to retrieve the approximated data values (i.e., the low-dimensional approximation ofP). In a symmetric biplot, the projections can be used in a similar fashion. See Greenacre (1993) or Van de Velden and Kiers (2005) for details on the relationship of correspondence analysis and biplots.
In exploratory factor analysis (EFA) coordinates are usually inspected to explain the meaning of the k dimensions, and the best possible solution is the one which is easiest to interpret. In order to maximize the simplicity of coordinates, they are usually (orthogonally or obliquely) rotated. In this biplot scenario, row and column points can both be rotated without influencing the approximation. To see this, let T denote a rotation matrix for which T −1 exists. Post-multiplication of the row coordinates R by T and C by T −1 does not change the approximation. This rotational freedom was exploited by Van de Velden and Kiers (2005) and Lorenzo-Seva, Van de Velden, and Kiers (2009) to increase interpretability of the solutions using orthogonal and oblique rotations, respectively.
Before rotations are computed, it is not uncommon in the context of EFA for the loading matrices to be weighted. After rotation, the original distances of points to the origin are then re-established. In the context of CA, weighting can also be used. In fact, due to the specific weighting of coordinates in CA, infrequently observed points may be positioned relatively far from the origin. Consequently, these points may have a significant impact on the rotation angle. Pre-multiplying the coordinate matrices by D r (for the row points) and D c (for the column points) before rotation, removes this effect. Alternatively, as is common practice in EFA, a row-wise normalization can be used. See Lorenzo-Seva et al. (2009) for details on these alternative options for weighting CA coordinates for rotation.
Note that, rather than calculating the singular value decomposition (2), the solution can also be found by usingP The column coordinates can then be obtained directly from the second formula in (3) whereas the solution for the rows can be obtained by applying (6). This procedure can be useful when the number of rows (columns) of the original data matrix is much larger than the number of columns (rows).

Multiple correspondence analysis (MCA)
Multiple correspondence analysis (MCA) is a method that allows the researcher to analyze data on more than two categorical variables. The name may be a little misleading because it suggests that the method is not the same as CA. In fact, MCA is CA applied to a so-called indicator matrix. That is, the categorical data are coded by constructing so-called indicator matrices. For the jth categorical variable we define Z j to be the n×p j indicator matrix where n denotes the number of observations, p j the number of categories for variable j and the ijth element of Z j is one if individual i selected category j. All other elements are zero. We can construct these indicator matrices for all categorical variables and collect them in a so-called super-indicator matrix Z = (Z 1 , Z 2 , . . . , Z p ). For each categorical variable, an observation corresponds to exactly one category. Hence, and 1 n Z1 nc = np.
and we define a n c × n c diagonal matrix D z satisfying Inserting Z for P in the CA equations of the previous section, we get Alternatively, we can writeP where M denotes the centering matrix. Hence, using (7) to obtain the MCA solution we get and it is clear that MCA is closely related to PCA as it amounts to finding the eigendecomposition of the (normalized) covariance matrix.
Although MCA is defined as CA applied to a super-indicator matrix, (9) shows that the calculations can be based directly on the so-called Burt matrix: B = Z Z. The Burt matrix is a square symmetric matrix consisting of all cross-tabulations for all combinations of the variables. That is, the Burt matrix contains counts of co-occurrences for all combinations of categories. The diagonal blocks, Z j Z j (for j = 1 to n), of B contain the marginal frequencies (i.e., counts for each category) for all variables.
Considering the eigenvalue decomposition of the (appropriately scaled) Burt matrix rather than the super-indicator matrix, is much more efficient. In particular when the number of observations, n, is large. Moreover, although coordinates for the observations do not follow from the eigendecomposition of the Burt matrix, it is straightforward to calculate these coordinates using the transition formula (6). Note, however, that in the analysis of B the squared singular values are obtained. This poses some issues concerning the scaling of coordinates as well as the calculation of explained inertia. For the calculation of the explained inertia (cf. (5)) as well as for the appropriately scaled coordinates (i.e., the coordinates satisfying (4)) this needs to be taken into account.
As Chavent, Kuentz-Simonet, and Saracco (2012) pointed out, despite the close relationship between EFA and MCA, rotation in MCA has not received much attention. In the general context of rotation in PCAMIX, a principal component method for the mixture of qualitative and quantitative variables, Kiers (1991) proposed orthogonal rotation applied to MCA. An application using a real data set that illustrates the advantages of using orthogonal rotation in MCA can be found in Chavent et al. (2012). In addition, Adachi (2004) studied the applicability of oblique rotations in multiple correspondence analysis.

Joint correspondence analysis (JCA)
The analysis of the Burt matrix reveals an important property of MCA: The cross-tabulations between all variables are simultaneously approximated (in a least-squares sense). Using the definitions for Z and z, the left hand side of (9) becomes In MCA, this matrix is approximated. However, the approximation of the diagonal blocks, may not be of interest. To remedy this, Greenacre (1988) proposed joint correspondence analysis (JCA). In JCA, coordinates are obtained for the categories of the categorical variables in such a way that the off-diagonal blocks of the Burt matrix (i.e., the cross-tabulations for all pairs of variables) are approximated in a least-squares sense whilst the approximation of the diagonal blocks is ignored.
The matrix S can be decomposed as: S = S u + S d where S d is an n c × n c block diagonal matrix with the p j ×p j diagonal blocks of S as diagonal blocks. Then, to find this least-squares solution, the following five-step algorithm can be used: where W is a q × k matrix whose columns are eigenvectors of S associated with the k largest eigenvalues and whose squared length is equal to the associated eigenvalue.
(2) Compute the eigenvalues and associated eigenvectors of S t and determine A and Γ in such a way that the columns of A are appropriately scaled eigenvectors of S t associated with the k largest eigenvalues of S t .
This algorithm is easily implemented and appears to converge sufficiently fast in practice. Unlike in MCA, the JCA solutions for different choices of dimensionality are not nested.
In order to assess the quality of the JCA solution, we compare the sum of squared residuals (Rss) with the total variation between the categories of the different variables (Tss). Let S u = AA denote the final approximation of S. Then, Hence, the quantity can be expressed as Just as MCA can be regarded as the principal component analysis of categorical variables, JCA can be regarded as a multigroup factor analysis of categorical data ( Van de Velden 2000). Once again, despite the close relationship between EFA and JCA, rotation in JCA has received no attention by researchers.

Software packages available to compute MCA
CA and MCA are available in most statistical software packages. However, the implementation in these functions is typically kept to a minimum, and JCA is frequently omitted. In addition, none of the major commercial programs offers recent methodological developments.
The software for computing the most recent methodological developments seems to have been developed mainly in R (R Core Team 2019). The most elaborate package is ca (Nenadić and Greenacre 2007): It is an R package that makes it possible to include supplementary points and adjust eigenvalues for improved fit. It also allows for the corresponding adjustments of contributions, joint correspondence analysis (JCA) and subset analysis. The main drawback of ca is that it was not developed with a graphical user interface. Lê, Josse, and Husson (2008) offer to compute MCA in R as a part of the FactoMineR package, which is dedicated to multivariate data analysis that is implemented within the Rcmdr environment ( However, their toolbox focuses on the multiple correspondence analysis of fuzzy coded data sets.
In conclusion, although in the context of R there are several packages that perform CA and MCA, MATLAB users currently only have a few basic functions at their disposal. We therefore decided to develop MultipleCar as a MATLAB application. Also, to make the procedures accessible to researchers without R or MATLAB programming skills, we created a compiled release of our software.

Overall description of MultipleCar
We have developed MultipleCar in MATLAB 2019a, and compiled it to be run in Microsoft Windows 64-bit operating systems. We provide the source code to be used as a typical MATLAB toolbox, but also the standalone version to be run under Windows. We tested MultipleCar on several computers with different versions of Windows (7/8/10) and found that it works correctly.
The main characteristics of MultipleCar are: 1. MATLAB advanced users can use MultipleCar as a typical toolbox, and they can analyze their data using the command line, or the graphical user interface.
2. MultipleCar can be used as a standalone Windows application, and the user does not need to have MATLAB installed on the computer. The MultipleCar graphical user interface can be used to control the whole toolbox, and no command lines are needed.
3. MultipleCar implements the most important features already available in the various R packages: MCA based on the indicator matrix, MCA based on the Burt matrix, the inclusion of supplementary points, adjustment of eigenvalues for improved fit, and JCA. In addition, it is the only toolbox that implements the orthogonal and oblique rotation of coordinates.
These characteristics make MultipleCar the most advanced MATLAB toolbox for computing MCA and JCA. In addition, the graphical user interface makes it very helpful for applied researchers with no knowledge of MATLAB or R data analysis programming.

Procedures implemented in MultipleCar
MultipleCar has been developed to compute multiple correspondence analysis. Below we describe the procedures used in detail. Multiple correspondence analysis can be computed from two different kinds of input matrix: the indicator matrix or the Burt matrix. The suitability of the matrix to be analyzed is assessed by three tests: chi-square, total inertia, and Cramer's V index. In MultipleCar the number of dimensions to be retained must be specified. The program computes the principal inertias and adjusted principal inertias from the Burt matrix.
MultipleCar can compute both, MCA and JCA. We regard multiple correspondence analysis as pure exploratory data analysis. From this point of view, the user should be able to inspect the data set from any point of view. To allow this flexibility in the exploratory analysis, users can switch between different principal coordinates settings, that is, different choices for α in (3). For MCA, the program allows four coordinate configurations: 1. variables as principal coordinates; 2. observations as principal coordinates; 3. both variables and observations as principal coordinates; (Sometimes referred to as the French model; in this case, the row and column points do not constitute a so-called biplot: One set of points cannot be projected on the other to retrieve the approximated data values.) 4. biplot symmetrical coordinates. (Both variables and observations coordinates are scaled to constitute a so-called biplot.) When MultipleCar is used to compute JCA, only one configuration is allowed: variables as principal coordinates. Users can decide to graphically represent just the variable coordinates or to include the subject points, too.
The quality of the coordinates is assessed by computing a variety of indices. The absolute contributions indicate how much a coordinate contributed to the inertia described along the corresponding axis. A relatively high absolute contribution for a particular row indicates that the row had an important influence on determining the position of the axis. The relative contributions are the squared correlations between a variable category and the principal axes. The relative contributions indicate how well a certain point (i.e., the coordinate of a variable category) is represented by a particular axis. They can be interpreted as the amount of inertia that an axis contributed to the inertia of a point.
MultipleCar allows orthogonal varimax rotation and oblique quartimin rotation. In addition, in the context of EFA, loading matrices are frequently weighted before rotations are computed. After rotation, the original distances of points from the origin are re-established, so that the interpretation is not affected by the weights applied. This common practice in EFA is also applied to the context of correspondence analysis (see Lorenzo-Seva et al. 2009). Re-scaling coordinates using the masses of the corresponding categories prevents infrequently observed points from playing an important role in determining the rotation angle. When the row-wise normalization of the matrix to be rotated is computed, all the coordinates have the same influence on the final position of the axes. MultipleCar allows users to decide which one of these weighting schemes they wish to use (if any).

Input and output
To run the standalone program, MultipleCar must be used on the Windows operating system. To run the graphical user interface with MATLAB as a toolbox, the following command line must be executed in the MATLAB prompt

>> MultipleCar
Once the main window of MultipleCar has been opened, the data can be loaded and the analysis configured. Figure 1 shows the graphical user interface of MultipleCar. The input consists of an ASCII file containing scores on the variables, the number of categories in   each variable, the labels for variable categories (optional), the variables to be considered as supplementary points (optional), and the number of dimensions to retain. Alternatively, an ASCII file containing the indicator matrix or the Burt matrix can be used. Finally, users who are familiar with MATLAB can choose to read the data stored in their own MATLAB files. The output consists of the indices explained above and is stored in the ASCII file OUPUT.TXT.
If the output information is too detailed, users can choose a simplified output option. In addition, the MAP option displays the coordinates in a bidimensional graph. Figure 1 shows the main window of MultipleCar, while Figures 2 and 3 show how to configure the program in order to analyze a data set. We illustrate and explain these steps in the next section by means of an example.

An illustrative example
From the Eurostat database (see http://ec.europa.eu/eurostat), we recorded three variables for 287 regions in Europe during the year 2013: (1) Youth employment rate, (2) Youth long-term unemployment rate, and (3) Country of the region. The population under study consisted of individuals between 15 and 29 years old, and long-term unemployment meant having been without work for 12 months or longer. The employment rates in the 287 regions ranged from 16.5% (in Dytiki Macedonia, Greece) to 75.2% (in Ostschweiz, Switzerland) and long-term unemployment ranged from 0.5% (in Prague, Czech Republic) to 37.4% (in Ceuta, Spain). We computed a new categorical Employment variable coded as: Very low (employment rates between 16.5 and 31.2%), Low (employment rates between 31.3 and 45.9%), High (employment rates between 46.0 and 60.5%), and Very high (employment rates between 60.6 and 75.2%). We computed a new categorical Long-Term Unemployment variable coded as: Average adjusted explained variance: 5.3% (dimensions explaining less variance should be excluded from the map; equivalent to Kaiser s One-eigenvalue rule in EFA) Table 1: Adjusted principal inertias based on eigenvalues of the Burt matrix.
Very low (unemployment rates between 0.5 and 9.7%), Low (unemployment rates between 9.8 and 19.0%), High (unemployment rates between 19.1 and 28.2%), and Very high (unemployment rates between 28.3 and 37.4%). Finally, the 287 regions were from one of the 32 countries listed in Table 1.
We analyze the three categorical variables (Employment, Long-Term Unemployment, and Country) using multiple correspondence analysis with MultipleCar. In order to read the data and to set the program as presented in Figure 1, the user must follow the five steps presented in Figures 2 and 3. These steps are: • Step 1: Read the data using the File tab. Data can be either read as a text file in ASCII format or a MATLAB (*.mat) file and is stored in a data matrix named X. The program expects to find a matrix in which each column corresponds to a categorical variable, and the categories are coded as category numbers (consecutive, starting with 1). • Step 2: Labels related to each variable in matrix X can be read from a text file in ASCII format and are stored in a text variable named labels. Each label is expected to be stored in a row of the text file. The number of rows should be equal to the total number of categories and the order of labels should correspond to the category numbers. • Step 3: By clicking the arrow next to Data, the matrix X can be selected for analysis. When the matrix is selected, the vector MCAR_maxima_in_rawdata is computed by the program. This vector contains the maximum value for each variable. • Step 4: The vector MCAR_maxima_in_rawdata is selected to define the number of categories in each variable contained in matrix X. Alternatively, the user can load a vector using the same procedure described in Step 1, and select it to define the number of categories for each variable. • Step 5: The text variable labels is selected to assign labels to the categories of the variables. If no labels are given, the output will just show the corresponding numerical value.
To compute an MCA solution using the default values, the user needs to click the button Compute. The program computes the 247 × 50 indicator matrix and asks the user to indicate the number of dimensions to retain. This is the dimensionality of the solution. Note that, for CA and MCA the solutions are nested. That is, the choice of the number of dimensions does not affect the (unrotated) solution. It merely removes/adds coordinates. Therefore, some initial choice of the number of dimensions can be used and, based upon the output, a final choice can be made. For JCA, solutions are not nested. In this case, the analysis needs to be re-run for different choices of k to select k based on the explained inertias of solutions of different dimensionalities. In our example, on the basis of the adjusted inertias, we decide to retain three dimensions the explained 84.6% of the total inertia (see Table 1).
MCA outcomes are typically presented as a display of the coordinates in a bidimensional graph. However, when more than two dimensions are retained, the graphical presentation becomes more complex. Figure 4 shows the bidimensional graphs in three panels, one for each pair of dimensions. Understanding the information contained in these graphs is not straightforward. In this case, it is preferable to interpret the rotated principal coordinates. In our example, Bentler's simplicity index (1977) is 0.586 before rotation and 0.965 afterwards.
We therefore advise to interpret the rotated coordinates are shown in Table 2. It can be useful to label dimensions by using the rotated coordinate values. Lorenzo-Seva et al. (2009) proposed comparing the squared coordinates to the corresponding mean. Only coordinates whose squared values are larger than the mean are considered to be salient coordinates. These salient coordinates can be used to assign labels to the dimensions. The salient coordinates are marked in Table 2 with an * mark.
Finally, the salient coordinates in the first dimension suggest that this dimension is bipolar: One pole is high employment (salient positive coordinates), whereas the other is medium (i.e., values between low and high) and long-term unemployment (salient negative coordinates). The countries are ordered along this bipolar dimension according to their employment/longterm unemployment levels. For example, Estonia, Finland, Latvia, Malta, Sweden, and Germany are on the high employment pole; whereas Ireland, Slovakia, Bulgaria, Portugal, Spain, Croatia, and Cyprus are on the medium and long-term unemployment pole. This dimension could be labeled as Countries with non-extreme employment/long-term unemployment levels. Salient coordinates in the second dimension suggest a very low level of employment and a very high long-term unemployment in some countries: Macedonia, Greece, and to some extent Italy. This dimension could be labeled as Countries with a very difficult employment situation. Salient coordinates in the third dimension suggest a very high level of employment in some countries: Denmark, the Netherlands, Switzerland, and Norway. This dimension could be labeled as Countries with very high employment. Note that the same conclusions can be drawn from the maps in Figure 4. However, as the number of dimensions k was larger than 2, the interpretation of the rotated coordinates is more straightforward.    Table 2.

Program limitations
The number of variables, categories and observations in the data set are not limited. However, when large data sets are analyzed and depending on the characteristics of the computer (processor chip, memory, etc.), computing the indicator matrix can take some time. To speed up the analysis time, the indicator and the Burt matrices are given to the user afterwards so that these matrices can be saved and used for future analyses without needing to compute them again. In addition, if the analysis is made more than once in the same session, Multi-pleCar does not compute the indicator matrix again but uses the one that has already been computed.

Program availability
MultipleCar can be downloaded for free from http://psico.fcep.urv.cat/utilitats/ CorrespondenceAnalysis. The user can download a standalone version of the program to be run on Windows 64-bit operating systems. Alternatively, the MultipleCar toolbox can be run as a MATLAB script. Note that, in order to use MultipleCar as a standalone application, users that do not have MATLAB installed on their computer first need to download a MATLAB runtime program. The site also provides a manual that includes video tutorials on how to use MultipleCar, and some example data sets.