Jan de Leeuw and the French School of Data Analysis

The Dutch and the French schools of data analysis differ in their approaches to the question: How does one understand and summarize the information contained in a data set? The commonalities and discrepancies between the schools are explored here with a focus on methods dedicated to the analysis of categorical data, which are known either as homogeneity analysis (HOMALS) or multiple correspondence analysis (MCA).


Introduction
In the 1960s two currents of research emerged in the spirit of Tukey's exploratory data analysis (Tukey 1962): the French school and the Dutch school. Researchers in these schools were outliers in the statistical landscape of the time, in which most research was performed in the framework of probability models. What can be highlighted is that many of the modern arguments about data sciences, machine learning, statistics (see Donoho 2015, for a nice overview), and inference (see for instance the ASA statement on p-values in Wasserstein and Lazar 2016) were already debated. We feel that the way both schools tackled problems and data was a bit ahead of its time.
The French school of "analyse des données" (data analysis) was led by Jean-Paul Benzécri, a mathematician and linguist, who encouraged the idea of "letting the data speak for themselves". One of his famous quotes (Benzécri 1973, p 6, Tome 2) starts with, "The model must follow the data, not the other way around," and ends with, "What we need is a rigorous method which extracts structures from the data." 1 He described "statistical analysis as a tool It is of course more difficult for us to talk about the Dutch school and to reflect on Jan's views of statistics, models, and inference without taking the risk of misrepresenting his thoughts. In addition, his views may have evolved through the years. So we would advise the reader to consider two articles we really enjoyed and that may reflect some of his current ideas: "Models of Data" (De Leeuw 2005) and "Statistics and the Sciences" (De Leeuw 2011b). We also refer the reader to the interview of De Leeuw (De Leeuw 2011a) given on the occasion of the International Conference on Correspondence Analysis and Related Methods (CARME) in 2011. We see there many similarities with our day to day practice of statistics, in which we think about the encoding of the data, the use of statistics "as tools for data analysis", concerns about stability, etc. What we can say for sure is that the absence of models is also a strong characteristic of the Dutch school. In addition, methods of this school known as Gifi's methods (Gifi 1990;Michailidis and De Leeuw 1998) also reduce the dimensionality and respect the nature of the data, whether categorical or ordinal, for instance. For both schools, coding categorical variables with the indicator matrix of dummy variables and considering them as Gaussian, for instance, is almost a crime. Another strong feature of both schools is that the approaches are completely unsupervised in the sense that very often there is no distinction between explanatory variables and a response variable, or in other words, there is no Y variable.
What is different between the schools is the manner in which problems are of presented and solved problems, mainly based on projections in the French school, and on the definition of a loss function solved by an alternating least squares (ALS) algorithm and transformation of the variables in the Dutch school. The difference between these points of view implies different research focuses and developments. It is very interesting to see how the way a problem is written influences the stream of ideas. As we will see in what follows, the projections point of view facilitates the introduction of supplementary elements to enhance the interpretation of the graphical outputs whereas the ALS point of view easily enables the introduction of constraints in the optimization problem.
In this paper, to illustrate the commonalities and discrepancies between the French and Dutch schools we focus on the method dedicated to analyse categorical data, known either as multiple correspondence analysis (MCA) or homogeneity analysis (HOMALS). We start by reviewing both approaches and by presenting how these methods have been extended to deal with missing values. Then, we illustrate the approaches on a survey data set describing genetically modified organisms. Finally, we show how Jan's developments influenced the French school.

HOMALS and multiple correspondence analysis
HOMALS and MCA have been successfully applied to describe the relationship between categorical variables in many fields such as the social sciences, marketing, health, psychology, educational research, political science, genetics, etc. (Greenacre and Blasius 2006). They are often used to analyse survey data where participants answer many questions. Let us consider a dataset with n rows and J categorical variables, v j j=1,...,J with K j categories each. The data are coded using the indicator matrix of dummy variables denoted G n×K , K = j K j with g ijk = 1 if person i selects category k of variable j and g ijk = 0 otherwise as illustrated below for three variables with respectively three, three and two categories.

Classical MCA presentation
Historically, Lebart (Lebart and Tabard 1973) had the idea to apply correspondence analysis (CA) to the indicator matrix G (Lebart and Saporta 2014). This strategy yields very interesting results with new properties: this is how MCA was born, and this remains its most common definition. Another nearly-equivalent way to perform MCA consists of applying CA on the Burt matrix B = G G, which is the matrix of all pairwise associations between the variables. Note that in this table, the information on rows is lost. The final common presentation of MCA consists of performing PCA onto the indicator matrix G with specific row and column weights. The choice of weights ensures the properties of the method such as the Chi-square interpretation of the distances between rows as well as the fact that the principal components are "new" variables that are maximally related to the set of variables, with the relationship measured here by the squared correlation ratio (η 2 ) of analysis of variance. More precisely, let us denote a matrix X n×S which represents the principal components (scaled to 1) also known as the normalized scores, i.e., the normalized coordinates of the n observations on the S axes. They satisfy the following property (Saporta 1988a): with the constraint that x s (the s th column of X) has norm equal to 1 and is orthogonal to x s for all s < s. This expression strengthens the presentation of MCA as an extension of PCA. It also strengthens the practice of performing a clustering method, such as k-means or a hierarchical clustering algorithm onto the S first principal components of MCA. Indeed, it allows both working with continuous variables that summarize the categorical variables and also removes some noise (assuming that the last dimensions are restricted to noise), which stabilizes the clustering (Husson, Lê, and Pagès 2010). Note that the complementarity between clustering and principal components methods is usual in the French school of data analysis.
That three presentations of MCA can be seen as a strength of the method (Husson and Josse 2014). Whatever the point of view used, the MCA solution can be obtained by performing the generalized SVD (Greenacre 1984) of the triplet data, column weights, row weights G − M, J −1 D Σ −1/2 , n −1 I n with D Σ , the diagonal matrix of the column margins of the matrix and M the matrix where each row is equal to the vector of the means of each column of G. It boils down to performing the following SVD: MCA can also be defined as finding the best low rank approximation of G − M with a matrix of rank S according to the Hilbert-Schmidt norm T 2 The solution is given by A = VΛ 1/2 and X = U truncated at order S. MCA analysis mainly consists of interpreting the graphical outputs where rows are represented with UΛ 1/2 D 1/2 Σ and categories are represented with VΛ 1/2 D 1/2 Σ . There are different choices regarding the graphical representations; the previous system is known as the French coordinates. In addition, we usually emphasize the (pseudo) barycentric principle which helps in interpreting simultaneous graphical displays: a column category point is, apart from scaling factors, the centroid of observations belonging to that category and a row point is, also apart from scaling factors, at the barycenter of the categories it belongs to. This property is at the origin of an additional way to introduce MCA known as "dual scaling" and popularized by Nishisato (1980). Note also that examing both rows as well as columns of a data set is already a step away from the classical inferential framework, where the rows are often a sample from a larger population and useful only in that they provide information on the relationship between variables. HOMALS (De Leeuw and Van Rijckevorsel 1980;Gifi 1990;Michailidis and De Leeuw 1998) is defined using the concept of "quantification". The quantification of the rows is represented with a matrix X n×S and the quantification of the categories is represented with the matrix Y S×K = (Y 1 , . . . , Y J ) and the quantification of each variable is G j Y j of size n × S. Homogeneity analysis is defined using a loss function which represents a criterion of departure of homogeneity:

Classical HOMALS presentation
The HOMALS solution minimizes criterion (2): with the constraint that x s has norm equal to 1 and is orthogonal to x s for all s < s.
Contrary to MCA which is solved by SVD, homogeneity analysis uses an alternating leastsquares algorithm where at step there are three substeps (at iteration = 0, arbitrary rows scores X 0 is used): 3. X is defined as the orthonormalized version of Z .
The homogeneity analysis framework makes it easy to add constraints. It is common, for instance, to impose a rank constraint on the Y j ; often rank 1 is chosen. It can be done simply by adding after step (1) in the algorithm a step where Y c is defined as the best rank 1 approximation to Y . De Leeuw and Mair (2009) highlighted the fact that such a constraint may make the interpretation easier since it leads to a more parsimonious representation. In addition, such a constraint can be a way to avoid horseshoe effects if such effects are not desirable. In addition to a rank constraint, a level constraint can be imposed to reflect the data type, i.e., ordinal or numerical variables. The idea is to respect the nature of the variables by preserving the original order of the categories, for instance. Thus, the categories of an ordinal variable will be ordered as well on the low-dimensional graphical representation.
Note also that the HOMALS frameworks allows definition of variable transformations with other restrictions on the quantification matrix Y J , which gives new methods such as non linear version of PCA (De Leeuw 2014).
We should mention that Jan was aware of the work of Benzécri and was influenced by Van De Geer's books on multivariate analysis from a graphical perspective. So he gave an extra perspective by including the optimization framework and this point of view was favored by the Dutch school.

Connection between HOMALS and MCA
Both HOMALS and MCA are dedicated to the analysis of categorical data and represent the data in a lower dimensional space with row coordinates X and category coordinates Y. The connection between both criteria (1) and (2) is straightforward by plugging-in the centroid Thus, MCA and HOMALS (in its simplest form without constraints) lead to exactly the same graphical representations and analysis.
However, due to the difference of starting points, we feel that the algorithms as applied in practice are more different than they initially appear. The strongest point of the Gifi methods is their use of advanced optimization techniques and Jan is a pioneer in this domain; one can mention his works on majorization algorithms known as majorization by minorization (MM) algorithms (De Leeuw and Heiser 1977), for instance. On the other hand, the extensive use of SVD has led to the developments in the matrix completion framework as illustrated in Section 2.4. Once again, this highlights the very modern aspects of these schools since both optimization techniques and the SVD have gained huge popularity in the past decade due to their ability to address problems involving high dimensional data.
In the next section, we discuss missing values. HOMALS and MCA approach missing values differently, which can be explained by the differing formulation of the methods.

Handling missing values in HOMALS and MCA
A first possibility to manage missing values consists of adding an additional column to the indicator matrix for each variable with missing data. In this case, missing values for a variable are considered as a new category and not as one of the observed categories. Then, classical HOMALS or MCA can be applied on this new complete data set. Note that this strategy makes sense for missing not at random data (MNAR) (Little andRubin 1987, 2002), for instance, or to inspect the missing data pattern (Josse, Chavent, Liquet, and Husson 2012).
Other ways are available and they differ in HOMALS and MCA with respect to their strategy and their results.

Missing values in HOMALS
In HOMALS, missing observations are simply coded as zero rows in the matrix G; if object i is missing on variable j, then row sum i of G j is 0, otherwise row sum becomes 1 since the category entries are disjunctive. Then, whatever the coding, all row sums of G j are collected in a diagonal matrix M j and the criterion L(X, Y) is written by introducing these matrices (M j ) j=1,...,J : We note that this approach seems a very natural way to "skip" the missing values in the optimization problem. This strategy is also known as missing passive (Meulman 1982) and has been extended in the framework of MCA by Escofier (1987) with missing passive modified margin. Van der Heijden and Escofier (2003) and Josse et al. (2012) discuss the advantages and drawbacks of both approaches.

Missing values in MCA
Since MCA can be presented as a particular PCA with some metrics, the approach used to handle missing values in PCA has been extended to MCA by Josse et al. (2012). In PCA, for a data matrix Z, it consists of ignoring the missing values by minimizing the reconstruction error over all non-missing elements. This is done by introducing a weighted matrix W (with w ij = 0 if z ij is missing and w ij = 1 otherwise) in the PCA least squares criterion: with * the Hadamard product. This criterion can be minimized either using the alternating weighted least squares algorithm or iterative PCA (Kiers 1997;Josse, Pagès, and Husson 2009). This latter consists of randomly imputing the missing entries, performing PCA on the completed matrix and then using the principal components and loadings to impute missing values. The steps of estimation and imputation are repeated until convergence. From this iterative PCA algorithm, an algorithm called "iterative MCA" has been derived in Josse et al. (2012) and it takes into account the features of MCA such as updates for the column margins.

Comparison between both strategies
Both approaches aim at skipping missing values by introducing a weighted matrix in the criterion. However, both approaches lead to very different results as discussed in Josse et al. (2012). As mentioned in Section 2.3, the criterion and the choice of an algorithm have an impact on the properties highlighted or sometimes worse, overlooked. For instance iterative MCA can be seen as a matrix completion method which can be interesting in itself (Audigier, Husson, and Josse 2016). Of course, imputation is also possible with HOMALS, although it is less natural since it does not show up in the algorithm.
What can be noted is that the strategy missing passive used in HOMALS was proposed in Benzécri (1973, p.327) but it has been criticized by the French (Van der Heijden and Escofier 2003; Josse et al. 2012) due to the fact that many MCA properties are lost. On the contrary, in iterative MCA, the strategy to handle missing values is based on a criterion that is minimized with an iterative algorithm, which is more in the spirit of the Dutch school.

Example: Survey on the perception of genetically modified organisms
To illustrate the methods, we use an example of a survey describing genetically modified organisms (GMOs). These data are described in Husson, Josse, Le, and Mazet (2011) and are available at Husson, Josse, Lê, and Mazet (2009). The questionnaire contains 16 questions directly linked to the participants' opinion of GMOs. For instance "Do you feel implicated in the debate about GMOs (a lot, to a certain extent, a little, not at all)?"; "What is your view of GMO cultivation in France (very favourable, favourable, somewhat against, totally opposed)?" and so on. The questionnaire also contains five socio-demographic variables: sex, professional status (farmer, student, manual labourer, senior management, civil servant, accredited professional, technician, retailer, other profession, unemployed, retired), age (-25 years, 25-40 years, 40-60 years, +60 years), "Is your profession or education in any way linked to agriculture, the food industry or the pharmaceutical industry (Yes/No)?", "Which political movement do you most adhere to (extreme left, green, left, liberal, right, extreme right)?". The aim of the data analysis is first to characterise the respondents in terms of their   (Lê, Josse, and Husson 2008). Figure 1 gives the graph obtained by HOMALS and MCA for the active categories (the two graphs are the same). On the negative side, represented by the first principal component, we can observe those people who feel implicated by the debate surrounding GMOs and who are somewhat against their use (through the categories they chose). On the positive side, we can see those people who do not feel implicated by the debate surrounding GMOs and who are in favour of their use. Along the second principal component, we can also observe those people with less distinct opinions who feel somewhat implicated by the debate surrounding GMOs and who are somewhat against their use. More interpretation is given in Husson et al. (2010). The representation of supplementary variables in Figure 2 reveals a strong structure for both of the variables profession and identification with a political movement, and second, it fails to identify any particular structure with the variables of age, sex, or profession in relation to agriculture, the food industry, and the pharmaceutical industry. The categories senior management, unemployed, and retired are in opposition to the categories technician and manual labourer to civil servant between the two groups. Similarly, the category right is opposed to the categories green and extreme left, to, in the middle, left.

MCA factor map − Active categories
Since in the questionnaire, some variables are naturally ordered, we can use HOMALS with the rank 1 constraint ( Figure 3) and with the constraint that the categories are ordered ( Figure 4). As expected, the categories of the ordinal variables are on a straight line and the order is preserved.
Note also that some attempts have been made in MCA to add constraints; see Benzécri (1973, p. 261-287) and Beh and Lombardo (2014, Chapter 6). However, the inclusion of constraints is less straightforward than in HOMALS and they are consequently not as used by the French school. The lack of use of constraints can also be explained by the fact that they have never been implemented. Software is an incredibly powerful tool to popularize methods, and the implementation and availability of methods in software may explain why some practices (even when flawed) are still in use.  Benzécri, Henri Caussinus, and above all to the dissertation of Brigitte Escofier-Cordier entitled "l'analyse des correspondances" (Cordier 1965). In turn the influence of Jan de Leeuw was felt as early as the late 70s in two areas. The main media were the Revue de Statistique Appliquée (Revue de Statistique Appliquée) and Statistique et Analyse des Données (Statistique et Analyse des Données).

Influence of Jan de Leeuw's work on
The two areas in which the influence of De Leeuw was felt were: • Optimal scaling where categorical variables, either ordinal or nominal are optimally transformed into discrete numerical variables enabling the use of methods like regression, PCA and discriminant analysis. De Leeuw's dissertation was rapidly known in France and was referred to in Bouroche, Saporta, and Tenenhaus (1975), Saporta (1975), Tenenhaus (1977). The series of papers by Young, Takane, De Leeuw in Psychometrika were very influential: see e.g., Dupont-Gatelmand (1979). Nonlinear PCA, which is intimately connected to optimal scaling (De Leeuw 1988;Gifi 1990), inspired many French works more or less directly till the end of the 90s; see for instance Ferraty (1997).

Contribution of optimization methods in recent work
Jan de Leeuw has been a forerunner in developing and using optimisation techniques (see for instance De Leeuw 2016). The block-relaxation algorithms (De Leeuw 1994) or, in more modern words, block coordinate descent, together with majorization by minorization (De Leeuw and Heiser 1977), are used for instance in the regularized generalized canonical correlation analysis (RGCCA) method (Tenenhaus and Tenenhaus 2011) for multi-block data analysis, which concerns the analysis of several sets of variables (blocks) observed on the same group of individuals. The main aims of RGCCA are: (i) to study the relationships between blocks and (ii) to identify subsets of variables of each block which are active in their relationships with the other blocks. RGCCA is based on a monotonically convergent iterative algorithm and has the distinct advantage of being formulated as an explicit optimization problem.

Appendix 4
My first, half-missed, encounter with Jan was in April 1976 on the occasion of a symposium on Optimal Scaling during the spring meeting of the Psychometric Society. It was my first trip to the US and, suffering from jet-lag, I collapsed early in my bed. That's when I got a phone call from Jan offering to get acquainted. I stammered a few words and then I fell asleep again. We finally met the next day.
There is an unfortunate typo in the title of my talk: it was about "nominal", and not about "normal", variables.
The European Meeting of Statisticians, organised from 6 to 11 September 1976 in Grenoble (France) under the auspices of the European Regional Committee of the Bernoulli Society, gave us the opportunity to form a better relationship. I especially remember a lunch organised by Gérard Drouet d'Aubigny in Sassenage, a village nearby Grenoble: Jan van Rijckevorsel, Jean-Marie Bouroche, Michel Tenenhaus, and a few others were there. On the menu there was a very French, and not very vegetarian special sausage: the "andouillette", but I do not remember if Jan de Leeuw tasted it! In the proceedings of the Grenoble meeting, one can find a paper by Jan (De Leeuw 1977) as well as the one by the French trio (Bouroche, Saporta, and Tenenhaus 1977), which referred to Jan's communication at the Spring Meeting of the Psychometric Society a few months before.
We had many opportunities to see each other afterwards in various meetings as well as at