The R Package groc for Generalized Regression on Orthogonal Components

The R package groc for generalized regression on orthogonal components contains functions for the prediction of q responses using a set of p predictors. The primary building block is the grid algorithm used to search for components (projections of the data) which are most dependent on the response. The package offers flexibility in the choice of the dependence measure which can be user-defined. The components are found sequentially. A first component is obtained and a smooth fit produces residuals. Then, a second component orthogonal to the first is found which is most dependent on the residuals, and so on. The package can handle models with more than one response. A panoply of models can be achieved through package groc: robust multiple or multivariate linear regression, nonparametric regression on orthogonal components, and classical or robust partial least squares models. Functions for predictions and cross-validation are available and helpful in model selection. The merit of a fit through cross-validation can be assessed with the predicted residual error sum of squares or the predicted residual error median absolute deviation which is more appropriate in the presence of outliers.


Introduction
The R (R Core Team 2015) package groc (Bilodeau and Lafaye de Micheaux 2015) is for generalized regression on orthogonal components.It bears some similarities with the package pls (Mevik, Wehrens, and Liland 2013) used to fit partial least squares (PLS) models.In the multivariate case, using the SIMPLS algorithm of de Jong (1993), package pls finds two linear combinations, one of responses and another of predictors, for which their covariance is maximum.The linear combination of predictors is called the first PLS component.The second PLS component is obtained by repeating this optimization of covariance on the deflated responses and predictors.The deflated responses are obtained by computing the residuals of the classical linear regression of the responses on the first PLS component, and similarly for the deflated predictors.Package groc generalizes this scheme.It finds two linear combinations, one of responses and another of predictors, for which a general measure of dependence is maximum.The deflated responses in package groc may take several forms depending on the application one has in mind.It can be the residuals of a linear regression, a robust linear regression, or a nonparametric regression of the responses on the first component.The deflated predictors are computed to cover the linear span of predictors orthogonal to the first component.The deflated predictors considered in package groc are the residuals of the classical or robust linear regression of the predictors on the first component.Package groc is available from the Comprehensive R Archive Network (CRAN) at http://CRAN.R-project.org/package=groc.A brief description of related regression methods follows to situate package groc in the existing literature.The multiple linear regression model has a mean function of the form A non-parametric generalization is the generalized additive model (GAM) given by E(y) = g 1 (x 1 ) + • • • + g p (x p ), with unknown functions g j estimated by the backfitting algorithm of Buja, Hastie, and Tibshirani (1989).A GAM fit can be obtained with the function gam of the package mgcv (Wood 2006(Wood , 2015)).These two models are defined in terms of the original predictors x1 ,. . .,x p .When the predictors are multicollinear, a better approach to a multiple linear regression model is the PLS model The number h of components is determined by cross-validation.With the original predictors written as the matrix X = (x 1 , . . ., xp ), each PLS component is a linear combination t i+1 = Xr i+1 obtained sequentially as the solution of the optimization where the second argument to COV is the residual of the multiple linear regression of y on the previous components t 1 , . . ., t i .This formulation was used to make the comparison of package pls to package groc and projection pursuit regression (PPR) clearer.In fact, because of the linearity and orthogonality properties of PLS, y could be used equivalently to the residuals in the second argument.The model in package groc can be viewed as a generalization to PLS since it is a generalized additive model defined in terms of orthogonal components.The orthogonal components are obtained sequentially by solving the optimization where D is a general dependence measure, and letting t i+1 = Xr i+1 .This algorithm requires an evaluation of the dependence measure for each candidate direction r.Once t i+1 is found, the next function g i+1 is found by the backfitting algorithm of GAM which simultaneously updates the previous functions g 1 , . . ., g i .In the special parametric case g i (t i ) = c i t i and D = COV, the function groc from package groc reduces to the function plsr from package pls.The groc proposal is more flexible than pls and is more restrictive than projection pursuit regression (function ppr in package stats) which fits a model as in Equation 1 without imposing the constraint of orthogonality of components.The sequential algorithm proposed in ppr solves the optimization Here, for each direction r one must find the smooth fit S r which minimizes the residual sum of squares.The optimal value of r denoted r i+1 yields the next component t i+1 = Xr i+1 and the estimated function g i+1 = S r i+1 .Friedman and Stuetzle (1981) also suggest to update the functions g 1 , . . ., g i+1 by backfitting the GAM model based on the components t 1 , . . ., t i+1 .
Regardless of the fitting method used, prediction mean squared error is small when a good compromise between squared bias and variance is reached.Increasing flexibility of the models is generally accompanied by increasing variances of predictions and decreasing squared biases.It is expected that groc yields higher variances than plsr but smaller than ppr, with reverse ordering when considering squared biases.Depending on the particular data analysis task at hand, one may choose the best fitting method by cross-validation in which case groc becomes one among several competitors.
By an appropriate choice of the dependence measure D and of the smoother used in the GAM fit, the flexibility of groc can be used to fit robust multiple or multivariate linear regression, nonparametric regression on orthogonal components, and classical or robust partial least squares models.The groc model situated between pls and ppr has not been considered as such in the literature.The numerical grid algorithm originally proposed by Croux, Filzmoser, and Oliveira (2007) and implemented in the package pcaPP (Filzmoser, Fritz, and Kalcher 2014) for projection-pursuit robust principal component analysis is also used for the first time in the regression context to find optimal components.The grid algorithm could also be used to search for optimal components of ppr models in replacement to the algorithm of Rosenbrock (1960) used by Friedman and Stuetzle (1981).However, it is not clear which numerical algorithm does the most extensive search over the unit sphere since the algorithm of Rosenbrock (1960) in the context of ppr is not described in detail in Friedman and Stuetzle (1981).
The framework is oriented towards prediction and not towards interpretation of components or inference about model parameters.Prediction accuracy is measured by cross-validation, a practice that is common in machine learning.The usefulness of package groc is illustrated on several examples using simulated and mostly real data.Examples of real data analysis are: the influence of five anatomical factors on wood specific gravity, volume of timber as a function of height and girth of trees, densities of NIR spectra of PET yarns, a particle physics experiment to predict the combined energy of a particle using six predictors, and the prediction of six attributes from a sensory panel using five physico-chemical quality parameters of olive oil.
The paper is organized as follows.Section 2 treats models with a single response.The first component is defined in Section 2 and the grid algorithm for its computation is described in Section 2.1.The grid algorithm in package groc is similar to the one described in Croux et al. (2007) and integrated in the package pcaPP for robust principal component analysis.Models with only one component comprise the robust multiple linear regression model in Section 2.2 and nonparametric regression on the first component in Section 2.3.The algorithm to sequentially obtain the first few components is described in Section 3. The numerical accuracy of the package groc is shown on a real data example where the grid algorithm reproduces the results of the package pls for which the optimization problem has an explicit solution.Section 4 treats model selection through cross-validation.The function grocCrossval based on the predict method for 'groc' objects computes the PRESS (predicted residual errors sum of squares) and the PREMAD (predicted residual errors median absolute deviation) which may be more appropriate for data with vertical outliers and bad leverage points.As a dependence measure, the distance covariance of Székely, Rizzo, and Bakirov (2007) is also presented in Section 4 and its usefulness is shown through an analysis of simulated data with quadratic functions of two components, borrowed from Friedman and Stuetzle (1981), where package groc is compared to the packages pls and ppr.The orthogonal invariance of package groc in Section 5 is used to speed up computations through a preprocessing step in Section 5.1 which replaces the predictors by their principal components whenever the number of predictors surpasses the number of observations.Section 6 generalizes the algorithms to multivariate models with several responses.In Section 7, the package groc is used to fit robust multivariate partial least squares models and, as a particular case, robust multivariate linear models.The robustness of package groc allows the identification of vertical outliers as well as bad leverage points.The function plot method for 'groc' objects graphs robust Mahalanobis distances of residuals versus robust Mahalanobis distances of components for this identification.An example shows that package groc is not affected by vertical outliers contrary to the MATLAB (The MathWorks Inc. 2014) package prm (Daszykowskia, Serneels, Walczak, Espen, and Croux 2013;Daszykowskia, Serneels, Kaczmarek, Espen, Croux, and Walczak 2007) for partial robust M-regression in Serneels, Croux, Filzmoser, and Espen (2005).The R code to reproduce all examples is available in the supplementary material.

The first component
We begin with regression models with multiple predictors and a single response.Höskuldsson (1998) gives the following interpretation of partial least squares regression: the first component is the linear combination of predictors with maximal covariance with the response.Partial least squares regression is available in the package pls through the function plsr.This proposal replaces the covariance by a general dependence measure.For n observations, the data consists of the response vector y.The observations form the n × p design matrix X = (x 1 , . . ., x n ) , where p is the number of predictors.It should be emphasized that x i denotes case i, whereas xj denoted variable j in Section 1.This means that x i is the ith row of X and xj is its jth column.All variables are centered at their means.The first direction of partial least squares is obtained from the optimization where COV(t, u) = t u is the dot product.The first component is t 1 = Xr 1 .Now, let D(t, u) be a measure of mutual dependence (thus more general than the covariance) between Algorithm 1 Grid algorithm for optimal direction Require: X, y 1: r = e 1 Initialization 2: for i = 1, . . ., N c do 3: for k = 1, . . ., N g do 4: end for 6: for j = 1, . . ., p do 7: end for 10: end for 11: return r Optimal direction the vectors t and u, such as for example the distance covariance of Székely et al. (2007) or the statistic of Deheuvels described in Genest, Quessy, and Rémillard (2006).In our context of generalized regression on orthogonal components (groc), the optimization problem becomes The first component is again t 1 = Xr 1 .In this approach, the component is a linear combination of the original variables, which is not the case for kernel methods as in Rosipal and Trejo (2001).All dependent statistics considered in this paper are either nonnegative, or else, their sign can be inverted by a sign inversion of r.Hence, it is assumed without loss of generality that D ≥ 0.
Once the first component t 1 is found, a smooth fit of the response y on t 1 is computed to obtain a predictive model ŷ = g 1 (t 1 ).

Grid algorithm for the first component
The algorithm for determining the first direction r 1 is inspired from the grid algorithm in Croux et al. (2007) for robust principal components.In our context, the grid algorithm is as follows.Let e j be the p-dimensional unit vector with a one in position j and zeros elsewhere.
As described in Croux et al. (2007), during one of N c cycles, a sequence of p grid searches over planes is carried out; the jth grid search updates the jth coordinate of r.When the second cycle starts, it is assumed that r is already pointing in the right direction but still needs local improvement.Hence, the grid search will not look over the whole plane, but only over the half-plane determined by the angles [−π/2, +π/2).After every cycle, the search is limited to a more narrow interval of angles, but keeping the number N g of grid points constant for every search.The grid Algorithm 1 searches for the global optimum for dependence functions D which need not be differentiable or even continuous as is the case when ranks are used.
The original grid algorithm in Croux et al. (2007) was proposed to find orthogonal principal components based on data X.For each candidate direction r, it computes a robust measure of scale of the component, say σ(Xr).Algorithm 1 is the original grid algorithm with the exception that the dependence measure D(Xr, y) in step 7 is used in replacement to the measure of scale σ(Xr).

Robust multiple linear regression
The least squares fit of a multiple linear regression finds the linear combination of predictors whose squared Pearson correlation with the response is the largest.A robust fit is obtained by using a robust correlation.A robust covariance matrix is the orthogonalized Gnanadesikan-Kettenring pairwise estimator (Maronna and Zamar 2002).The R package robust A robust multiple linear regression can be fitted in two steps: 1. Find the optimal direction and obtain the first component t 1 = Xr 1 .
2. Fit a robust simple linear regression (with a single predictor) through the origin ŷ = c 1 t 1 .
The fitted multiple linear regression becomes where β 1 = c 1 r 1 .This robust multiple linear regression requires the repeated evaluation of robust bivariate correlations and a single robust simple linear regression.
Example 1.The data wood in the package robustbase (Todorov and Filzmoser 2009;Rousseeuw et al. 2015) is used to illustrate the proposed robust fitting method.The original data are from Draper and Smith (1966) and were used to determine the influence of five anatomical wood factors (x1 to x5) on wood specific gravity (y).These data were contaminated by replacing a few observations with outliers in Rousseeuw and Leroy (1987).
The robust fit was computed with the function groc.The option method = "lts" for least trimmed squares specifies the robust simple linear regression of step 2. Two common choices of the breakdown point are 0.5 for the highest possible breakdown and 0.25, the default in groc, which gives a better compromise between efficiency and breakdown.In Figure 1, which contains a plot of robust Mahalanobis distances for residuals versus robust Mahalanobis distances for components, four bad leverage points were found as in the analysis of Rousseeuw and Leroy (1987, p. 245).Identification of outliers was made by robustly scaling residuals with the τ scale of Yohai and Zamar (1988).The function scaleTau2 of the package robustbase was used for that purpose.Scaled residuals in absolute value greater than the square root of the 0.975 quantile of a chi-square distribution with 1 degree of freedom (shown with a dashed line in Figure 1) were flagged as outliers.Identification of leverage points was done similarly using the component.q q q q q q q q q q q q q q q q q q q q 0.0 0.5 1.0 1.5 2.0 The squared robust correlation between the response y and the fitted value ŷ is obtained as follows: R> corrob(wood$y, fitted(out))^2 [1] 0.9518

Nonparametric regression on the first component
A dependence measure suited to monotonic relations is Spearman's rank correlation which is the correlation between two vectors of ranks where R(t i ) denotes the rank of t i among t 1 , . . ., t n .
The first component t 1 is determined to maximize Spearman's correlation with the response y.A smoothing spline is then fitted to the scatterplot of y versus t 1 .
Nguyen and Rojo ( 2009) considered the optimization problem where R(y) is the vector of ranks corresponding to y and R(X) is the matrix of ranks corresponding to predictors computed columnwise, i.e., separately for each predictor.The transformation to ranks is done in an attempt to robustify the partial least squares regression.It must be noted, however, that R(X)r is different from R(Xr), hence the solution to problem (3) does not give the component Xr with the largest Spearman's correlation with the response.In applications where the response may take negative values, the Box-Cox transformation requires adding a somewhat arbitrary constant to achieve positivity.In this case, the groc fit can be applied directly which may be considered an advantage.q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q −6 −4

Subsequent components
Initially, let X 0 = X and y 0 = y.Having computed the first component, t 1 = X 0 r 1 , each predictor in X 0 is regressed on t 1 to obtain the residuals Since residuals are orthogonal to predictors, then X 1 r is orthogonal to t 1 , for all r.The residuals are then computed where g 1 is the result of some smooth fit applied to the scatterplot of y 0 and t 1 .In other words, after the first component t 1 is found, a fit is made of y 0 using t 1 as the predictor in a smooth fit.The residual of this fit is y 1 .If the user is satisfied with this residual, then only one component is used.Otherwise, a search is done to find some structure in the residual in the orthogonal complement of the space spanned by t 1 .
For the second component, optimize as before using the grid Algorithm 1.Then, t 2 = X 1 r 2 is the second component.
is the residual of the linear regression through the origin of y 0 on t 1 , where c 1 = t 1 y 0 /(t 1 t 1 ).In this case, for the second component, since Returning to groc, all the components can be computed sequentially.Having computed the first few components t 1 , . . ., t h , compute Then, the residual of a generalized additive model (GAM) fit of y 0 on the first h components is For all r, X h r is orthogonal to t 1 , . . ., t h .Therefore, the optimization problem for the next component is The next orthogonal component is t h+1 = X h r h+1 .
In the GAM fit (4), although the components are orthogonal, the last function g h is fitted and the previous functions g 1 , . . ., g h−1 are updated by backfitting to improve the fit.Backfitting could be ignored to speed up computations.This is unlike plsr, in which case the last coefficient c h in the linear function g h (t h ) = c h t h is obtained by ordinary least squares and the previous coefficients c 1 , . . ., c h−1 remain unchanged.
The method in groc for computing the fit on the first h components is summarized in Algorithm 2.
Example 3. Consider again the data wood with one response and five predictors of Example 1.This example is used to show the numerical accuracy of Algorithms 1 and 2. Recall that groc with covariance as the dependence statistic and lm (ordinary least squares) as the smoother is equivalent to plsr in the sense that they are two solutions to the same optimization problem.However, unlike plsr which relies on an explicit solution of the optimization, groc does an extensive search of directions on the unit sphere in five dimensional Euclidean space using the grid algorithm.The much smaller relative difference for h = p (five here) components is expected.Indeed, plsr and groc with the option method = "lm" produce a least squares regression on the p extracted components which are linearly independent and even orthogonal.In both cases, the matrices of components which may differ have the form T = XA, where the p × p matrices A are non singular.Since fitted values of least squares regression are invariant to non singular linear transformations of the predictors, the two methods numerically coincide in this case.This holds regardless of the dependence function D used in groc.Due again to the invariance property, when h = p, a groc fit with the option method = "lts" produces the same result as a least trimmed squares fit of y on the original predictors X.

Model selection
The merits of several models are compared by cross-validation.A cross-validation criterion is defined in terms of the absolute error of predictions |y − ŷ|.The package pls has the function crossval to compute the predicted residual errors sum of squares (PRESS) by cross-validation which is the sum of squares of |y − ŷ| over all predictions made.The corresponding function for package groc is grocCrossval.In addition to PRESS, grocCrossval also computes the predicted residual errors median absolute deviation (PREMAD) which is the median of |y − ŷ| over all predictions.It is more relevant for model selection in the presence of outliers.
For the next example, we introduce a dependence measure which may not be known to most users.For accuracy of the optimization by the grid algorithm, dependence measures which are smooth functions of the data should be preferred.A smooth dependence measure is the squared distance covariance of Székely et al. (2007) where where ϕ T,U is the joint characteristic function, and ϕ T , ϕ U are the marginal versions.It is thus universal in the sense that it can detect any kind of dependence.A similar dependence measure is the statistic of Deheuvels described in Genest et al. (2006).It is a discontinuous function of the data since it is computed from ranks.It is also universal as it consistently estimates in terms of distribution functions.Our preference, however, is for the distance covariance of Székely et al. (2007) which leads to greater numerical accuracy of the optimization in the grid algorithm.
Example 4. A synthetic example is the nonlinear regression model with conditional mean Friedman and Stuetzle (1981) to illustrate the method of projection pursuit regression.This model can be rewritten as a sum of two quadratic functions of linear functions of x 1 and x 2 , The predictors x 1 and x 2 are independently and uniformly distributed over the interval (−1, 1) and the errors are normally distributed with a variance of 0.04.The sample size is n = 200.Two models were fitted to these simulated data: a plsr model and a groc model using the distance covariance dcov and smoothing splines.The functions found by smoothing splines of components in Figure 3 resemble two quadratic functions of opposite signs as in Equation 6.Since the true model is not additive, a GAM fit using smoothing splines of the original variables would be unable to find this structure in the data where our method was successful.This example is concluded with a comparison of groc with projection pursuit regression (ppr).Since there is no cross-validation function for ppr, the value of PRESS for ten-fold cross-validation is evaluated directly by deleting subsets of 20 consecutive observations.It ensures that predictions for groc and ppr are made exactly at the same data points.Evaluation of PRESS for a ppr fit is very easy since the function predict may be applied to an R object of class 'ppr'.In terms of PRESS, predictions with groc were better than with ppr with a PRESS value of 10.21 for the ppr model with two components.The ppr fit was computed with the option optlevel = 3, the highest level of thoroughness of the optimization routine.It may be argued, however, that this example favors groc because the two components x 1 + x 2 and x 1 − x 2 are uncorrelated.Components in ppr are generally not orthogonal.

Orthogonal invariance
Let r 1 be the solution to the generalized optimization problem (2) for the first component.Consider the orthogonal transformation X → XG for an orthogonal p × p matrix G.The optimization problem for the transformed data is sup r =1 D(XGr, y).
Since Gr runs over the unit sphere when r does, the solution is G r 1 .Hence, the first component is invariant to this orthogonal transformation since The orthogonal invariance of the other components follows by induction as in Denham (1995) for partial least squares.

Preprocessing of the data
Denham (1995) wrote: "By the use of orthogonal transformations it is therefore possible to reduce the original problem to a much smaller equivalent problem in which p is effectively reduced to the rank of the centered matrix, X, which for typical infrared problems will be n − 1."It is most important to reduce computational time.Remember that the grid Algorithm 1 for each direction searches for a unit vector in a p dimensional space.This can be a burden especially when p is much larger than n which is often the case for partial least squares applications.The following preprocessing can be done as soon as p > n.
It consists of entering Algorithm 2 for the first h orthogonal components not with X but with principal components derived from X.For mean centered X, a principal component analysis gives XH = U, where X is n × p of rank a (say), H is p × a with orthonormal columns (loadings) and U is the matrix n × a of principal components (scores).In fact, if all principal components were calculated we would have where (H, F p×(p−a) ) is now p×p orthogonal.The last p−a principal components are degenerate at 0; all the variability is explained with the first a components.The data actually belongs to an a dimensional space and after the appropriate rotation the last p − a components of the data are all 0. So X and U are the same data apart from the rotation.As a consequence, the components from X are the same as the one from U. Indeed, because of the orthogonal invariance, the components from X are the same as those from (U, 0 n×(p−a) ).The following condition on the dependence statistic is assumed.
The dependence statistic, after the orthogonal transformation, is where f is the vector of the first a components of r, which satisfies f ≤ r = 1.Then, the first direction is the solution to sup However, since D can always increase, or remain constant, when f is multiplied by a scalar c ≥ 1, it follows that the solution is always on the boundary, f = 1.
Example 5. Orthogonal invariance is shown empirically by way of the chemometric data yarn of the pls package.The response is density and the predictors are the NIR spectra.There are 28 observations and 268 predictors.The function plsr does not preprocess to principal components, whereas, our function groc preprocesses the data whenever n < p.
The maximal number of components in this case is 27; it is 26 when the merit of the fit is evaluated by leave-one-out cross-validation.The relative differences between fitted values with and without preprocessing were for all observations, response variables and number of components less than 0.03% even in high dimensions.Since optimal partial least squares directions have an explicit solution, it cannot be expected that groc which did an extensive search by the grid Algorithm 1 in about 1.6 sec performs as fast as plsr which required only 0.02 sec.The PRESS values produced respectively by plsr and groc were also very close and suggested a model with 12 components.On the grounds of the PREMAD values, a model with only 8 components may be suitable.Example 6.The data prim7 of the groc package is a particle physics experiment analyzed by projection pursuit regression in Friedman and Stuetzle (1981).It has 7 variables on 500 observations.In our approach, a nonlinear model to predict the combined energy of the three π mesons X 1 using the other variables X 2 , . . ., X 7 as predictors was fitted using the distance covariance measure and smoothing splines.The orthogonal invariance is numerically verified in this example with n ≥ p. Preprocessing must be done separately because the function groc does preprocessing only when n < p.The first three directions r i , i = 1, 2, 3, without preprocessing R> data("prim7", package = "groc") R> prim7.out<-groc(X1 ~., ncomp = 3, D = dcov, method = "s", data = prim7) R> prim7.out$R
The following example illustrates the usefulness of package groc in the context where n < p and the simulated model is nonlinear with two components.
Example 7. The simulated model is where x 1 ,. . ., x 5 are independently and uniformly distributed over the interval (−1, 1) and the errors are normally distributed with a variance of 1.Note that x 3 , . . ., x 5 are part of the data but not involved in the data generating process.The sample size is n = 50.The data also contain 50 more predictors perfectly collinear with the first 5 predictors resulting in a situation where n = 50 and p = 55.It should be noted that the two variables x 1 + x 2 and x 1 + 5x 2 are not uncorrelated, and therefore, the two components are not approximately orthogonal.A model is fitted with plsr, ppr and groc always using two components.For each of the three fits, the correlation between the actual response and the fitted value is computed and the user time is recorded.This model fitting scheme is repeated 30 times.Average correlations and user times over these 30 fits are finally reported in the following order: plsr, ppr, and groc.The average correlations [1] 0.3193 0.9228 0.9589 achieved by ppr and groc are both very high as compared to plsr.In terms of average user times, [1] 0.004933 3.511100 0.903033 plsr is the clear winner with groc being about four times as fast as ppr.

The multivariate case
The response matrix Y is now of dimension n × q.Initially, let X = X 0 and Y = Y 0 .The first direction r 1 is the solution for r in the optimization sup The first component is t 1 = X 0 r 1 .Having computed the components t 1 , . . ., t h , compute the deflated design matrix Then, for each response variable, compute the residuals from a GAM fit where g 1j , . . ., g hj is the result of a GAM fit of the jth variable (column) in Y 0 on t 1 , . . ., t h .The next orthogonal component is t h+1 = X h r h+1 , where r h+1 is the solution for r in the optimization sup When D is the covariance and ordinary least squares is the smoother in the GAM fit, groc reproduces the plsr model estimated by the SIMPLS algorithm of de Jong (1993).A justification in terms of the optimization of the objective function ( 8) is given in Boulesteix and Strimmer (2007).The optimization ( 8) is done using the grid Algorithm 1, in which at every cycle, a search over r, with s fixed, is followed by a search over s, with r fixed, alternating this way until convergence in a similar way as in the nonlinear iterative partial least squares (NIPALS) algorithm of Wold (1975).
Example 8.The data oliveoil of the pls package contains scores on 6 attributes from a sensory panel and measurements of 5 physico-chemical quality parameters on 16 olive oil samples.The first five oils are Greek, the next five are Italian and the last six are Spanish.
A comparison of plsr with the option method = "simpls" and groc with the options D = covariance and method = "lm" gives relative differences less than 0.009% between predicted values for all observations, response variables and number of components.The PRESS values from leave-one-out cross-validation for each response and any number of components are given next.

Robust multivariate partial least squares regression
Multivariate partial least squares regression by the SIMPLS algorithm can be computed by groc with the options D = covariance and method = "lm".Although, not as fast as plsr with the option method = "simpls", groc is flexible and can be adapted to produce a robust fit able to withstand vertical outliers and bad leverage points.For this purpose, a robust covariance using the robust τ scale estimate of Yohai and Zamar (1988) is defined as suggested by Gnanadesikan and Kettenring (1972) In Algorithm 2, the GAM fit ŷ0 = g 1 (t 1 ) + is replaced by a least trimmed squares regression.The deflation of X by ordinary least squares, is also robustified by least trimmed squares (lts).By default, all the least trimmed squares regressions use a breakdown point of 0.25 for a good compromise between robustness and efficiency.No multivariate implementation of least trimmed squares is available.Hence, the coefficients b i are replaced by the coefficients of lts regressions computed for each predictor separately.The deflation of X finally computes the residuals of these lts regressions.The groc function can be used with the option plsrob = TRUE.
The flexibility of the package groc allows to fit robust partial least squares regression by robustifying each step of Algorithm 2.
Example 9.The data cookie described in Osborne, Fearn, Miller, and Douglas (1984) contains measurements from quantitative NIR spectroscopy.It is part of the ppls package (Krämer, Boulesteix, and Tutz 2008;Kraemer and Boulesteix 2014) which fits only models with a single response.The data arise from an experiment done to test the feasibility of NIR spectroscopy to measure the composition of biscuit dough pieces (formed but unbaked biscuits).A sample set of size 40 was made up, with the standard recipe varied to provide a large range for each of the four constituents under investigation: fat, sucrose, dry flour, and water.The calculated percentages of these four ingredients represent the 4 responses.Observation 23 was identified as an outlier by Osborne et al. (1984).An NIR reflectance spectrum is available for each dough piece.The spectral data consist of 700 points measured from 1100 to 2498 nanometers (nm) in steps of 2 nm.The first and last 50 points of the spectrum were discarded because of lower instrumental reliability.Then, the data were transformed as in Hubert, Rousseeuw, and Verboven (2002).Firstly, a logarithmic transformation of NIR reflectance spectrum was applied to eliminate drift and background scatter.Secondly, we used first differences to remove constants and sudden shifts.After this preprocessing, we ended up with a design matrix with 40 observations and 600 predictors.Figure 4 shows the preprocessed reflectance spectrum.The number of components using plsr was found with the leave-one-out cross-validation PRESS criterion averaged over all 4 responses.The plot in Figure 5 suggests a model with 3 components.The correlations between observed response and fitted value are very large for the 4 responses: 0.9787, 0.9335, 0.9185 and 0.9426.Outlier detection was done using 3 components with plsr and groc with the plsrob = TRUE option.For groc, robust Mahalanobis distances were computed in the space of components where μi and τi are the location and scale τ estimates of Yohai and Zamar (1988) from the component t i with components t 1i , . . ., t ni .Since components are orthogonal, Mahalanobis groc: Generalized Regression on Orthogonal Components in R q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 0 10 20 30 Critical threshold for squared Mahalanobis distances were determined as the 0.975 quantile of a chi-squared distribution with degrees of freedom equal to the dimension of the corresponding space.The robust correlations evaluated by corrob between each response and its corresponding fitted value by groc (0.961, 0.9618, 0.9839 and 0.9888) are now even stronger.Another method called partial robust M-regression (prm) in Serneels et al. (2005) is faster but not as robust as groc.A MATLAB routine for models with a single response is implemented in (Daszykowskia et al. 2013).Two reasons for the lack of robustness of prm are: 1. Their weight function for residuals in models with a single response is w r i = f êi σ , c , q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 0 1 2 3 4 0 1 2 3 4 5 6

Classical distances
Components Residuals 7 21 23 q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 0 1 2 3 4 5 0 10 20 30 40 50 60 70 where

Components
and c is a tuning constant, taken as c = 4.The estimate σ is the robust scale estimate qn of Rousseeuw and Croux (1993).Therefore, the weights are never zero even for points with very large standardized residuals.The same remark holds for the weights measuring the leverage in the space of components.
2. But most importantly, in the first step of their routine, they define residual weights from the residuals taken as y i − med j y j and leverage weights using , c from the original cases x i .The notation med L 1 stands for the L 1 median also called the spatial median.Incorrect identification of outliers and leverage points in the first step of their routine may lead to a local minimum corresponding to a nonrobust solution.This happens for data with vertical outliers which are neither outlying in the space of y i 's, nor in the space of x i 's.
An artificial dataset best illustrates this caveat.
Example 10.Consider the data with n = 30 generated from the model y = t 1 + , where t 1 is uniform on the interval (−1, 1) and is normal with a standard deviation of 0.05.Vertical outliers are induced at observations 14, 15, and 16 by adding the constant 0.5 to the y values.
Next, the prm and groc methods are applied to an augmented dataset containing the response y, the predictor t 1 , and two perfectly collinear predictors 2t 1 and −1.5t 1 .Figure 7 exhibits the data and the fitted values.The lack of robustness of prm is evident from its fitted line which is attracted by the vertical outliers.q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q −0.When the design matrix X is of full column rank p, a multivariate least squares regression is equivalent to a partial least squares regression by the SIMPLS algorithm with p components.Hence, groc can also be used for robust multivariate least squares regression.This is now illustrated with a last example on a real data related to pulp fiber.
Example 11.Cited from Rousseeuw, Aelst, Driessen, and Gulló (2004): "The dataset contains measurements of properties of pulp fibers and the paper made from them.The aim is to investigate relations between pulp fiber properties and the resulting paper properties.The dataset contains n = 62 measurements of the following four pulp fiber characteristics: arithmetic fiber length, long fiber fraction, fine fiber fraction, and zero span tensile.The four paper properties that have been measured are breaking length, elastic modulus, stress at failure, and burst strength."As in the previous example, a robust multivariate linear regression is fitted by groc with the option plsrob = TRUE with four components.Identification of outliers and leverage points is done again with robust Mahalanobis distances in both spaces: components and residuals.The cutoff point is the same in both spaces since here p = q = 4, and is taken as the square root of the 0.975 quantile of the chi-squared distribution with four degrees of freedom as in Rousseeuw et al. (2004).Our result is given in Figure 8 which is very close to their Figure 3. Citing again from Rousseeuw et al. (2004): "By exploring the origin of the collected data we found out that all but the last four pulp samples (observations 59-62) were produced from fir wood." and "For example, observation 62 is the only sample from a chemi-thermomechanical pulping process, observations 60 and 61 are the only samples from a solvent pulping process, and observations 51, 52 and 56 are obtained from a kraft pulping process."The bad leverage points, especially observations 60 and 61, exert an undue influence on the classical SIMPLS fit which gives small residuals at these points on the left panel of Figure 8. q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 0 1 2 3 4 5 61 q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 0 5 10 15  -21.470 -4.747 -7.911 -23.143 vary approximately between 5% and 20% for all four responses.

Conclusion
When the true model between responses and predictors is nonlinear, groc naturally outperforms pls (partial least squares).It can even perform better than ppr (projection pursuit regression) because the orthogonality constraints between components of package groc yields a reduction in variability of predictions, as compared to ppr, which can largely compensate for the increased squared biais.When the number of predictors is large, package groc can also outperform ppr in terms of computer user time.
When the true model between responses and predictors is linear, package groc is more robust than package pls and prm (partial robust M-regression).It is thus better suited for the identification of outliers.It could be used in conjunction with package pls by identifying outliers with package groc, and performing the final analysis on the good data with package pls.
(Wang et al. 2014) contains a function covRob which computes this estimator with the option estim = "pairwiseGK".It also has the option corr = TRUE for the correlation matrix.Our package groc has its own function corrob to compute this robust correlation.Our corrob function uses non-default values for the arguments of the original covRob function.Defining our own function corrob prevents to allow passing extra argument values to the D function.Note that our preference is to let the user define their own function D with only two arguments.

Figure 1 :
Figure 1: Diagnostic plot for the data wood in Example 1. Robust Mahalanobis distances for residuals versus robust Mahalanobis distances for components.

Figure 2 :
Figure 2: Smoothing spline of volume on first component determined by Spearman correlation.

Figure 3 :
Figure 3: Plot of fitted functions by a groc model with components found by the distance covariance measure and smoothing splines.

Figure 5 :
Figure 5: Average PRESS values over all responses as a function of the number of components.

Figure 6
reveals a bad leverage point (observation 23) two other influential points (observations 7 and 21).These 3 points are much more singled out by the robust fit which also identified other points with large residuals.The robust distances on the right panel of Figure 6 may differ slightly for different runs of the program since lts does an approximation by an exhaustive enumeration up to 5000 samples with the default option nsamp = "best".

Figure 6 :
Figure 6: Diagnostic plots for the data cookie in Example 9.

Figure 7 :
Figure 7: Plot of the artificial data of Example 10.

Figure 8 :
Figure 8: Diagnostic plots for the data pulpfiber in Example 11.
The relative differences, in percentages, between fitted values Note that the PRESS values for plsr (not shown) can be obtained using subsetting with plsr.v$validation$PRESS.