PCovR: An R Package for Principal Covariates Regression

In this article, we present PCovR , an R package for performing principal covariates regression (PCovR; De Jong and Kiers 1992). PCovR was developed for analyzing regression data with many and/or highly collinear predictor variables. The method simultaneously reduces the predictor variables to a limited number of components and regresses the criterion variables on these components. The ﬂexibility, interpretational advantages, and computational simplicity of PCovR make the method stand out between many other regression methods. The PCovR package oﬀers data preprocessing options, new model selection procedures, and several component rotation strategies, some of which were not available in R up till now. The use and usefulness of the package is illustrated with a real dataset, called psychiatrists .


Introduction
Principal covariates regression (PCovR) was proposed by De Jong and Kiers (1992) to deal with the interpretational and technical problems that are often encountered when applying linear regression analysis using a relatively high number of predictor variables -say, higher than 10.Indeed, when interpreting a particular regression weight, in principle all other predictors and corresponding regression weights have to be taken into account.Furthermore, chances get higher that at least some of the predictor variables will be highly correlated with a linear combination of the other predictor variables.In the latter case, parameter estimates may become unstable, in that removing or adding one single observation can dramatically alter the regression weights, which is the so-called bouncing beta problem (Kiers and Smilde 2007).
In PCovR, the predictor variables are reduced to a limited number of components and the criterion variables are regressed on these components.Specifically, the components are linear combinations of the predictor variables that are constructed in such a way that they summarize the predictor variables as good as possible, but at the same time allow for an optimal prediction of the criterion variables.As the user may choose the extent to which both aspects (good summary of predictors, optimal prediction of criteria) play a role when constructing the components, PCovR is a flexible approach that subsumes principal components regression (Jolliffe 1982) and reduced-rank regression (Izenman 1975) as special cases.As is often the case with component techniques, the components have rotational freedom (including reflectional and permutational freedom) which can be exploited to enhance the interpretability of the PCovR parameters.Another attractive feature of PCovR is that a closed form solution exists, as optimal model estimates can be obtained by conducting one single eigenvalue decomposition.However, the flexibility of PCovR (number of components to be used, extent to which summarizing the predictors and predicting the criteria are emphasized, rotational freedom) has as downside that the user has to choose among a huge number of possible solutions.Up to now, no software was available to assist users in this daunting task.
To be sure, other dimension reduction methods exist for solving the above described problems, such as the R (R Core Team 2015) package pls (Mevik, Wehrens, and Liland 2013) for partial least squares (PLS; Wold, Ruhe, Wold, and Dunn III 1984), which is available from the Comprehensive R Archive Network (CRAN) at http://CRAN.R-project.org/package=pls.However, whereas PLS focuses explicitly on the prediction of the criterion block, PCovR allows to flexibly balance appropriate reduction of the predictors and accurate prediction of the criteria.Hence, it is no surprise that Kiers and Smilde (2007), who compared the performance of PCovR (using three different weighting schemes) and PLS in five different simulation settings, demonstrated that, for each setting, at least one PCovR weighting scheme yielded better or equally good results than PLS.The method that resembles PCovR the most is exploratory structural equation modeling (ESEM; Asparouhov and Muthén 2009), implemented in the commercial software package Mplus (Muthén andMuthén 1998-2011).However, ESEM does not have the unique combination of flexibility and computational simplicity that is typical for PCovR.In this paper we present a new R package, the PCovR package (Vervloet, Kiers, and Ceulemans 2015), available from CRAN at http://CRAN.R-project.org/package=PCovR, to perform principal covariates regression in R. Several rotation options are provided in this package, including some rotation strategies that were not available in R up to now, as well as some new model selection procedures.
The remainder of this paper is structured as follows: First, we will recapitulate PCovR analysis, by discussing the data and the associated preprocessing, model formulae, loss function, model estimation, and model selection.Next, we will describe the usage of the PCovR package, by giving a step-by-step overview of the available options.

Data
To run a PCovR analysis, two matrices are needed.A first matrix, X, contains the information regarding the J predictors under study and the second, Y, holds the data on the K criteria.These predictors and criteria are measured for the same N observations.When applying dimension reduction methods, appropriate preprocessing of the data is important, as it will influence the obtained results.Here we consider two different forms of preprocessing: centering and scaling.As PCovR is based on the principles of principal component analysis, centering of X (i.e., setting the mean of each predictor to zero) is necessary to model the correlation or covariance structure of the data.Centering of Y is not necessary, but may enhance the interpretation of the regression weights and discards the need for an intercept (which can however be easily computed if desirable).
Regarding scaling, in component analysis, variables that have a larger variance influence the obtained components more.If such differences in variance may be arbitrary, e.g., caused by response tendencies or differences in response scales, it is advised to normalize the data (i.e., scale each variable to a variance of one).An additional advantage of normalizing the data, is that the obtained regression weights for a specific criterion variable can be compared in size.

Model
In PCovR, the J predictors are reduced to R new variables, called components: where T is an N × R component score matrix that contains the scores of the N observations on the R components, P X is the R × J loading matrix that contains the loadings of the predictor variables on the components, E X are the residual X scores and W is a J × R weight matrix.
The criteria in Y are regressed on the components instead of on the predictors: where the matrix P Y (R × K ) contains the resulting regression weights for each of the K criteria (Kiers and Smilde 2007) and E Y contains the residual Y scores.Note that when R equals J, PCovR boils down to standard multivariate multiple regression.
To partly identify the solution (without loss of fit), the variances of the component scores (i.e., the columns of T) are fixed at 1.This implies that in case the predictors are standardized and the components are orthogonal, the loadings in P X equal the correlations between the respective components and variables.
Each PCovR solution has rotational freedom.Indeed, premultiplying P X and P Y by a random transformation matrix B and postmultiplying T by B −1 , does not alter the reconstructed X scores or the predicted Y scores.In empirical practice, researchers may take advantage of this rotational freedom to enhance the interpretability of the components.Specifically, they may orthogonally or obliquely rotate the loading matrix towards an a priori specified target structure (Browne 1972a,b) or towards simple structure.
A simple structure implies that there is only one non-zero loading per variable and there are more than R and fewer than J zero-loadings per component (Browne 2001).The components can then be labeled by considering what the variables with a clear non-zero loading on a component have in common.To approximate simple structure, several rotation criteria have been proposed.For the PCovR case the following criteria seem useful: Varimax, Weighted Varimax, Quartimin, Simplimax, and Promin.
Varimax (Kaiser 1958) is a very popular orthogonal rotation criterion that maximizes the sum of the variances of the squared loadings: Weighted Varimax (Cureton and Mulaik 1975) is an oblique variant of Varimax, in which the simplest variables (i.e., the ones with only one high loading) have more influence on the rotation than the complex variables (i.e., the ones with multiple high loadings).
Quartimin (Carroll 1953) is an oblique rotation strategy that minimizes the sum across variables and across component pairs of the cross-products of the squared loadings: Simplimax (Kiers 1994) finds the target matrix that can be approximated best by rotation, among all simple-structure target matrices (i.e., matrices with as few as possible non-zero loadings per variable) that have a specified number of zero elements.This number can be chosen a priori or by comparing the rotation function values for different numbers of zeros, retaining the number after which the improvement in function value levels off (similarly to the scree test procedure).
Promin (Lorenzo-Seva 1999) applies oblique target rotation using the Weighted Varimax solution as the target matrix.

Loss function
One of the key features of PCovR is that the reduction of the predictors to components and the prediction of the criteria by those components is conducted simultaneously.To this end, a weighting parameter α is used, ranging between 0 and 1, that determines to what degree the reduction and prediction parts of the model are emphasized.Specifically, in a PCovR analysis, the following loss function value L is minimized: with A being the Frobenius matrix norm of a matrix A. Note that PCovR analyses with α values of 0 and 1 correspond, respectively, to reduced-rank regression and principal component regression (Smilde and Kiers 1999).

Estimation
Given a specific α value and number of components R, a closed form solution exists for estimating the PCovR parameters.Specifically, T is estimated by computing the first R eigenvectors of the matrix in which H X is the projection matrix which projects Y on X. P X and P Y can then be calculated, respectively, as and (De Jong and Kiers 1992).Finally, T, P X , and P Y , are rescaled such that the columns of T have a variance of 1.

Model selection
As the most appropriate number of components R is often unknown in practice, PCovR model selection involves two decisions: selecting the number of components and tuning the α value.
In the literature, most authors solve this procedure by performing cross-validation (Smilde and Kiers 1999;De Jong and Kiers 1992).As this simultaneous selection procedure may become rather time-consuming, in case the data are large and one considers many different R and α values, we also propose a new and fast sequential procedure, in which we first select the α value using maximum likelihood principles and, subsequently, the number of components by means of a generalization of the well-known scree test.Of course, also substantive arguments based on interpretability of the obtained solution, previous research or theory can be taken into account (Ceulemans and Kiers 2006).

Simultaneous procedure
The simultaneous procedure selects the optimal α and R values by performing leave-one-out cross-validation (Hastie, Tibshirani, and Friedman 2001).In leave-one-out cross-validation, one conducts N PCovR analyses, each time discarding a different observation.Next, for each discarded observation, reconstructed y CV n scores are computed given the PCovR estimates for the other observations: where x n contains the predictor scores of the nth observation, and W n and P Y,n indicate the W and P Y matrix of the analysis in which the nth observation was discarded.Finally, the leave-one-out cross-validation fit is calculated as This is done for each combination of an α and an R value, and the α and R values that maximize Q 2 Y are retained.Note that this strategy requires a relatively high computational effort, because the PCovR analysis needs to be performed N times for each (α,R) combination that is considered.In order to save computation time, it is possible to perform k -fold crossvalidation (Hastie et al. 2001).This implies that one discards more than one observation in each of the k cross-validation steps by splitting the data in k roughly equal-sized parts and omitting all observations that belong to a particular part in the corresponding step.

Sequential procedure
In the sequential selection procedure, we first tune α on the basis of maximum likelihood principles (Vervloet, Van Deun, Van den Noortgate, and Ceulemans 2013): Given the assumption that the error on the predictor block (E X ) and the error on the criterion block (E Y ) is drawn from a normal distribution with mean 0 and variances of, respectively, σ 2 E X and σ 2 E Y , the α value that will maximize the likelihood of the data given the model is equal to: This value is approximated by replacing the variances σ 2 E X and σ 2 E Y by estimates.An estimate for σ 2 E X can be obtained by applying principal component analysis to X and determining the optimal number of components through a scree test (see Section 3); the estimate equals the associated percentage of unexplained variance.The estimate for σ 2 E Y boils down to the percentages of unexplained variance when Y is regressed on X.This approach for estimating σ 2 E X and σ 2 E Y was based on the work of Wilderjans, Ceulemans, Van Mechelen, and Van den Berg (2011).
Next, to select the optimal (i.e., good fit without being overly complex) number of components, we compute the solutions with R min − 1 to R max + 1 components and the corresponding weighted sum of the percentage of variance accounted for in X and in Y: Note that for R = 0, this sum equals 0. Subsequentially, among the R min to R max values, the optimal R value is the one that has the highest st value (Cattell 1966;Wilderjans, Ceulemans, and Meers 2013): Indeed, when plotting VAF R,sum as a function of R, the solution with the highest st value will be the one after which VAF R,sum levels off, implying that additional components do not significantly increase the fit of the solution.
The sequential procedure can be amended with cross-validation steps.First, when selecting the optimal R value given α M L , one can replace the scree ratio step by a step in which the cross-validation fit for different R values is computed and the R value that maximizes this fit is retained.Alternatively, once the optimal R value given α M L , has been determined in the scree step, one can add a third step that consists of a cross-validation procedure to empirically assess the α value that optimizes the prediction of future data.Indeed, α M L may be inaccurate if the assumptions about the error variances -each predictor (resp., criterion) has the same error variance, the error of different predictors (resp., criteria) is uncorrelatedare violated.
Preprocessing the data

Model selection
Rotating and interpreting the component matrices Figure 1: Flowchart of the different steps in a PCovR analysis.
In practice, it may happen that these four procedures point towards a different solution, as will also be the case for our illustrative dataset (see below).Indeed, the scree test based procedures favor a solution with a few components that each explain a lot of variance whereas the cross-validation based methods focus on the prediction of future data, which can of course (slightly) improve by including components that explain a small amount of variability only, but in a consistent way.In such cases, following Hastie et al. (2001), we recommend to retain the more parsimonious model among the indicated models (i.e., fewer components) if the corresponding cross-validation fit is only slighly lower.

Using the PCovR package
In this section, we discuss how a PCovR analysis can be conducted in R. First, one loads the PCovR package:

R> library("PCovR")
To load the dataset psychiatrists and start the analysis, one types R> data("psychiatrists", package = "PCovR") R> X <-psychiatrists$X R> Y <-psychiatrists$Y R> results <-pcovr(X, Y) which runs the analysis with the default analysis settings and saves the output in a list variable, that is called results here.This command requires X and Y to be numerical data frames.Of course, as will be explained below, the command can be further refined, to use other analysis settings.The three main steps of a PCovR analysis are: (1) preprocessing the data, (2) determining the number of components R and choosing the α weight, and (3) rotating and interpreting the solution (Figure 1).These steps will be illustrated with a real-data example, called psychiatrists (Van Mechelen and De Boeck 1990), that is included in the PCovR package.
The data contain the scores of 30 Belgian psychiatric patients on 23 psychiatric symptoms and 4 psychiatric disorders (toxicomania, schizophrenia, depression, and anxiety disorder).Each score is the summated score of the binary symptom and disorder scores that were given by 15 different psychiatrists.In our analysis, we will examine the extent to which the degree of toxicomania, schizophrenia, depression and anxiety disorder, can be predicted by the 23 psychiatric symptoms (see Figure 2).Applying ordinary least squares regression (OLS) would not be appropriate, because the different symptoms are very likely to have a high degree of multicollinearity (indeed, 18 of the 23 variance inflation factors are larger than 5).Furthermore, PCovR is useful here to gain insight into the correlation structure of the data.

Preprocessing the data
The PCovR package includes two preprocessing options, which can be applied to X and/or Y. Specifically, it is possible to only center the data (prepX = "cent", prepY = "cent").However, the default option is to standardize the data (prepX = "stand", prepY = "stand"), which implies that X and/or Y are centered and normalized (i.e., each variable has a mean of zero and a standard deviation of one).

Model selection
The package offers the simultaneous and sequential model selection procedures that were described in the Section 2. A sequential approach needs less computational time as is shown in Table 1 for the psychiatrists dataset.Note that the assessment of the optimal R value can be overruled, in case one is only interested in the solutions with a particular R value.In particular, when specifying the input parameter R, Rmin and Rmax will be ignored, and the specified number of components will be used when running the analysis and determining α.

Simultaneous procedure
The simultaneous procedure (modsel = "sim") that was explained earlier performs leaveone-out cross-validation for a range of α and R values.By default, 100 α values between 0.01 and 1 are explored, but alternatively, a vector (or scalar) of choice can be specified with the parameter weight.The same holds for the number of components, which is by default initialized as R min = 1 and R max = J/3.The α and R value combination that maximizes Q 2 Y is retained.Note that the parameter fold can be used to alter the number of roughly equal-sized parts in which the data are split for cross-validation.The default value of fold is "LeaveOneOut", implying that the data is split in N parts.For the psychiatrists dataset, the 6-component solution with α = 0.79 has the highest cross-validation fit.

Sequential procedure
The fastest and therefore default model selection setting (modsel = "seq") implies a sequential procedure in which α is determined on the basis of maximum likelihood principles (11), unless a specific α value is imposed by the user (e.g., weight = 0.50).For instance, for the psychiatrists dataset, α M L = 0.21 if the error variance ratio is determined on the basis of a principal component analysis of X and a regression of Y on X (see Section 2), which is the default option.Among all models with the selected α value and a number of components between R min and R max , the solution is picked that has the highest st value (13), which is the 3-component solution when using the default R min and R max values.
The package also provides two sequential procedures that incorporate a cross-validation step (modsel = "seqRcv" and modsel = "seqAcv").seqRcv also starts with the selection of α based on maximum likelihood principles, but in the next step, R is determined using leaveone-out cross-validation."seqAcv" is identical to the default procedure, but has an extra step: after the selection of R using α M L , leave-one-out cross-validation is applied to choose the α value among the values specified in weight.For the psychiatrists example, these procedures retain a solution with α = 0.21 and R = 4 and with α = 0.49 and R = 3, respectively.
For this particular dataset, it can be seen that the simultaneous selection procedure points towards more components than the three sequential procedures.To better understand this difference, it is instructive to inspect Figure 3 which displays the cross-validation fit for all models: R> plot(pcovr(X, Y, modsel = "sim")) This plot indeed shows that the best cross-validation fit is achieved with a 6-component model and a relatively high α value.However, the cross-validation fit of the solutions that are retained by the sequential procedures are only slightly lower, whereas these solutions are clearly more parsimonious.Therefore, following Hastie et al. (2001) who recommend to also have a look at models with similar cross-validation fits, but lower complexity, we decided to retain the 3-component model with α equal to 0.49.Indeed, among the solutions retained by the sequential approaches, this solution has the highest cross-validation fit.
Some of the rotation criteria require extra input arguments.For both orthogonal and oblique target rotation, a target matrix has to be specified with the argument target.When using Simplimax, the user has to specify a number of zero elements with the argument zeroloads (which equals J by default).
The interpretation of a specific PCovR solution usually starts with the inspection of the loading matrix, which can be requested by the command

R> results$Px
For instance, Table 2, which displays the Varimax rotated loadings of the α = 0.49 and R = 3 solution for the psychiatrists data, reveals that the first component has highly positive loadings for all depressive symptoms (e.g., "depression", "suicide" and "social isolation") and a negative loading for hallucinations.The second component seems to indicate the amount of substance abuse (e.g., "narcotics" and "alcohol"), while the third component reflects inappropriate behavior (e.g., "inappropriate", "social leveling", "desorganised speech", "routine") versus fear.These three components are uncorrelated.Note that Varimax was used here, be- cause it is the only built-in exploratory rotation criterion that yields orthogonal components, which enhances the interpretability of the regression coefficients, and because the resulting component loadings were sufficiently clear.
After labeling the components, the regression weights (results$Py) and component scores (results$Te) can also be interpreted.The regression weights indicate to which extent the criteria can be predicted on the basis of the components and the component scores reflect how each individual scores on these components.From the regression weights for the psychiatrists dataset in Table 2, it can be concluded that the degree of depressive symptomatology versus hallucinations of individuals is a strong predictor of both depression (positive relation) and schizophrenia (negative relation).Substance abuse can predict both toxicomania (positive relation) and schizophrenia (negative relation).The third component, inappropriate behavior versus fear is associated with schizophrenia (positive relation) and anxiety disorder (negative relation).

Conclusion
The main features of the R package PCovR have been explained and illustrated in this paper, using the dataset psychiatrists that is available in the package.PCovR is a package for performing principal covariates regression, a method developed by De Jong and Kiers (1992).The package depends on the packages GPArotation (Bernaards and Jennrich 2005), ThreeWay (Giordani, Kiers, and Del Ferraro 2014), MASS (Venables and Ripley 2002), and Matrix (Bates and Mächler 2014).

Figure 2 :
Figure 2: Visualization of the real-data example and the PCovR decomposition.

Figure 3 :
Figure 3: Cross-validation fit for the results of the simultaneous model selection procedure.

Table 1 :
Overview of the available model selection procedures and their outcomes for the psychiatrists dataset.The analyses were run with R version 3.0.2using an Intel Core 2 Duo P9700 processor.

Table 2 :
Varimax rotated loadings and associated regression weights of the α = 0.49 and R = 3 solution for the psychiatrists dataset.The highest loadings (i.e., with an absolute value higher than 0.35) are shown in bold.