Permutation Tests for Regression, ANOVA, and Comparison of Signals: The permuco Package

Recent methodological researches produced permutation methods to test parameters in presence of nuisance variables in linear models or repeated measures ANOVA. Permutation tests are also particularly useful to overcome the multiple comparisons problem as they are used to test the eﬀect of factors or variables on signals while controlling the family-wise error rate (FWER). This article introduces the permuco package which implements several permutation methods. They can all be used jointly with multiple comparisons procedures like the cluster-mass tests or threshold-free cluster enhancement (TFCE). The permuco package is designed, ﬁrst, for univariate permutation tests with nuisance variables, like regression and ANOVA; and secondly, for comparing signals as required, for example, for the analysis of event-related potential (ERP) of experiments using electroencephalography (EEG). This article describes the permutation methods and the multiple comparisons procedures implemented. A tutorial for each of theses cases is provided.


Introduction
Permutation tests are exact for simple models like one-way ANOVA and t test (Lehmann and Romano 2008, pp. 176-177). Moreover it has been shown that they have some robust properties under non normality (Lehmann and Romano 2008). However they require the assumption of exchangeability under the null hypothesis to be fulfilled which is not the case in a multifactorial setting. For these more complex designs, Janssen and Pauls (2003), Janssen (2005), Pauly, Brunner, and Konietschke (2015) and Konietschke, Bathke, Harrar, and Pauly (2015) show that permutation tests based on non exchangeable data can be exact asymptotically if used with studentized statistics. Another approach to handle multifactorial designs is to transform the data before permuting. Several authors (Draper and Stoneman 1966;Freedman and Lane 1983;Kennedy 1995;Huh and Jhun 2001;Dekker, Krackhardt, and Snijders 2007;Kherad-Pajouh and Renaud 2010;ter Braak 1992) have proposed different types of transformations and Winkler, Ridgway, Webster, Smith, and Nichols (2014) gave a simple and unique notation to compare those different methods.
Repeated measures ANOVA including one or more within subject effects are the most widely used models in the field of psychology. In the simplest case of one single random factor, an exact permutation procedure consists in restricting the permutations within the subjects. In more general cases, free permutations in repeated measures ANOVA designs would violate the exchangeability assumption. This is because the random effects associated with subjects and their interactions with fixed effects imply a complex structure for the (full) covariance matrix of observations. It follows that the second moments are not preserved after permutation. Friedrich, Brunner, and Pauly (2017a) have derived exact asymptotic properties in those designs for a Wald-type statistic and Kherad-Pajouh and Renaud (2015) proposed several methods to transform the data following procedures developed by Kennedy (1995) or Kherad-Pajouh and Renaud (2010).
For linear models, permutation tests are useful when the assumption of normality is violated or when the sample size is too small to apply asymptotic theory. In addition they can be used to control the family wise error rate (FWER) in some multiple comparisons settings (Troendle 1995;Maris and Oostenveld 2007;Smith and Nichols 2009). These methods have been successfully applied for the comparison of experimental conditions in both functional magnetic resonance imaging (fMRI) and electroencephalography (EEG) as they take advantage of the spatial and/or temporal correlation of the data.
The aim of the present article is to provide an overview of the use of permutation methods and multiple comparisons procedures using permutation tests and to explain how it can be used in R (R Core Team 2021) with the package permuco (Frossard and Renaud 2019). The package is available from the Comprehensive R Archive Network (CRAN) at https://CRAN.R-project.org/package=permuco. Note that the presentation and discussion of the available packages that handle permutation tests in related settings is deferred to Section 5.1, where all the notions are introduced. Appendix A shows a comparison of the relevant code and outputs. But first, Section 2 focuses on fixed effect models. It explains the model used for ANOVA and regression and the various permutation methods proposed in the literature. Section 3 introduces the methods for repeated measures ANOVA. Section 4 explains the multiple comparisons procedures used for comparing signals between experimental conditions and how permutation tests are applied in this setting. Section 5 describes additional programming details and some of the choices for the default settings in the permuco package. Section 6 treats two real data analyses, one from a control trial in psychology and the second from an experiment in neurosciences using EEG.

Model and notation
For each hypothesis of interest, the fixed effects model (used for regression or ANOVA) can always be written as where y n×1 is the response variable, D n×(p−q) X n×q is a design matrix split into the nuisance variable(s) D (usually including the intercept) and the variable(s) of interest X associated with the tested hypothesis. D and X may be correlated and we assume without loss of generality that D X is a full rank matrix. The parameters of the full model η ⊤ are also split into the parameters associated with the nuisance variable(s) η and the one(s) associated with the variable(s) of interest β. ǫ is an error term that follows a distribution (0, σ 2 I n ). The hypothesis tested writes The permutation test is exact under the null hypothesis for finite samples if the data are exchangeable under the null hypothesis. This assumption is not fulfilled in model in Equation 1 as we cannot control the influence of the nuisance term Dη when permuting. In fact, under the null hypothesis in Equation 2, the responses follow a distribution (Dη, σ 2 I n ) which are not exchangeable due to the presence of unequal first moments. Pauly et al. (2015) show however that permuting the responses and using a Wald-type statistic is an asymptotically exact procedure in factorial designs. Another approach, which is the focus of this paper, is to transform the data prior to the permutation. Those transformation procedures are what will be called permutation methods. They are described in Section 2.2 and are implemented in permuco.
The permutation of a vector v is defined as P v and the permutation of the rows of a matrix M as P M where P is a permutation matrix (Gentle 2007, pp. 66-67). For any design matrix M , its corresponding "hat" matrix is Greene 2011, pp. 24-25). The full QR decomposition is

Permutation methods for linear models and factorial ANOVAs
The discussed permutation methods are functions that transform the data in order to reduce the effect of the nuisance variables. They can be computed for all permutations P ∈ P where P is the set of all n P distinct permutation matrices of the same size. For any permutation matrix P , a given permutation method will transform the observed data {y, D, X} into the permuted data {y * , D * , X * }. The permuco package provides several permutation methods that are summarized in Table 1 using a notation inspired by Winkler et al. (2014).
The default method of permuco is the freedman_lane method that works as follows: we first fit the "small" model which only uses the nuisance variables D as predictors. Then, we permute its residuals and add them to the fitted values. Theses steps produce the permuted response variable y * which constitutes the "new sample". It is fitted using the unchanged design D and X. In this procedure, only the residuals are permuted and they are supposed to share the same expectation (of zero) under the null hypothesis. For each permutation, the effect of nuisance variables is hence reduced. Using the above notation, the fitted values of the "small" model can be written as H D y and its residuals R D y. Its permuted version is pre-multiplied by a permutation matrix, e.g., P R D y. The permuted response variable is therefore simply written as y * = H D y + P R D y = (H D + P R D )y, as displayed in Table 1. The permuted statistics (e.g., t or F statistics) are then computed using y * and the unchanged design matrices D * = D and X * = X. method/Authors y * D * X * manly (Manly 1991) P y D X draper_stoneman (Draper and Stoneman 1966) y D P X dekker (Dekker et al. 2007) y D P R D X kennedy (Kennedy 1995) (P R D )y R D X huh_jhun (Huh and Jhun 2001) ( (Freedman and Lane 1983) (H D + P R D )y D X terBraak (ter Braak 1992) (H X,D + P R X,D )y D X All the remaining permutation methods are also summarized by the transformation of y, D and X into y * , X * and D * and are explained next. The manly method simply permutes the response (this method is sometimes called raw permutations). Even if this method does not take into account the nuisance variables, it still has good asymptotic properties when using studentized statistics. draper_stoneman permutes the design of interest (note that without nuisance variables permuting the design is equivalent to permuting the response variable). However, this method ignores the correlation between D and X that is typically present in regressions or unbalanced designs. For the dekker method, we first orthogonalize X with respect to D, then we permute the design of interest. This transformation reduces the influence of the correlation between D and X and is more appropriate for unbalanced design. The kennedy method orthogonalizes all of the elements (y, D and X) with respect to the nuisance variables, removing the nuisance variables in the equation, and then permutes the obtained response. Doing so, all the design matrices lie in the span of X, a sub-space of observed design X and D. However this projection modifies the distribution of the residuals that lose exchangeability (R D y ∼ (0, R D σ 2 ) for original IID data). The huh_jhun method is similar to kennedy but it applies a second transformation (V ⊤ D ) to the data to ensure exchangeability (up to the second moment, V ⊤ D R D y ∼ (0, I n−(p−q) σ 2 )). The V D matrix comes from the Equation 3 and has a dimension of n × (n − (p − q)). It implies that the P 's matrices for the huh_jhun method have smaller dimensions. The terBraak method is similar to freedman_lane but uses the residuals of the full model. This permutation method creates a new response variable y * which assumes that the observed value of the estimateβ|y is the true value of β. Computing the statistic using y * , X, D would not produce a permutation distribution under the null hypothesis. To circumvent this issue, the method changes the null hypothesis when computing the statistics at each permutation to H 0 : β =β|y = (X ⊤ R D X) −1 X ⊤ R D y|y. The right part of this new hypothesis corresponds to the observed estimate of the parameters of interest under the full model, and implicitly uses a pivotal assumption. Note that terBraak is the only method where the statistic computed with the identity permutation is different from the observed statistic. The notation R D,X means that the residuals matrix is based on the concatenation of the matrices D and X. See Section 5.2 for advises on the choice of the method.
For each of the methods presented in Table 1, permutation tests can be computed using different statistics. For univariate or multivariate β parameters, the permuco package implemented a F statistic that constitutes a marginal test (or "type III" sum of square) (Searle 2006, pp. 53-54). For a univariate β 1×1 , one-and two-sided tests (based on a t-statistic) are also implemented. We write the F statistic as When q = 1, the t statistic is where the numerator is the estimate of β under the full model. Note that the statistic can be simplified by a factor of (X ⊤ R D X) −1/2 . The two statistics are function of data. They lead to the general notation t = t(y, D, X) when applied to the observed data and to t * = t(y * , D * , X * ) when applied to the permuted data. The permuted statistics constitute the set T which contains the t * for all P ∈ P. We define the permuted p value as p = 1 n P t * ∈T I (|t * | ≥ |t|), for a two-tailed t test, p = 1 for an upper-tailed t test or an F test and finally p = 1 n P t * ∈T I (t * ≤ t), for a lower-tailed t test, where I(·) is the indicator function.

Model and notation
We write the repeated measures ANOVA model in a linear mixed effects form: where y n×1 is the response, the fixed part of the design is split into the nuisance variable(s) , and the variable(s) of interest X n×(p 1 ) . The specificity of the repeated measures ANOVA model allows us to split the random part into E 0 n×(p 0 2 −q 0 2 ) and Z 0 n×q 0 2 which are the random effects associated with D and X respectively (Kherad-Pajouh and Renaud 2015). The fixed ǫ ∼ (0, σ 2 I). The matrices associated with the random effects E 0 and Z 0 can be computed using where D 0 within and X 0 within are overparametrized matrices and are associated with the within effects in the design matrices D and X. Z 0 ∆ is the overparametrized design matrix associated to the subjects and * is the column-wise Khatri-Rao product (Khatri and Rao 1968). Since the matrices E 0 and Z 0 are overparametrized and colinear to the intercept or between-participant effects they cannot directly be used to compute their corresponding sums of squares. We need versions that are constrained into their respective appropriate sub-spaces: The matrices E and Z are respectively of rank p 2 −q 2 and q 2 and are the ones used to compute F statistics. Formally, the hypothesis of interest associated with Equation 6 writes:

Permutation methods for repeated measures ANOVA
Similarly to the fixed effects model, we can test hypotheses using permutation methods (Kherad-Pajouh and Renaud 2015). The ones that are implemented in the permuco package are given in Table 2. The two methods are based on a similar idea. By pre-multiplying the design and response variables by R D or R D,E , we orthogonalize the model to the nuisance variables. This procedure can be viewed as an extension of the kennedy procedure (see Table 1) to repeated measures ANOVA.
The hypothesis in Equation 9 is tested based on the conventional F statistic for repeated measures ANOVA: As for the fixed effects model, the statistic is written as a function of the data t = t(y, D, X, E, Z) and the permuted statistic t * = t(y * , D * , X * , E * , Z * ) is a function of the permuted data under the chosen method. The p value is defined as in the fixed effect case.
Here is a small example of the creation of the matrices for the F statistic in repeated measures ANOVA. In a balanced design with 12 participants, 1 between-participants factor B 2 with 2 levels and 1 within-participants factor W 3 with 3 levels, assuming the test of the main effect of B 2 , the denominator of Equation 10 represents the sum of squares associated to the participants. The matrix Z 0 has one column for each participant coding with 0 and 1 for the participant. It is overparametrized as it has a dimension 36 × 12 and a rank of 12. However, the matrix Z 0 is not orthogonal the fixed part of the design, especially the intercept and the main effect of B 2 . Computing the sum of squares using directly Z 0 would also consider the effect of the intercept and of B 2 in addition of the effect of the participants. If we only want the sum of squares associated to the participants, we must reduce the rank of Z 0 which means, geometrically, orthogonalizing Z 0 to the intercept and to the matrix associated to B 2 . Moreover, we are not interested by the estimations of the parameters γ but only by the projection of y into Z 0 which means that any matrices spanning the appropriate space is a potential candidate for B 2 . Hence, we only have to orthogonalize Z 0 to the fixed part of the design which is done using Equation 8. It creates the matrix Z with a dimension of 36 × 12 but a rank of 10. Note that most of the columns of [D X] are not useful when computing R D,X as the matrix Z 0 is already orthonogonal to the part of the design coding the effects of W 3 and the interaction between B 2 and W 3 .

Signal and multiple comparisons
In EEG experiments, researchers are often interested in testing the effect of conditions on the event-related potential (ERP). It is a common practice to test the signals at each time point of the ERP. In that kind of experiments, thousands of tests are typically carried out (e.g., one measure every 2 ms over 2 seconds) and the basic multiple hypotheses corrections like Bonferroni (Dunn 1958) are useless as their power is too low. Troendle (1995) proposed a multiple comparisons method that considers the correlation between the resampling data. This method does not specifically use the time-neighborhood information of a signal but uses wisely the general correlation between the statistics and may be used in more general settings.
Better known, the cluster-mass test (Maris and Oostenveld 2007) has shown to be powerful while controlling the family-wise error rate (FWER) in EEG data analysis. And recently using a similar idea, the threshold-free cluster-enhancement (TFCE) was developed for fMRI data (Smith and Nichols 2009) and EEG data (Pernet, Latinus, Nichols, and Rousselet 2015), but usually presented only with one factor.
All these approaches use permutations and are compatible with the methods displayed in Tables 1 and 2, as shown next. In addition to multiple comparisons procedures that use permutation, the well-known Bonferroni and Holm (Holm 1979) corrections and the control of the false positive rate by Benjamini and Hochberg (1995) are also implemented in permuco.

Model and notation
We can construct a model at each time point s ∈ {1, . . . , k} for the fixed effects design as where y s is the response variable for all observations at time s and each of the k models are the same as Equation 1. D and X, the design matrices, are then identical over the k time points. The aim is to test simultaneously all k hypotheses H s 0 : β s = 0 vs. H s 1 : β s = 0 for s ∈ {1, . . . , k} while controlling for the FWER through the k tests. Likewise, the random effects model is written: where each of the k models are defined as in Equation 6 and, similarly, we are interested to test the k hypotheses H s 0 : β s = 0 vs. H s 1 : β s = 0 for s ∈ {1, . . . , k}. For both models, we choose one of the permutation methods presented in Tables 1 or 2 and compute the k observed statistics t s , the k sets of permutated statistics T s , which lead to k raw or uncorrected p values.
To correct them, the k sets of permutated statistics T s can be analyzed as one set of multivariate statistic. It is done simply by combining the k univariate permutation-based distributions into a single k-variate distribution which maintains the correlation between tests. For each permutation, we simply combine all k univariate permuted statistics t * 1 , . . . , t * k into one multivariate permuted statistic t * = [t * 1 . . . t * k ] ⊤ . The three multiple comparisons procedures described below are all based on this multivariate distribution and take advantage of the correlation structure between the tests.
for each P ∈ P do 5: Return the maximum over the Control for a stepwise procedure by:

Troendle's step-wise resampling method
The method developed by Troendle (1995) takes advantage of the form of the multivariate resampling distribution of the t * s . If we assume that t s is distributed according to T s then by ordering the observed statistics t s we obtain t (1) ≤ · · · ≤ t (s) ≤ · · · ≤ t (k) with their corresponding k null hypotheses H (1) ≤ · · · ≤ H (s) ≤ · · · ≤ H (k) . Then Troendle (1995) Secondly, if we reject H (k) and want to test H (k−1) , we can safely assume that H (k) is false while controlling the FWER. Either H (k) is true and we already made a type I error or was wrong and we can go as if H (k) was absent. We can then update our decision rule for testing We continue until the first non-significant result and declare all s with a smaller t statistic as non-significant.
This procedure is valid in a general setting and is easly implemented for permutation tests. The permuted sets T s is interpreted as a nonparametric distribution of the T s and based on Troendle (1995), we use Algorithm 1 to compute the corrected p value.

Cluster-mass statistic
The method proposed by Maris and Oostenveld (2007) for EEG rely on a continuity argument that implies that an effect will appear into clusters of adjacent timeframes. Based on all timespecific statistics, we form these clusters using a threshold τ as follows (see Figure 1). All the adjacent time points for which the statistics are above this threshold define one cluster C i for i ∈ [1, . . . , n c ], where n c is the number of clusters found in the k statistics. We assign to each time point in the same cluster C i , the same cluster-mass statistic m i = f (C i ) where f is a function that aggregates the statistics of the whole cluster into a scalar; typically the sum of the F statistics or the sum of squared of the t statistics. The cluster-mass null distribution M is computed by repeating the process described above for each permutation. The contribution of a permutation to the cluster-mass null distribution is the maximum over all cluster-masses for this permutation. This process is described in Algorithm 2.
To test the significance of an observed cluster C i , we compare its cluster-mass m i = f (C i ) with the cluster-mass null distribution M . The p value of the effect at each time within a cluster C i is the p value associated with this cluster, i.e., p i = 1 In addition to the theoretical properties of this procedure (Maris and Oostenveld 2007), this method makes sense for EEG data analysis because if a difference of cerebral activity is Algorithm 2 Cluster-mass null distribution M .

3:
Find the n * c clusters C * i as the sets of adjacent time points which statistic is above τ .

4:
Compute the cluster-mass for each cluster m * i = f (C * i )

5:
Return the maximum value over the n * c values m * i . Here 4 clusters are found using a threshold τ = 4. Using the sum to aggregate the statistics, for each cluster i, the shaded area underneath the curve represents its cluster-mass m i .
believed to happen at a time s for a given factor, it is very likely that the time s + 1 (or s − 1) will show this difference too.

Threshold-free cluster-enhancement
Although it controls (weakly) the FWER for any a priori choice of threshold, the result of the cluster-mass procedure is sensitive to this choice. The TFCE (Smith and Nichols 2009) is closely related to the cluster-mass but gets rid of this seemingly arbitrary choice. It is defined at each time s ∈ [1, . . . , k] for the statistics t s as where e(h) is the extend at the height h and it is interpreted as the length of a cluster for a threshold of h. E and H are free parameters named the extend power, and the height power respectively. t 0 is set close to zero. Figure 2 illustrates how the TFCE statistic is computed for a given time point s.
We construct the TFCE null distribution U by applying the formula in Equation 13  Algorithm 3 Threshold-free cluster-enhancement null distribution U . 1: for each P ∈ P do 2: Compute the k permuted statistics t * s for s ∈ {1, . . . , k} 3: Compute the k enhanced statistics u * s using a numerical approximation of the integral in Equation 13 4: Return the maximum over the k value u * s (see Algorithm 3). In practice, the integral in Equation 13 is approximated numerically using small dh ≤ 0.1 (Smith and Nichols 2009;Pernet et al. 2015).
At time s, the statistic t s will be modified using the formula in Equation 13. The formula can be viewed as a function of characteristics in the grey area (its area in the special case where both E and H are set to 1).
To test the significance of a time point s we compare its enhanced statistics u s with the threshold-free cluster-enhancement null distribution U . For an F test we define the p value as p s = 1 n P u * ∈U I(u * ≥ u s ).

Interpreting cluster based inference
The cluster-mass test and the TFCE are methods based on clustering the data and the interpretation of significant findings is then not intuitive. First, note that the Troendle's method is not based on clustering and does not have these issues. Its interpretation is straight-forwards as we can interpret individually each discovery. For the cluster-mass test the interpretation should be done at a cluster level: a significant cluster is a cluster which contains at least one significant time-point. It follows that the cluster-mass test does not allows the interpretation of the precise time location of clusters (Sassenhagen and Draschkow 2019). Intuitively, the cluster-mass test is a two steps procedure: first, it aggregates time-points into clusters, and then summarizes them using the cluster-mass. The inference is only performed at the second step which looses any information on the shape and size of the clusters. It implies that the interpretation of individual time-point is proscribed. Finally, the transformation of the TFCE statistic is an integration over all thresholds of cluster statistics (Smith and Nichols 2009). Therefore, the TFCE does not allow an interpretation of each time-point individually either as it also summarizes statistics using the concept of clusters. Thus, the interpretation of individual time-point must also involves it. Therefore, a significant time-point must be interpreted as a time-point being part of at least one significant cluster (among all clusters formed using all thresholds), where a significant cluster contains at least one significant time-point.

Comparison of packages
Several packages for permutation tests are available for R in CRAN. Since permutation tests have such a variety of applications, we only review packages (or the part of packages) that handle regression, ANOVA or comparison of signals.
For testing one factor, the perm (Fay and Shaw 2010), wPerm (Weiss 2015) and coin (Hothorn, Hornik, Van De Wiel, Zeileis et al. 2008) packages produce permutation tests of differences of locations between two or several groups. The latter can also test the difference within groups or block, corresponding to a one within factor ANOVA.
The package lmPerm (Wheeler and Torchiano 2016) produces tests for multifactorial ANOVA and repeated measures ANOVA. It computes sequential (or Type I) and marginal (or Type III) tests for factorial ANOVA and ANCOVA but only the sequential is implemented for repeated measures, even when setting the parameter seqs = FALSE. The order of the factors will therefore matter in this case. The permutation method consists in permuting the raw data even in the presence of nuisance variables, which correspond to the manly method, see Table 1. For repeated measures designs, data are first projected into the "Error()" strata and then permuted, a method that has not been validated (to our knowledge) in any peer-reviewed journal. Additionally, lmPerm by default uses a stopping rule based on current p value to define the number of permutations. By default, the permutations are not randomly sampled but modified sequentially merely on a single pair of observations. This speeds up the code but the quality of the obtained p value is not well documented.
The flip package (Finos 2018) produces permutation and rotation tests (Langsrud 2005) for fixed effects and handles nuisance variables based on methods similar to the huh_juhn method of Table 1. It performs tests in designs with random effects only for singular models (e.g. repetition of measures by subjects in each condition) with method based on Basso and Finos (2012) and Finos and Basso (2014) to handle nuisance variables.
The GFD package (Friedrich, Konietschke, and Pauly 2017b) produces marginal permutation tests for pure factorial design (without covariates) with a Wald-type statistic. The permutation method is manly. This method has been shown to be asymptotically exact even under heteroscedastic conditions . Moreover, Friedrich, Konietschke, and Pauly (2021) generalize these tests to multivariate data like MANOVA models.
To our knowledge, only the permuco package provides tests for comparison of signals.
The codes and outputs for packages that perform ANOVA/ANCOVA are given in Appendix A.1 and in Appendix A.2 for repeated measures. For fixed effects, this illustrates that permuco, flip and lmPerm handle covariates and are based on the same statistic (F ) whereas GFD uses the Wald-type statistic. It also shows that flip is testing one factor at a time (main effect of sex in this case) whereas the other packages produce directly tests for all the effects. Also, the nuisance variables in flip must be carefully implemented using the appropriate coding variables in case of factors. Note that lmPerm centers the covariates using the default setting and that it provides both marginal (Type III) or sequential (Type I) tests.
Concerning permutation methods, only the manly method is used for both lmPerm and GFD, the flip package uses the huh_jhun method, whereas multiple methods can be set by users using the permuco package. Note also that different default choices for the V matrix as implemented in flip (based on eigendecomposition) and permuco (based on QR decomposition) packages lead to slightly different results (see Table 1 for more information on the permutation methods).
Finally, concerning repeated measures designs, flip cannot handle cases where measures are not repeated in each condition for each subject, and therefore cannot be compared in Appendix A.2. As already said, lmPerm produces sequential tests in repeated measures designs and permuco produces marginal tests. This explains why, with unbalanced data, only the last interaction term in each strata produces the same statistic.

Permutation methods
For the fixed effects model, simulations (Kherad-Pajouh and Renaud 2010; Winkler et al. 2014) show that the method freedman_lane, dekker, huh_jhun and terBraak perform well, whereas manly, draper_stoneman and kennedy can be either liberal or conservative. Moreover Kherad-Pajouh and Renaud (2010) provide a proof for an exact test of the huh_jhun method under sphericity. Note that huh_jhun will reduce the dimensionality of the data and if n − (p − q) ≤ 7 the number of permutations may be too low. Based on all the above literature the default method for the permuco package is set to freedman_lane.
For the random effects model, Kherad-Pajouh and Renaud (2015) show that a more secure approach is to choose the Rde_keradPajouh_renaud method.
All n! permutations are not feasible already for moderate sized datasets. A large subset of permutation is used instead, and it can be tuned with the np argument. The default value is np = 5000. Winkler, Ridgway, Douaud, Nichols, and Smith (2016) recall that with np = 5000 the 0.95% confidence interval around p = 0.05 is relatively small: [0.0443; 0.0564]. For replicability purpose, the P argument can be used instead of the np argument. The P argument needs a Pmat object which stores all permutations. For small datasets, if the np argument is greater than the number of possible permutations (n!), the tests will be done on all permutations. This can be also be selected manually by setting type = "unique" in the Pmat functions. Given the inequality sign in the formulas for the p value described at the end of Section 2.2, the minimal p value is 1/np, which is a good practice for permutation tests. Moreover this implies that the sum of the two one-sided p values is slightly greater than 1.
The huh_jhun method is based on a random rotation that can be set by a random n × n matrix in the rnd_rotation argument. This random matrix will be orthogonalized by a QR decomposition to produce the proper rotation. Note that the random rotation in the huh_jhun method allows us to test the intercept, which is not available for the other methods.

Multiple comparisons
The multcomp argument can be set to "bonferroni" for the Bonferroni correction (Dunn 1958), to "holm" for the Holm correction (Holm 1979), "benjamini_hochberg" for the Benjamini-Hochberg method (Benjamini and Hochberg 1995), to "troendle", see Section 4.2, to "clustermass", see Section 4.3 and to "tfce", see Section 4.4. Note that in the permuco package, these 6 methods are available in conjunction with permutation, although the first 3 methods are general procedures that could also be used in a parametric setting.
For the "clustermass" method, the threshold parameter of the cluster-mass statistic is usually chosen by default at the 0.95 quantile of the corresponding univariate parametric distribution; but the FWER is preserved for any a priori value of the threshold that the user may set. The mass function is specified by the aggr_FUN argument. It is set by default to the sum of squares for a t statistic and the sum for an F . It should be a function that returns a positive scalar which will be large for an uncommon event under the null hypothesis (e.g., use the sum of absolute value of t statistics instead of the sum). It can be tuned depending on the expected signal. For the t statistic, typically, the sum of squares will detect more efficiently high peaks and the sum of absolute values will detect more efficiently wider clusters.
For the "tfce" method, the default value for the extend parameter is E = 0.5 and for the height H = 2 for t tests and, for F test, it is E = 0.5 and H = 1 following the recommendations of Nichols (2009) andPernet et al. (2015). The ndh parameter controls the number of steps used in the approximation of the integral in Equation 13 and is set to 500 by default.
The argument return_distribution is set by default to FALSE but can be set to TRUE to return the large matrices (n P × k) with the value of the permuted statistics.
The algorithm and formula presented in the previous sections may not be efficient for very large size of data. When available, they are implemented in a more efficient way in permuco. For example, to reduce the computing time, the permuted statistics are computed through a QR decomposition using the qr, qr.fitted, qr.resid or qr.coef functions.

Fixed effects model
The emergencycost dataset contains information from 176 patients from an emergency service (Heritier, Cantoni, Copt, and Victoria-Feser 2009). The variables are the sex, the age (in years), the type of insurance (private/semiprivate or public), the length of the stay (LOS) and the cost. These observational data allow us to test which variables influence the cost of the stay of the patients. In this example, we will investigate the effect of the sex and of the type of insurance on the cost and we will adjust those effects by the length of the stay. To this end, we perform an ANCOVA and need to center the covariate.

R> emergencycost$LOSc <-scale(emergencycost$LOS, scale = FALSE)
The permutation tests are obtained with the aovperm function. The np argument sets the number of permutations. We choose to set a high number of permutations (np = 100000) to reduce the variablity of the permutation p values so that they can safely be compared to the parametric ones. The aovperm function automatically converts the coding of factors with the contr.sum which allows us to test the main effects of factors and their interactions.

R> mod_cost_0 <-aovperm(cost~LOSc * sex * insurance, data = emergencycost, + np = 100000) R> mod_cost_0
Anova The interaction LOSc:insurance is significant both using the parametric p value 0.0116 and the permutation one 0.0233 using a 5% level. However, the difference between these 2 p values is 0.0117 which is high enough to lead to different conclusions e.g., in case of correction for multiple tests or a smaller α level.
If we are interested in the difference between the groups for a high value of the covariate, we center the covariate to the third quantile (14 days) and re-run the analysis.
If the researcher has an a priori oriented alternative hypothesis H A : β sex=M > β sex=F , the lmperm function produces one-sided t tests. To run the same models as previously, we first need to set the coding of the factors with the contr.sum function before running the permutation tests. The effect sex1 is significant for both the parametric one-sided p value, p = 0.007, and the permutation one-sided p value, p = 0.0211. It indicates that when the length of the stay is high, men have a shorter cost than women.

R> contrasts(emergencycost$insurance) <-contr.sum R> contrasts(emergencycost$insurance)
To test the effect of the sex within the public insured persons (called simple effect), we change the coding of the factors inside the data.frame using the contr.treatment function and disable the automatic recoding using the argument coding_sum = FALSE. The sex row can be interpreted as the effect of sex for the public insured persons for an average length of stay. Both the parametric p = 0.0003 and permutation p value, p = 0.0003, show significant effect of sex within the public insured persons.

R> contrasts(emergencycost$insurance) <-contr.treatment
Given the skewness of the data for each case where the permutation test differs from the parametric result, we tend to put more faith on the permutation result since it does not rely on assumption of normality.

Repeated measures ANCOVA
The jpah2016 dataset contains a subset of a control trial in impulsive approach tendencies toward physical activity or sedentary behaviors. It contains several predictors like the body mass index, the age, the sex, and the experimental conditions. For the latter, the subjects were asked to perform different tasks: to approach physical activity and avoid sedentary behavior (ApSB_AvPA), to approach sedentary behavior and avoid physical activity (ApPA_AvSB) and a control task. The dependent variables are measures of impulsive approach toward physical activity (iapa) or sedentary behavior (iasb). See Cheval, Sarrazin, Pelletier, and Friese (2016) for details on the experiment. We will analyze here only a part of the data.

R> jpah2016$bmic <-scale(jpah2016$bmi, scale = FALSE)
We perform the permutation tests by running the aovperm function. The within subject factors should be written using + Error(...) similarly to the aov function from the stats package: R> mod_jpah2016 <-aovperm(iapa~bmic * condition * time + Error(id/(time)), + data = jpah2016, method = "Rd_kheradPajouh_renaud") Warning message: In checkBalancedData(fixed_formula = formula_f, data = cbind(y, : The data are not balanced, the results may not be exact. A warning message is issued if the design is not fully balanced, as some exactness properties of the tests are no longer warranted. However, the method from Kherad-Pajouh and Renaud (2015) can still be applied as the within-subject factor (time) is balanced. The results are shown in an ANOVA table by printing the object:

R> mod_jpah2016
Resampling test using Rd_kheradPajouh_renaud to handle nuisance variables and 5000 permutations.  This analysis reveals a significant p value for the effect of the interaction bmic:condition with a statistic F = 5.4269 , which lead to a permutation p value p = 0.0224 not far from the parametric one. For this example, the permutation tests backs the parametric analysis. The permutation distributions can be viewed using the plot function like in Figure 3.

EEG experiment in attention shifting
attentionshifting_signal and attentionshifting_design are data provided in the permuco package. They come from an EEG recording of 15 participants watching images of either neutral or angry faces (Tipura, Renaud, and Pegna 2019). Those faces were shown at a different visibility: subliminal (16ms) and supraliminal (166ms) and were displayed to the left or to the right of a screen. The recording is at 1024 Hz for 800 ms. Time 0 is when the image appears (event-related potential or ERP  dataset contains the ERP of the electrode O1. The design of experiment is given in the attentionshifting_design dataset along with the laterality, sex, age, and 2 measures of anxiety of each subjects, see Table 3.
As almost any ERP experiment, the data is designed for a repeated measures ANOVA. Using the permuco package, we test each time points of the ERP for the main effects and the interactions of the variables visibility, emotion and direction while controlling for the FWER. We perform F tests using a threshold at the 95% quantile, the sum as a clustermass statistics and 5000 permutations. We handle nuisance variables with the method Rd_kheradPajouh_renaud: R> electrod_O1 <-+ clusterlm(attentionshifting_signal~visibility * emotion * direction + + Error(id/(visibility * emotion * direction)), + data = attentionshifting_design) The plot method produced a graphical representation of the tests that allows us to see quickly the significant time frames corrected by clustermass. The results are shown in Figure 4.

R> plot(electrod_O1)
Only one significant result appears for the main effect of visibility. This cluster is corrected using the clustermass method. The summary of the clusterlm object gives more information about all clusters for the main effect of visibility, whether they are driving the significant effect or not:   Figure 4: The plot method on a clusterlm object displays the observed statistics of the three main effects and their interactions. The dotted horizontal line represents the threshold which is set by default to the 95% percentile of the statistic. For this dataset, one cluster is significant for the main effect of visibility using the clustermass method, as shown by the red part. The summary method gives more details. There is a significant difference between the two levels of visibility. This difference is driven by one cluster that appears between the measures 332 and 462 which correspond to the 123.7 ms and 250.9 ms after the event. Its cluster-mass statistic is 3559.1 with an associated p-value of 0.0012. The threshold is set to 4.60011 which is the 95% percentile of the F statistic. If we want to use other multiple comparisons procedures, we use multcomp argument: R> full_electrod_O1 <-+ clusterlm(attentionshifting_signal~visibility * emotion * direction + + Error(id/(visibility * emotion * direction)), + data = attentionshifting_design, P = electrod_O1$P, + method = "Rde_kheradPajouh_renaud", multcomp = c("troendle", + "tfce", "clustermass", "bonferroni", "holm", "benjaminin_hochberg")) Note that we retrieve the very same permutations as previous model by using the P argument. The computation time for those tests is reasonably low: it takes less than 12 minutes on a desktop computer (i7 3770CPU 3.4GHz, 8Go RAM) to compute the 7 permutation tests with all the multiple comparisons procedures available. To see quickly the results of the thresholdfree cluster-enhancement procedure, we set the multcomp argument of plot to "tfce" as shown in Figure 5.

R> plot(full_electrod_O1, multcomp = "tfce", enhanced_stat = TRUE)
The TFCE procedure gets approximately a similar effect. However the time-points around 400 (190 ms) are not part of significant effect. If the curves in the TFCE plot happen to to show some small steps (which is not the case in Figure 5) it may be because of a small number of terms in the approximation of the integral of the tfce statistics of Equation 13. In that case it would be reasonable to increase the value of the parameter ndh.
Finally, to be able to interpret individually each time-point, we can use the troendle multiple comparisons procedure whose results are visualized by plotting the full_electrod_O1 object. A similar period is detected for the main effect of visibility.

R> plot(full_electrod_O1, multcomp = "troendle")
To interpret individually each time-point in Figure 6, we extract the significant time-points (with an α level of 5%) using the summary method, setting the multcomp parameter to "troendle". We find that the main effect of visibility begin at 129.6 ms after the event.

Conclusion
This article presents recent methodological advances in permutations tests and their implementation in the permuco package. Hypotheses in linear models framework or repeated measures ANOVA are tested using several methods to handle nuisance variables. Moreover permutations tests can solve the multiple comparisons problem and control the FWER trough cluster-mass tests or TFCE, and the clusterlm function implements those procedures for the analysis of signals, like EEG data. Section 6 illustrates some real data example of tests that can be performed for regression, repeated measures ANCOVA and ERP signals comparison.
We hope that further developments of permuco expand cluster-mass tests to multidimensional adjacency (space and time) to handle full scalp ERP tests that control the FWER over all electrodes. An early version of the functions are already available in the the following repository: https://github.com/jaromilfrossard/permuco4brain. Another evolution will concern permutation procedures for mixed effects models to allows researchers to perform tests in models containing participants and stimuli specific random effects. Indeed, we plan to include in permuco the re-sampling test presented by Bürki, Frossard, and Renaud (2018) as they show that, first, using F statistic (by averaging over the stimuli) in combination with cluster-mass procedure increases the FWER and, secondely, that a re-sampling method based on the quasi-F statistic (Clark 1973;Raaijmakers, Schrijnemakers, and Gremmen 1999) keeps it much closer to the nominal level of 5%.