nparcomp : An R Software Package for Nonparametric Multiple Comparisons and Simultaneous Conﬁdence Intervals

One-way layouts, i.e., a single factor with several levels and multiple observations at each level, frequently arise in various ﬁelds. Usually not only a global hypothesis is of interest but also multiple comparisons between the diﬀerent treatment levels. In most practical situations, the distribution of observed data is unknown and there may exist a number of atypical measurements and outliers. Hence, use of parametric and semipara-metric procedures that impose restrictive distributional assumptions on observed samples becomes questionable. This, in turn, emphasizes the demand on statistical procedures that enable us to accurately and reliably analyze one-way layouts with minimal condi-tions on available data. Nonparametric methods oﬀer such a possibility and thus become of particular practical importance. In this article, we introduce a new R package nparcomp which provides an easy and user-friendly access to rank-based methods for the analysis of unbalanced one-way layouts. It provides procedures performing multiple comparisons and computing simultaneous conﬁdence intervals for the estimated eﬀects which can be easily visualized. The special case of two samples, the nonparametric Behrens-Fisher problem, is included. We illustrate the implemented procedures by examples from biology and medicine.


Introduction
In many experiments more than two treatment groups are involved. Hereby, the global null hypothesis, i.e., no impact of the treatment on the response, is often not the main question.
Multiple comparisons, e.g., multiple Dunnett-type many-to-one (Dunnett 1955) comparisons, or Tukey-type all-pairs comparisons (Tukey 1953), with an accompanying computation of simultaneous confidence intervals (SCI), are particularly of practical importance. By controlling the familywise error rate in the strong sense, the SCI and the multiple comparison decision, e.g., by multiplicity adjusted p-values, must be compatible with each other. This means, it cannot occur that an individual null hypothesis has been rejected by the multiple comparison procedure, but the corresponding SCI contains the value null coming from the null hypothesis (Konietschke, Hothorn, and Brunner 2012a). It is well known that the classical Bonferroni adjustment can be used to perform multiple comparisons as well as for the computation of compatible SCI. This approach, however, has a low power, particulary when the test statistics are not independent. Bretz, Genz, and Hothorn (2001) propose exact multiple contrast tests (MCTP) and SCI for means of independent and homoscedastic normal samples. The procedures allow for testing arbitrary contrasts, e.g., Dunnett-type, Tukey-type or changepoint comparisons (Hirotsu 1997) and take the correlation between the test statistics into account. Thus, MCTP provide an extensive tool for the computation of compatible SCI. Hereby, the SCI are computed as in the univariate case as "Mean ± t 1−α,ν (R)-quantile · Standard Error", just by replacing the common univariate t 1−α,ν -quantile by an equicoordinate quantile t 1−α,ν (R) coming from a multivariate t distribution with correlation matrix R and ν degrees of freedom. Multiplicity adjusted p-values for the individual hypotheses are computed by using the cumulative multivariate t distribution function (Genz and Bretz 2009) instead of the univariate t distribution function. Therefore, the results of these procedures can be easily interpreted and are particularly of practical importance. Konietschke, Bösiger, Brunner, and Hothorn (2013) compare exact multiple contrast tests with the ANOVA and conclude that both procedures have comparable power. For a comprehensive overview of parametric multiple contrast tests and SCI we refer the reader to Bretz, Hothorn, and Westfall (2010) and references therein. In particular, parametric methods are numerically available using the R-package multcomp (Hothorn, Bretz, and Westfall 2008). We note that the parametric procedures use the critical values from the extreme tail portion of the multivariate t distribution, which is the portion most sensitive to nonnormality. Therefore the problem of robustness will be more serious for SCI compared to individual intervals.
Nonparametric inferences, i.e., without assuming a specific distribution of the data, however, arise in a variety of problems in biomedical research, e.g., in case of skewed data or ordered categorical data. While parametric inferences usually deal with differences between population means, there is an increasing focus in medicine on effect size measures on an individual basis (Browne 2010). For two independent samples, say group 1 and group 2, the relative effect size measure represents the probability that a randomly chosen subject in treatment group 1 reveals a smaller response value X than a randomly chosen subject from treatment group 2 with response value Y . If p < 1/2, then the values in group 1 tend to be larger than those in group 2. If p = 1/2, none of the observations tend to be smaller or larger.
It is the aim of the present paper to introduce an R extension package called nparcomp (Konietschke 2015), which can be used to compute nonparametric MCTP and SCI for relative treatment effects given in (1). We hereby propose algorithms for single step MCTP proposed by Steel (1960), Konietschke et al. (2012a) as well as stepwise procedures derived by Gao, Alvo, Chen, and Li (2008). The package is online available and can be downloaded from the Comprehensive R Archive Network (CRAN), see Konietschke (2015).
The paper is organized as follows. In Section 2 the statistical model, purely nonparametric effects and hypotheses are introduced. In Section 3 the use of multiple contrast test procedures and simultaneous confidence intervals are explained, while Section 4 demonstrates the application of stepwise procedures. The different routines of the software package nparcomp are explained in Section 5. The paper closes with a discussion and an outlook for future projects.

Statistical model, effects, and hypotheses
We consider a completely randomized one-way layout with a treatment groups and n i independent replications within the ith treatment group. Without specifying an explicit distribution (e.g., normal distribution) the statistical model can be described by where F i (x) = P (X ik < x) + 1/2P (X ik = x) denotes the average of the left and right continuous version of the distribution function. The statistical model does not include any parameters, such as means, which can be used to describe treatment effects. Therefore, the marginal distribution functions are used to describe treatment effects by where H = 1 a a j=1 F j denotes a mean distribution in its unweighted (Brunner and Puri 2001;Gao et al. 2008) form. Here Z represents a random variable with distribution H being independently distributed from X i1 . These effects are called unweighted relative effects (Gao et al. 2008;Konietschke et al. 2012a). They can be interpreted as the probability that an observation Z -randomly chosen from all observations -has a smaller value than a randomly chosen observation from sample i. In the case of p i > 1/2 data from sample i tend to larger values than Z. If p i = 1/2, neither X i1 nor Z tends to larger or smaller values. In particular, if p i < p j , then the values in group i tend to be smaller than those in group j; if p i = p j , none of the observations tend to be smaller or larger. Figure 1 illustrates these relations for two normal distributions.
In the special case of independent ordinal data p i is also called ordinal effect size measure (Ryu and Agresti 2008;Ryu 2009). Gao et al. (2008) propose rank based multiple stepwise procedures for testing the Dunnetttype (Dunnett 1955) multiple comparisons as well as the Tukey-type (Tukey 1953) multiple comparisons

Hypotheses
formulated in terms of the distribution functions F 1 , . . . , F a of the data. All test procedures for H F 0 , however, are limited to testing problems and cannot be used to construct confidence intervals for the underlying treatment effects. Therefore, Konietschke et al. (2012a) propose multiple contrast test procedures and SCI for the effects p. The procedures allow for an arbitrary user-defined contrast matrix where each row vector c of C is one contrast, i.e., each row sum of the contrast matrix is zero by definition. For example, multiple comparisons to a control are expressed by all-pairwise comparisons are formulated by and Williams-type (Williams 1972;Bretz 2006;Konietschke and Hothorn 2012) comparisons are expressed by using the contrast matrix For a comprehensive overview of different kinds of contrasts we refer the reader to Bretz et al. (2001). We note that the hypothesis in the classical Behrens-Fisher model is contained in this general setup as a special case. This is easily seen from the fact that p i = 1/2 if H and F i are both symmetric distributions with the same center of symmetry. The nonparametric hypothesis H F 0 : CF = 0 is very general and implies H p 0 : Cp = 0: H F 0 : CF = 0 ⇒ H p 0 : Cp = C HdF = HdCF = 0. The shape of the distribution functions can differ even under the null hypothesis. In the special case of quite restrictive location models F i (x) = F (x − µ i ), i = 1, . . . , a, the nonparametric and parametric hypotheses in terms of the location parameters µ i are equivalent. For a detailed discussion of the hypotheses formulated above we refer to Akritas, Arnold, and Brunner (1997) and Brunner and Munzel (2000). Furthermore, a nonparametric procedure for testing independence in distribution for categorical variables between two or more populations is suggested in Finos and Salmaso (2004).

Multiple contrast test procedures for
). The test statistics T p are collected in the vector It can be shown that T follows a multivariate normal distribution with expectation 0 and correlation matrix R = (r m ) ,m=1,...,q , asymptotically. Since R is unknown, it must be replaced by a consistent estimator R (Konietschke et al. 2012a). The individual null hypothesis H ( ) where t 1−α,ν ( R) denotes the two-sided equicoordinate quantile from the multivariate t -distribution with approximated degree of freedom ν and correlation matrix R (Konietschke et al. One-sided confidence intervals can be computed by replacing the two-sided quantile t 1−α,ν with its one-sided version (Konietschke and Hothorn 2012;Konietschke et al. 2012a). Note that the SCI defined in (11) may be not range preserving, i.e., the lower bounds can be smaller than −1 and the upper bounds can be larger than 1. Konietschke et al. (2012a) therefore propose the Fisher-approximation for the construction of range-preserving SCI. We note that the rank-based MCTP's control the familywise error rate in the strong sense, which follows from the fact that the set of hypotheses C and the corresponding test statistics T constitute a joint-testing family. The proofs are given in Konietschke et al. (2012a).

The two sample problem
In particular, inference methods for testing the null hypothesis H 0 : p 1 = p 2 with twoindependent samples X ik ∼ F i , i = 1, 2; k = 1, . . . , n i , occur frequently in practical applications. Testing the null hypothesis H F 0 : F 1 = F 2 can be realized with the Wilcoxon-Mann Whitney test. We note that the relative treatment effect p = p 1 − p 2 + 1/2 = P (X 11 < X 21 ) + 1/2P (X 11 = X 21 ).
can be easily rewritten as given in (1). If p > 1/2, the observations in sample 2 tend to be larger than the observations in sample 1. No data tend to be larger or smaller if p = 1/2. Therefore the hypothesis of no treatment effect is formulated by H p 0 : p = 1/2, which is known as the nonparametric Behrens-Fisher problem (Brunner and Munzel 2000). We note that the testing problem H p 0 : p = 1/2 is not equivalent to location-scale testing problem H 0 : F 1 (t) = F 2 (t) versus H 1 : F 2 (t) = F 1 t−µ σ with µ = 0 or σ = 1. Location-scale testing implies that different variances and /or locations may result in significant treatment effects. In the Behrens-Fisher situation one is interested in testing location effects only. Here, even under the null hypothesis, the marginal distribution functions in the different groups may have different shapes, and are not assumed to be equal. Inference methods for location-scale problems are considered by Marozzi (2009Marozzi ( , 2013. The package nparcomp provides the function npar.t.test which can be used to test the null hypothesis H p 0 : p = 1/2 by using the approximate Brunner-Munzel test (Brunner and Munzel 2000) or by using the approximate studentized permutation test proposed by Neubert and Brunner (2007). Further (1 − α)-range preserving confidence intervals using the logit or probit transformation are available (Konietschke 2009). As only asymptotic and approximate procedures are provided by the package, we recommend the use of these procedures with medium sample sizes n i ≥ 8.

4.
Stepwise procedures for H F 0 Gao et al. (2008) have shown that the vector of estimates √ N C p is asymptotically multivariate normal with mean 0 and a certain covariance matrix Σ under the null hypothesis H F 0 . To address the problem of multiple comparisons to a control as described in (4), Gao et al. (2008) have shown that the distribution of the corresponding vector of test statistics T F = √ N c p/ σ , = 1, . . . , a − 1, satisfies the multivariate totally positive of order two (MTP2) condition. Therefore, Hochberg's step-up procedure (Hochberg 1986) is applicable to correct the individual p-values for multiplicity. This procedure rejects the individual null hypothesis H With regard to all pairwise comparisons defined in (5), Gao et al. (2008) generalize various single step and stagewise procedures for H F 0 . Hereby, the modified Campbell and Skillings (Campbell and Skillings 1985) procedure is recommended. At the initial step the treatments are ordered and labelled 1, . . . , a according to their effects p i and this same labelling is used in the subsequent steps. At the (a − p + 1)th step (p = 2, . . . , a), subsets of the form {j, j + 1, . . . , j + p − 1} are tested if and only if they have not been retained as homogeneous by implication at a previous step. For further details we refer to Gao et al. (2008).

Software
In this section we present the package nparcomp and provide examples that illustrate how the contained functions can be used to analyze the introduced two-sample problem or perform multiple comparisons via single step or stepwise procedures.

Nonparametric Behrens-Fisher problem
In this setting we analyze a two-sample design where we neither assume homogeneous variances nor any similarities in the shape of the distribution functions. By means of a data example, the functionality will be described explicitely. Consider the numbers of implantations data set available in the nparcomp package. In a fertility trial with 29 female Wistar rats the experimenter wanted to test if an active treatment influences the fertiliy of the rats. Therefore n 1 = 12 rats received a control while n 2 = 17 rats were administered a treatment. A first step in the analysis could be to test whether the numbers of implantations in the treatment group differ from the numbers of implantations in the control group. To do so we can use the function npar.t.test in the following way, testing H p 1 : p = 1/2 where p denotes the relative effect between the two treatment groups. By default the alternative is set to "two.sided". To obtain one-sided tests one can choose between "less" and "greater". We specify method = "t.app" to get an approximation by a t distribution as the asymptotic method. Here npar.t.test also provides the options "logit", "probit" and "normal" performing logit/probit transformations or a normal approximation. A studentized permutation test (Neubert and Brunner 2007) can be obtained by setting method = "permu". The confidence level is selected via the parameter conf.level. To get a structured overview of the outcome one can use the S3 method summary: The numbers of implantations tend to be larger in the verum group ( p = 0.743). The null hypothesis H 0 : p = 1/2 is significantly rejected at 5% level of significance. The nparcomp package provides a S3 method to visualize the estimator and its confidence interval: The obtained plot is shown in Figure 2.

Simultaneous inferences: Single step procedures
Here we use two examples how to analyze a one-way layout with a ≥ 3 groups to demonstrate the single step procedure described in Section 3. First, consider the relative liver weights trial, which was originally analyzed by Brunner and Munzel (2002), p. 97. The data is contained in the package. The response variable is the relative liver weight from n 1 = 8 rats in the negative control and n 2 = 7, n 3 = 8, n 4 = 7 and n 5 = 8 rats in the dose groups. The interest is in simultaneous many-to-one comparisons, i.e., to test the null hypotheses H p 0 : p 1 = p j , j = 2, 3, 4, 5, simultaneously. The function mctp provides different kinds of contrast matrices: "Tukey" for all-pairs comparisons, "Dunnett" for many-to-one comparisons, "Sequen" for sequential contrasts, "Williams" for Williams-type trend contrasts, AVE for average contrasts. In addition "Changepoint", "McDermott", "Marcus" and "UmbrellaWilliams" contrast are available (Konietschke, Libiger, and Hothorn 2012b). The user can also enter a "UserDefined" q × a contrast matrix containing the contrast coefficients in argument contrast.matrix. We apply mctp to the data set specifying type = "Dunnett" to perform many-to-one comparisons. Just like in the case of npar.t.test the alternative is set to "two.sided" by default. To obtain one-sided tests one has to choose between "less" and "greater". There are three options setting the asymptotic method: "mult.t" for a multivariate t distribution, "fisher" for the Fisher-approximation, "normal" for a normal approximation. The confidence level is set via conf.level: R> data("liver") R> tox.trial <-mctp(weight~dosage, data = liver, type = "Dunnett", + conf.level = 0.95, asy.method = "fisher", info = FALSE) R> summary(tox.trial) The relative liver weights tend to larger values for increasing dose ( p = (0.27, 0.32, 0.36, 0.69, 0.85) ). Simultaneous 95%-confidence intervals for the effects δ j = p j − p 1 as well as multiplicity adjusted p-values are displayed in the Analysis section of the output. Here, significant differences at 5% level occur between the negative control and both groups 4 and 5, respectively. A confidence interval plot is achieved via: For each individual null hypothesis the estimator and its 95% confidence interval is plotted (Figure 3). Here we can see the advantage of simultaneous confidence intervals. Whenever an individual hypothesis has been rejected by the multiple contrast test, the corresponding simultaneous confidence interval does not include the null.
In the second example we will analyze the colorectal cancer data set (Ryu 2009) contained in the package. It consists of 174 patients suffering from colorectal cancer which were randomly assigned to one out of three treatment regimens: IFL (irinotecan, bolus fluorouracil, and leucovorin), FOLFOX (infused fluorouracil, leucovorin, and ocaliplatin), and IROX (irinotecan and oxaliplatin). The response variable is the patients' appetite scores at their third visit for each treatment. There are five ordered categories: (1) normal, (2) not always good, (3) don't really enjoy, (4) force to eat, (5) can't stand. Ryu (2009) already analyzed this data set when presenting a score method with studentized range distribution and accompanied simultaneous confidence intervals for the ordinal effect measures. We will compare our results with the proposed method. The interest is in finding statistical differences between two treatment regimens. Therefore simultaneous all-pairs comparisons, i.e., testing the null hypotheses H p 0 : p i = p j , j = 1, 2, 3, simultaneously, will be performed. To this end we apply mctp to the data set specifying type = "Tukey" to stay compatible with Ryu (2009) The appetite scores in the IFL group ( p IFL = 0.417) tend to be smaller than those in the other two groups ( p IROX = 0.508, p FOLFOX = 0.574). The Analysis section contains simultaneous 95%-confidence intervals for the effects δ ij = p i − p j , i > j, as well as multiplicity adjusted p-values. A significant difference at 5% level occurs only between IFL and FOLFOX. This means the FOLFOX regimen leads to higher appetite scores than the IFL treatment. Thus one can infer that FOLFOX causes less severe appetite problems than IFL. We further note the three groups are not genuine random samples taken from a parent distribution, but that they are obtained through randomization of a non random sample. Therefore caution should be paid when drawing inferences (see Pesarin and Salmaso (2010), p. 10ff.). Ryu (2009) applies a score method based on pairwise relative effects to this dataset and obtains the same decisions. Only the 95%-confidence interval for the comparison between IFL and FOLFOX ([0.546,0.751]) does not include 1/2. However, note that an analysis based on pairwise effects can lead to paradox decisions and therefore global rank based procedures are preferred. We further note that Munzel and Hothorn (2001) propose non-parametric multiple contrast tests and SCI for pairwisely defined relative effects. In particular, Munzel and Hothorn (2001) implements the results in the R package npmc, which was, however, removed from CRAN due to abundance.

Simultaneous inferences: Stepwise procedures
In the previous section we have already seen how to perform many-to-one comparisons applying the single-step procedure mctp to the relative liver weights data set. Now we want to show an example of how to analyze a one-way layout with a ≥ 3 groups using the stepwise procedure introduced by Gao et al. (2008) presented in Section 4. The function gao implements a stepwise procedure which can only be used to perform nonparametric multiple tests for manyto-one comparisons. It is contained in the nparcomp package and can be called as follows. By default the first group by lexicographical ordering is handeled as control group. However, the user can specify it via the parameter control entering a character string defining the control group: R> data("liver") R> gao(weight~dosage, data = liver, alpha = 0.05) #----Xin Gao et al's (2008) Non-Parametric Multiple Test Procedure #----Type of Adjustment: Hochberg #----Level of significance = 0.05 #----The procedure compares if the distribution functions F() are equal. The FWER is strongly controlled #----This function uses pseudo ranks of the data! #----Reference: Gao, X. et al. (2008) The output mainly consists of two sections -the single analysis which contains raw p-values accompanied by Bonferroni-and Holm-adjusted p-values and the CS analysis which realizes the adjustment obtained by modifying Campbell and Skillings stepwise procedure. In simulations the CS-adjustment shows the best performence and is therefore recommended for real data evaluations. Here, the overall null hypothesis, i.e., H F 0 : F i = F j , ∀i, j = 0, 1, 2, 3, is rejected, because three individual hypotheses are rejected at α = 5% level of significance. All comparisons among the distributions from the control group and any active treatment group show an significant effect. In particular, all pairwise comparisons among the distributions from the active treatments do not demonstrate any significance at 5% level. Summing up we can find a significant effect of the active treatment, but we have also seen that this significance is only due to differences among the distributions from the reaction times between the control and active treatment groups, respectively.

Conclusions and future work
The R package nparcomp implements a broad range of rank-based nonparametric methods for multiple comparisons. The single step procedures provide local test decisions in terms of multiplicity adjusted p-values and simultaneous confidence intervals. They further allow for user-defined contrasts. In particular, paradox results in terms of Efron's paradox dice cannot occur (Thangavelu and Brunner 2006). The derivation of stepwise confidence intervals, however, is part of future research. A notable novel feature of nparcomp is that it can easily be extended to the anaylsis of factorial designs. We plan to update the package nparcomp on a regular basis with new nonparametric statistical procedures available for multiple comparisons. In addition, we plan to undertake a major update of the code and release nparcomp in the S4 style.
All implemented methods in the package are based on asymptotic results on the distribution of rank statistics. We plan to derive adequate resampling-and permutation based methods to approximate the distribution of the statistics for very small sample sizes. For example, optimal subset procedures and weighted methods controlling the familywise error rate are proposed by Finos and Salmaso (2005), Finos and Salmaso (2006) and Finos and Salmaso (2007). Permutation tests for umbrella alternatives and multivariate problems are considered in Basso and Salmaso (2011) and Pesarin and Salmaso (2012).