Wilcoxon Rank-Based Tests for Clustered Data with R Package clusrank

Wilcoxon Rank-based tests are distribution-free alternatives to the popular two-sample and paired t-tests. For independent data, they are available in several R packages such as stats and coin. For clustered data, in spite of the recent methodological developments, there did not exist an R package that makes them available at one place. We present a package clusrank where the latest developments are implemented and wrapped under a unified user-friendly interface. With different methods dispatched based on the inputs, this package offers great flexibility in rank-based tests for various clustered data. Exact tests based on permutations are also provided for some methods. Details of the major schools of different methods are briefly reviewed. Usages of the package clusrank are illustrated with simulated data as well as a real dataset from an ophthalmological study. The package also enables convenient comparison between selected methods under settings that have not been studied before and the results are discussed.


Introduction
The Wilcoxon rank-sum and signed-rank tests are important tools for two-group comparisons and paired comparisons, respectively.Unlike their counterparts under the normality assumption, they are attractive because they are rank-based without the need of distributional assumptions.Nonetheless, standard versions of such tests presume independent data, and cannot be applied to clustered data which frequently arise in many fields.Clustered data consist of data obtained from correlated observations from sub-units or members in each cluster, where clusters may be independent but measures from members within each cluster are not.For example, in longitudinal studies or familial studies, measures from observations of the same subject or the same family are not independent but correlated.The effective sample size for clustered data will be different from the number of observations in clusters due to intracluster dependence.Often times, because of the positive intracluster dependence, the variances of the test statistics are underestimated, and as a consequence, the resulting p-values are smaller than what they should be.The popular generalized estimating equations (GEE) approach (Liang and Zeger 1986) provides a general regression modeling strategy that considers intracluster dependence, and can be used to compare two groups as a special case.Unlike rank-based procedures, however, it is not invariant to monotonic transformations of the data.
Several recent developments have extended the Wilcoxon rank-sum test to allow two-sample comparisons for clustered data.Rosner and Grove (1999) proposed a Mann-Whitney Ustatistic for clustered data which corrects the variance of the test statistic for four types of intracluster correlation, but did not provide large sample theory.Rosner, Glynn, and Lee (2003) proposed an extended Wilcoxon rank-sum test under the assumptions that all subunit observations (or members) from the same cluster (i.e., subject) belong to the same treatment group, that observations within any cluster are exchangeable, and that the intracluster dependence does not vary across groups.With a test statistic in similar form as the standard Wilcoxon rank-sum test after ranking all the observations combined, they derived the asymptotic mean and variance under the clustered setting that accommodates unequal cluster sizes and possible stratification.Rosner, Glynn, andLee (2006a) extended their (2003) approach to accommodate the situation where members of a single cluster may be assigned to different treatment groups, but still assumes exchangeability with the same intracluster dependence across groups.The assumptions of Rosner et al. (2003) were relaxed in the approach of Datta and Satten (2005), which is based on within-cluster resampling (Hoffman, Sen, and Weinberg 2001) and remains valid when the cluster sizes are informative.More recently, Dutta and Datta (2016b) extended the idea of within-cluster resampling to further accommodate the case where the number of members in a group within a cluster is informative.
On the other hand, for the one-sample problems or paired comparison problems, Rosner, Glynn, and Lee (2006b) extended the Wilcoxon signed-rank test to clustered data by adjusting the variance of the standard test statistic, assuming a common intracluster correlation across clusters.The cluster sizes are allowed to vary, but the method does not consider the informative cluster sizes where the distribution of paired difference within a cluster depends on the cluster size.Datta and Satten (2008) proposed a signed-rank test based on sampling members within cluster that accounts for informative cluster sizes.
Existing implementations of Wilcoxon rank-sum and signed-rank tests are mostly standard versions where the data are assumed to be independent.They have long been available in standard software, such as wilcox.test in the built-in package stats of R (R Core Team 2016), PROC NPAR1WAY of SAS (SAS Institute Inc. 2013), and ranksum and signrank of Stata (StataCorp 2015).Permutation methods are available in StatXact (Cytel Inc. 2013).These implementations cannot, however, handle clustered data in general.One exception is, for instance, multi-center randomized clinical trials, where the centers can be viewed as blocks across the treatment groups.In this case, the R package coin (Hothorn, Hornik, van de Wiel, and Zeileis 2008), which provides a powerful toolkit for conditional inferences, can be applied.The general situation where the sampling unit is cluster, however, is not under the inference framework of the package coin.
Despite the popularity of clustered data arising from a wide range of applications such as biomedical and social science studies, the extensions of Wilcoxon rank-sum and signed-rank tests reviewed above have not been implemented in R until very recently.The package clusrank that we developed made first appearance on the Comprehensive R Archive Network (CRAN) in December, 2015.The package provides implementation of these recently available rank-sum tests (Rosner et al. 2003;Datta and Satten 2005;Rosner et al. 2006a) and signed-rank tests (Rosner et al. 2006b;Datta and Satten 2008) for clustered data.The methods are grouped into two categories by their authors: RGL for those by Rosner, Glynn, and Lee; and DS for those by Datta and Satten.Note that the RGL methods are available in SAS codes from Dr. Rosner's website (https://sites.google.com/a/channing.harvard.edu/bernardrosner/channing/)and in Stata (StataCorp 2015) package cluswilcox from Dr. Lee's website (http://cls.umd.edu/mtlee/).R codes for the DS methods are available from Dr. Datta's website (http://www.somnathdatta.org/softwar which was recently put into R package ClusterRankTest (Dutta and Datta 2016a) in April, 2016.Modeled after the familiar function wilcox.test in the R built-in package stats, the package clusrank that we developed unifies both RGL and DS methods under a user-friendly interface that accommodates the specifications from these methods.This makes it very easy for users to compare the performances of different approaches under various settings; see Section 6.
The rest of the article is organized as follows.The recently available Wilcoxon rank-sum tests and signed-rank tests for clustered data are briefly reviewed in Sections 2 and 3, respectively.The usage of the unified user-level function and major input arguments are described in Section 4. Illustrations of how to access the implemented methods using both simulated data and a real dataset from an ophthalmological study are presented in Section 5.In Section 6, comparisons of selected methods that have not been studied previously are reported, which are made very easy by the package using only a few lines of codes.A discussion concludes in Section 7.

Rank-sum test for clustered data
The Wilcoxon rank-sum test is used for two-sample comparison.Let X ij be the jth observation in the ith cluster, 1 ≤ i ≤ N , 1 ≤ j ≤ n i , where n i is the size of cluster i.Let δ ij be the group indicator of X ij ; δ ij = 1 if X ij is in group 1, and δ ij = 0 if X ij is in group 2. Let R ij be the rank of X ij among all the observations.The observed data consist of (X, δ) = {X ij , δ ij : 1 ≤ j ≤ n i ; 1 ≤ i ≤ N }.Clusters are assumed to be independent, while subunit observations within each cluster are not.The null hypothesis H 0 to be tested is that there is no difference of the measures of location of the two groups; i.e., the distribution of X ij remains the same regardless of the group indicator δ ij .

RGL method with cluster-level grouping
The RGL Wilcoxon rank-sum test (Rosner et al. 2003) was designed for the scenario where the treatment group is assigned at the cluster-level: i.e., δ ij = δ i for all 1 ≤ j ≤ n i .Define R i+ = n i j=1 R ij , the sum of observed ranks of the subunits in the ith cluster.The Wilcoxon rank-sum statistic is (1) The rationale of the RGL test procedure is random permutation conditioning on the observed R i+ , i = 1, . . ., N .Like all permutation-based approaches, the RGL method assumes that subunit observations within each cluster are exchangeable and that the intracluster dependence remains the same across groups.To derive the sampling distribution of W given R i+ 's, Rosner et al. (2003) stratified on the cluster sizes and investigated R i+ between two groups for each cluster size.Let G be the maximum cluster size; i.e., G = max 1≤i≤N n i .Then W in Equation ( 1) can be written as where I g is the set of indices of clusters whose size is g and W g = i∈Ig δ i R i+ .The null distribution of W conditioning on R i+ 's is obtained by combining all possible permutations of W g for each cluster size g ∈ {1, . . ., G}.Let N g be the number of clusters of size g, among which m g are in group 1 and n g are in group 2. The total number of permutation is G g=1 Ng mg .When N is large, exhaustive permutation is infeasible, and Rosner et al. (2003) proposed an asymptotic test statistic Z = (W − E(W ))/ VAR(W ), where E(W ) = G g=1 E(W g ), and VAR(W ) = G g=1 VAR(W g ).Under H 0 , for clusters with size g, the distribution of δ i is Bernoulli with probability m g /N g , and we have g , and ].The specific expressions needed are shown to be E(W g ) = m g R ++,g /N g and where R ++,g = i∈Ig R i+ .Rosner et al. (2003) showed that under mild conditions, the asymptotic distribution of the test statistic Z is standard normal.The method can be extended to the case of stratified data.Rosner et al. (2003) compares groups at each cluster size, the imbalance of sample sizes between two groups across cluster size strata may result in inefficiency (Datta and Satten 2005, p.911).If only one group shows up at a cluster size, the corresponding data will be ignored as no comparison at this cluster size can be made.Further, the rank-sum statistic scores clusters by the sum of ranks of cluster members, which is expected to perform best if intracluster dependence is weak; otherwise, when the effective number of independent observations per cluster becomes smaller, it overweights larger clusters, and, hence, may have lower efficiency.

DS method with subunit-level grouping
Unlike the RGL method, the DS method allows subunit observations within the same cluster to have different group memberships.The rationale is rooted in the within-cluster resampling principal of Hoffman et al. (2001).The test statistic is constructed from randomly picking one observation from each cluster to form a pseudo-sample and averaging the standard Wilcoxon rank-sum statistic over all possible pseudo-samples.Let X * i be a random pick from the ith cluster in the pseudo-sample, and δ * i be its group membership.The Wilcoxon rank-sum statistic for the pseudo-sample is where R * i is the rank of X * i among the pseudo-sample.The test statistic of the DS method is

VAR(S)
, where S = E(W * |X, δ).Datta and Satten (2005) derived the quantities needed to calculate the test statistic, using mid-ranks to allow for ties in the data.Let k=1 I(X ik ≤ x) be the empirical distribution function of the observations from cluster i and define The expectation turns out to be The variance term VAR(S) can be estimated by with n j1 being the number of subunits in group 1 from cluster j, and F = N i=1 n i F i / N i=1 n i , the pooled empirical distribution function of the observations.The asymptotic distribution of Z is standard normal under mild conditions (Datta and Satten 2005, p.910).
The DS method can be generalized to the comparison of location among m treatment groups, m ≥ 3, with the test statistic constructed from a quadratic form of group-wise rank-sum vector.This method allows arbitrary intracluster dependence structure (not necessarily exchangeable as assumed in the RGL method) within each cluster and remains valid when treatment affects the correlation structure.However, this test cannot be applied to strictly contralateral data (e.g., when each subject in an eye study have exactly one eye under each treatment) due to violation of the assumptions required by the asymptotic theory.Rosner et al. (2006a) extended the RGL method to allow subunit-level grouping; i.e., for each cluster i, treatment group indicator δ ij may take different values for j = 1, . . ., n i .The idea of this method can be easily explained with balanced data, where all cluster sizes are all the same; i.e., n i = g for all i.The rank-sum statistic is

RGL method with subunit-level grouping
where R ij is the rank of X ij among all the observed data.A cluster may have q ∈ {0, 1, . . ., g} subunits in group 1.The sampling distribution of W g,N is derived from a two-stage randomization: first, each cluster i is randomly assigned to a random number Q i according to the observed grouping distribution Q; then, within cluster i, a random Q i out of g subunits are assigned to group 1 while all the rest are assigned to group 2. Essentially, the first stage determines how many subunits are in group 1 in a cluster and the second stage determines which they are.The two-stage randomization process can be used to devise a random permutation test to exhaust all possibilities for small N and g.
It can be shown that where N q is the number of clusters with q members in group 1.The variance of W g,N can be estimated by }, and VAR and Ê are operated on the empirical distribution of Q. Rosner et al. (2006a) showed that converges to a standard normal distribution N → ∞ provided that lim N →∞ N q /N = ξ q , where 0 ≤ ξ q ≤ 1, if 0 < q < g; or, 0 ≤ ξ q < 1, if q ∈ {0, g}.This test is equivalent to the RGL test for balanced data when the treatment is assigned at the cluster-level.
For unbalanced data, let N (g) be the number of clusters of size g such that N = G g=1 N (g) , where G = max 1≤i≤N n i is the maximum cluster size.A test procedure can be constructed by efficiently combining W g,N (g) across all g ∈ {1, . . ., G}. Rosner et al. (2006a) proposed to base the test on a combined estimator θN of θ, the probability that an observation in group 1 is greater than that of an observation in group 2, which is 1/2 under the null hypothesis.

Signed-rank test for clustered data
The Wilcoxon signed-rank test is often used for paired data comparisons.Let X ij be the paired-difference score for the jth pair in the ith cluster, i = 1, . . ., N , j = 1, . . ., n i .The null hypothesis H 0 is that the marginal distribution of X ij is symmetric around 0. Note that unlike the rank-sum test, all subjects belong to a single group in the paired comparison setting.Let R ij be the rank of

RGL method: uninformative cluster size
The RGL method for the rank-sum test (Rosner et al. 2003) was adapted to the signed-rank test in Rosner et al. (2006b).For balanced data where n i = g for all i, the clustered WIlcoxon signed-rank statistic is where S i+ = g j=1 S ij is the rank sum within the ith cluster and only nonzero X ij are considered in the computation of signed-ranks.The null sampling distribution of T can be obtained from a randomization at the cluster-level conditional on S i+ 's.Let δ i , i = 1, . . ., N , be independent and identically distributed random variables with equal probability being 1 and −1; this distribution has variance 1.The conditional distribution of T given S i+ 's is the same as that of T p = N i=1 δ i S i+ .For small N , it is possible to assess the significance of T by enumerating all 2 N possibilities and computing the tail probability.A large sample test can be constructed with E(T ) = 0 and For unbalanced data, Rosner et al. (2006b) considered a stratified statistic where Si = S i+ /n i and w i = 1/VAR( Si ) under H 0 .The variance estimator VAR( Si ) is obtained assuming a shared intracluster correlation coefficient.The randomization distribution of T given Si 's is that of T p = N i=1 δ i w i Si , which facilitates a random permutation test for small N .The large sample test statistic is , where ŵi = 1/ VAR( Si ).It converges to a standard normal distribution as N → ∞ provided that G < ∞ and lim N →∞ N g /N → ξ g where 0 ≤ ξ g ≤ 1 for all g ∈ {1, . . ., G}.This method assumes that the cluster size distribution is uninformative, and, hence, is not valid when the distribution of paired differences within a cluster depends on the cluster size.

DS method: informative cluster size
Datta and Satten (2008) followed the same principle of within-cluster resampling as in Datta and Satten (2005) in the context of clustered paired data to develop a signed-rank test.This method allows informative cluster sizes as long as the marginal distributions of X ij 's are identical for all i and j.Suppose that from the ith cluster, paired difference X ij , denoted by X * i , is randomly picked and pooled to form a pseudo-sample.Let R * i be the mid-rank of |X * i |, i = 1, . . ., N to allow ties in the data, and let where F i (x) is the empirical distribution function of the observations in cluster i at x and F i (x−) is the left limit of F i (x) at x as in the DS method for the rank-sum test.It turns out that where The standardized test statistic Z = T / VAR(T ) converges to a standard normal distribution under mild conditions.
Note that the null distribution being tested using the DS method is the distribution of the paired difference of a randomly selected pair from a randomly selected cluster, regardless of the cluster size.Whereas the null distribution of most signed-rank tests is the common distribution of a randomly selected paired difference conditional on the size of the cluster it belongs to.Therefore the latter framework is a special case of the former.Also, the DS method accounts for the cluster size by assigning equal weight to each cluster instead of each paired difference, e.g., a paired difference from a larger cluster will be assigned a smaller weight than a paired difference from a smaller cluster.

Usage
Package clusrank provides a unified interface to all the methods reviewed in Sections 2 and 3 through function clusWilcox.test:

R> library(clusrank) R> args(clusWilcox.test)
function (x, ...) NULL Argument x can be either a numeric vector or a formula.The default interface is called if x is a numeric vector, in which case the interface is designed to mimic that of the function wilcox.test:R> args(getS3method("clusWilcox.test", "default")) function (x, y = NULL, cluster = NULL, group = NULL, stratum = NULL, data = NULL, alternative = c("two.sided","less", "greater"), mu = 0, paired = FALSE, exact = FALSE, B = 2000, method = c("rgl", "ds"), ...) NULL The arguments x, y, alternative, mu, paired, and exact have the same meaning as those in the familiar default interface of wilcox.test.Clustered rank-sum test is requested if paired = FALSE; clustered signed-rank test is requested if paired = TRUE.For both tests, the RGL and DS methods are requested with method set to be rgl and ds, respectively.
For clustered rank-sum tests, x, cluster, and group are required, which are of the same length; cluster and group specify the cluster membership and group membership, respectively.Argument y is not used for clustered rank-sum tests.The group assignment can be at either the cluster or the subunit level.When using RGL method for data with treatment group assigned at the cluster-level, an optional argument stratum can be specified to account for the stratification and therefore provide a more powerful test.The variables x, cluster, group and stratum can be found from a data frame specified by argument data.For clustered signed-rank tests, x and cluster are required while group is not needed because the data are paird differences.Argument x can be the parid difference between the pre-and post-treatment observations; alternatively, x and y can specify the pre-and post-treatment observations, respectively.This interface is also similar to that of wilcox.test.
The exact version of the test is requested by exact = TRUE.In this case, argument B is the number of random permutations to approximate the exact test, with default value 2000.The truly exact test is available for the RGL clustered rank-sum test with cluster-level grouping and the RGL clustered signed-rank test, which can be requested by setting B = 0; since this test is very computing intensive, it is recommended to be used only with small samples.
The formula interface mimics that of the wilcox.testtoo with arguments formula, subset and na.action: R> args(getS3method("clusWilcox.test", "formula")) function (formula, data = parent.frame(),subset = NULL, na.action = na.omit,alternative = c("two.sided","less", "greater"), mu = 0, paired = FALSE, exact = FALSE, B = 2000, method = c("rgl", "ds"), ...) NULL For clustered rank-sum tests, the left hand side of formula should be the data vector to be tested, and the right hand side of formula should contain the variables indicating group, cluster and possibly stratum.Except for the group variable, other variables on the right need to be indicated with special terms; for example, in formula z ~group + cluster(cid) + stratum(sid), group identifies the grouping variable, cid identifies the clusters, and sid identifies the stratum.See Section 5 for detailed illustrations.For clustered signed-rank test, the left hand side of the formula is the paired difference and the right hand side comprises the cluster id.Neither group or stratum is applicable for signed-rank tests.Other arguments are identical to those in the default interface.

Rank-sum test for clustered data
We use a scheme that is similar to the simulation study in Datta and Satten (2005) to generate data for clustered rank-sum test with both balanced and unbalanced data.For group g ∈ {0, 1}, the observations in a cluster is generated as where Z g is a standard multivariate normal random vector with mean zero and an exchangeable or autoregressive of order 1 (AR1) correlation matrix with correlation parameter ρ g , and δ is the group difference.The AR1 correlation structure is common in longitudinal studies, and it can be used to investigate the robustness of the methods when the intra-cluster correlation is not exchangeable.A correlation matrix of dimension dim with correlation parameter rho can be generated as follows, with ex for exchangeable and ar1 for AR1.
Below is a simple implementation that allows different levels of grouping (cluster-level and subunit-level) and unequal cluster sizes.Package mvtnorm (Genz, Bretz, Miwa, Mi, Leisch, Scheipl, and Hoth 2016;Genz and Bretz 2009)

) dat else dat[-drop, ] + }
There are two required inputs: nclus for the number of clusters in each group and maxclsize for the maximum cluster size.The difference between groups is specified by delta.The group specific intra-cluster correlation parameter ρ g is set by rho, a numeric vector of length 2, one for each group.The corr argument specifies the function to construct correlation matrix with correlation coefficients in rho.To allow unequal cluster sizes, misrate specifies the missing rate at which an subunit in a cluster to be excluded at random; when misrate = 0, balanced data with equal cluster size are generated.The grouping level is controlled by the logical argument clusgrp: TRUE for cluster-level and FALSE for subunit-level.The function returns a data set with three columns: observation x, grouping id grp, and cluster id cid.
For the replicability of the following demonstration, a random seed is set.

R> set.seed(1234)
To illustrate the clustered rank-sum test, a data set with 10 clusters of size 3 in each group is generated with δ = 0, exchangeable correlation structure, ρ 0 = ρ 1 = 0.9, and cluster-level treatment assignment.
R> dat.cl <-datgen.sum(10,3, 0, c(.9, .9),ex, 0, TRUE) The first and last 6 rows of the data frame looks like this: R> cbind (head(dat.cl, 6) The test statistics from the two methods are very close to each other.Note that, when using the code provided online by the authors of Rosner et al. (2003) and Datta and Satten (2005), statistics with different signs may occur.This is a result of using the opposite group to calculating the statistic.To get the matching results, one just needs to switch the group ids.
For data with cluster-level groups, the RGL method allows an extra stratum variable to accommodate stratification in the data.For illustration, we simply add an extra column strat to the dataset and perform the RGL test:

Signed-rank test for clustered data
To illustrate clustered signed-rank tests, we generate data from a scheme slightly modified from simulation scenario 1 in Datta and Satten (2008).The paired differences in a cluster are generated as where Z is a multivariate normal random vector with mean δ and an exchangeable or AR1 correlation structure with parameter ρ.This is implemented by the following function: R> datgen.sgn<-function (nclus, maxclsize, delta = 0.,

A real data example
The package contains a real dataset named amd from a retrospective observational study (Ferrara and Seddon 2015) that aims to characterize the phenotype associated with a rare variant in the complement factor H (CFH) R1210C, a protein involved in the age-related macular degeneration (AMD).The data contains measures of 283 eyes from 143 patients, among whom 62 had the rare variant and 81 did not.The clusters are the patients and the subunits are the eyes.The outcome variable was the AMD grading score based on the Age-Related Maculopathy Staging (CARMS), which has 5 levels.The first 3 levels correspond to the size of drusen which is an intermediate marker of AMD, while level 4 and 5 correspond to different types of AMD, with level 4 indicating the presence of geographic atrophy (GA) and level 5 indicating the presence of choroidal neovascularization (CNV).The Wilcoxon rank-sum test is to be carried out on two subsets: 1) the subset of observations with CARMS grade 1, 2, 3 or 4; and 2) the subset of observations with CARMS grade 1, 2, 3 or 5.In the amd data, the outcome variable is CARMS; the patients ids indicating the clusters are stored in variable ID; the rare variant grouping at the cluster (patient) level is indicated by variable Variant.
We apply both the RGL and DS method to the first subset: R> data(amd) R> clusWilcox.test(CARMS ~Variant + cluster(ID), data = amd, + subset = CARMS %in% c (1,2,3,4) The p-values from both tests are close and less than 0.001, which implies strong evidence of an association between the presence of CFH R1210C rare variant and the CARMS grade with GA as the advanced stage.
For the RGL method, a stratifying variable Agesex which categorizes the patients into 6 strata by their age and gender can be used as a control variable: Clustered Wilcoxon rank sum test using Rosner-Glynn-Lee method data: CARMS; group: Variant; cluster: ID; stratum: Agesex; (from amd) number of observations: 196; number of clusters: 112 Z = -4.0797,p-value = 4.509e-05 alternative hypothesis: true difference in locations is not equal to 0 The p-value is still less than 0.001 after controlling for age and gender.
We then apply the two tests to the second subset: R> clusWilcox.test(CARMS ~Variant + cluster(ID), data = amd, method = "rgl", + subset = CARMS %in% c(1, 2, 3, 5)) Clustered Wilcoxon rank sum test using This time, the p-values of the two approaches are easily discernable, which is quite possible because the methods are based on different assumptions.The DS method reports a p-value of 0.006312, in contrast to 0.06455 from the RGL method.The results suggest association between the presence of CFH R1210C rare variant and the symptom with CNV as the advanced stage.Again, the RGL method can be applied with age and gender controlled: Clustered Wilcoxon rank sum test using Rosner-Glynn-Lee method data: CARMS; group: Variant; cluster: ID; stratum: Agesex; (from amd) number of observations: 224; number of clusters: 121 Z = -1.8519,p-value = 0.06404 alternative hypothesis: true difference in locations is not equal to 0 The p-value from the RGL method remains virtually unchanged after controlling for the age/gender strata.

A simulation study
Comparisons of the RGL and DS methods under some common scenarios have not been studied in the recent literature.Such comparison can be easily done with the package clusrank.
Using the two data generation functions defined above, we conduct a simulation study comparing their sizes and powers in a few settings.The following function generates replicates of data for a given scenario and returns the empirical power for a given significance level.The first two arguments, nrep and level specify the number of replications and the desired significance level, respectively.Argument paired is a logical scalar to switch between the two methods, TRUE for signed-rank tests and FALSE for rank-sum tests.Other arguments have the same meanings as those in datgen.sumor datgen.sgndepending on the value of paired.The last argument ... is used to supply clusgrp, which is only needed by datgen.sumfor the level of groups.The function returns the empirical rejection rates of the RGL and DS methods for the setting defined by the inputs.

R>
As an example, consider the rank-sum tests in a setting with cluster-level grouping, each group containing 20 clusters of size 3, with exchangeable correlation parameter ρ = 0.5 in both groups.The group difference is set to be δ = 0. We do this experiment with 1000 replicates at significance level 0.05: R> simpower (1000, 0.05, FALSE, 20, 3, 0.0, c(0.5, 0.5), ex, 0., clusgrp = TRUE) rgl ds 0.052 0.056 The empirical sizes of both methods are close to the nominal level 0.05.
Similarly, a comparison for signed-rank tests can be done.This time we use an AR1 correlation setting with ρ = 0.5.(1000, 0.05, TRUE, 20, 3, 0.0, 0.5, ar1, 0.) rgl ds 0.055 0.057 Again, both methods have empirical sizes close to the nominal level.It means that the RGL method is robust to the violation of the exchangeability assumption in this setting.

Rank-sum tests for clustered data
Comparison between the RGL and DS methods for the rank-sum tests has never been done previously under subunit-level treatment group.Nor has it been done when the intracluster dependence is not exchangeable.Therefore we will present these comparisons in this session.Consider two equal size groups, with 20 or 50 clusters, and with exchangeable or AR1 intracluster correlation structure as specified in datgen.sum.The correlation parameters for the two groups were set to be (0.1, 0.1), (0.5, 0.5), or (−0.1, 0.9).The maximum cluster size G = max i n i has three levels: 2, 5, and 10.The simpower function makes the comparison very easy, and the misrate argument allows cluster sizes to be random, which is an additional scenario of interest that has not been compared before.Two missing rates, 0 and 0.5, were considered, representing balanced data and unbalanced data with missing completely at random, respectively.This is different from the informative cluster size setting studied in Datta and Satten (2005), where cluster size depends on group.The group difference δ was set to be 0, 0.2 and 0.5.For each setting, empirical rejection rates was obtained from 4000 replicates.
The results are summarized in Table 1-2.The empirical size (δ = 0) of both methods are close to the nominal level 0.05 in all the settings considered in this study.The empirical power (δ = 0) for both methods increases as the cluster size increases, and as the missing rate decreases.For balanced data, the powers of the two tests are very close regardless of the level of the treatment group assignment and the intracluster configurations.For unbalanced data from maximum cluster size 10 and missing rate 0.5, the DS method has higher power with cluster-level group assignment, while the RGL method has higher power with subunitlevel group assignment.As this setting has average cluster size 5, it can be compared with the balanced data setting with cluster size fixed at 5. Both methods have higher power in the random cluster size setting when the treatment group is assigned at the subunit-level.Under cluster-level group assignment, it appears that the RGL method has lower power with random cluster size, while the DS method has lower power with fixed cluster size, though the differences are not big.Both tests seem to be robust to the intracluster correlation structure.

Signed-rank tests for clustered data
For the signed-rank test, Datta and Satten (2008) did not have settings with completely random cluster size, and their non-exchangeable intracluster dependence is different from our AR1 setting.We considered settings similar to those for the rank-sum test.The paired differences were generated for 20 or 50 clusters.The intracluster correlation parameter was set to be 0.1, 0.5, and 0.9.The true difference δ was set to be 0, 0.2 and 0.5.For each setting, 4000 replicates were generated.both tests increases as the cluster size increases, and as the intracluster dependence decreases.

Selected results are summarized in
Completely random cluster size reduces the powers in comparison to the cases where the cluster sizes are fixed at their means.In all the settings considered here, the two methods preformed similarly.

Discussion
Clustered data are frequently encountered in scientific research, and rank-based tests for clustered data are an indispensable tool like their counterparts, the Wilcoxon tests for independent data.The package clusrank provides two schools, the RGL method and the DS method, of the recently developed rank-sum tests and signed-rank tests in a unified, higher-level, userfriendly interface.Users need to be aware of the applicability of these tests when using them.For example, the RGL method assumes exchangeability within clusters and does not account for informative cluster sizes; the DS method cannot be applied to contralateral designs with exactly one subunit in each group within a cluster; both asymptotic tests require that the number of clusters to not to be too small.
Implementation of other rank-based methods for clustered data can be considered in future development of the package clusrank.For the rank-sum test, the case of informative group size within a cluster (Dutta and Datta 2016b) would be a useful addition, though it is available in the package ClusterRankTest (Dutta and Datta 2016a).For the signed-rank test, Larocque (2005) has a variance estimator based on certain sums of squares over independent clusters.Sign tests and signed-rank tests for multivariate clustered data (Larocque 2003;Larocque, Nevalainen, and Oja 2007;Haataja, Larocque, Nevalainen, and Oja 2009) and multilevel data (Larocque, Nevalainen, and Oja 2008) have been studied.Some tests allow the distributions in two groups to have different scales and/or shapes under the null hypothesis (Larocque, Haataja, Nevalainen, and Oja 2010).Making these tests available would be of interest to many users.When covariates are available, rank regression for clustered data (e.g., Wang and Zhu 2006;Wang and Zhao 2008;Fu, Wang, and Bai 2010) would be the next step.
is used to generate multivariate normal random vectors.

Table 1 :
In all settings in this study, the empirical sizes of both methods are close to the nominal level, including the cases under of AR1 correlation where the exchangeability assumption for the RGL method is violated The empirical power of Empirical rejection percentage of the RGL and the DS methods for rank-sum tests at nominal significance level 0.05 when intracluster correlation is exchangable.The results are based on 4000 datasets.Each group contains same number of clusters.

Table 2 :
Empirical rejection percentage of the RGL and the DS methods for rank-sum tests at nominal significance level 0.05 when intracluster correlation is AR1.The results are based on 4000 datasets.Each group contains same number of clusters.

Table 3 :
Empirical rejection percentage of the RGL and DS methods for signed-rank tests at nominal significance level 0.05 with exchangable intracluster correlation.The results are based on 4000 datasets.Each group contains same number of clusters.

Table 4 :
Empirical rejection percentage of the RGL and DS methods for signed-rank tests at nominal significance level 0.05 with AR1 intracluster correlation.The results are based on 4000 datasets.Each group contains same number of clusters.