A C++ Program for the Cram´er-von Mises Two-Sample Test

As larger sets of high-throughput data in genomics and proteomics become more readily available, there is a growing need for fast algorithms designed to compute exact p values of distribution-free statistical tests. We present a program for computing the exact distribution of the two-sample Cram´er-von Mises test statistic under the null hypothesis that the two samples are drawn from the same continuous distribution. The program makes it possible to handle substantially larger sample sizes than earlier proposed computational tools. The C++ source code for the program is published with this paper, and an R package is under development.


Introduction
High-throughput technologies such as gene expression microarrays and proteomics are opening up a new era in biology and medicine. Such technologies yield an abundance of valuable quantitative information posing new challenges to the statistician. It has become common practice to use microarray technology for selecting "interesting" genes by comparing expression levels of thousands of genes in two different phenotypes. Modern methods for finding such differentially expressed genes typically employ two-sample statistical tests combined with multiple testing procedures to guard against Type 1 errors (see Dudoit et al. 2003;Simon et al. 2003;Speed 2003;McLachlan et al. 2004;Lee 2004;Wit and MacClure 2004, for reviews). As larger sets of microarray gene expression data become more readily available, nonparametric methods for testing multiple two-sample hypotheses in microarray data analysis are beginning to be more appreciated (Stamey et al. 2001;Grant et al. 2002;Troyanskaya et al. 2002;Xiao et al. 2004;Guan and Zhao 2005;Lee et al. 2005;Qiu et al. 2006, to name a few). For example, the Cramér-von Mises test appears to be quite competitive with the t test even when its power is assessed by simulating normally distributed log-expression levels under location alternatives (Qiu et al. 2006), conditions under which the t test is known to be optimal. The Cramér-von Mises test can provide a substantial gain in power as compared to the t test under some other departures from the null hypothesis.
In applications to microarrays and other types of high-throughput data, distribution-free statistical tests call for fast algorithms for computing exact p values because relevant asymptotic results (see Csörgö and Faraway 1996, for the Cramér-von Mises test) do not provide the required accuracy of approximation in the tail region of the corresponding limiting distribution. The reason is that the widely-used multiple testing procedures controlling the family-wise error rate (FWER) are focused on the region of very small p values. To illustrate this point, consider the following example. Suppose that one wishes to find changes in the marginal distributions of expression levels of 12558 genes when comparing two groups of patients of equal size n = m = 43 as in the application reported by Qiu et al. (2006). In other words, there are 12558 two-sample hypotheses to be tested in this setting. Consider a particular gene for which the observed Cramér-von Mises statistic value equals A = 2.2253921 with the exact and asymptotic p values being equal to 2.115 × 10 −6 and 3.994 × 10 −6 , respectively. The Bonferroni-adjusted p values are, therefore, equal to .02656 and .05015 respectively. Let the statistic value for another gene be equal to B = 2.1193889 so that the exact and asymptotic Bonferroni-adjusted p values are .0493 and .0866, respectively. As a result, all the genes with values of the Cramér-von Mises test statistic falling in the interval (B, A) will be declared differentially expressed when using exact p values, but they will not be selected if asymptotic p values are used at the same FWER level of 0.05. This example shows that the development of efficient numerical algorithms for computing exact p values has no sound alternative. Such algorithms should be designed to handle large sample sizes, that is situations where direct rearrangement methods are computationally prohibitive. In this paper, we present a C++ software that implements a numerical algorithm for computing the exact distribution of the two-sample Cramér-von Mises test statistic.
The Cramér-von Mises two-sample test is one of the best-known distribution-free two-sample tests. It is based on the statistic T 2 defined below. Suppose two independent samples x 1 , x 2 , . . . , x m and y 1 , y 2 , . . . , y n are drawn independently from two distributions with continuous cumulative distribution functions F (·) and G(·), respectively. Based on those samples, we want to test the null hypothesis H 0 : F = G against the two-sided alternative H 1 : F = G. The Cramér-von Mises test statistic T 2 is given by where F m and G n are the empirical distribution functions associated with the respective samples (x i ) and (y j ). This statistic and the test based on it (rejecting H 0 if the value of T 2 is "too large") were first studied by Anderson (1962) as a 2-sample variant of the goodness-of-fit test introduced by Cramér (1928) andvon Mises (1931).
The statistic (1) has a simple meaning. Move the m+n points x 1 , x 2 , . . . , x m and y 1 , y 2 , . . . , y n , without changing their mutual order, to their new positions, which are 1/(m + n), 2/(m + n), . . . , (m + n)/(m + n) = 1. Let {ξ 1 , . . . , ξ m } and {η 1 , . . . , η n } be two subsets of the set {1/(m + n), 2/(m + n), . . . , 1} formed by the x i 's and y j 's, respectively, and let F * m and G * n be the corresponding empirical distribution functions. Then T 2 equals, up to a constant factor depending only on m and n, the squared L 2 -distance between F * m and G * n . Anderson (1962) and Burr (1963) studied the distribution of T 2 under H 0 (which, obviously, does not depend on the distribution F = G) and provided some tables. In particular, Anderson (1962) listed some percentiles for T 2 when m, n ≤ 7. Burr (1964) did the same for m + n ≤ 17. However, the methods they used rely on the listing of all splittings of the set {1, 2, . . . , m + n} into a pair of subsets of cardinalities m and n (such splittings are often imprecisely called "permutations"). As a result, these methods are only applicable to small samples. Burr (1963) proposed an iterative method, which does not use permutations and is less computationally intensive. Using it, he found the exact distribution of T 2 for m = n = 10. Burr presented his algorithm for the case m = n only, but it works for the general case as well (Hájek andŠidák 1967). It is worth mentioning that the set of selected quantiles for T 2 with m = n ≤ 23 by Zajta and Pandikow (1977) are still based on permutations. The software proposed in the present paper makes it possible to handle substantially larger sample sizes m and n.
The paper is organized as follows. In section 2, the necessary technical details of the basic numerical algorithm (based on Burr's idea) are given. Section 3 describes the usage of the software. The C++ source code for the software is published with this paper and an R package is under development.

Basic ideas behind the design of the software
As we have mentioned in the Introduction, the design of the software is based on ideas originally explored by Burr (1963). The recurrence relations we use are akin to those given by Hájek andŠidák (1967).

Tabulating the distribution function
Let z 1 , z 2 , . . . , z m+n be the pooled and ordered sample of x 1 , x 2 , . . . , x m and y 1 , y 2 , . . . , y n . Under the assumption that each sample is drawn from a continuous population, the pooled sample almost surely has no ties. The test statistic only depends on the differences F m (z t )−G n (z t ) of the two empirical distribution functions at the observed values.
where L is the least common multiple of m and n. It follows that h t 's (t = 0, 1, 2, . . . , m + n) are integers and T 2 differs only by a constant factor from the integer-valued random variable Specifically, T 2 = (mn/(m + n) 2 L 2 )ζ. Then, the purpose is to find the distribution of ζ. To do that, it suffices to find the frequency function of ζ.
Throughout the paper, by a frequency function we understand a non-negative integer-valued function f on the set Z of all integers, such that the set {i ∈ Z : f (i) = 0} is non-empty and finite.
In computer, such a frequency function can be represented either by an array of values f (i) on a sufficiently long "interval" {0, 1, 2, . . . , I} (with I so large that f (i) = 0 if i > I), or by an array of pairs of integers (i, f (i)), where we may only store the pairs with f (i) = 0. We will call an array of either kind a frequency table and will use the terms "frequency function" and "frequency table" interchangeably. Note that in the actual program the second method of storage of frequency tables is used.
Let B be a finite set with the uniform probability measure on it, so that every element b ∈ B has probability 1/|B|. (Note that | · |, for a finite set ·, stands for its cardinality.) Let η be an integer-valued random variable on B, i.e., a mapping η : B → Z. Then by the frequency function of η (over B) we understand the function f η : Z → Z + defined by Obviously, the probability of the event η = i, i ∈ Z, equals f η (i)/|B|. Note that if the random variable η is non-negative (which will always be the case in this paper), then f η (i) = 0 for all i < 0.
Since F m (z m+n ) = G n (z m+n ) = 1, we have h m+n = 0. Let h 0 = 0, then the sequence (h t ) m+n t=0 , for a given pair of samples (x i ) and (y j ), satisfies the following relation: The sequence (h t ) m+n t=0 can be represented by a broken line (or, briefly, a path) on the plane R 2 (see Figure 1), joining the points (t, h t ), t = 0, 1, 2, . . . , m + n (see Figure 1). Note that the path starts at the point (0, 0), ends at the point (m + n, 0), and all of its m + n + 1 vertices belong to the lattice Z 2 ; m legs of the path are parallel translations of the vector (1, a), and the other n legs -translations of the vector (1, −b) (see Figure 1). There are totally m+n m such paths, and under the null hypothesis all of them are equally likely.
It follows from (3) and h 0 = h m+n = 0 that the possible values of h t (for a fixed t, 0 ≤ t ≤ m + n), form a finite arithmetic progression H t with the common difference a + b: where and The set H t contains at most s = min(m, n) + 1 points (exactly s points if t is between m and n) and H m+n = H 0 = {0} (see Figure 1).
In other words, f + t,d (s) is the number of paths from (0, 0) to (t, d), such that the sum (6) equals s. Apparently, f + m+n,0 is the frequency function of (2), and the sum of all its values (frequencies) is the total number of paths joining (0, 0) and (m + n, 0), i.e., m+n m . Hence, the probability mass function of (2) at point i ∈ Z (i.e., the probability that ζ = i) equals m + n m −1 f + m+n,0 (i).
For z ∈ Z, denote by I[z] the frequency function which is one at z, and zero elsewhere. It is obvious that f + 0,0 = I[0]. More generally, because there is exactly one path joining the points (0, 0) and (t, −bt), and similarly for the points (0, 0) and (t, at).
Any path joining the points (0, 0) and (t, d) (1 ≤ t ≤ m + n) should pass one of the two points (t − 1, d − a) and (t − 1, d + b). Therefore, which can also be rewritten as: Here S r , r ∈ Z, is the right shift operator: given a function f (·) on Z, the function S r f may be defined as follows: Recurrence relation (9), together with equations (8), allows to compute frequency functions f + t,d recursively, starting from f + 0,0 = I[0] and ending up with the function f + m+n,0 . However, the algorithm can be simplified if we change the "coordinates" t, d in the parallelogram formed by all paths from (0, 0) to (m + n, 0) (see Figure 1) to new coordinates u, v. These coordinates are defined as follows.
For a given path h i , i = 0, 1, . . . , t, from (0, 0) to (t, d), let u := |{j : 1 ≤ j ≤ t, h t −h t−1 = −b}| and v := |{j : 1 ≤ j ≤ t, h t −h t−1 = a}| (in other words, u and v are the number of y's and x's, respectively, among the first t z's). It follows that, given t and d, we have v+u = t, av−bu = d, which implies that u and v are uniquely determined by t and d. It is also clear that the pair (u, v) takes all values in Z 2 , such that We can use the new coordinates u and v to label the frequency functions f + t,d , putting while (8) transforms into equalities g + u,0 = I[b 2 u(u + 1)(2u + 1)/6], 0 ≤ u ≤ n and Note that g + n,m = f + m+n, 0 . Relations (11), (12) and (13) lead to the following algorithm for computing the frequency functions g + u,v .
Note that, as soon as the inner v-loop is complete, the frequency tables g + u−1,v , 0 ≤ v ≤ m, are not needed any longer, and the memory they occupy can be freed.
In the case m = n, the following algorithm allows to store only half of necessary frequency tables using the identity f + t,−d ≡ f + t,d .
We have conducted numerical experiments to compare the computed tail probabilities with those tabulated by Burr (1963). In Burr's tables, the values of the probability mass function are represented as ordinary fractions and they are in complete agreement with the results of our computations.

Computing p values
In practice, one usually needs tail probabilities of the test statistic (p values), rather than probabilities of individual values. Of course, we can compute the individual probabilities and then find the tail probabilities by summation. In this subsection, we describe an alternative way of computing them, which allows larger sample sizes.
over all paths connecting the points (t, d) and (m + n, 0). In view of symmetry, Proposition . Let M = [(m + n)/2] ([z] stands for the integer part of z). If m + n is even, the frequency function f + m+n,0 is If m + n is odd, Here * denotes the convolution of frequency functions: (g * h)(i) = k∈Z g(k)h(i − k).
Remark. Note that f + M,d ≡ 0 if d < l M or d > u M (see (4)).
Proof. Let f M d (d ∈ H M ) stand for the frequency function of (2) over all paths that connect the points (0, 0) and (m + n, 0) through the point (M, d). Any path connecting the points (0, 0) and (m + n, 0) should pass one and only one of the points (M, d), d ∈ H M . Therefore, the function f + m+n,0 can be decomposed into the sum of functions f M d , d ∈ H M : Suppose m + n is even, so that m + n = 2M ; since the path's part connecting (0, 0) with (M, d) and the part connecting (M, d) with (m + n, 0) can be chosen independently, we have (The argument i − d 2 , rather than i, of f M d is due to the fact that, adding the two sums (6) and (14), we "almost" obtain the sum (2), but take h 2 t into accout twice.) Equivalently, we Applying (18), (19) and (15), we obtain (16).
From now on we will assume that m ≤ n.
Formulas (16) and (17) show that, in order to tabulate the probability mass function of (2), it is sufficient to have the frequency tables for some pair (u, v) with u + v = M and av − bu = d, these frequency tables actually are {g 0,M , g 1,M −1 , . . . , g m,M −m }. Therefore, they can be obtained using the following algorithm.
It follows from the assumption (21) that l M = −bM (see (5)), so that each integer d ∈ H M equals one of the numbers Using these facts, we can re-write (16) as and (17) as Convoluting large frequency tables is computationally intensive. However, if we need to compute only several p values, this can be done fast.
Associate with each frequency function f its tail function f ≥ defined as We have where Assume Q is a given value of (2),then the corresponding p value, P , is Using (25) and (26), we can derive from (23) and (24) that if m + n is even, and if m + n is odd.
Now, the only need is to efficiently evaluate quantities of the following type: where f and g are two frequency functions and c an integer. We proceed further along the lines of the work by van de Wiel (2001).
Let f i = f (u i ), i = 1, 2, . . . , k, be all the non-zero values of the frequency function f , and g j = g(v j ), j = 1, 2, . . . , l, be those of g. We assume that both sequences u i and v j are strictly increasing. By the definition of R, where Note that all r i can be found using one linear run through both arrays u i and v j : we put i = 1, j = k + 1 and decrease j until the inequality v j ≥ c − u i fails or j = 0; the previous j is r 1 . We increase i by 1 and continue to decrease j until the same inequality fails or j = 0; the previous j is r 2 ; etc. Now we can compute the G i 's recursively: In view of equations (29) and (30), the p value P = Pr{ζ ≥ Q} can now be computed using the following algorithm.
We have made some additional improvements in the code to increase its efficiency. We do not describe these small modifications at length, lest the exposition become too cumbersome. Technical details are available from the corresponding author upon request.

Usage
The XCVMTest program allows to compute the exact null distribution of the Cramér-von Mises test statistic, as well as p values corresponding to given values of the statistic. The program works in command line mode, its C++ source code is available along with this paper. There are four ways to use the program.

A. XCVMTest m n
The program computes the distribution of the Cramér-von Mises statistic: its values, their probabilities and the corresponding p values. In addition, the output contains the related integers (see Section 2.1): ζ (the scaled statistic), f (ζ) (the frequency) and j≥ζ f (j) (the cumulative frequency). The command line arguments m and n are the sample sizes. E.g., typing XCVMTest 10 10 gives Table 2 in Burr (1963). The output is self-explanatory.
The arguments m and n are the sample sizes; t 1 , . . . , t r (r ≥ 1) are the given values of the Cramér-von Mises statistic. The output contains pairs consisting of the statistic values and the corresponding p values. (More precisely, the given values are first re-scaled to the ζ-scale (see Section 2.1) and rounded to the nearest integer, then the corresponding p values are computed.) The values -d and -f of the optional parameter have the following meaning: the option -d implies computing the full distribution of the Cramér-von Mises statistic (see Section 2.1), the p values being then obtained by summation. The option -f implies a more efficient computation using convolutions (see Section 2.2), which reduces the memory load and, therefore, extends the range of pairs (m, n) for which the computation is possible. The default option is -f.

C. XCVMTest [-d|-f] m n StatFileName
The only difference from B is that the given values of the Cramér-von Mises statistic are being read from a text file (whose name is the last argument of the command) rather than from the command line. The file consists of the statistic values separated with delimiters; possible delimiters are: the space, the tab character, and the end-of-line character. The values -d and -f of the optional parameter have the same meaning as in B.

D. XCVMTest [-d|-f] DataFileName
The program evaluates the Cramér-von Mises statistic and the corresponding p value for each pair of samples contained in the text file whose name is the last argument of the command.
The values -d and -f of the optional parameter have the same meaning as in B.
If the program, at some step of computation, cannot allocate enough memory, it displays a message to this effect and stops.
Our computation experiments were carried out on a UNIX workstation (Sunfire V480) with 16.3GB RAM, 4 × 8.0MB Cache and 4 × 1200MHz CPU. The computation was still successful for the following sample sizes m and n in the extreme cases m = n (L = n) and m = n−1 (L = n(n − 1)): • A: m=n=200 and m=60, n=61; • B, C, D with optional parameter -d: m=n=200 and m=64, n=65; • B, C, D with optional parameter -f: m=n=250 and m=80, n=81.
The following table presents the computing time for various pairs of sample sizes (m, n) for each option.