Rational Arithmetic Mathematica Functions to Evaluate the One-sided One-sample K-S Cumulative Sampling Distribution

One of the most widely used goodness-of-ﬁt tests is the Kolmogorov-Smirnov (K-S) family of tests which have been implemented by many computer statistical software packages. To calculate a p value (evaluate the cumulative sampling distribution), these packages use various methods including recursion formulae, limiting distributions, and approximations of unknown accuracy developed over thirty years ago. Based on an extensive literature search for the one-sided one-sample K-S test, this paper identiﬁes two direct formulae and ﬁve recursion formulae that can be used to calculate a p value and then develops two additional direct formulae and four iterative versions of the direct formulae for a total of thirteen formulae. To ensure accurate calculation by avoiding catastrophic cancelation and eliminating rounding error, each formula is implemented in rational arithmetic. Linear search is used to calculate the inverse of the cumulative sampling distribution (ﬁnd the conﬁdence interval bandwidth). Extensive tables of bandwidths are presented for sample sizes up to 2 , 000. The results conﬁrm the hypothesis that as the number of digits in the numerator and denominator integers of the rational number test statistic increases, the computation time also increases. In comparing the computational times of the thirteen formulae, the direct formulae are slightly faster than their iterative versions and much faster than all the recursion formulae. Computational times for the fastest formula are given for sample sizes up to ﬁfty thousand. direct and iterative formulae are correct. The results clearly show that the direct formulae are faster than their corresponding iterative formulae. Since the DwassAltD direct formula is the fastest of all the direct and iterative formulae, it will be used as a comparison in the computational experience of the recursion formulae.


Introduction
The Kolmogorov-Smirnov (K-S) family of tests is one of the most widely used goodness-offit tests and is included in many nonparametric statistics texts (see recent texts Gibbons and Chakraborti (2003), Sprent and Smeeton (2001), Conover (1999), Daniel (1990)). These include the one-sided one-sample, two-sided one-sample, one-sided two-sample, and the twosided two-sample tests. The K-S family of tests also include restricted range tests (comparing distributions over a portion of their range) and ratio tests (the ratio of one distribution to another).
For sample size n, the most common K-S test is the two-sided one-sample test which uses the maximum absolute distance D n between the hypothesized continuous cumulative distribution F (x) and the empirical cumulative distribution F n (x), D n = sup −∞<x<∞ |F n (x) − F (x)|, as the random variable. In a hypothesis testing application, computing the test statistic d is relatively easier than evaluating the cumulative sampling distribution to determine the p value, P [D n ≥ d]. The cumulative sampling distribution is a piecewise polynomial that is different for each sample size n and whose complexity rapidly grows with increasing n so that it has not even been generated let alone used for n > 31 (see Ruben and Gambino (1982) and Drew, Glen, and Leemis (2000)). Consequently, the limiting distribution, various recursion formulae, and various approximations have been used to evaluate the cumulative sampling distribution. In addition, many computer statistical software packages such as SPSS, STATISTICA, R, Numerical Recipes, and IMSL include K-S tests. Although a recursion formula will theoretically determine the p value P [D n ≥ d] for a particular value d of the test statistic, the complexity of the formula is such that roundoff error and catastrophic cancelation can greatly reduce the accuracy of the calculations. Since most procedures used today were developed on pre-1978 computers where only machine precision was available, the accuracy of their results is not known exactly. Consequently, recursion formulae have only been used to generate tables for sample sizes of n ≤ 40 and various approximations of unknown accuracy have been used for n > 40. Using recursion formula and rational arithmetic, Brown and Harvey (2005) were able to compute p values for sample sizes up to two thousand, n = 2, 000.
In addition to the two-sided one-sample case (absolute difference between hypothesized and empirical), the one-sided one-sample (difference between hypothesized and empirical) cumulative sampling distribution is a complex series that can also be evaluated by recursion formulae. Indeed, many computer statistical software packages that implement both the two-sided and one-sided one-sample K-S test use different methods to calculate the p values. Table 1 summarizes the strategies used by some of these commercial statistical packages to calculate two-sided and one-sided one-sample p values. Although these packages compute the K-S test statistic in the same way, there is considerable difference in the way they evaluate the cumulative sampling distribution (calculate p values). The Numerical Recipes statistical subroutines in Press, Teukolsky, Vetterling, and Flannery (1992) use an approximation by Stephens (1970) to generate p values for the two-sided one-sample K-S cumulative sampling distribution. SPSS 15.0 (SPSS Inc. 2006) does not state how the p value for the two-sided one-sample K-S test is calculated. However, in 2002, the manual for SPSS 11.0 stated that the statistical software package used a modification of the limiting distribution derived by Feller (1948) and used by Smirnov (1948). Assuming SPSS has not changed how the p value is calculated, the 2002 method is listed in Table 1. The STATISTICA (StatSoft, Inc. 2006) software package uses the critical values tabulated by Massey (1950) and Massey (1951) to generate their p values. IMSL (Visual Numerics 2006) computes both the one-sided and twosided K-S test statistics and then gives p values for each test. Specifically, for the one-sided K-S test and sample sizes n ≤ 80, IMSL uses a recursion formula in Conover (1972) to compute the exact p values for the one-sided sampling distribution, but for large sample sizes

Statistical
Type of Used To Software One-sample Calculate Package K-S Test p values IMSL One-sided For n ≤ 80, recursion formula by Conover (1972).
Two-sided Double the corresponding one-sided p value.

SPSS
Two-sided Modification of limiting distribution derived by Feller (1948).
STATISTICA Two-sided Critical values tabulated by Massey (1950) and Massey (1951). n > 80, it uses the one-sided limiting distribution. IMSL then doubles the one-sided p values to get the corresponding two-sided p values. Like IMSL, the R statistical software package computes both the one-sided and two-sided K-S test statistics but uses different methods to compute the p values. For large sample sizes n > 100, R Development Core Team (2006) states that the asymptotic distributions are used (presumably the formula by Feller (1948) for the one-sided K-S test and the formula by Kolmogorov (1933) for the two-sided case). For small sample sizes n ≤ 100, the one-sided K-S test uses the direct formula by Smirnov (1944) to calculate the p value and the two-sided K-S test uses the matrix formula of Durbin (1973) as implemented by Marsaglia, Tsang, and Wang (2003). In 2002, the R statistical software package instead computed the one-sided p values using the techniques in Conover (1972) and doubled the one-sided p value to get the two-sided p value. It is not clear in 2002 whether R also used the one-sided limiting distribution for large sample sizes.
In addition to the one-sample K-S cumulative sampling distributions, there are sampling distributions for many other K-S type statistics including two-sample (difference between two empirical distributions), restricted range (two distributions compared over a portion of their range), and ratios (the ratio of one distribution to another). For the two-sample K-S case with sample sizes m and n, the one-sided cumulative sampling distribution is a simple formula for m = n and a complex series for m = n. The two-sided two-sample K-S cumulative sampling distribution is a complex series for m = n while a recursion formula is needed for m = n.
There are many K-S one-sided one-sample restricted range and ratio cumulative sampling distributions whose formulae are known but are very complex expressions.
Given the advances in computing power and computational software in the past thirty years, it is time to reconsider the entire area and if possible, devise techniques to accurately and quickly evaluate K-S cumulative sampling distributions. Since the entire K-S cumulative sampling distribution area is so large, the question is where we should begin a comprehensive evaluation of the alternate formulae for the various K-S cumulative sampling distributions. Because of the large number and complexity of the formulae, the one-sided one-sample K-S restricted range and ratio areas are not good starting points. Similarly, since two-sample sizes require more computational work than one-sample size, the two-sample area should be done after the one-sample area. The one-sample area contains two tests, the two-sided onesample K-S test and the one-sided one-sample K-S test. Since Brown and Harvey (2005) have already investigated the two-sided one-sample case, this paper will investigate the one-sided one-sample case.
This paper reviews the one-sided one-sample K-S cumulative sampling distribution formulae, devises rational arithmetic implementations of each formula, verifies the validity of each implementation by determining if each implementation gets exactly the same p value over a broad range of examples, develops an efficient method to calculate the bandwidth (the inverse of the cumulative sampling distribution), and finally compares the computational times needed for each implementation to determine the fastest formula.

One-sided one-sample K-S sampling distribution formulae
There are two one-sided one-sample random variables: the one-sided upper random vari- Since by symmetry D + n and D − n have the same cumulative sampling distribution, D + n is used to represent both cases. Based on an extensive literature search for the one-sided one-sample K-S test, this paper identifies two direct formulae and five recursion formulae that can be used to calculate a p value, P [D + n ≥ d + ], and then develops two additional direct formulae and four iterative versions of the direct formulae for a total of thirteen formulae. Table 2 contains a summary of the thirteen formulae which are developed in this section.

Direct formulae
A closed form expression of the one-sided one-sample K-S cumulative sampling distribution was developed by Smirnov (1944) and verified by many scholars including Feller (1948) and Birnbaum and Tingey (1951). For 0 < d + ≤ 1 and sample size n, Smirnov's formula denoted by SmirnovD in this paper is shown in the first row of Table 3 where n(1 − d + ) is the greatest integer less than or equal to n(1 − d + ). Dwass (1959) derived a different formula denoted by DwassD that is also shown in Table 3. Both the SmirnovD and DwassD formulae are also derived in Durbin (1973). A second form of the Smirnov distribution denoted by SmirnovAltD and a second form of the Dwass distribution denoted by DwassAltD are derived by factoring 1/n n−1 out of their respective formulae. The alternate forms, SmirnovAltD and DwassAltD, shown in Table 3 may be faster than the original formulations because the terms inside the summation are simpler rational numbers.
In most applications, the test statistic d + is less than 0.5 and usually much less than 0.5 which means the number of terms n(1 − d + ) + 1 in the SmirnovD and SmirnovAltD formulae can be close to the sample size n. In comparison, the number of terms nd + 1 in the DwassD and DwassAltD formulae is much less than the number in the SmirnovD and SmirnovAltD  Kotelnikov and Chmaladze (1983) Recursion Bolshev Table 2: Thirteen formulae to calculate a K-S one-sided one-sample p value formulae and should therefore take less computation time. On the other hand, all the terms in the SmirnovD and SmirnovAltD formulae are positive while the terms in the DwassD and DwassAltD formulae alternate signs. This means that the Dwass formulae are much more susceptible to error than the Smirnov formulae. Note that implementing the formulae in rational arithmetic removes the issue of computational error as all computations are exact. This will be discussed in detail in Section 3.1.

Iterative formulae
Each of the four formulae in Table 3 can be transformed into an iterative formula which might be faster than the original formula. The Smirnov formula will be used to illustrate the process. Let γ j be the value of the jth term in the series and let x j be the iterative factor that converts γ j−1 to γ j so that γ j = x j γ j−1 . The following derives x j for the SmirnovD and SmirnovAltD formula. Note that the x j must be the same for the SmirnovD and SmirnoAltD formulae since the SmirnovAltD formula is simply the SmirnovD formula with 1/n n−1 factored out.
is the greatest integer less than or equal to n(1 − d + ) Table 3: K-S one-sided one-sample direct formulae.
For the DwassD and DwassAltD formulae, let y j be the iterative factor that converts γ j−1 to γ j so that γ j = y j γ j−1 . The following derives y j for the DwassD and DwassAltD formulae.
Using the same process as above, the iterative formula for SmirnovD denoted by SmirnovI can be derived. Similarly, the iterative formulae for SmirnovAltD, DwassD, and DwassAltD can be derived and are denoted by SmirnovAltI, DwassI, and DwassAltI respectively. The results are shown in Table 4.

Type
Iterative Formula is the greatest integer less than or equal to n(1 − d + ) Table 4: K-S one-sided one-sample iterative formulae.
In addition to the four direct formulae and four iterative formulae, five recursion formulae to compute the one-sided one-sample K-S p value have been derived and are presented in chronological order in the next five subsections. Daniels (1945) derived a difference equation that was later restated by Noe and Vandewiele (1968). The form of Daniels' recursion formula (referred to henceforth as Daniels) shown below is derived by solving the difference equation for Q i (1). The recursion formulae use the test statistic d + = t/n or t = d + n.

Noe and Vandewiele recursion formula
Since the Daniels recursion formula has both positive and negative terms, Noe and Vandewiele (1968) derived an alternate recursion formula that has only non-negative terms. Noe (1972) later added a correction to this recursion formula. The particular form of the recursion formula (referred to henceforth as Noe) listed below containing Noe's correction is taken from Shorack and Wellner (1986), page 363, formulas (24) through (28). Steck (1969) derived the recursion formula (referred to henceforth as Steck) shown below that was later listed in Shorack and Wellner (1986).
Although not stated explicitly, this appears to be the recursion formula used by the IMSL and R statistical software packages. Kotelnikov and Chmaladze (1983) used the recursion formula (referred to henceforth as Bolshev) shown below that was later called the Bolshev recursion in Shorack and Wellner (1986).

Computational and research issues
The thirteen formulae presented in the last section are complex. Consequently, implementing them raises certain computational issues that need to be studied. The following are the three major computational questions that need to be resolved before the thirteen formulae are implemented.
1. What type of computational arithmetic should be used?
2. What arithmetic form should be used to input the test statistic d + to the thirteen formulae?
3. What is the most efficient way to calculate the binomial coefficients?
These questions will be considered and answered in the order shown above. After these implementation questions have been resolved, the following four research questions will be answered.
1. What is the best way to calculate the bandwidth (the inverse of the cumulative sampling distribution)? The answer to this question will be used to compute and present detailed bandwidth tables.

What is the relationship between computation time and sample size n?
3. What is the fastest formulae?
4. What is the relationship between the accuracy of the test statistic d + and the computation time?

Computational options
Using current computational software, the formulae can be implemented using either rational arithmetic, arbitrary precision arithmetic, or machine precision arithmetic. Rational arithmetic stores every number as a ratio of two integers (a rational number) where each integer can have as many decimal digits as needed to express the number exactly. Although the speed of rational arithmetic declines as the number of digits in the numerator/denominator integers increase, it has the advantage of no error as long as no irrational numbers are used. Conversely, machine precision arithmetic specifies the number of decimal digits (usually less than twenty and determined by the computer hardware) to use in computations so it is subject to roundoff error and catastrophic cancelation. Catastrophic cancelation occurs when one number is subtracted from another number of about the same value. For example, if 123.345689 is subtracted from 123.3456799 both with nine decimal digits of precision, then the result is 0.000010 with two decimal digits of precision. Although machine precision is fast, it is possible to significantly degrade the accuracy and even worse, not be aware that the accuracy has been reduced. Arbitrary precision arithmetic is like machine precision except that the number of decimal digits of precision is not dependent on the computer hardware and the user can specify the number of decimal digits of precision. Although arbitrary precision is slower than machine precision, it is faster than rational arithmetic. In addition, the software system Mathematica, Wolfram (2003), keeps track of the resulting precision rp so that for the example above, Mathematica would also know that the result 0.000010 had a precision of rp = 2. The trick in using arbitrary precision arithmetic is specifying the precision to be used in internal calculations (internal precision ip) so that the final answer has a specified desired precision dp or greater. In other words, the user must specify both ip and dp so that the final answer has rp ≥ dp. Since all the K-S formulae can be modified so that no irrational numbers are used, this paper will use rational arithmetic implementations as they produce exact p values with no error. Future research will develop machine precision and arbitrary precision implementations whose accuracy will then be verified by the rational arithmetic implementations in this paper.
In terms of accuracy, rational arithmetic gives the exact probability (no error) as long as the test statistic d can be expressed exactly as a rational number; it cannot be an irrational number like π/100. The only way d + can be an irrational number is if the hypothesized distribution F (x) for some x is an irrational number because by definition the empirical distribution F n (x) is a rational number (i/n for i = 0, 1, 2, . . . , n). In such cases d can be approximated arbitrarily closely by rational numbers above and below d + . These are then used to calculate the p value P [D + n ≥ d + ] to any desired accuracy. Thus, rational arithmetic either provides the exact p value if d + is a rational number or can get as close as the user desires if d + is an irrational number.

Test statistic complexity
Using rational arithmetic, the sample size n and the test statistic d + can produce a p value with many digits in the numerator and denominator integers. For example, when n = 200 and d + = 13/200; d + has two digits in the numerator and three digits in the denominator ( A rational number implementation of each of the thirteen formula has two inputs, the sample size n that is by definition a rational number (an integer) and the test statistic d + that is a number between zero and one, 0 ≤ d + ≤ 1. If d + is neither zero, one, or an irrational number, then d + can be expressed as either a rational number or an arbitrary precision number. For example, with n = 100 and d + = 0.183683, d + can be used in the program as the rational number d + = 183683/1000000 or as the arbitrary precision number d + = 0.183683 while all the rest of the computations in the Mathematica program are in rational arithmetic. Since Mathematica treats rational numbers and arbitrary precision numbers differently, the same Mathematica program that implements the SmirnovD formula in rational arithmetic will yield different probabilities depending on whether d + is used as rational number or an arbitrary precision number in the program. When used as a rational number in the Mathematica program, d + = 183683/1000000 produces a rational number probability with 597 digits in the numerator and 600 digits in the denominator ([597/600] numerator/denominator digits) that when converted to an arbitrary precision number with 20 decimal digits of accuracy yields P D + 100 ≥ d + = 0.0010000109813850096033. However, when used as an arbitrary precision number in the Mathematica program, d + = 0.183683 yields a different p value, P D + 100 ≥ d + = 29398345/29398022169 = 0.0010000109813850110101, than that produced by the rational number d + . Since the correct probability is the one produced by the rational number input, the input d + is always converted to a rational number before it is used in any Mathematica program.

Calculating a series of binomial coefficients
Since all thirteen formulae use the binomial coefficient in almost every term, an important consideration is how to calculate them. Although the notation varies in each formulae, let k j represent the binomial coefficient. The four iterative formulae include the binomial coefficient in the iterative formulae for x j and y j so that the binomial coefficient is automatically included and need not be calculated separately. However, every one of the direct formulae and the recursion formulae use the binomial coefficient in each term of the summations. In each of these summations, the binomial coefficient k j for each succeeding term changes by adding one to j so that each formula uses a series of binomial coefficients. The efficiency of the method for calculating the series of binomial coefficients as rational numbers will effect the speed of the implementation for each formulae. The two following competing methods can be used to compute every binomial coefficient in the series as a rational number.  in time increases as the sample size n increases. This implies that as the number of terms grow, a RNBF implementation will eventually exceed the time needed by the corresponding RNIC implementation. In addition, RNBF uses the Mathematica Binomial function so any code using the RNBF method must be implemented in Mathematica while the RNIC method can be implemented using any rational arithmetic software. Thus, RNIC is more portable than RNBF and is another reason for adopting RNIC. As a result, this paper will use the RNIC method exclusively to calculate the binomial coefficients for the thirteen formulae.

Implementations of the thirteen formulae
All the Mathematica code generated for this paper is contained in one Mathematica file named KS1SidedOneSampleRational.nb which is divided into 23 sections. Each section contains one program and sample output. Table 5 contains a list of all the formulae, their type, the Mathematica function name implementing them, and the section number in file KS1SidedOneSampleRational.nb that contains the Mathematica code and sample output.
5. Calculating the one-sided bandwidth d + (n, α, ρ) In addition to calculating the p value for hypothesis testing, the one-sided one-sample K-S cumulative sampling distribution can be used to construct a one-sided confidence band around the empirical distribution F n (x). The bandwidth of a one-sided confidence band with confidence coefficient 1 − α and sample size n is the value of the test statistic d + that satisfies Determining a bandwidth d + for a particular sample size n and confidence coefficient 1 − α means evaluating the inverse of the cumulative sampling distribution which can only be done by search techniques such as binary search. Unlike the p value, a bandwidth d + cannot in practice be determined exactly because the search technique may not converge to the exact value. For example, binary search with starting values of 0 and 1 would never find d + = 1/3 and would iterate forever. Thus, search techniques are designed to stop when a specified accuracy is reached. Let d + (n, α, ρ) represent the bandwidth rounded to ρ significant digits for sample size n and confidence coefficient 1 − α. Note that bandwidth d + (n, α, ρ) is also the hypothesis testing critical value for an α level of significance.
Finding the bandwidth d + (n, α, ρ) is a three step process: (1) use an approximation to find an initial value close to d + (n, α, ρ), (2) use the initial value to find upper and lower bounds on d + (n, α, ρ), and (3) use a search procedure to determine d + (n, α, ρ) between the lower and upper bounds.
The first step will use the approximation of Maag and Dicaire (1971) to find the initial value . Since the initial value found by the approximation in the first step should be fairly close to the actual value, the second step gradually increases the distance away from the initial value until a lower and upper bound on the actual value is found. Although there are many search techniques that can be used in the third step to determine the bandwidth, this paper will consider the two most common techniques: binary search and linear search. Note that binary search takes the midpoint between the upper and lower bounds as the next value to test while linear search uses linear interpolation to find the next value. Preliminary computational experience showed that linear search was always faster than binary search so the following linear search algorithm is used to find the bandwidth d + (n, α, ρ).
Step 2 (Determine If Initial Value Is Lower Or Upper Bound): Calculate p = P [D + ≥ d + ]. If p > α, then d + is a lower bound, set d + L = d + set p L = p, and go to Step 3. Otherwise, d + is an upper bound, set d + U = d + , and go to Step 6.
Step 3 (Determine an Upper Bound): Convert d + L to a numerator integer dnumerator + L and a denominator integer ddenominator + L where ddenominator + L is a power of ten and d + L = dnumerator + L /ddenominator + L . If the number of digits of precision does not exceed four, ρ ≤ 4, set increment inc = 1. Otherwise ρ > 4 and set the increment inc = 10 ρ−5 . Go to Step 4.
Step 4 (Construct and Test a Possible Upper Bound): Set dtry = dnumerator + L + inc, calculate p = P [D + ≥ dtry/ddenominator + L ]. If p > α, then a new lower bound has been found and go to Step 5. Otherwise, the initial upper bound has been found, set dnumerator + U = dtry, set dnumerator + U = dnumerator + L , set p U = p, and go to Step 9.
Step 6 (Determine a Lower Bound): Convert d + U to a numerator integer dnumerator + U and a denominator integer ddenominator + U where ddenominator + U is a power of ten and d + U = dnumerator + U /ddenominator + U . If the number of digits of precision does not exceed four, ρ ≤ 4, set increment inc = 1. Otherwise ρ > 4 and set the increment inc = 10 ρ−5 . Go to Step 7.
Step 7 (Construct and Test a Possible Lower Bound): Set dtry = dnumerator + U − inc, calculate p = P [D + ≥ dtry/ddenominator + U ]. If p < α, then a new upper bound has been found and go to Step 8. Otherwise, the initial lower bound has been found, set dnumerator + L = dtry, set dnumerator + L = dnumerator + U , set p L = p, and go to Step 9.
Step 9 (Linear Search Iteration): If p > α, then a new lower bound has been found and go to Step 10. Otherwise, a new upper bound has been found and go to Step 11.
Step 10 (Linear Search New Lower Bound): Set dnumerator + L = dtry, p L = p, and go to Step 9.
Step 11 (Linear Search New Upper Bound): Set dnumerator + U = dtry, p U = p, and go to Step 9.
Step 13 (Bandwidth Found): Terminate the algorithm with the bandwidth d + .
The linear search algorithm is implemented using the direct formulae: SmirnovD, SmirnovAltD, DwassD, and DwassAltD. The section number in file KS1SidedOneSampleRational.nb containing the Mathematica function that implements the linear search algorithm for each direct formula is listed in Table 6.
Tables 7 and 8 contain computational experience of the linear search algorithms for each direct formula to find the bandwidths. Since the DwassAltD linear search algorithm is faster than the other three direct formula implementations, it will be used in the remainder of the paper to find bandwidths.  The Mathematica function KS1SidedOneSampleBandwidthsToFile contained in Section 18 of the KS1SidedOneSampleRational.nb file finds bandwidths using linear search with Dwas-sAltD and writes these bandwidths to a comma delimited file for input into Excel and a text file that can be used as the input into timing programs. The text file contains bandwidths where every digit in a bandwidth is output separately so the bandwidth can be reconstructed to any desired accuracy. These text files will be used as input files to produce the computational experience in Sections 6, 8, and 9.

Computational experience comparing all thirteen formulae
This section compares the computer time needed to calculate the same p value by all thirteen formulae with the objective of determining the fastest formula. Using the program in Section 5, bandwidths are generated for sample sizes n = 1000, 2500, 5000 and confidence coefficients α = 0.001, 0.01, 0.1, 0.25, 0.5, 0.9. The resulting bandwidths with ρ = 20 digits of precision are put into data file KS1SidedOneSampleBandwidthsN1000to5000.dat and Excel file KS1SidedOneSampleBandwidthsN1000to5000.csv. The six confidence coefficients for α = 0.001, 0.01, 0.1, 0.25, 0.5, 0.9 were chosen so that the entire range of α's from 0 to 1 had some representation but the range of greatest p value interest from 0.001 to 0.1 has the most representation.
To illustrate the results, Table 12 contains the ρ = 20 bandwidths for both α = 0.001 and α = 0.9. In addition, Table 12 contains all the ρ = 3 bandwidths in both decimal and rational form.      Table 11: Bandwidth d + (n, α, ρ = 6) to six digits of precision for n = 920 to n = 2, 000 approach is to compare the computation times needed to produce the same p value across various sample sizes. In order to do this, we need to calculate the value of the test statistic that yields a specified p value α for a sample size n. In other words, we will use the bandwidth d + (n, α, ρ) where P [D + n ≥ d + (n, α, ρ)] α (see Section 5). For ρ = 3, the Mathematica function TimingKS1SidedOneSampleRationalDirectFormulae contained in Section 19 of the KS1SidedOneSampleRational.nb file inputs the test statistics listed Table 12 and produces the timings in Table 13. The fastest direct formula in Table 13 was DwassAltD followed in order by DwassD, SmirnovAltD, and SmirnovD. Since the number of terms in the SmirnovD and SmirnovAltD formulae for the same sample size n increase with increasing α, we would expect that the time would also increase with α. This is the pattern followed by the times in Table 13 with two exceptions which will be dealt with in Section 8: the time decreases for n = 1, 000 going from α = 0.25 to α = 0.5 and the time also decreases for n = 2, 500 going from α = 0.01 to α = 0.1.     Table 12 and produces the timings shown in Table 14 for all direct and iterative formulae. In this timing program, the p values produced by the direct and iterative formulae for the same sample size n and test statistic d + are compared and an error message is written if they are not all exactly equal. For all the computational experience performed, no error message was ever generated which is an indication that the Mathematica implementations of the direct and iterative formulae are correct. The results clearly show that the direct formulae are faster than their corresponding iterative formulae. Since the DwassAltD direct formula is the fastest of all the direct and iterative formulae, it will be used as a comparison in the computational experience of the recursion formulae.

Recursion formulae computational experience
For ρ = 3, the Mathematica function TimingKS1SidedRecursionFormulaeDwassAltD contained in Section 21 of the KS1SidedOneSampleRational.nb file inputs the test statistics and produces the timings for the recursion formulae and DwassAltD. Preliminary testing of the recursion formulae indicated they are much slower than the direct and iterative formulae. Specifically, the computer times needed by the recursion formulae for the sample sizes n = 1000, 2500, 5000 used for the computational experience of the direct and iterative formulae, are excessive. Consequently, the computational experience for the recursion formulae and the direct formula DwassAltD reported in Table 15 are for the sample sizes n = 100, 200, 300, 400, 500. The fastest recursion formulae are Conover and Steck with Conover being faster for small α and Steck being faster for large α. However, Conover and Steck are much slower than the direct formula DwassAltD. Since each direct formula is faster than its iterative formula counterpart, and since each direct formula is faster than every recursion formula, the rest of the paper will concentrate on the direct formulae.

Rational number precision for the test statistic and p value
For various sample sizes n and test statistics d + chosen to give a good representation of the full range of the sampling distribution, Table 16 reports the approximate probability (p value) to four digits and the number of digits in the numerator and denominator integers of the rational number for the actual p value P [D + n ≥ d + ]. Table 16 also reports the time in seconds to compute the actual p value for the four direct formulae (SmirnovD, SmirnovAltD, DwassD, and DwassAltD). As the sample size n increases, the computational time as well as the number of numerator/denominator digits in the actual p value P [D + n ≥ d + ] also increases. For a fixed sample size n, the number of numerator/denominator digits in the actual p value increases as the value of P [D + n ≥ d + ] increases. However, for a fixed sample size n, the relationship between computational time and the p value varies across the formulae. Specifically, the computational time for SmirnovD and SmirnovAltD increases with increasing p values, while conversely the computational time for DwassD and DwassAltD decreases with increasing p values. An exception occurs for sample size n = 3, 000 and approximate p value 0.1008. The reason for the exception is probably due to the fact that the number of numerator/denominator digits of the test statistic d + for approximate p value 0.1008 is less than that for the approximate p values 0.001 and 0.4996. If this is the reason for the exception, then for a fixed sample size n and a fixed confidence coefficient α, computation time should increase as the rational number precision ρ increases. This hypothesis is tested in the next section.

Computational times comparing rational number precisions
As stated in Section 7, the large number of numerator/denominator digits for the p values P [D + n ≥ d + ] suggests computational time varies with the number of numerator/denominator digits in test statistic d + . Specifically, for a fixed sample size n and a fixed confidence coefficient α, computation time should increase as the rational number precision ρ increases. Using function TimingKS1SidedOneSampleRationalDirectFormulaeBYrnpForAlpha contained in   n ≥ d + (n, α, ρ)] for sample sizes n = 1000, 2000, 3000, for rational number precisions ρ = 3, 6, 9, 12, 15, and for confidence coefficients α = 0.001, 0.9, is recorded in Table 17. This table shows that for all direct formulae (SmirnovD, SmirnovAltD, DwassD, and DwassAltD), the hypothesis is correct, that is, computation time significantly increases as the rational number precision ρ increases.

The fastest formula, DwassAltD
The computational experience in the previous sections show that direct formula DwassAltD is the fastest. The Mathematica function TimingKS1SidedOneSampleRationalDwassAltD contained in Section 23 of the KS1SidedOneSampleRational.nb file produces timings for the direct formula DwassAltD. To get a better idea of the efficiency of DwassAltD, Table 18 contains computational experience for large sample sizes from n = 10, 000 to n = 50, 000 and two rational number precisions ρ = 3 and ρ = 6. As expected the computational times increase as ρ increases.

Conclusions
Thirteen different formulae for calculating the one-sided one-sample K-S p value were evaluated. These consisted of seven formulae (two direct formulae, SmirnovD and DwassD, and the five recursion formulae of Daniels, Noe, Steck, Conover and Bolshev) that were identified after an extensive literature search. In addition, two alternate direct formulae, SmirnovAltD and DwassAltD, and the iterative versions of the four direct methods (SmirnovI, DwassI, SmirnovAltI, and DwassAltI) were derived. All thirteen formulae yielded identical p values but computational times varied considerably. The evaluation of these times for such a large number of formulae proceeded in three cascading stages. First, of the four direct formulae, DwassAltD is the fastest. Second, the times for DwassAltD are faster than the times for the four iterative formulae (SmirnovI, DwassI, SmirnovAltI, and DwassAltI). Finally, DwassAltD is faster than the recursion formulae of Daniels, Noe, Steck, Conover and Bolshev where Noe is the most inefficient recursion formula. For DwassAltD, the computational time to calculate the bandwidth d + (n, α, ρ) increases with increasing digits of precision ρ. Since most applications will not need critical values with more than six digits of precision, the bandwidths for (ρ = 6), for representative sample sizes from n = 2 to n = 2000, and p values α = 0.2, 0.1, 0.05, 0.02, 0.01, 0.001 were tabulated. To get a better sense of the efficiency of DwassAltD, the computational times to calculate p values for large sample sizes (n = 10, 000 (10, 000) 50, 000), α = 0.001, 0.01, 0.1, 0.25, 0.5, 0.9, and ρ = 3 and 6, were tabulated.
The following are the three computational questions posed in Section 3 along with the conclusions developed in this paper.
1. What type of computational arithmetic should be used? Section 3.1 concludes that unlike arbitrary precision and machine precision, rational arithmetic eliminates all rounding and catastrophic cancelation errors so rational arithmetic in Mathematica was used exclusively in this paper.
2. What arithmetic form should be used to input the test statistic d + to the thirteen formulae? When using rational arithmetic to compute the p values, Section 3.2 concludes that the test statistic d + must be input as a rational number otherwise the correct p value will not be produced.
3. What is the fastest way to calculate the binomial coefficient in the direct and recursion formulae: (1) the Rational Number Binomial Function (RNBF) method or (2) the Rational Number Iterative Calculation (RNIC) method? Section 3.3 concludes that the RNIC method is slightly faster and more portable than the RNBF method, so this paper used the RNIC method.
The following are the four research questions posed in Section 3 along with the conclusions developed in this paper.
1. What is the best (fastest) way to calculate a bandwidth: (1) binary search or (2) linear search? Preliminary computational experience in Section 5 compared the binary search algorithm to the linear search algorithm and concluded that the linear search algorithm is the fastest.
2. What is the relationship between computation time and sample size n? All the computational experience in Section 6 showed that computational time increases with increasing sample size n. This increase in computational time reflects the fact that the number of terms in every one of the thirteen formulae increases with increasing sample size.
3. Using rational arithmetic, what is the fastest formula? Section 6 compares the computational times for all thirteen formulae and concluded that DwassAltD is the fastest. Section 9 gave the computational times for DwassAltD for samples sizes n up to n = 50, 000 from which we can conclude that sample size n = 50, 000 is a practical upper limit for calculating p values using rational arithmetic.
4. How does the computation time change as the accuracy of the test statistic d + increases? Section 8 showed that computational time increases as the number of numerator/denominator digits increases in the rational arithmetic representation of the test statistic d + .
11. Areas of future research Dvoretzky, Keifer, and Wolfowitz (1956) gave an inequality that showed the two-sided onesample K-S probability could be approximated by doubling the one-sided one-sample K-S probability. Owen (1962) used this approximation to tabulate two-sided critical values. Future work on the computation of the one-sided one-sample K-S test should investigate other arithmetics that may prove to be more efficient and/or more portable than rational arithmetic. Specifically, arbitrary precision arithmetic implementations, where the internal precision ip is selected so that the final p value will have a resulting precision rp greater than or equal to the user specified desired precision dp, rp ≥ dp, may produce faster computational times. Once K-S p values can be calculated accurately by arbitrary precision, the error in machine precision arithmetic implementations can be studied. Since the computer statistical software packages listed in Table 1 use machine precision arithmetic to generate p values, it is important to devise machine precision techniques to calculate K-S p values whose accuracy are known. Another area of future research is to evaluate the error in calculating the one-sided one-sample K-S p values using the limiting distribution and other approximations.
Finally, it is always possible that clever implementations of some of the formulae may change their relative efficiency and DwassAltD may not be the fastest. For example, arbitrary precision implementations may considerably change the relative efficiencies of the thirteen formulae.