DixonTest.CriticalValues: A Computer Code to Calculate Critical Values for the Dixon Statistical Data Treatment Approach

The computer program DixonText.CriticalValues is written in VB.NET to extend the quadrature approach to calculate the critical values with accuracy up to 6 significant digits for Dixon’s ratios. Its use in creating the critical values tables in Excel is illustrated.


Introduction
Statistical tests are used for the identification and rejection of extremums, commonly called outliers in a dataset and their application for analytical data treatment is common in all branches of science (Lohginger 1999;Verma, Quiroz-Ruiz, and Diaz-Gonzalez 2008). Recently, Komsta (2006) developed a computer software package, outliers, for analytical data processing using various statistical tests in the programming language R (R Core Team 2013). Dixon's ratio or Q-test is one of the methods for identifying extremums in a sample, taken from a normal distribution (Dixon 1950(Dixon , 1951Dean and Dixon 1951;Rorabacher 1991). The implementation of this approach requires the calculation of critical values. Dixon (1951) created the critical values tables for sample data points up to 30 with uncertainty in the third digit. Komsta (2006) used the Dixon's critical values. Now, many researchers have calculated the critical values for a large number of data points with higher accuracy (Lohginger 1999;Miller and Miller 2005;McBane 2006;Verma et al. 2008). Two types of numerical analysis methods are used to evaluate the integral equation: Monte Carlo simulation (Efstathiou 1992;Verma et al. 2008) and quadrature (Dixon 1950(Dixon , 1951McBane 2006).
In this article the quadrature approach is extended by programming the generating functions for the polynomials like Hermite (Riley, Hobson, and Bence 2006;Weisstein 2011), Legendre (Riley et al. 2006;Weisstein 2011) and half-range Hermite (Steen, Byrne, and Geldard 1969;Gautschi 1994;Ball 2003), which are used for the evaluation of definite integrals through the numerical quadrature approximation method. The program, DixonTest.CriticalValues is written as an object library (DLL; dynamic link library) in VB.NET, which permits its use in any other programming environment on any platform, where the .NET framework and the virtual environment called as Common Language Runtime (CLR) are installed (Venkat 2012). The CLR contains a component called Just In-Time Compiler (JIT), which converts the intermediate language into native code for the underlying operating system. A demonstration program, DixonDemo, is written in VB.NET. Similarly, the creation of critical values tables in Excel is illustrated.

Theoretical aspects
For the sake of completeness, the procedure of Dixon (1951) and McBane (2006) is summarized here. The joint probability density function for a set of n random observations (x 1 , x 2 , . . . , x n ) from a normal distribution with mean (µ) and standard deviation (σ) is defined as For the ordered set of observations (x 1 ≤ x 2 ≤ · · · < x n ), Dixon (1951) defined the ratio, r j,i−1 = (x n − x n−j )/(x n − x i ), where subscripts i and j indicate the number of suspected outliers at the lower and upper ends of the data set, respectively. The ratios are independent of µ and σ of the distribution; therefore, the joint density function is equally valid for the standard normal distribution (i.e., µ = 0 and σ = 1).
The Dixon's ratios are functions of three variables (x i , x n−j and x n ) of the n data values and the corresponding joint probability density function is obtained by integrating Equation 1 over all except the three points. Then the variables are changed from (x i , x n−j , x n ) to (t, u, r) according to the transformations as t 2 = (1 + r 2 )v 2 /2 and u 2 = 3x 2 /2, where x = x n , v = x n − x i and r = (x n − x n−j )/v, and the joint density function is integrated for x and v over their ranges (−∞ < x < ∞, 0 ≤ v < ∞) to obtain the function of r alone as For simplicity, J(x, r, v) is defined as On substituting values of φ's and Equation 3 in the Equation 2, we get where N = n! (i−1)!(n−j−i−1)!(j−1)! (2π) −3/2 . After further simplification, the Equation 4 is converted to the integral form suitable for the Gauss-Hermite quadrature approximation approach as where x(u) = u 2/3 and v(t, r) = t 2/(1 + r 2 ).
The numerical integration for u with limits (−∞ to +∞) is performed with the Gauss-Hermite quadrature. The integration limits for t are from 0 to +∞. The above integration for t may be performed with the half-range Hermite polynomials (Steen et al. 1969;Gautschi 1994;Ball 2003). The half-range Hermite polynomials are developed differently by different people using the recurrence formula approach. The instability in the approach creates fluctuations after a certain degree of the polynomials (Steen et al. 1969), which was overcome with the brute force method of using arbitrary precision arithmetic (Gautschi 1994;Ball 2003). Ball (2003) presented a new approach using the asymptotic initial Gauss values and refining the values from higher to lower side (backward) by iteration procedure and obtained the identical values of abscissa and weights of half-range Hermite as proposed by Gautschi (1994). In this work the approaches of Steen et al. (1969) and Ball (2003) for the generation of half-range Hermite polynomials are implemented in order to present a comparative study in the calculation of total probability density and critical values.
According to the quadrature integration approach, the approximate solution of Equation 5 for the joint probability density P (r) is expressed as where u k is the value of the kth abscissa of the nH -degree Hermite polynomial (written in short as n-Hermite) and w nH (u k ) is the corresponding weight. Similarly, t l and w nHH (t l ) are the corresponding parameters for the nHH -degree half-range Hermite polynomial either of Gautschi (1994) and Ball (2003) denoted as half Hermite Ball or of Steen et al. (1969) denoted as half Hermite SBG. The variable r has limits 0 to 1. Riley et al. (2006) and Weisstein (2011) explained the estimation of error in the numerical quadrature; however, the error calculation must be performed in each step of the program (Kumar 2011). Here we will implement a simple procedure to estimate the accuracy (precision) in the critical values for the Dixon's ratios by calculating the accuracy in the total probability density (or cumulative density function, CDF ).
The calculation of critical values for the Dixon ratios is performed by integrating the Equation 6 for r = 0 to R and the equation is evaluated with the numerical quadrature approach using the Legendre polynomials (McBane 2006). The generating functions for the Legendre polynomials are taken from Riley et al. (2006) and Weisstein (2011). The numerical solution is performed here in two steps: (i) integrating for the total probability density (i.e., changing the limits from r = 0 to 1 to z = −1 to +1) and (ii) integrating for the cumulative probability density up to R (i.e., changing the integration limits from r = 0 to R to z = −1 to +1). For a given value of CDF , the critical value R is calculated by the bisection method (Press, Teukolsky, Vetterling, and Flannery 1992) using the initial values, CDF = 0, R = 0 and CDF = 1, R = 1.
In this work the program of McBane (2006) in FORTRAN is rewritten in VB.NET. The programming of special functions (Hermite, Legendre and half range Hermite) is included which makes the program versatile for comparative studies. Similarly, the present program, Dixon-Test.CriticalValues provides better accuracy up to 6 decimal places in the critical values and can be used in Microsoft Windows-based software including Excel.

Critical values calculation
The present computer program, DixonTest.CriticalValues, is written as an object library (DLL) in VB.NET. A namespace DixonTest is created which will be used in future to include all the programs associated with the Dixon statistical test methods. Presently, it contains a class 'CriticalValues'. The properties of the class are given in Table 1. A class encapsulates the data and methods, and serves as blue-print for creating objects. A class may be extended for its functionality without knowing its code. The use of classes in Excel will be presented later.

DixonDemo: A demonstration program
To illustrate the calculation of critical values, a demonstration program, DixonDemo, is written in VB.NET. Figure 1 shows the graphical user interface of the program. The user provides the values of nDixon, iDixon, jDixon, Alpha, n-Hermite, n-HHermite, half Hermite (Ball), n-Legendre and accuracy (RAccuracy). On pressing the button, Calc, the values of total probability, critical value, and corresponding cumulative probability are calculated as shown in Figure 1. The accuracy indicates the number of accurate digits after the decimal point in the critical value. One should check the accurate digits in the total probability in order to assign the value of RAccuracy. Indeed, the program works for any value of RAccuracy; but the accuracy in the critical value is one digit less than the accurate digits in the total probability. If the check-box of half Hermite (Ball) is selected, the program uses the half-range Hermite polynomials of Ball (2003), otherwise of Steen et al. (1969). The upper limit for the degree of n-HHermite SGB is 15; there is some instability in the polynomials after this degree (Steen et al. 1969). Figure 1 shows the calculation for nDixon = 7, iDixon = 2, jDixon = 1, Alpha = 0.1 (i.e., for the ratio r 11 or r 1,1 with 90% confidence interval) using the half Hermite SBG. There are seven accurate digits in the total probability density, so the critical value is 0.532944 (i.e., accurate at least up to six decimal places). However, the total probability is 0.9996055, when calculated using the half Hermite Ball polynomial while keeping the same values for the other parameters. Thus the accuracy reduces to 3 digits instead of 7 in the total probability and  (1969)).

Alpha
Read-write A normalized value to represent the confidence interval (e.g., for 95% confidence interval, Alpha = 0.05).

RAccuracy
Write only An integer value which denotes the calculated accurate digits in the values of rCritical after the decimal point.

TotalProbility
Read only Total probability for a given value of nDixon, iDixon, and jDixon. rCritical Read only Critical value for a given value of nDixon, iDixon, and jDixon. rProbabilty Read only Cumulative probability for a given value of nDixon, iDixon, jDixon and Alpha consequently up to 2 digits in the critical values. Figure 2 presents a comparative study for the calculation of total probability using the half Hermite Ball and half Hermite SBG polynomials. The values of the other parameters were kept the same for both cases. On the x-axis the degree of the half Hermite polynomial is plotted, whereas the total probability is plotted on the y-axis, which was multiplied by 1 × 10 6 and then 9 × 10 5 was subtracted from it in order to amplify the probability axis scale. The values of common parameters are nDixon = 7, iDixon = 2, jDixon = 1, Alpha = 0.0, n-Legendre = 12. The value of n-half Hermite for SBG is up to 15 as there is instability in the polynomial for higher values. The value of n-Hermite is changed to 5, 10, 15 and 20 in the respective figures. For n-Hermite = 5 and 10, there are similar fluctuating behaviors in the total probability for both half Hermite Ball and half Hermite SBG, whereas for n-Hermite = 15, the behavior is getting stable for half Hermite SBG and is stable for n-Hermite = 20. But there is the same fluctuating behavior in the case of half Hermite Ball even for n-Hermite = 15 and 20. Thus the half Hermite polynomial of Steen et al. (1969) provides better results for these calculations, when the value of nDixon is less than 10. Similarly, the adequate accuracy in the critical value with minimum expense is obtained by n-half Hermite = 10 for the SBG polynomials.   Table 2: Comparison of critical values for the ratio r 10 for Alpha = 0.05 (i.e., confidence interval = 95%). For nDixon > 50, there is accuracy up to at least 4 digits. Figure 3 presents the behavior of total probability for different values of nDixon with keeping the same values of other parameters (i.e., iDixon = 2, jDixon = 1, n-Hermite = 30, n-Legendre = 12). For nDixon = 12, the behavior of total probability for both types of n-half Hermite polynomials is the same and stable; however, on further increasing nDixon there is fluctuation in the first part of the behavior. The highest degree of half Hermite SBG is 15, so the total probability and critical value calculation has higher uncertainty. In the case of nDixon = 100, the behavior is stable for the degree of n-half Hermite Ball (≥ 45). Thus the half Hermite Ball provides better results for nDixon > 12.   Table 2 presents the comparison of critical values for the ratio r 10 . The total probability up to 8 decimal digits is also shown in the table. One has to take into account that the accuracy decreases with increasing total number of data points in the quadrature approach and is at least 4 significant digits for nDixon = 100. It can be observed that there is good agreement up to 4 decimal places between the calculated values using the Monte Carlo simulation (Verma et al. 2008) and the quadrature method (this work). However, the execution time for the calculation of critical values is in the order of seconds, whereas the Monte Carlo simulation requires months of execution time.
The requirement of accuracy in the critical values needed for analytical data treatment depends on two aspects: total number of data points and precision of confidence interval. Let us consider an example for the Dixon ratio r 10 : nDixon = 30, iDixon = 1, jDixon = 1, Alpha = 0.05 (i.e., 95% confidence interval). The values of n-Hermite, n-half Hermite, HermiteBall, and n-Legendre are 30, 15, False and 10, respectively and the permitted uncertainty in the confidence interval be ±0.5% (i.e., confidence interval is between 94.5% and 95.5%

Using DixonTest.CriticalValues in Excel
The library (DixonTest.dll) can be used in any computer programming language in the Windows environment. However, its use in Excel to create critical values tables will be explained here; since most of the geochemists use Excel for geochemical data management and calculations.
The installation and uninstallation of library, DixonTest.dll, is explained in file ReadMe.PDF which is available as supplementary material. Verma (2003) explained the procedure for writing a function in the personal workbook of Excel. A macro stored in this location will be available for all the workbooks. For using the class 'CrticialValues' in the library DixonTest it is needed to perform the following steps: 1. Add reference to library: press the "Developer" tab on the Excel ribbon and then press "Visual Basic" button. It will open the Visual Basic Environment. Now, press menu Tools→References and set the references to the library, DixonTest.tlb, which sets the reference on the library, DixonTest.dll.

2.
Creating functions: write the following code in a module in the personal workbook (PERSONAL.XLSB). There are two functions, TotalProbability and RCriticalValue.
Save the personal workbook. Now, the function can be used in any workbook.
Public Function RCriticalValues(nDixon As Integer, iDixon As Integer, _ jDixon As Integer, alpha As Double, nH As Integer, _ HH As Boolean, nHH As Integer, nL As Integer, _ RAcc As Integer)  Similarly, the values of nDixon can be changed according to the need. Once, the table is calculated, one has to copy and save it in another worksheet with "save as value type". The Excel workbook is also available as supplementary material. Similarly, the calculation of total probability to know the accurate digit in the critical values is shown in the last column.

Conclusions
The present computer program DixonTest.CriticalValues is fast, flexible and efficient for the calculation of critical values with accuracy up to 6 significant digits for the Dixon statistical tests. The accuracy in the critical values is estimated through the calculation of the total probability density and is considered as one digit less than the total accurate digits in the total probability. The accuracy of three digits in the critical values is sufficient for the statistical treatment of an analytical data set up to 100 points. The program can be easily transported to other environments like Excel.