QXLA : Adding Upper Quantiles for the Studentized Range to Excel for Multiple Comparison Procedures

Microsoft Excel has some functionality in terms of basic statistics; however it lacks distribution functions built around the studentized range ( Q ). The developed Excel add-in introduces two new user-deﬁned functions, QDISTG and QINVG , based on the studentized range Q -distribution that expands the functionality of Excel for statistical analysis. A workbook example, demonstrating the Tukey, S-N-K, and REGWQ tests, has also been included. Compared with other options available, the method is fast with low error rates.


Introduction
Researchers in most fields of science are often faced with comparing means obtained with various treatments. The statistical analysis using multiple comparison procedures (MCPs) is often carried out by statistical software packages that are less suitable for data storage and manipulation. While Microsoft Excel offers excellent data storage and manipulation, it provides few built-in methods for statistical data analysis. Excel's Analysis ToolPak add-in provides basic analysis of variance (single factor and two-factor) and also provides t-tests for two sample means. The availability of Excel functions such as FDIST, FINV, TDIST, and TINV for the F -and t-distributions allows for some capability to conduct simple multiple comparison procedures such as Fisher's least significant difference test. However, the lack of Excel support for the studentized range (Q) distribution does not allow for the Tukey honest significant difference, Student-Newman-Keuls (S-N-K), or Ryan-Einot-Gabriel-Welsch Q (REGWQ) tests to be carried out.

Brief overview of multiple comparison procedures
Comparisons of means of groups receiving different treatments often begin with a simple analysis of variance (ANOVA) followed by post-hoc analysis or MCP to determine which means are statistically different. The methods by which these tests are performed differ depending on factors such as planned versus unplanned comparisons, comparisons between groups of the same or different sizes, comparisons of groups with or without equal variances, and parametric versus stepwise comparisons. However, a common denominator is the use of statistical distributions such as the F -, t-, or Q-distributions. A comprehensive review of different methods was performed by Day and Quinn (1989) who described, provided equations for, and evaluated many of the commonly used methods. It should be noted that Day and Quinn recommended MCP methods that use the Q-distribution such as the Tukey, Ryan's Q (here expanded to REGWQ), and Games-Howell tests. As formulas for this distribution are not available in Excel, an add-in was constructed to provide the feature. As part of this manuscript, we also constructed an Excel workbook to demonstrate the use of the add-in with the Tukey, S-N-K, and REGWQ tests. Within this manuscript, we will use the term α to represent the upper percentile of the studentized range distribution; α is typically in the range of 0.1 to 0.01 in comparisons. We will use r to denote the total number of groups compared and v (or df in the VBA code) to denote the degrees of freedom within groups (available in the standard single factor ANOVA table). It should be noted that many statistical tables of the studentized range quantiles use one minus α (often denoted by p) to identify a table. However, as the upper percentile (α) is used in several other Excel functions (e.g., FINV and TINV), α was chosen as parameter for the user-defined formulas.

Methods for obtaining Q values and probabilities
As with many statistical properties related to distributions, statistical tables based on studentized ranges have long been part of statistical text books (Walpole and Myers 1989;Snedecor and Cochran 1967) but they tend to be limited to a few probabilities (e.g., p = 0.95, α = 0.05). For an extensive set of Q tables, the reader is directed to Harter (1960). Lund and Lund (1983) developed a numerical integration algorithm (AS 190) that could be used to estimate Q values for α values of 0.10-0.01 and also included a rough estimate algebraic algorithm (AS 190.2) for α values of 0.20-0.05. Copenhaver and Holland (1988) developed an algorithm in Fortran using Gauss-Legrendre quadrature that was later implemented in Pascal by Ferreira, Demetrio, Manly, and Machado (2007). It is the same algorithm used by the R environment for statistical computing and graphics (R Core Team 2018, C source code is freely available) and other statistical software packages. It is likely also the algorithm used by of the Excel add-in RealStats-2007, which is freely available (Zaiontz 2016). This last algorithm is computationally intensive.
The method described within this manuscript takes a different approach. It uses a method to calculate Q values proposed by Gleason (1998;1999) that built on relationships between Student t quantiles and studentized range quantiles. Using this method, Gleason (1999) transformed traditional studentized range tables (Q tables) to a new set of tables (one for each probability). Each table listed four constants (a 1 , a 2 , a 3 , a 4 ) for each degree of freedom (v). These constants could be used in a fourth-order polynomial with the parameter r to calculate the value of Q (α, r, v). Eight tables (corresponding to eight probabilities 0.50, 0.75, 0.90, 0.95, 0.975, 0.99, 0.995, and 0.999) were created by Gleason (see example in Figure 1) and, Figure 1: Table of fourth order polynomial constants by Gleason (1999) for calculation of Q values. The above example is for p = 0.95, α = 0.05. Entries for some values of v have been deleted for brevity. The last value of v is listed as 100,000,000; in the typical text book table it is listed with an infinity symbol.
while not all possible values of v and α were included, details were provided for interpolation mechanisms (Gleason 1999). Even accurate extrapolation was deemed possible: at least to α = 0.0001 and r = 200, and the approach was included in the statistical software program Stata (Gleason 1998). The method by Gleason has also been written in Python with an expanded set of polynomial constants (Lew 2011). Briefly, the method used by Gleason and in this add-in to calculate Q values was: 1. If r = 2, set y = 1 and goto step 12.
2. Select two data lines from all eight Gleason tables corresponding to v's higher and lower (v high , v low ) than the desired v. Each line will have four a constants.
4. Calculate y 2 low and y 2 high for each of the eight α values using the appropriate four constants (a 1 , a 2 , a 3 , a 4 ) and any adjustment due to r = 3. y 2 Note that LOG is an VBA function for the natural logarithm. 5. Interpolate y 2 from y 2 low and y 2 high and take the square root to find y for the desired v. This will yield eight y-values, one for each α. y 2 = y 2 low + ( NORMINV is an Excel function corresponding to the Z p -function by Gleason (1999). This will yield eight x-values, one for each α.
7. Calculate z-values as z = LOG(y + r/v). This will yield eight z-values, one for each α.
8. Assume that the variable z is dependent on x through a 2nd order polynomial, z = m 1 + m 2 · x + m 3 · x 2 , or a 4th order polynomial for higher accuracy if 0.5 < α < 0.001. Gleason (1999) only used 2nd order.
9. Find the values of the constants (m 1 , m 2 , etc.) in the polynomial through matrix algebra.
10. Calculate x-value for the desired α (see step 6) and use the polynomial (see step 8) to find its z-value.
11. From the z-value, calculate the y-value from relationship in step 7.
12. Calculate the final Q value as Q = y·SQR(2)·TINV (α, v). Note that SQR is a VBA function and TINV is an Excel function.
The method of obtaining the probability (or α value) for a specific Q value with given r and v was not given by Gleason (1998;1999). For this manuscript, an iterative process was developed based on a false position bracketing iterative interpolation technique for fast guaranteed convergence (Chapra and Canale 1985). Also, as Gleason reported that his method could be used to extrapolate to α values lower than 0.001, we extrapolated calculations to include α = 0.0001, 0.00001, and 0.000001. This allowed for an expanded range when seeking probabilities for a given Q. The following procedure was used: 1. If r = 2, set α = TDIST(QVal/SQR (2), v, 2) and stop. Note that TDIST is an Excel function.
2. Follow steps 2-9 in the previous section. Only use 2nd order polynomial in step 8.
11. For each of the additional α values, calculate a temporary x-value and use the polynomial (see steps 8 and 9) to find a temporary z-value.
12. From the z-value, calculate the y-value (equation in step 7) for each of the three additional α values.
13. Calculate w-values as w = y·SQR(2)·TINV (α, v). This will yield eleven w-values, one for each α. Note that the w-values are the same as Q values, one for each α.
14. Find the α values that bracket the sought α based on the desired Q value and w-values.
15. Use false position bracketing iterative interpolation technique to find α corresponding to the desired Q value with an acceptable error of 0.1%.

Creating algorithms and an add-in in Excel
The Excel add-in was created using the Microsoft Excel 2013 built-in VisualBasics module. In addition to a small subroutine for loading Gleason tables into string variables, the functions QINVG and QDISTG were created as user-defined functions. They were so named to differentiate them from the user-defined functions (QINV and QDIST) in the Excel add-in RealStats-2007(Zaiontz 2016 and to recognize Gleason's contribution to this work (Gleason 1998(Gleason , 1999. The code of the add-in is provided as part of this manuscript and has been annotated to correspond to the step-by-step methods listed above. After the code had been entered and tested, the spreadsheet was saved as an Excel 97-2003 add-in (*.xla).

User installation
The process of installation and activation of add-ins for Microsoft Excel is covered in Excel's help files with slight differences between Excel versions. Examples for the add-in installation process for Excel 2007 and earlier have also been published by Buttrey (2009). The add-in presented within this manuscript has been tested for Excel versions 2003 through 2013.
In this case (see Figure 2), the QDISTG function was unable to calculate a value for α with Q = 10 when r was 5 and 10. This is due to the fact that the estimated α was outside the limits of the add-in. Currently, the lowest α value that can be calculated by QDISTG is 0.000001. The limitations of the QXLA add-in are as follows: • α values used by QINVG should ideally be between 0.5 and 0.001 but can be used for α < 0.001 without major concerns (Gleason 1999). The code automatically allows for this.
• α values calculated by QDISTG must be between 0.5 and 0.000001. Between 0.5 and 0.001, it uses the method by Gleason (1999) and was extended by extrapolation to 0.000001. Outside this range, a value #VALUE will be displayed in the cell. It would be possible to extend the range by further extrapolation but Gleason (1998) does not recommend extrapolation on the other side of the range, above α = 0.5.
• v used by QINVG and QDISTG can be an integer between 1 and 99,999,999. It would be possible to extend the range further in the code. Outside this range, a value #VALUE will be displayed in the cell.
• r used by QINVG and QDISTG should ideally be an integer between 2 and 100. However, higher values will automatically be extrapolated and will not return an error value. According to Gleason (1998) extrapolation is feasible (and accurate) to r = 200. Figure 3: Spreadsheet with example implementing the REGWQ test for four treatment groups using the user-defined function QINVG(α, r, v). The green sections indicate user input sections which contain the raw data (cells B16:G36) and the desired α (cell F2). Other sheets in the downloadable workbook demonstrate S-N-K and Tukey tests. Day and Quinn (1989), in their review of different MCPs, provide an example also used here. They presented four groups (A1, A2, NB, and S) that represented different treatments with five replicates in each group. Figure 3 is a depiction of a spreadsheet which uses the add-in and allows the user to enter data in the highlighted green sections. The workbook automatically constructs the basic single factor ANOVA table, performs the Tukey test, the stepwise REGWQ and S-N-K tests, and determines which mean belongs to which grouping (cells B15:G15). The Q values calculated by the add-in are in agreement with those listed by Day and Quinn (1989). The result from the REGWQ test shows that the means of groups A1 and A2 are significantly different from the other two but not from each other, and likewise for the means of the other two groups. The workbook is available for download and uses the QXLA add-in code. It should be noted that the workbook contains the VBA code for QINVG and QDISTG and can therefore function without installing the add-in.

Conclusions
• A subroutine and user-defined functions, QINVG(α, r, v) and QDISTG (QVal, r, v), of the upper quantiles and the upper percentiles for the studentized range were successfully implemented in an Excel add-in using VBA programming.
• As indicated by Gleason (1999), the method "sacrifices some exactness for speed and simplicity." However, its easy programming, fast execution, and wide range of usability makes it an attractive option in Excel.
• Its efficiency was demonstrated by Lew (2011) who stated that R completed a data set of 1,216,000 points in 45 min using the qtukey function, which uses the algorithm by Copenhaver and Holland (1988), while a Python program using the Gleason method took 181 seconds (Lew 2011).
• The user-functions allow for an expanded use of Excel in data analysis.