PoweR : A Reproducible Research Tool to Ease Monte Carlo Power Simulation Studies for Goodness-of-ﬁt Tests in R

The PoweR package aims to help obtain or verify empirical power studies for goodness-of-ﬁt tests for independent and identically distributed data. The current version of our package is only valid for simple null hypotheses or for pivotal test statistics for which the set of critical values does not depend on a particular choice of a null distribution (and on nuisance parameters) under the non-simple null case. We also assume that the distribution of the test statistic is continuous. As a reproducible research computational tool it can be viewed as helping to simply reproducing (or detecting errors in) simulation results already published in the literature. Using our package helps also in designing new simulation studies. The empirical levels and powers for many statistical test statistics under a wide variety of alternative distributions can be obtained quickly and accurately using a C / C++ and R environment. The parallel package can be used to parallelize computations when a multicore processor is available. The results can be displayed using L A TEX tables or specialized graphs, which can be directly incorporated into a report. This article gives an overview of the main design aims and principles of our package, as well as strategies for adaptation and extension. Hands-on illustrations are presented to help new users in getting started.


Introduction
Reproducible research, a philosophy of research first promoted by Claerbout and Karrenbach (1992), is a way of providing the reader of a document the possibility of reproducing all of its graphical and numerical content. This implies access to all the data, codes and software to run it, as well as instructions to use it properly. Note that this means that users can easily produce similar results from their own data. In the specific field of research on goodness-offit tests, most published papers contain a section including Monte Carlo simulation results showing performance of several competing tests in terms of size and power. These results are usually presented more or less in the form of voluminous tables. To obtain or reproduce such results is a fastidious and time-consuming operation, as one usually has to program in one's favorite language all the test statistics involved, devise a plan for the simulations (including all the chosen alternative distributions), run those simulations and integrate the results in a L A T E X table. These operations can take weeks (see e.g., Romão, Delgado, and Costa (2010) for a comprehensive study of non-normality tests) and errors are difficult to avoid. To circumvent these problems, we propose the following guidelines for reproducible research: 1. always provide an explicit (mathematical) formula or procedure to compute the test statistic that has been developed; 2. apply the test on a small real (or simulated) data set and provide the data, the value of the test statistic and the p value; 3. if possible, give a pseudocode description of the algorithm used to compute the test statistic, its critical values and p values (see Kreher and Stinson (2005) for an appropriate L A T E X package); 4. give clear indications or references for the competing tests and alternative distributions used in the simulations, including the values of all the parameters involved; 5. assuming familiarity with the R and C/C++ languages, integrate the code of your functions (with comments) into our package PoweR and use it to perform the simulations, then in your publication give the set of instructions used.
As per these guidelines, we advocate the use of our new package PoweR. This is designed to help obtain or verify empirical power studies for (goodness-of-fit) testing of a null hypothesis that the (independent and identically distributed) simulated data comes from some specified distribution. Its main characteristics are: • generation of values from many probability distributions; • computation of several goodness-of-fit test statistics; • Monte Carlo computation of p values; • functions to compute the empirical size and power of several hypothesis tests under various distributions; • various plot functions, described later, to ease graphical comparisons between tests; • C/C++ implementation of several parts of the code for faster computations; • optimized management of the random generated sampled values; • possibility to use parallel computations, using the parallel package (R Core Team 2015), if your computer is equipped with several CPUs or with multicore processors; • output of L A T E X tables that can be directly incorporated into your document; • graphical user interface (GUI).
However, we warn a potential user that the current version of our package is only valid for simple null hypotheses or for pivotal test statistics for which the set of critical values does not depend on a particular choice of a null distribution (and on nuisance parameters) under the non-simple null case (but see Remark 2.2).
In the next section, we recall some results of computational theory of various statistical quantities using Monte Carlo simulations, and recall the terminology used when one performs a goodness-of-fit test simulation study. We also mention the probability distributions and the hypothesis tests that are already included in the package. In Section 3, we show how to use our package -via a script or via the GUI -to (re)produce a simulated comparison of powers for already existing tests. In Section 4, we show how to extend the package to add a new law or a newly created test. We then conclude the paper by giving future avenues of development.
Note that hereon, we will suppose that our package PoweR has been loaded into R memory using the instruction library("PoweR"). All the computations presented in this paper have been performed on a laptop under the Linux Debian 8 operating system, equipped with a 64 bit eight-cores Intel(R) Core(TM) i7 CPU 960 at 3.20GHz.

Monte Carlo simulations for goodness-of-fit tests 2.1. Theoretical background on hypothesis tests
Suppose that we have a sample X 1 , . . . , X n of random variables for which the distribution is assumed to belong to some parametric family F = {P(θ); P ∈ D, θ ∈ Θ P }, where D is some subset of all the probability distributions and Θ P ⊂ R k P , k P ∈ N. Goodness-of-fit tests are procedures designed to evaluate the following statistical hypotheses involving the true probability distribution L(X i ) of X i : where P 0 (θ) ∈ F is called a null distribution. If the true distribution of the sample does not belong to A, we call it an alternative distribution. For example, one might be interested in testing non-normality, in which case P 0 (θ) = N (µ, σ 2 ), the Gaussian distribution with parameters θ = (µ, σ 2 ). The truthfulness of hypothesis H 0 is questioned, at a pre-specified significance level α, using a test statistic T n = T (X 1 , . . . , X n ) built so as to reflect a discrepancy between the null hypothesis and the information brought by the data. Its observed value t obs is usually compared with one (c α ) or two (c L (α) and c R (α)) critical values (also called percentage points) defining the so-called rejection (of H 0 ) region R α . That is we decide to reject H 0 (declare it is false) if and only if T n ∈ R α . In this paper, we shall only consider where F 1 and F 2 are two nonempty disjoint families of distributions (e.g., sub-Gaussian and super-Gaussian distributions), and if the alternative hypothesis is H 1 : , the test will be termed one-tailed or onesided; otherwise, it will be called two-tailed or two-sided.
Two types of errors are commonly associated with any test procedure. A Type-I error occurs if we wrongly reject a true null hypothesis and a Type-II error if we fail to reject a false null hypothesis. A test procedure is designed to control, with some threshold level (the significance level α), the probability of committing a Type-I error. A critical value(s) is (are) needed to define the rejection region, and is (are) chosen so that P H 0 [T n ∈ R α ] ≤ α (but as close to α as possible). The probability of a Type-I error P H 0 [T n ∈ R α ] is called the size of the test and it will be important to check if it is really less than α. Note that an evaluation of this size will be possible numerically using our package; the value obtained being called the empirical size (or sometimes empirical level) of the test. Note also that sometimes the critical values defining the rejection region are computed based on the asymptotic distribution of T n . Therefore, if T ∞ denotes any random variable following this asymptotic distribution, we choose critical values such that P H 0 [T ∞ ∈ R α ] ≤ α. These asymptotic critical values will be denoted c a α , c a L (α) and c a R (α). Several non-normality tests are available in the nortest package (Gross and bug fixes by Uwe Ligges 2012). A question naturally arises for any user of this package: which test should they use? The answer depends on the power 1 − β(θ) of the test statistic T n = T (X 1 , . . . , X n ) considered, which is given by where β(θ) is the probability of committing a Type-II error and R α is the α-rejection region of the test procedure, determined beforehand as described above. For a given significance level, the test with the greatest power (for a given value of θ) should be used. At this point, one should note that given real data, the power of any test procedure is unknown because it depends on the unknown true distribution of the sample at hand. This is why it is important to perform simulations to compare numerically several hypothesis tests (designed to test the same null hypothesis) numerically under a wide variety of common alternative distributions. This is when our package PoweR becomes handy, as it offers a fast and automated way to perform such computations using Monte Carlo simulations.
Another important characteristic attached to the application of a hypothesis test on real or simulated data is the so-called p value, informally defined as the probability of obtaining a test statistic at least as extreme as the one that was actually observed, assuming that the null hypothesis is true. Note that two-sided statistical p values (those computed when R α = {T n < c L (α)}∪{T n > c R (α)}) are well defined (as 2P H 0 [T n > |t obs |]) only when the test statistic considered has a symmetric distribution. For non-symmetric distributions, several proposals have been made (Kulinskaya 2008) but none of them led to a consensus. In this paper, we shall only consider the Fisher's doubled p value (Yates 1984, p. 444), even given the evident drawback of this doubling rule that it may result in a p value greater than 1 (The rule doubles the one-sided p value coming from the tail corresponding to the observed direction, which is taken to be the lower one if t obs < q 2 and the upper one otherwise, where q 2 is the median of T n or of T ∞ ). Fisher's motivation was an equal prior weight of departure in either direction. Future versions of our package might propose other approaches.

Monte Carlo computations
All the characteristics defined so far (critical values, size and power of the test, p value), can be estimated numerically using Monte Carlo simulations. A typical algorithm is given below.

Require: n (Sample size)
Require: M (Number of Monte Carlo samples) Require: P (A probability distribution from which to generate numbers) Require: θ (Vector of parameters of P) for m = 1, . . . , M do generate independently a sample x m = (x 1,m , . . . , x n,m ) of size n drawn at random from a distribution P(θ) ∈ F for some fixed value θ of the parameter vector; compute the observed value of the test statistic t n,m = T (x 1,m , . . . , x n,m ).

end for return
We can then use t n,1 , . . . , t n,M to estimate quantities of interest. Let us consider a two-sided test that rejects for either large or small values of the test statistic. If P = P 0 (we are under the null), we could compute (using our function many.crit()) the critical valueŝ which are respectively the (α/2) th and (1 − α/2) th empirical quantiles, where x denotes the smallest integer not less than x. Note that in the current version of our package, we have assumed a rejection of the null with equal probability for large or small values of the test statistic. The power of the test under some alternative distribution P(θ) can then be estimated by where the t n,m are here computed under the alternative distribution specified by θ. If P = P 0 and we have given values of c L (α) and c R (α), for example the (theoretical) asymptotical critical values c a L (α) and c a R (α), we can compute the empirical size of the test by

M .
Suppose one has an observed value t obs of some test statistic T n used to evaluate some goodness-of-fit hypothesis H 0 . The p value can be approximated (byp given in Equation whereq 0.5 is the empirical median of t n,1 , . . . , t n,M . Presenting Monte Carlo results to show evidence of the finite-sample properties of hypothesis testing procedures usually relies on tables of the (empirical) size and power of these hypothesis tests. One can also use simple techniques for the graphical display of simulation evidence. Our package PoweR render these tasks easily. The reader is encouraged to read Ehrenberg (1977) which contains some precepts for improving data presentations. Davidson and MacKinnon (1998) describe three types of figures, called p value plots, p value discrepancy plots and sizepower curves that convey much more information, in a more easily assimilated form, than tables; see also Koziol (1989), Lieber (1990), Schweder and Spjøtvoll (1982) and Wilk and Gnanadesikan (1968).
All of these graphs, presented below, are based on the empirical distribution function (EDF) of the p values of the test statistic considered. If we have a generated sample p 1 , . . . , p N of p values derived from observed values of a certain test statistic T n (under a given distribution), then the associated EDF is computed using the following formula: where 1l{C} equals 1 if condition C is satisfied, and 0 otherwise. This EDF can be evaluated at J points x j , j = 1, . . . , J which should be chosen in advance so as to provide a reasonable snapshot of the (0, 1) interval, or of that part of it which is of interest. A quite parsimonious way to choose the x j is Extra points can be added near 0 and 1 to ensure that we do not miss any unusual behavior in the tails.
Two plots designed to evaluate (and compare) the size of one or several test statistics under the null hypothesis are: • The p value plot: this is a plot ofF (x j ) against x j , j = 1, . . . , J. If the distribution of T n used to compute the p i 's is correct, each of the p i should be the realization of a uniform distribution on (0, 1). Therefore, whenF (x j ) is plotted against x j , the resulting graph should be close to the 45 • line. We can then easily distinguish at a glance between test statistics that systematically over-reject, under-reject, or reject in roughly the right proportion of times.
• The p value discrepancy plot: this is a plot ofF (x j ) − x j against x j . This plot conveys a lot more information than p value plots for test statistics that are well behaved. However, some of this information is spurious, simply reflecting experimental randomness. Davidson and MacKinnon (1998, Section 5) therefore discuss semi-parametric methods for smoothing them. Moreover, because there is no natural scale for the vertical axis, p value discrepancy plots can be harder to interpret than p value plots.
The nominal (i.e., approximated) size of a test is often different from the true size P H 0 [T n ∈ R α ], for example when it is computed using some approximation (e.g., asymptotic) of the (finite sample) null-distribution of T n . Because the previous graphs are used to evaluate the correctness of the size, it is important to note that the p values under the null should be calculated using this approximate distribution.
If we want to compare the power of competing tests, we can use the following graph which has the advantage of being useable even if all the tests do not have the correct size. In this case, the p values under the null can be calculated using a Monte Carlo approximation of the null distribution.
• The size-power curves: the points (F (x j ),F * (x j )), j = 1, . . . , J, including the points (0, 0) and (1,1), whereF (x) andF * (x) are respectively the EDF of the p values under the null and under the alternative distribution; see Wilk and Gnanadesikan (1968) for a description of such a plot and Bhattacharya and Habtzghi (2002) for a definition of the p value as a random variable.
We define three other quantities, given a table of powers p ik (1 ≤ i ≤ I, 1 ≤ k ≤ K) of test statistics T 1 , . . . , T K against I alternative distributions. For a given sample size n and significance level α, the average power for test statistic T k is defined as the average gap power is defined as and the worst gap power is defined as

Preliminaries
The instruction help(package = "PoweR") returns a list of all the functions available in package PoweR. These are listed in Table 1 below.

Distributions in PoweR
The PoweR package contains a large set of probability distributions (39 at the moment of writing this paper) from which to generate random numbers. These are detailed in Section A in the Appendix. Note that this information is also available from within R, as illustrated below.
R> head(getindex()$mat.laws) The columns Default1 to Default4 display the default values of the parameters of each corresponding probability distribution in the Law column (e.g., mu = 0 and b = 1 for Laplace(mu, b)). In this context, the useful function help.law() enables one to obtain various information about a given law. For example, try help.law(6) to display documentation about the law whose index is 6, namely the Beta(a,b) distribution.
The function gensample() enables one to generate a sample from any distribution in the package. Try this to generate a random sample of size 10 from a N (µ, σ) distribution with µ = 1 and σ = 2: R> gensample(law.index = 2, n = 10, law.pars = c(1, 2)) Retrieve the default number of parameters of some laws getnbparstats() Get numbers of parameters of test statistics graph() p value plots, p value discrepancy plots and size-power curves help.law() Open documentation for a given law using its index help.stat() Open documentation for a test using its index law.cstr() Gives information about a given law many.crit() Compute critical values for several test statistics, sample sizes and significance levels, for a given law many.pval() Computation of p values for several test statistics plot.discrepancy() p value discrepancy plots plot.pvalue() p value plots plot.sizepower() Size-power curves powcomp.easy() Computation of power and level tables for hypothesis tests ("slow" but easy to use version) powcomp.fast() Computation of power and level tables for hypothesis tests, for several sample sizes and significance levels ("fast" version) power.gui() Graphical user interfaces (GUI) print.critvalues() Transform the critical values given by function many.crit() into L A T E X code to build a

Goodness-of-fit test statistics in PoweR
The PoweR package contains many functions to test non-normality, non-uniformity and nonlaplacity. These are given in Tables 4, 6 and 5 of the Appendix respectively.
The instruction getindex()$mat.stats returns a data.frame with four columns listing these test statistics. The first one (Index) contains the indices of all the test statistics available in PoweR. The second one (Stat) contains the corresponding names of these test statistics.
The third one (Alter) gives the type of test: If Alter = 0, Fisher's doubled p value will be computed. The fourth argument (Nbparams) gives the number of parameters for the corresponding test statistic. This is illustrated below.

R> head(getindex()$mat.stats)
Index Stat Alter Nbparams Note that the function getindex() can be used to obtain this information for a subset of all test statistics using its argument stat.indices: The function help.stat() displays some information about a given test statistic. Try for example help.stat(21) to display the documentation for the test whose index is 21, namely the Shapiro-Wilk test for non-normality.

Perform a test on an observed data sample
The function statcompute() enables one to perform a test on a given sample of observed (or simulated) data. This is illustrated by showing that it gives identical results to the shapiro.test() function from stats package (R Core Team 2015).
R> set.seed (1)  The arguments of this function are: • stat.index: this should be a single integer-valued index.
• data: sample from which to compute the test statistic.
• levels: vector giving the desired significance levels for the test. The length of each vector of critical values should be the same as the length of the argument levels. A convenient way to fill these values is either to use the quantiles of the asymptotic null distribution (e.g., qchisq(0.95,2) for a χ 2 2 distribution), or to use the function compquant() to compute them using a Monte Carlo simulation with M repetitions: R> crit <-compquant(n = 10, law.index = 2, stat.index = 21, M = 10^5) R> crit$quant 2.5% 5% 10% 90% 95% 97.5% 0.8186749 0.8443032 0.8702271 0.9714234 0.9782153 0.9830226 Note that if both critvalL and critvalR are NULL, then the decisions to reject H 0 (1 is the code for a rejection) are taken by comparing the p value to each element of levels. The p value is, most of the time, computed using the asymptotic (or tabulated 1 ) distribution of the test statistic under the null. If this computation is not possible, the NA value is returned. Nevertheless, the function pvalueMC(), described later on, can always be used.
• stat.pars: A vector of parameters. If NULL, the default parameter values for this test statistic will be used. See Tables 4, 6 and 5 in the Appendix for a list of these parameters.

Monte Carlo p values
The function pvalueMC() can be used to compute a Monte Carlo empirical p value as described in Equation 1. Its arguments are • data: sample from which to compute the p value.
• stat.index: this should be a single integer-valued index of the test statistic considered.
• null.law.index: index of the law under the null hypothesis.
• M: number of Monte Carlo repetitions; default value is 10 5 .

Perform a simulation study using a script
In this section, we briefly present the main functions of our package by showing how to obtain the simulation results from an article already published. Puig and Stephens (2000) present several simulation results for tests of the Laplace distribution. Using our package PoweR, we can easily retrieve the same simulation results with a few command lines.

Critical values
Let us start by reproducing their   If latex.output = TRUE, a L A T E X code is returned. After compilation, with pdflatex say, this gives directly Table 2 below, which is virtually identical to Table 1 in Puig and Stephens (2000). Note that the function print.critvalues() automatically adapts its result (to produce one or several tables) depending on the number of test statistics.
Note that the same procedure can be used to obtain in one shot the critical values of Tables 1-4 in Puig and Stephens (2000) for the Watson U 2 , Anderson-Darling A 2 , Kolmogorov √ nD and Kuiper V statistics.

Power
These critical values being obtained, we can study the empirical power of these tests using the instructions below. The alternative distributions we have considered are the Normal, Cauchy, Logistic symmetric and Generalized Extreme Value distributions, for which the indices (given in law.indices) are 2, 3, 4 and 28, respectively. As before, the value 1 for law.index stands for the Laplace distribution, and alter is a list defining the type of each test. Note that the default values that we used for the alternative distributions can be obtained using the function law.cstr(). The function print.power() works similarly to print.critvalues(). Table 3 below is obtained using the instruction print(table6,digits=3,latex.output=TRUE). The results we have obtained are very similar to those presented in Table 6 from Puig and Stephens (2000). Note that our function adds the average power, the average gap power and the worst gap power to this table of powers (see Section 2.3 for a definition of these quantities).

R>
The arguments of the function powcomp.fast() are: 1. law.indices: vector of law indices as displayed by function getindex().   12. Rlaws: When some law indices in law.indices are equal to 0, this means that you will be using some R random generators. In that case, you should provide the names of the random generation functions in the corresponding components of Rlaws list, the other components should be set to NULL.
13. Rstats: A list of the same length as stat.indices. If a value of the vector stat.indices is set to 0, the corresponding component of the list Rstats should be an R function that outputs a list with components statistic (value of the test statistic), pvalue (p-value of the test; if not computable should be set to 0), decision (vector of decisions for each value in levels; 1 if we reject the null, 0 otherwise), alter (see above), stat.pars (see above), pvalcomp (1L if the p-value can be computed, 0L otherwise), nbparstat (length of stat.pars). A user can thus provide its own R functions to perform statistical tests. Note: If a value of stat.indices is not 0, then the corresponding component of Rstats should be set to NULL. The input arguments of this R function should be data, levels, usecrit=0 (will be set to 1 upon call if critical values are to be used to take the decisions), critvalL=0 and critvalR=0. Depending on the value of alter, only critvalL or only critvalR or both will be used.

Plots
Our package PoweR provides not only the L A T E X tables of critical values and powers, but also plots based on p values as explained in Section 2.3. These are obtained (except for statistics 45 and 46 for which no general theory exists to find their asymptotic distribution) using the R instructions given below, and are shown in Figures 1, 2 and 3. Note that the results of Puig and Stephens (2000) are not that easy nor quick to obtain because the asymptotic distribution is an infinite sum of weighted chi-squared random variables. Unfortunately, the weights λ are not provided in their paper and computing them is not an easy task and involves optimization procedures that can be time consuming. This illustrates pretty well the reproducibility problem.

R> Fxnull <-calcFx(ptmpnull)
The call to the many.pval() function above, where stind contains the indices of the hypothesis tests and alter contains the type of each test, produces a N × 3 matrix for which the k-th column is an N -dimensional vector of p values associated with the k-th test statistic (1 ≤ k ≤ 3). These p values are calculated via a direct method, namely using the true null distribution (if it is known), the asymptotic null distribution or any other theoretical approximation to it. Next, the function calcFx() is applied columnwise on this matrix of p values to computeF (x j ), using Equation 2 for the J points given in Equation 3. Now we can produce the first two graphs (Figures 1 and 2): R> plot.pvalue(Fxnull) R> plot.discrepancy(Fxnull, legend.pos = "topleft") To produce the size-power curves, we first need to use the following instructions, where now a Monte Carlo approach is employed to compute the p values.

Using user-defined densities and tests coded in R
It is possible to avoid the burden of a C++ implementation of your test or of the random generation procedure for a new density not yet included in the package. This is easily done through the Rstats and Rlaw (or Rlaws) arguments, as will be described below. Note however that this can lead to an increase in the computing time. Also, we think that one should consider implementing these in C++ (see Section 4) before publication in order to follow the guidelines presented in the Introduction (in particular step 5).
We present below a situation where we want to evaluate the power of the Shapiro-Wilks test for normality against the Benini alternative distribution. Random values of this distribution (not included in our package) can be obtained thanks to the rbenini() function in the VGAM package (Yee 2014 The following code computes critical values. Note that in order to use an R random generating function (rnorm() here), one has to set law.index to 0, specify the name of that R function via the Rlaw argument and specify the values of the parameters of the distribution through the law.pars argument. Similarly, to use an R function for your test (my.shapiro() here), you have to set the value of stat.indices to 0 and specify the name of that function via the Rstats argument. The other components of Rstats, corresponding to non-zero values of stat.indices, should be set to NULL.

The graphical user interface (GUI)
To aid in using the package, we have created a GUI. To start it, simply type power.gui() in the R console (see the 'Details' section in help(power.gui) concerning potential problems with Iwidgets). You should see the window shown in Figure 4.
This interface consists of five tabs: • Generate sample: generate random samples from a law added in the package.
• Compute statistic: compute value of the test statistic for a given test index.
• Critical values: compute critical values of several test statistics.
• Compute power: compute power for hypothesis tests.
• Examples: reproduce simulation results already published in the literature.
The lower portion of the window -a different one corresponding to each tab -is divided into two parts: the left and the right sides.
• Left side: this contains required fields (e.g., indices of laws, tests, or parameters, etc.), a Reset button used to reset the fields to their original values and a Submit button used to send commands to a text editor on the right side of the window. In addition, ? buttons instruct us on how to fill in the fields, as depicted in Figure 5.
The ? button at the bottom left sends us directly to the documentation of the function corresponding to this tab. • Right side: this contains a text editor that receives commands from the left. It is easy to edit, then to save with the Save button, or to load a file already created by the Load button. We can execute commands via the Run button.
You will find in the Examples tab around 50 code examples which allow us to regenerate simulation results from many articles already published in the literature.

Extending PoweR
When a researcher plans a new simulation study to evaluate the performance, in terms of power, of a newly developed procedure, he or she might need to proceed as described below. Note that to help in integrating new functions into the package, we employed a mechanism described in Temple Lang (2001) to register our C/C++ routines. The developper then only has to add his own law and/or hypothesis test files in the appropriate subfolders.

Adding a law
It is possible to add a new law to our package PoweR. These new laws will then be used to generate random values in order to offer other choices of alternative distributions for power comparisons. We assume that you have already downloaded our package source code from the CRAN website. Here are a few steps to guide you through the process of implementing a new law in the package: 1. Add the definition of the new law in the two C++ files def-laws-stats.cpp and register.cpp, following the instructions given in those files. These two files are located in the subfolder Power/src/laws-stats/.
2. Create a new C++ file lawj.cpp in subfolder Power/src/laws-stats/laws/ that will contain the source code to generate observations from the new distribution, where j is an available index for the new law. Note that j should be taken as the value returned by the following instruction: nrow(getindex()$mat.laws)+1. Look at the other files in that directory to see how to write your own file. All files (should) have the same pattern which is succinctly described below. The function prototype (all arguments are pointers) is: void lawj (int *xlen, double *x, char **name, int *getname, double *params, int *nbparams, int *setseed); where xlen[0] will receive from R the length of the data sample to be generated, x will be used for these generated data to be passed back to R, name will be a 50-length pointer to char pointers each one of length 1, the purpose being to output the name of the distribution, getname[0] will contain 1 or 0 depending if the distribution name should be retrieved or not (in the former case, no data will be generated), params (of length nbparams[0]) will receive the values of the parameters of the distribution, and finally the input parameter seed[0] will contain 1 if one wants to read in .Random.seed and write it out after use; otherwise it must be set to 0. After that you can use the function gensample() to verify whether the new added law behaves correctly. First of all, having a sufficient number of generated random numbers from the new law allows us to compute their empirical expectation and variance. We can then compare them with the theoretical values. This is exemplified below for a Weibull(10,1) distribution.

Adding a test statistic
Once a new goodness-of-fit test has been developed, a researcher might want to compare it with other existing tests in the literature in terms of power. The idea is to add it to our package and use the function powcomp.fast() to perform this kind of comparison. We present in this section the steps to follow to implement a new test statistic in the package. We assume that you have already downloaded the source code of our package from the CRAN website.
1. Add the definition of the new test in the two C++ files def-laws-stats.cpp and register.cpp. These two files are located in subfolder Power/src/laws-stats/.

Adding an example in the GUI
To make simulation results from some published paper available in the 'Examples' tab of the GUI, one only has to create a file in the folder PoweR/inst/examples/, following the pattern of those already present. This will increase the reproducibility of simulation results published in the literature.

Conclusion
The The current version of our package is currently restricted to the case of simple null hypotheses or for pivotal test statistics for which the set of critical values does not depend on a particular choice of a null distribution (and on nuisance parameters) under the non-simple null case. When this is not the case, a classical Monte Carlo approach is not valid anymore because of the presence of nuisance parameters. As such, the pvalueMC() and many.crit() functions are potentially dangerous. These problems could manifest if users want to add new tests to the package. For example, if they want to test a distribution with an unknown shape parameter, then the Anderson-Darling test statistic would depend on that parameter. This problem can also be seen in the Algorithm on page 5. Indeed, when H 0 is a family of distributions, each θ 0 ∈ Θ 0 would lead to a different (pair of) critical value(s). One should then compute critical values for all θ 0 ∈ Θ 0 and take the supremum (resp. infimum) of these right (resp. left) critical values in order to build a test procedure controlling the (maximal) type-I error rate. Similarly, to calculate p-values, one should take something of a supremum over the parameter space Θ 0 . The problem here is that the cardinal of Θ 0 could be infinite. Several approaches can be considered to try to circumvent this problem: a) Using a (finite) grid of points over Θ 0 , hopping that this will be sufficient; b) choose a value θ such that P 0 (θ) ∈ A (i.e., corresponds to a null distribution) and such that the Kullback-Leibler (KL) distance between the distribution P 0 (θ) and the alternative distribution (with density g(x) say) is minimum (see Remark 2.1). Note that, for a power computation, the alternative distribution is chosen by the user and should be entirely specified. For example, g(x) could be the U [0, 1] uniform density and P 0 (x, θ) the density of the N (µ, σ 2 ) Gaussian distribution, where θ = (µ, σ 2 ). The objective would then be to find a θ such that is minimum. This can be achieved by numerical integration and optimization. The intuition is that we want to be able to measure the capability (power) of some test to detect departures from a given set of distributions corresponding to the null hypothesis towards some entirely specified alternative. This will be more difficult if the chosen null distribution is close to the alternative, which is the choice made using this KL approach. Note that to compute a p-value for an observed sample, one could start by replacing g(x) by some density estimate based on the sample; c) add an option for parametric bootstrapping to generate observations under the null when the distribution of the test statistic depends about nuisance parameters of the null distributions. These nuisance parameters would be estimated using the sample simulated under the alternative (or the observed sample), the intuition being to choose a null distribution closest in some sense to the alternative (or true) distribution.
Some other future avenues of development include: • to treat the case of test statistics having a discrete distribution, for example using the empirical quantiles of Ma, Genton, and Parzen (2011); • to compute empirical critical values with a non-equal probability of rejection of the null in the tails (for bilateral tests); • to perform simulations for goodness-of-fit tests for the errors of parametric models; • to use a cluster of workstations; and • to provide other templates for the L A T E X output of critical values and powers, with an alias in files print.critvalues.Rd and print.power.Rd).
Other ideas in which to use this package could also emerge. For example, one can use it with students for pedagogic purposes to investigate the robustness of the Student t-test (test index: 83) to non-normality of the data. 14. Average uniform: AveUnif (k, a, b). k, a, b)). Expectation: 1 2 (a + b). Variance: 1 12k (b − a) 2 .