Testing Goodness-of-Fit with the Kernel Density Estimator: GoFKernel

To assess the goodness-of-fit of a sample to a continuous random distribution, the most popular approach has been based on measuring, using either L∞ - or L2 -norms, the distance between the null hypothesis cumulative distribution function and the empirical cumulative distribution function. Indeed, as far as I know, almost all the tests currently available in R related to this issue (ks.test in package stats, ad.test in package ADGofTest, and ad.test, ad2.test, ks.test, v.test and w2.test in package truncgof) use one of these two distances on cumulative distribution functions. This paper (i) proposes dgeometric.test, a new implementation of the test that measures the discrepancy between a sample kernel estimate of the density function and the null hypothesis density function on the L1 -norm, (ii) introduces the GoFKernel package, and (iii) performs a large simulation exercise to assess the calibration and sensitivity of the above listed tests as well as the Fan's test (Fan 1994), fan.test, also implemented in the GoFKernel package. In addition to dgeometric.test and fan.test, the GoFKernel package adds a couple of functions that R users might also find of interest: density.reflected extends density, allowing the computation of consistent kernel density estimates for bounded random variables, and random.function offers an ad-hoc and universal (although computational expensive and potentially inaccurate for long tail distributions) sampling method. In light of the simulation results, we can conclude that (i) the tests implemented in the truncgof package should not be used to assess goodness-of-fit (at least for non-truncated distributions), (ii) the test fan.test shows an over-tendency to not reject the null hypothesis, being visibly miscalibrated (at least in its default option, where the bandwidth parameter is estimated using dpik from package KernSmooth), (iii) the tests ks.test and ad.test show similar power, with ad.test being slightly preferable in large samples, and (iv) dgeometric.test represents a good alternative given its satisfactory calibration and its, in general, superior power in samples of medium and large sizes. As a counterpart it entails more computational burden when the random generator of the null hypothesis density function is not available in R and random.function must be used.


Introduction
In the literature there are a number of non-parametric tests to assess whether a sample of a continuous random variable comes from a specified distribution. In goodness-of-fit tests the usual statistics are based on measuring, in some way, the discrepancy between either the empirical cumulative distribution or density function and the corresponding theoretical function. L ∞ -norm and L 2 -norm have been the most popular distance measures employed. Indeed, the tests currently available in R (R Core Team 2015) to this issue (such as ks.test from package stats, ad.test from package ADGofTest (Bellosta 2011), kuiper.test from package circular 1 and the tests ad. test, ad2.test, ks.test, v.test and w2.test available in the package truncgof 2,3 ) use one of these two distances on cumulative distribution functions.
These are not however the unique dissimilarity criteria suggested in the literature to deal with this problem. Some proposals can also be found using likelihood ratios, Kullback-Leibler divergence and Renyi distance via entropy measures, or other closeness measures -see, e.g., Fan and Gencay (1993), Zhang (2002), Mattheou and Karagrigoriou (2010), Gurevich (2010), or, Mashhadi (2011). Indeed, within these approaches, it is still possible to find in R the dbEmplikeGOF package which provides a function, dbEmplikeGOF, for density based empirical likelihood goodness-of-fit tests based on sample entropy (Miecznikowski, Vexler, and Shepherd 2013). Unfortunately, the dbEmplikeGOF function currently only performs tests of normality and uniformity and, therefore, does not offer a general solution.
In this paper, two tests operating on the density function are made available to R users through the GoFKernel package. In particular, the GoFKernel package contains an implementation of the Fan's test (Fan 1994), which is based on the L 2 -distance, and a practical approximation to compare, using the L 1 -norm, the discrepancy between a theoretical density function and a sample kernel estimate of the density function (Cao and Lugosi 2005). 4 The p values of a test based on this distance can be easily computed by Monte Carlo simulation using statistical software. The R functions collected in the GoFKernel package are designed to run this test for (almost) any one-dimensional continuous random variable.
Although this paper deals exclusively with one-dimensional continuous random variables, the test could be easily generalized to discrete variables. Its generalization to the useful case of multivariate random variables entails the use of high dimensional density estimation and it is less straightforward. Lindsay, Markatou, and Ray (2014) provide a discussion of the issues entailed and an optimal method for identifying the appropriate bandwidth for use in goodness-of-fit problems. Some proposals for goodness-of-fit tests for multivariate normal vectors, using L 2 -distance and entropy measures, can be found in Bowman and Foster (1993) and Anderson, Hall, and Titterington (1994). The relevance of these references also rests on the fact that both papers discuss bandwidth selection; an issue that, as we shall see when we analyze the outcomes for Fan's test, can be crucial in performance of L 2 -distance based tests.
The paper is structured as follows. A brief review of the main goodness-of-fit tests based on comparing the distance between the empirical cumulative distribution function and the corresponding theoretical function is performed in Section 2. This review is focused on those tests currently programmed in R. Section 3 introduces theoretically the tests implemented in the R package GoFKernel to measure the discrepancies between the null hypothesis density function and an empirical kernel estimate. Section 4 describes and exemplifies the main functions available in GoFKernel. In Section 5, a large simulation exercise comparing the calibration (size), power (sensitivity), and speed of the different general goodness-of-fit tests available in R is carried out for several distributions and sample sizes. Section 6 investigates convergence of the dgeometric.test. Finally, Section 7 provides conclusions.

Tests based on the cumulative distribution function
The Kolmogorov-Smirnov test (KS) is probably the most widely known non-parametric goodness of fit test (ks.test from package stats, ks.test from package truncgof). The KS statistic, Equation 1, quantifies at the sampled values, x i , and with the L ∞ -norm the maximum (supremum) distance in absolute values between the empirical cumulative distribution function (ECDF) of the sample, F n (x i ), and the cumulative function of the reference distribution, F (x). A distance that, as is well-known, converges to 0 if the sample comes from the reference distribution.
The KS test is however not considered to have good power (e.g., Stephens 1974), requiring a relatively large number of data points to properly reject the null hypothesis (Frampton 2010), and is moreover considered more sensitive to the part of the cumulative distribution above the median (Johnson, Miller, and Freund 2011 This test is equally sensitive to differences along the entire range of the distribution and compared to KS test has the advantage of being invariant under cyclic transformations 5 . It shares however the limitations of the KS test. The Anderson-Darling tests are another alternative to the KS test. In the same way as the KP test, these tests provide equal sensitivity at both tails (although not maintaining the cyclic invariance). In its simplest version (ad.test from package truncgof), it is a variance weighted KS statistic based on the L ∞ -norm, Equation 3, and in its L 2 -norm variant (Anderson and Darling 1952), it is a generalization of the Cramer-von Mises test (w2.test from package truncgof). In particular, the Anderson-Darling L 2 -norm variant is based on a quadratic ECDF statistic measure, Equation 4, and has the advantage of taking into account the differences between the empirical and theoretical cumulative distributions at all the sampled points.
More specifically, the Cramer-von Mises test uses w(x) = 1 as weighting function, while the Anderson-Darling test employs w(x) = [F (x)(1 − F (x))] −1 ; a weighting function that places more weight on observations in the tails of the distribution.
Other alternative tests also based on comparing empirical and theoretical cumulative functions have been proposed by Watson (1961), Stephens (1964), and Pearson and Stephens (1962), among others -see also Marsaglia and Marsaglia (2004).

Density function based tests
A large number of goodness-of-fit tests have also been proposed using empirical approximations to the density function. Indeed, the idea of using non-parametric empirical (kernel) density estimators for goodness-of-fit tests goes back to Bickel and Rosenblatt (1973) and Rosenblatt (1975). More recent work includes Bowman (1992), Ahmad and Cerrito (1993), Fan (1994Fan ( , 1998 and Fan and Ullah (1999), among those papers that base their tests on the L 2 -error, and Cao and Lugosi (2005) and Albers and Schaafsma (2008), among the approaches employing L 1 -distances. In particular, following Fan (1994) and Cao and Lugosi (2005), the statistical discrepancy measures used are based respectively on the integral of the squared difference between an empirical kernel function estimate (EKF) of the unknown density function and the null hypothesis density function to be tested, I g n,h , Equation 5, and the integral of the absolute value difference, T n,h , Equation 6.
where f n,h (x) is a sample empirical h-bandwidth kernel estimation and f (x) is the density function under the null hypothesis. Note that the values of the statistics depend on the bandwidth selected.
After developing the sum of squares of Equation 5 and employing some empirical approximations, Fan (1994) proposed a bias-corrected test, later simplified in Li and Racine (2007, pp. 380-381), which follows an asymptotic Gaussian distribution under some standard smoothness assumptions and the condition that the kernel bandwidth tends to zero when its product with the sample size tends to infinity. According to Cao and Lugosi (2005, p. 600), however, L 1 approaches are preferable to L 2 ones since they allow one "to drop unnecessary assumptions as well as to obtain non-asymptotic performance bounds". To compute the p value of the test based on Equation 6, we can use numerical integration and Monte Carlo simulation. After computing by numerical integration the area between the density function under the null hypothesis and a sample empirical kernel estimator (see Figure 1), we can obtain the p value of the test by simulation as follows: (i) drawing S samples from f (x) with the same size n as our actual sample; (ii) estimating the kernel density function f n,hs (x) for each of these new samples; (iii) computing the area between the theoretical density and each of the estimates of (ii); and, (iv) calculating the p value as the proportion of times the sample S areas computed in (iii) exceed the value of T n,h obtained from the observed sample. This approach resembles the strategy implemented in the package truncgof and is quite close to the one suggested in Cao and Lugosi (2005) for computing the critical region of the T n,h statistic, given a significance level. On the one hand, the tests implemented in the truncgof package compute the p value by resampling in a transformation of the (truncated) distribution (Chernobai et al. 2005, p. 5), which under the null hypothesis follows a uniform distribution. On the other hand, Cao and Lugosi (2005, p. 609) suggested a similar algorithm to compute the threshold of the critical region of the test, but with a fixed bandwidth properly chosen to minimize T n,h under the null hypothesis. Indeed, Cao and Lugosi (2005, p. 599) consider that the choice of the bandwidth "plays a crucial role in the performance of the test". Regarding the number of S samples needed to obtain a relatively accurate approximation of the p value, although it depends on the particular null hypothesis density considered, it seems that as a rule good results are obtained with just a hundred of samples (see Section 6). More accurate approximations that require the increase of the number of samples can be obtained at the expense of increased computational time.

Package GoFKernel
The algorithm described above to implement a test based on the T n,h statistic as well as the Fan's test (with the simplification suggested in Li and Racine 2007, pp. 380-381 which "should have better finite-sample properties" since it has an asymptotic zero center term) has been programmed in the GoFKernel package. These two tests are implemented in the two main functions of the package: fan.test and dgeometric.test. Section 4.1 explains the fan.test function and Section 4.2 describes how the dgeometric.test works. Furthermore, in Section 4.3 the rest of the functions of the package, a couple of which (density.reflected and random.function) could be of interest for R practitioners, are presented.

Function fan.test
The function fan.test performs the Fan's test (Fan 1994) in the variant proposed in Li and Racine (2007, pp. 380-381). The programmed test is based on an asymptotic approximation of I g n,h , Equation 5, to a normal distribution and analyzes the goodness-of-fit of a sample, via a kernel density estimate, to a theoretical density function. In its default option, fan.test uses the function dpik included in the package KernSmooth (Wand 2013) to estimate the bandwidth 6 . Hence it requires that package to be available to run correctly. The function is used as: fan.test(x, fun.den, par = NULL, lower = -Inf, upper = Inf, kernel = "normal", bw = NULL) The arguments of the function are described as follows: x: a numeric vector of data values.
fun.den: an actual density distribution function, such as dnorm. Only continuous densities are valid.
par: a list of additional parameters of the distribution specified, default NULL.
lower: lower end point of the support of the variable characterized by fun.den, default -Inf.
upper: upper end point of the support of the variable characterized by fun.den, default Inf.
bw: a number indicating the bandwidth (parameter h in Equation 5) to be used in the empirical kernel estimate of the data, default NULL. In its default option, the bandwidth is estimated using the function dpik included in the package KernSmooth.
The output is an object of class 'htest' like for the Kolmogorov-Smirnov test, ks.test from package stats. The output is a list containing the standardized value of the I g n,h statistic (statistic), the p value of the test (p.value), the character string "Fan's test" (method), a character string giving the kernel used (kernel), and a character string giving the name of the data (data.name).
As an example, firstly, a test is carried out to see if the null hypothesis of uniform distribution can be accepted for a random sample of size 100 from a uniform distribution and, secondly, if the object risk76.1929 available in GoFKernel follows the particular density function defined by f (x) = 2 − 2x for 0 < x < 1.
Fan's test data: risk76.1929 Ig = -3.8156, p-value = 0.9999 According to this output the Fan's test points clearly to not reject -with a p value of 0.9999 -the null hypothesis for the risk76.1929 dataset.

Function dgeometric.test
The function dgeometric.test performs the test described in the last paragraph of Section 3. The test is based on computing the area defined by the T n,h statistic for the observed data. To obtain the p value, the value of the statistic is then compared to the same area for a simulation of samples, with the same size than the observed sample, drawn from the distribution stated in the null hypothesis. The function uses a Gaussian kernel to estimate the kernel density functions and, when available, employs the random generators programed in R. It is used as: dgeometric.test(x, fun.den, par = NULL, lower = -Inf, upper = Inf, n.sim = 101, bw = NULL) The arguments of the function are described as follows: x: a numeric vector of data values. ) to be used in the empirical kernel estimates, default NULL. In its default option, the bandwidth varies in each simulated dataset and is the one provided by the function density under hypothesis of a Gaussian kernel.
The output is an object of class 'htest' like for the Kolmogorov-Smirnov test, ks.test from package stats. The output is a list containing the value of the T n,h statistic (statistic), the p value of the test (p.value), the character string "Geometric test" (method), the number of simulations performed to calculate the p value (iterations), and a character string giving the name of the data (data.name).
As an example, firstly, a test is carried out to see if the object risk76.1929 available in GoFKernel follows the particular density function defined by f (x) = 2 − 2x for 0 < x < 1, and, secondly, for a random sample of size 200 from a logNormal distribution, a test is carried out to see if the null hypothesis of a Gamma distribution with sample MLE as parameters can be accepted. According to this output, at the usual significance level of 0.05, the null hypothesis of a Gamma density is rejected for this simulated data with an approximate p value of 0.016.

Other functions in GoFKernel
In addition to the functions fan.test and dgeometric.test, which perform respectively Fan's test and the L 1 geometric test in the simulation form described at the end of Section 3, the GoFKernel package includes the functions inverse, support.facto, random.function, area.between and density.reflected. These five functions are (internal) functions necessary to implement the geometric test. More specifically, (i) inverse computes the inverse function of any given cumulative distribution function; (ii) support.facto determines for a random variable with an infinity theoretical support its numerical de facto support; (iii) random.function generates draws of any random variable given its density (or cumulative) function; (iv) area.between numerically calculates the area between a theoretical density function and an empirical kernel estimate (see Figure 1); and, (v) density.reflected computes an empirical kernel estimate of a sample using, for bounded variables, reflection in the borders -see, e.g., Silverman (1986). In addition, the risk76.1929 object contains a vector with the annual fraction of the time exposed to risk of death with an age of 76 years for people born in 1929 that immigrated to Spain during 2006. Under the null hypothesis of uniform distribution of dates of birth and dates of immigration, the above time exposed to risk has as density function f (x) = 2 − 2x for 0 < x < 1 (Pavía, Morillas, and Lledó 2012).
Of the above functions, the two of most interest for R users are: density.reflected and random.function. The function density.reflected is based on the function density and produces an output of the same class as density for bounded variables and the same output for unbounded variables. 7 For bounded variables, density.reflected avoids via reflection the inconsistencies that density shows in the boundaries of the support of the random variable. The random.function function offers an ad-hoc and universal (although computational expensive and potentially inaccurate for long tail distributions) sampling method that allows the drawing of samples of (almost) any one-dimensional continuous random variable. This function makes it possible to implement the dgeometric.test even when the null hypothesis density function is not available in the R environment and it must be directly provided by the user. They are used as follows: density.reflected(x, lower = -Inf, upper = Inf, ...) The arguments of the function density.reflected are described as follows: x: a numeric vector of data values.
lower: lower end point of the support of the variable from which x is supposed to come, default -Inf.
upper: upper end point of the support of the variable from which x is supposed to come, default Inf.
The arguments of the function random.function are described as follows: random.function(n = 1, f, lower = -Inf, upper = Inf, kind = "density") 7 The latter is true except for two small changes. The density.reflected function (i) always omits NAs and (ii) significantly concentrates the kernel density around the unique observed value in degenerate samples. kind: a character string identifying the kind of function used to identify the distribution, either "density" (default) or "cumulative", as alternative.
To exemplify how these functions work and to show their usefulness, Figure 2 displays graphically (i) the different kernel estimates that are obtained using functions density and density.reflected for a bounded variable (left panel) and (ii) how a sample can be drawn using random.function for a random variable even though its random generator is not available in R (right panel). The code used to generate Figure 2 is also provided.

A comparison of the ECDF and EKF approaches
As just presenting the proportion of times the p values are under 0.05 in each scenario 8 can hide relevant issues, the actual p value distributions have been displayed for a sample of scenarios in order to provide further insights and to reinforce the comments that are derived from the chosen summary statistic. In particular, as a consequence of the specifications employed in the simulation exercises, within the calibration scenarios a test is considered to have a good size when its p value distribution is approximately uniform. On the other hand, in sensitivity scenarios the clustering of p values around zero will be an indicator of a proper power of the corresponding test.

Calibration of the tests
To analyze the calibration of the tests, we study for some instances to what extend the size of the test (the actual proportion of times that a true null hypothesis is rejected) equals the nominal significance level of the test. This is a standard procedure that in the current study   has been performed in scenarios when the samples come from (i) a uniform distribution on the interval [0, 1], U (0, 1), (ii) a standard normal distribution, N (0, 1), (iii) an exponential distribution of rate (and mean) equal to 1, Exp(1), and (iv) a logNormal distribution whose logarithm has mean equal to 1 and standard deviation equal to 1, logN (1, 1).
Focusing on the outputs of Table 1 9 the first issue that comes to our attention is the bad behavior of truncgof functions as a whole. Indeed, analyzing the p values attained for the U (0, 1) scenarios (see also Figure 3, where the default density.reflected kernel density estimates of p values for a group of U (0, 1) scenarios are presented), it seems that these functions are not useful for bounded variables. The truncgof package produces (almost) systematically p values equal to 1 for three of its functions (ks.test from package truncgof, v.test and w2.test) and really clustered distributions, around 0 and 1, for, respectively, ad.test and ad2.test functions from package truncgof. In the remaining scenarios the picture for the truncgof functions is also quite demoralizing, their frequency in rejecting the null hypotheses is clearly above the expected figure of 0.05. The only function of this package that shows reasonable figures for several of the (unbounded) variables is ad.test from package truncgof, which registers acceptable rejecting rates for large samples in N (0, 1) and logN (1, 1) scenarios and adequate rates in the Exp(1) scenarios. In Exp(1) scenarios, there is another function of the package, v.test, that also shows acceptable figures. The same conclusions can be extracted observing Figures 4, 5 and 6, where the default density.reflected kernel estimates of the p value distributions have been presented for a sample of scenarios.
The above results for the truncgof package functions are really unexpected. Although the truncgof package (Wolter 2012) is intended to deal with the issue for left truncated distributions, looking at the theoretical paper that supports this package (Chernobai et al. 2005) there are no apparent reasons for this weird behavior. Indeed, except for a constant factor (which is a function of the number of observations, n, of the sample to be tested), the statistics that are (should be) implemented in the truncgof collapse to the non-truncated ones introduced in Equations 1-4 when the threshold is kept at its default option. 10 For the rest of the functions, the results highlight that the fan.test function clearly has an over-tendency to accept the null hypothesis, an unambiguous sign of miscalibration of the test (at least in its default options). This miscalibration is even exacerbated in the scenarios with right-asymmetric distributions (see Figures 3 to 6). Finally, as results in Table 1 show and pictures in Figures 3 to 6 confirm, the outcomes for the rest of the functions are adequate. As a rule, nevertheless, it could be said that the ks.test function from package stats and the ad.test function from package ADGofTest show a test size slightly smaller than expected and that dgeometric.test shows a size slightly greater. 9 The numbers in the table must be observed as approximations to the actual sizes of the tests. Obviously, different figures would have been obtained with a different set of simulated samples. As a reference, and by comparison with a different set of simulations (not presented here), a variation as large as ±0.02 in the estimation of the size of a test might be considered as non-unusual. Hence, those tests whose values in the tables vary within the range [.03, .07] could be observed, in general, as well-calibrated.
10 In its default option, the truncgof tests set the threshold in -Inf. (Although this is not documented in the help file of the package, it can be easily observed looking inside the functions.) This issue entails the estimated cumulative distribution function to be zero in the threshold and, consequently, the stated equivalence, see Tables 1 and 2 in Chernobai et al. (2005, pp. 20-21).

Sensitivity of the tests
The power of a statistical test is the probability that the test rejects the null hypothesis when the null hypothesis is indeed false. To study how sensitive the rejection rules of the tests are to false null hypotheses, we study some scenarios where their discriminant power is challenged. That is, we consider situations in which the random distributions stated in the null hypotheses are chosen to be close (in shape) to the actual distributions generating the data. More specifically, we consider couples where the actual and null hypothesis densities are, respectively: (i) a Beta with both shape parameters equal to 1.3, Be(1.3, 1.3), and a [0, 1] uniform distribution, U (0, 1); (ii) a standard Cauchy 11 , Ca(0, 1), and a N (0, σ), where σ is the standard deviation of the sample, (iii) a Gamma with both shape and rate equal to 0.9, Ga(0, 9, 0.9) and a Exp(1), and (iv) a logN (1, 1) and a Ga( α, β), where α and β are respectively the shape and rate MLE.
The summary outcomes of the sensitivity analysis are presented in Table 2. Focusing firstly on the Be(1.3, 1.3) − U (0, 1) scenarios, the non-usefulness of truncgof tests for bounded random variables is confirmed (see also Figure 7). In line with previous results, they produce systematically extreme p values irrespectively of the sample. Regarding the rest of the tests, we observe (both in Table 2 and in Figure 7) the Fan test showing again the really conservative behavior demonstrated in the calibration analysis. This conservative behavior also occurs for small-size samples in the remaining tests, although at a lower level. These three tests are not equivalent, however. The ad.test function from package ADGofTest looks preferable to ks.test from package stats, and dgeometric.test is the function showing the higher power in this case, although its figures for really small sample sizes are indeed modest.
The Ca(0, 1) − N (0, σ) scenarios are the ones with the greatest differences between the actual and null hypothesis distributions. This is reflected in the rejection rates of all the tests that show really high figures in all the cases, except for the smallest sample sizes. The ks.test from package stats and ad.test from package ADGofTest functions are the more reluctant to correctly reject the null hypothesis with small sizes, whereas the truncgof functions seem to have the greatest powers now (see Table 2 and Figure 8). This superiority of truncgof functions is, however, more apparent than real, observing N (0, 1) outcomes in Table 1. It highlights also the amount of simulations for which no solution is offered by some functions of, mainly, the truncgof package. The functions ad.test and ad2.test from package truncgof produced an error 221, 546 and 635 times for, respectively, samples of sizes 100, 200 and 500. Likewise, the fan.test function generated a total of seven errors. 12 Focusing now on the Ga(0.9, 0.9) − Exp(1) scenarios we observe that in general all the tests experience great difficulties in discriminating between the actual distribution generating the sample and the null hypothesis distribution. This is especially true if we take into account (and discount) the results obtained in Section 5.1 for Exp(1) scenarios (see Table 1 and Figure 5) for some of the truncgof functions (ad2.test, ks.test and w2.test), which show a clear over-tendency to reject the real Exp(1) null hypothesis. EKF tests do not offer this time an alternative to be considered. In this case the EKF tests are the ones showing the lower powers. In practice, fan.test is unable to discriminate between both models and dgeometric.test is, unlike the rest of the functions, not able to significantly improve their power in the largest       The picture for the logN (1, 1) − Ga( α, β) scenarios serves to reinforce the conclusions reached after analyzing Be(1.3, 1.3) − U (0, 1) scenarios and Ca(0, 1) − N (0, σ) scenarios. Discounting again for the over-tendency of truncgof functions to reject the true null hypothesis, massively present in logN (1, 1) scenarios (see Table 1), it becomes evident that the superior power that truncgof tests exhibit in small samples is a matter of illusion. The possibilities that the Gamma distribution offers to mimic logNormal models is reflected in the lower powers that all the tests show in small samples, which nevertheless grow significantly as sample sizes increase. They all need a relatively large number of observations to properly reject the null hypothesis. The worst power is registered by the fan.test function, without doubt a consequence of its over-tendency to accept null hypotheses. On the other hand, regarding the outcomes for ks.test from package stats, ad.test from package ADGofTest and dgeometric.test, we observe that the three functions show similar power, with dgeometric.test presenting nonetheless greater power in medium-size samples. Finally, in these scenarios, it also highlights the fact that some of the functions, (ad.test and ad2.test from package truncgof), experience an error in some of the simulations. 13

Speeds of the tests
Although the speed of the studied functions is not of concern when just a bunch of tests must be performed 14 , the computational burden could become an issue when we are interested in carrying out hundreds of thousands (like in our simulation exercise) or millions of tests. So, this section is devoted to analyzing the time spent by the different functions during the simulation exercise.
The first issue that clearly emerges in this analysis is that the analyzed functions, when used with their default options, can be clustered in four groups in terms of velocity. In an analysis by scenarios (see Figure 11), the first result that stands out is the regularity that speeds of sup-tests and dgeometric.test functions show, not significantly varying among scenarios and sample sizes 16 ; when, on the contrary, truncgof-tests and fan.test functions' velocities vary both across scenarios and sample sizes. Focusing on the truncgof-tests and fan.test groups of functions, Figure 11 highlights the great similarities that their speed curves show between scenarios with the same null hypothesis. The only exception to this rule seems to be in the Ca ( To try to give an answer to this question, the actual p values corresponding to the samples generated in our simulation exercise have also been computed 17 and compared to the ones obtained with the default option. A summary of the comparisons made between approximate and actual p values is offered in Table 3. In particular, in each scenario, Table 3 (i) counts the proportion of times that the default option approximation offers the correct solution at the usual significant levels (α = 0.1, α = 0.05 and α = 0.01), (ii) the correlation between actual and approximate p values and (iii) the mean and standard deviation of the absolute differences between actual and approximate p values.
Looking at the numbers in Table 3, it could be argued that the outcomes that produces the dgeometric.test function with default options can be considered adequate. The differences observed, both in absolute values and in terms of the proportion of times that the dgeometric.test function with default options does not hold the correct decision, are indeed within the limits of the uncertainty linked with the process. For example, with a significant level of α = 0.05, the percentage of decision coincidences in calibration scenarios is as large as 98 per cent, when as we noted in Section 5.1 the volatility in the estimation of the p values is as large as ±0.02, even for well-established test like the KS test (ks.test from package stats). Some deviations like the ones observed are therefore reasonable.
As a final note of interest, it is worth mentioning that as expected when the null hypothesis is clearly rejected by the data (as happened in the Ca(0, 1) − N (0, σ) scenarios) the estimated p values are less sensitive to the number of simulated samples employed.

Conclusions
This paper introduces a new solution for the non-parametric goodness-of-fit test based on the test statistic that measures the distance in absolute value (i.e., the area) between an empirical kernel estimate of the observations and a null hypothesis (one-dimensional continuous) density function. This test in addition to the Fan's test (Fan 1994) has been implemented in the GoFKernel package, which is presented in this paper. A comparative analysis with the other general non-parametric goodness-of-fit tests currently available in R reveals that: (i) the tests implemented in the truncgof package are non-usefulness for bounded variables, presenting moreover in unbounded variables a clear over-tendency to reject the null hypothesis; and  Table 3: Differences in performance of dgeometric.test function with default options and with n.sim = 1000. Elaborated using R version 3.1.0 (R Core Team 2015), GoFKernel 2.0-3 (Pavía 2015) and MASS 7.3-31 (Venables and Ripley 2002).
that, on the contrary, (ii) the fan.test function shows an over-tendency to accept the null hypothesis, at least in its default option (where the bandwidth is computed using the dpik function of KernSmooth package). The analysis also show that (iii) the tests implemented in ks.test in package stats, ad.test in package ADGofTest and dgeometric.test functions represent the best alternatives, with the last function showing more frequently superior power in samples of medium and large sizes, although at the cost of a higher computational burden (see Section 5.3).
In summary, taking aside computational costs and combining the statistical conclusions of Sections 5.1 and 5.2, it seems that: (i) to test uniformity, the dgeometric.test function offers the preferable test for samples of small and medium size, with ks.test from package stats and ad.test from package ADGofTest being competitive as soon as the sample size grows; (ii) to test normality, all the three recommended tests offer good performance, with the dgeometric.test function being superior in samples of small size; (iii) to test goodness-of-fit to an exponential distribution, no solution is adequate for small sample sizes, with both versions of the AD test (ad.test from package ADGofTest and ad.test from package truncgof) improving their numbers starting from samples of medium size; and (iv) to test goodness-of-fit to a logNormal distribution, the dgeometric.test function is slightly preferable to ks.test from package stats and ad.test from package ADGofTest, with no test offering good discriminant power between the logNormal distribution and the Gamma distribution in samples of small size.
The GoFKernel package is of course incomplete and as part of future work it should grow incorporating other EKF based tests and/or improving the flexibility of their methods.