vsgoftest: An Package for Goodness-of-Fit Testing Based on Kullback-Leibler Divergence

The R-package vsgoftest performs goodness-of-fit (GOF) tests, based on Shannon entropy and Kullback-Leibler divergence, developed by Vasicek (1976) and Song (2002), of various classical families of distributions. The theoretical framework of the so-called Vasicek-Song (VS) tests is summarized and followed by a detailed description of the different features of the package. The power and computational time performances of VS tests are studied through their comparison with other GOF tests. Application to real datasets illustrates the easy-to-use functionalities of the vsgoftest package.


Introduction
Goodness-of-fit (GOF) tests constitute a classical tool in deciding of the compatibility of data with a theoretical (probability) distribution. The present work proposes a package for the R statistical computing environment R Core Team (2017) performing GOF tests based on Shannon entropy and Kullback-Leibler divergence, together with a methodological guide and applications.
Precisely, we consider fitting numeric (real valued) data either to a unique distribution, the so-called simple null hypothesis test H 0 : P = P 0 (θ) against H 1 : P = P 0 (θ), or to a parametric family, the so-called composite null hypothesis test H 0 : P ∈ P 0 (Θ) against H 1 : P / ∈ P 0 (Θ).
The set P 0 (Θ) = {P ∈ D : P = P 0 (θ), θ ∈ Θ}, with Θ ⊂ R d , is a parametric subfamily of the set D of all probability distributions absolutely continuous with respect to Lebesgue measure on R, i.e., probability distributions with a density function. Decision is to be taken from the observation x n 1 = (x 1 , . . . , x n ) of a sample X n 1 = (X 1 , . . . , X n ) of size n of independent and identically distributed random variables drawn from P ∈ D.
Classically, a GOF test procedure is derived by computing some distance-like functional between the observations and the null distribution, or family of distributions, the null hypothesis being rejected when the distance is larger than a critical value. Kolmogorov-Smirnov, Cramérvon Mises and Anderson-Darling tests constitute some of the most commonly used GOF tests.
Their test statistics measure discrepancy between the empirical cumulative distribution function of the sample and the cumulative distribution function of the null distribution; these tests are refered to as EDF tests in the following; see Stephens (1974). The tests in this paper are based on the Kullback-Leibler (KL) divergence of the density of the sample with respect to the null density.
GOF tests based on KL divergence have been introduced by Vasicek (1976) for testing normality. Vasicek normality test relies on the maximum entropy property satisfied by the normal distribution: amoung all distributions with density and finite variance, Shannon entropy is maximized by the normal distribution. Vasicek test statistic is a monotone function of the entropy difference between the null normal distribution and the observed one. This has been subsequently extended to GOF tests of uncategorical data for numerous families of distributions satisfying a maximum entropy property -say maximum entropy (ME) distributions; see Section 2 for references. GOF tests based on entropy differences are known to have higher power than classical GOF tests in numerous cases; see Section 4.1. Song (2002) considers GOF tests based on KL divergence, for a large class of distributions including all classical distribution families. The test statistic is an estimate of the KL divergence between the sample and the null distributions. It is asymptotically normally distributed. When applied to ME distributions, it is equal to the difference between the entropies of the null distribution and the sample one, yielding the same decision rule as Vasicek test. This paper presents the implementation of Vasicek and Song tests (VS tests) for various families of distributions: uniform, normal, log-normal, exponential, gamma, Weibull, Pareto, Fisher, Laplace and Beta distributions. For further details on the theoretical aspects of VS tests, see Girardin and Lequesne (2017) in which a unifying framework for tests based on entropy difference and KL divergence is provided.
Numerous R packages perform GOF tests for various families of distributions. The functions chisq.test, ks.test and shapiro.test of the stats package perform respectively the chi-squared test of adequacy to a discrete distribution, the Kolmogorov-Smirnov GOF test for any theoretical continuous distribution and the Shapiro-Wilk normality test. The packages goftest developed in Faraway, Marsaglia, Marsaglia, and Baddeley (2015) and goft in Gonzalez-Estrada and Villasenor-Alva (2016) perform respectively Cramér-von Mises and Anderson-Darling GOF tests, and tests based on the ratios of variance and other moment estimators. KScorrect in Novack-Gottshall and Wang (2016) performs the Lilliefors-corrected Kolmogorov-Smirnov GOF test. Numerous GOF tests of the exponential or two-parameter Weibull distributions are available in EWGoF Krit (2015) while nortest and normtests are dedicated to testing normality. The dbEmpLikeGOF package developed in Miecznikowski, Vexler, and Shepherd (2013) proposes GOF normality and uniformity tests based on empirical likelihood ratio. These tests are closely related with VS tests; similarities and differences between them are highlighted in the following sections.
The test procedure implemented in vsgoftest uses either the asymptotic distribution of the test statistic or Monte-Carlo simulation, depending on sample size or user's choice. Optional arguments are included for handling particular situations such as samples with numerous ties. They also contribute to make the procedure flexible and fully parameterizable. Besides these practical aspects, the paper presents a comprehensive review of the literature dealing with power properties of VS tests. Monte Carlo simulations are conducted to illustrate their performance when applied to discriminate between close distributions.
The paper is organized as follows. The theoretical framework of GOF tests based on Shannon entropy difference and KL divergence is briefly presented in Section 2. The functionalities of vsgoftest are presented in Section 3. The tests performed by vsgoftest are compared to other GOF tests in Section 4. More precisely, power comparisons to classical GOF tests are presented in Section 4.1; Section 4.2 focuses on the comparison of vsgoftest and dbEmpLikeGOF test procedures, which rely on very close theoretical frameworks but significantly differ in some of their features. Finally, applications to real data in Section 5 illustrate the usage of the proposed functionalities.

Entropy difference and KL divergence based GOF tests
The Shannon entropy of a distribution P with density function p on R has been defined in Shannon (1948) as Entropy measures the uncertainty or variability of a distribution. The maximum entropy principle under moment constraints, or ME method, favours distributions with highest entropy for their highest degree of uncertainty; see Shannon (1948) and Jaynes (1957). Among all distributions supported by a given finite length interval I in R, entropy is maximum and equals log |I| for the uniform distribution, where |I| denotes the length of I. Hence, the entropy difference log |I| − S(P ) can be thought as a distance-like measure between P and the uniform distribution.
Similarly, among all continuous distributions supported within R with mean µ and variance σ 2 , Shannon entropy is maximum for the normal distribution N (µ, σ 2 ) and equals The entropy difference S(N (µ, σ 2 )) − S(P ) is nonnegative and thus defines a distance-like measure between any distribution with mean µ and variance σ 2 and the N (µ, σ 2 ) distribution. Based on this property, Vasicek (1976) derives a normality test, with a test statistic expressed in terms of entropy differences, defined as follows are the empirical estimators of respectively the mean and the variance of the sample X n 1 := (X 1 , . . . , X n ), X (1) ≤ · · · ≤ X (n) denotes the order statistics associated to X n 1 and is the non-parametric Vasicek estimator of S(P ) based on spacings, with X (i) = X (1) if i < 1 and X (i) = X (n) if i > n; the window size m ∈ N * is smaller than n/2.
The test statistic has been extended to various families of ME distributions under moment constraints; see Dudewicz and Van Der Meulen (1981), Ebrahimi, Habibullah, and Soofi (1992), Choi and Kim (2006), Mergel (1999), Mudholkar and Tian (2002), among many others. A unifying framework for any exponential family of distributions is proposed in Girardin and Lequesne (2017), with asymptotic properties, consistency and application to biology; see also Lequesne (2013), Lequesne (2015b) and Lequesne (2015a) for power efficiencies, GOF tests of Pareto distributions and extension to generalized entropies.
The Kullback-Leibler (KL) divergence of a distribution P with respect to another one Q, is defined as if P is absolutely continuous with respect to Q, with respective densities p and q, and as +∞ if not; see Kullback and Leibler (1951). The KL divergence is linked to Shannon entropy through the relation The KL divergence is not a mathematical distance because of lack of both symmetry and triangular inequality, but it satisfies K(P |Q) ≥ 0, with K(P |Q) = 0 if and only if P = Q, and thus constitutes a natural measure of discrepancy for GOF tests. Song (2002) proposes GOF tests based on KL divergence for either simple (1) or composite (2) null hypothesis. Precisely, thanks to (7), the test statistic I mn is the estimator of K(P |P 0 (θ)), defined by where V mn is the Vasicek estimator (5) of S(P ), and θ n is either the maximum likelihood estimator (MLE) of θ satisfying or θ itself in case of a simple null hypothesis (1).
The KL divergence K(P |Q) for a maximum entropy distribution Q under moment constraints reduces to the entropy difference S(Q)−S(P ) for all P satisfying the same moment constraints; see Csiszár (1975). This Pythagorean equality allowed Girardin and Lequesne (2017) to establish that entropy difference GOF test for ME distributions coincide with Song testwe will refer to these tests as Vasicek-Song tests and keep on denoting them by VS tests. Especially, Vasicek and Song normality test statistics are linked through the equality yielding identical decision rules.
Based on the asymptotic properties of V mn proven by Dudewicz and Van Der Meulen (1981) for testing uniformity, Song (2002) establishes the asymptotic behavior of I mn , independently of the null hypothesis: I mn is consistent and asymptotically normally distributed provided the null distribution belongs to the class F = P ∈ D : sup where F is the cumulative distribution function of P and p its density with derivative p (almost everywhere). The class F contains the most classical distributions such as uniform (γ = 0), normal, exponential and gamma (γ = 1), Fisher (γ = (2 + ν 2 )/ν 2 where ν 2 is the second degree of freedom), Pareto (γ = (µ + 1)/µ, where µ is the shape parameter), etc. For where ψ(m) is the digamma function. The asymptotic bias log(2m) − ψ(2m) of I mn is that of −V mn . Song (2002) suggests a bias correction in the asymptotic distribution (11) for moderate sample sizes: where with R m := m j=1 1/j. From (12), an asymptotic p-value for the related VS test is given by where I mn (x n 1 ) denotes the value of the statistic I mn for the observations x n 1 = (x 1 , . . . , x n ), and Φ denotes the cumulative distribution function of the normal distribution. According to Song (2002), the asymptotic p-value (13) provides accurate results for sample sizes n larger than 80. For small sample sizes, Monte Carlo simulations should be preferred. A large number N of replications of X n 1 drawn from P 0 ( θ n ) (or P 0 (θ) in case of simple null hypothesis) are generated. The test statistic I i mn is computed for each replication i, 1 ≤ i ≤ N. The p-value is then given by the empirical mean ( Song (2002) proposes to minimize I mn -that is to maximize V mn , with respect to m, yielding the most conservative test. The KL divergence K(P |P 0 (θ)) being nonnegative, values of m for which I mn is negative are excluded, leading to choose m subject to the constraint Finally, the window size proposed by Song (2002) -say the optimal window size, is for some δ < 1/3 and the VS test statistic is then The upper bound n 1/3−δ for the window size m is chosen so that conditions (10) are fulfilled and hence that asymptotic normality (11) holds. No optimal choice of δ exists; it depends on the family of distributions of the null hypothesis; see Section 3 for details.
The package vsgoftest presented below performs VS GOF tests for several parametric families of ME distributions: uniform, normal, log-normal, exponential, Pareto, Laplace, Weibull, Fisher, gamma and beta distributions. These families of distribution, all included in the class F given by (9), have been chosen so that the package covers a large variety of applications. Note that the package dbEmpLikeGOF performs uniformity and normality VS tests, with an alternative choice for the window size. Precisely, the test statistic is nI mn + 1/2, and the window m is chosen, between 1 and n 1/2 , minimizing nI mn . The constraint (14) is not considered. The asymptotic distribution of nI mn is not used, p-values being computed from a pre-calculated table for small sample sizes or via Monte-Carlo simulation; see Miecznikowski et al. (2013) and Vexler and Gurevich (2010). This alternative methodological approach leads to different decisions that may be less reliable, particularly when applied to heavy tailed samples. Other differences in the coding structure make vsgoftest faster than dbEmpLikeGOF, especially when Monte-Carlo simulation is performed. These points will be detailled in Section 4.2.

The package vsgoftest
The vsgoftest package provides functions for estimating Shannon entropy of absolutely continuous distributions and testing the goodness-of-fit of some theoretical family of distributions to a vector of real numbers. It also provides functions for computing the density, cumulative density and quantile functions of Pareto and Laplace distributions, as well as for generating samples from these distributions.
The vsgoftest package is available on CRAN mirrors and can be installed by executing the command

install.packages('vsgoftest')
Alternatively, the latest (under development) version of the vsgoftest package is also available and can be installed in R from the github repository of the project as follows: #Package devtools must be installed devtools: The package is structured around two functions, entropy.estimate and vs.test; the first one computes the spacing based estimator (5) from a numeric sample, the second one performs Vasicek-Song GOF test for usual parametric families of distributions based on the test statistic (16). A comprehensive presentation of their usage is proposed in Sections 3.1 and 3.2, with numerous examples. Section 3.3 provides further technical information about the structure of the package.

Function entropy.estimate for estimating Shannon entropy
The function entropy.estimate computes the spacing based estimate (5) of Shannon entropy (3) from a numeric sample. Two arguments have to be provided: • x: the numeric sample; • window: an integer between 1 and half of the sample size, specifying the window size of the spacing-based estimator (5).
It returns the estimate of Shannon entropy of the sample. Here is an example for a sample drawn from a normal distribution with parameters µ = 0 and σ 2 = 1.

Function vs.test for testing GOF to a specified model
The function vs.test performs the VS test, as described in Section 2; setting two non-optional arguments is required: • x: the numeric sample; • densfun: a character string specifying the theoretical family of distributions of the null hypothesis. Available families of distributions are: uniform, normal, log-normal, exponential, gamma, Weibull, Pareto, Fisher and Laplace distributions. They are referred to by the symbolic name in R of their density function. For example, set densfun = 'dnorm' to test GOF of the family of normal distributions; see Table 1 for details.
It returns an object of class htest, i.e., a list whose main components are: • statistic: the value of VS test statistic (16) for the sample, with optimal window size defined by (15); • parameter: the optimal window size; • estimate: the maximum likelihood estimate of the parameters of the null distribution (for the test (2) with composite null hypothesis); • p.value: the p-value associated to the sample.
By default, vs.test performs the composite VS test of the family of distributions densfun for the sample x. The p-value is estimated by means of Monte-Carlo simulation if the sample size is smaller than 80, or through the asymptotic distribution (11) of the VS test statistic otherwise.
In the following example, a normally distributed sample is simulated. VS test rejects the null hypothesis that this sample is drawn from a Laplace distribution, but does not reject the normality hypothesis (for a significant level set to 0.05).
set.seed(5) samp <-rnorm(50,2,3) vs.test(x = samp, densfun = 'dlaplace') Vasicek - For performing a simple null hypothesis GOF test, the additional argument param has to be set to a numeric vector, consistent with the parameter requirements for the null distribution. In such case, the MLE of the parameter(s) of the null distribution has not to be computed and hence the component estimate in results is not available.
set.seed (26) vs.test(x = samp, densfun = 'dnorm', param = c(2,3)) Vasicek-Song GOF test for the normal distribution with Mean=2, St. dev.=3 data: samp Test statistic = 0.22196, Optimal window = 2, p-value = 0.331 If param is not consistent with the specified distribution -e.g., standard deviation for testing a normal distribution is missing or negative, the execution is stopped and an error message is returned.
set.seed(1) samp <-rweibull(200, shape = 1.05, scale = 1) set.seed (2) vs.test(samp, densfun = 'dexp', simulate.p.value = TRUE, B = 10000) Vasicek-Song GOF test for the exponential distribution data: samp Test statistic = 0.10907, Optimal window = 3, p-value = 0.3504 sample estimates: Rate 1.15047 Vasicek's estimates V mn are computed for all m from 1 to n 1/3−δ , where δ < 1/3; the test statistic is I mn for m the optimal window size, as defined in (15). The choice of δ depends on the family of distributions of the null hypothesis. Precisely, for Weibull, Pareto, Fisher, Laplace and Beta, δ is set by default to 2/15, while for uniform, normal, log-normal, exponential and gamma, it is set to 1/12. These default settings result from numerous experimentations. Still, the user can choose another value through the optional argument delta. Note that upper-bounding the window size by n 1/3−δ is only required when the asymptotic normality of I mn is used to compute asymptotic p-values from (11). When the p-value are computed by means of Monte-Carlo simulation, this upper-bound can be extended to n/2 by adding extend = TRUE, which may lead to a more reliable test, as illustrated below.
set.seed(8) samp <-rexp(30, rate = 3) vs.test(x = samp, densfun = "dlnorm") Vasicek-Song GOF test for the log-normal distribution data: samp Test statistic = 0.30717, Optimal window = 2, p-value = 0.1206 sample estimates: Location Scale -2.162290 1.683868 vs.test(x = samp, densfun = "dlnorm", extend = TRUE) Vasicek-Song GOF test for the log-normal distribution data: samp Test statistic = 0.3029, Optimal window = 3, p-value = 0.007 sample estimates: Location Scale -2.162290 1.683868 Enlarging the range of m is also pertinent if ties are present in the sample. Indeed, the presence of ties is particularly inappropriate for performing VS tests, because some spacings X (i+m) − X (i−m) can be null. The window size m has thus to be greater than the maximal number of ties in the sample. Hence, if the upper-bound n 1/3−δ is less than the maximal number of ties, the test statistic can not be computed. Setting extend to TRUE can avoid this behavior, as illustrated below.
samp <-c(samp, rep(4,3)) #add ties in the previous sample vs.test(x = samp, densfun = "dexp") Finally, Vasicek's estimate V mn may exceed the parametric estimate of the entropy of the null distribution for all m between 1 and n 1/3−δ . Then, no window size exists satisfying (15), as illustrated below.
Enlarging the possible window sizes by setting extend to TRUE may enable Vasicek estimates to be smaller than empirical entropy.
Note that when computing the p-value by Monte-Carlo simulation, the constraint (14) may not be satisfied for some replicates, whatever be the window size. These replicates are then ignored and the p-value is computed from the remaining replicates. A warning message is added to the output, informing on the number of ignored replicates. data(contaminants) #load data from package vsgoftest; see ?contaminants set.seed(1) vs.test(x = aluminium2, densfun = 'dpareto') Warning in vs.test(x = aluminium2, densfun = "dpareto"): For 176 simulations (over 5000 ), entropy estimate is greater than empirical maximum entropy for all window sizes.
Vasicek-Song GOF test for the Pareto distribution data: aluminium2 Test statistic = 1.3676, Optimal window = 2, p-value < 2.2e-16 sample estimates: mu c 0.3288148 360.0000000 A large proportion of such ignored replicates may indicate that the original sample is too small or the null distribution does not fit it.
The function vs.test also allows to avoid the constraint (14) when computing the optimal window size, by setting the optional argument relax to TRUE. This however should be used with special care, even when the p-value is computed by Monte-carlo simulation, because it may lead to spurious conclusions. Some examples will be discussed in Section 5. This option is to recover the non-parametric likelihood ratio GOF test developed by Vexler and Gurevich (2010) and performed by dbEmpLikeGOF; see Section 4.2.

Technical information on the internal structure of the vsgoftest package
While entropy.estimate is a stand-alone function -depending only on the base and stats packages, vs.test is supported by a set of internal functions -not available for users; the structure of the package and connections between functions are described in the organisational chart presented in Figure 1. Functions available for users are depicted by rectangles while internal functions are depicted by ellipses. An arrow connecting a function to another means that the first function (say master function) calls the second (slave) during execution. When such a call is optional (depending on arguments given in the master function), the arrow is dashed and annotated with the corresponding argument settings. The function fitdist depicted by a dashed rectangle is a function implemented in the fitdistrplus package Delignette-Muller and Dutang (2015). The double-lined ellipse depicts a C++ encoded function that has been integrated via the package Rcpp Eddelbuettel and Francois (2011).
The vsgoftest package is structured in such a way so as to: • Allow easy access to the code source. Especially, the master function vs.test calls four slave functions corresponding to the following tasks (enumerated according to the organisational chart of Figure 1): pa ra m = N U LL Figure 1: Organisational chart of the structure of the package and connections between functions available for users (depicted by rectangles) and internal functions (depicted by ellipses).

MLE.param. 2. Computing
Vasicek estimate V mn of Shannon entropy for the sample with the optimal window m given by (15). 3. Computing the VS test statistic I mn .
4. Computing the p-value associated to the sample. If the sample size is either greater than 80 or the optional argument simulate.p.value is TRUE, then the p-value is estimated by means of Monte-Carlo simulation performed by the internal function simulate.vs.dist.
• Limit dependence to other packages. In this aim, density, cumulative density and quantile functions as well as random generators for Pareto and Laplace distributions have been encoded, even if they are available in other R packages such as VGAM in Yee (2010), POT in Ribatet and Dutang (2016) and smoothmest in Hennig (2012). The MLE θ n of the parameter of the null distribution is computed thanks to the function fitdist of the fitdistrplus package only if no closed form expression is known for it, i.e., for Gamma, Weibull, Beta and Fisher distributions. Otherwise, the closed form expression is used.
• Optimize time and resources, especially for Monte-Carlo simulation. To this end, the most time-consuming part of the procedure -namely, the computation of Vasicek estimate for all possible window sizes, has been converted to C++ and integrated to the package via Rcpp, in the internal function vestimates.

Performance of Vasicek-Song tests
First, a review of power studies of VS tests available in literature is presented in Section 4.1.
Then, power comparisons of VS tests and classical GOF tests are proposed when applied to discriminate between close distributions, such as Pareto versus shifted log-normal and Exponential versus Weibull. Finally, the features of packages vsgoftest and dbEmpLikeGOF are compared in Section 4.2; the methodological differences are highlighted, the higher performance of vsgoftest both in terms of power and computational time is pointed out and illustrated.

Power computation
Comparisons of the power properties of VS tests are widely discussed in the literature. Various choices of null and alternative distribution families are considered. VS tests are shown to generally outperform classical GOF tests. A comprehensive list of these references is given in this section, with main conclusions summarized in Table 2. Especially, power properties of the VS test for normality have been discussed by Vasicek (1976), Arizono and Ohta (1989) and Gurevich and Davidson (2008) among many others. Compared with many tests, including Kolmogorov-Smirnov (KS), Cramér-von Mises (CvM), Anderson-Darling (AD) and Shapiro-Wilk (SW), the VS test exhibits higher power for most of alternative distributions. When the null distribution is an exponential distribution, the VS test is also shown in Ebrahimi et al. (1992) to be more powerful than the Van-Soest and Finkelstein and Schafer tests, which are modified versions of respectively CvM and KS tests, for various alternative distributions such as Weibull, gamma and log-normal. Choi and Kim (2006) for Laplace and Lequesne (2015b) for Pareto show that the VS test is more powerful than EDF tests, for various alternative distributions. The uniform VS GOF test is shown to outperform many other tests for alternative distributions having most of their mass near 0.5, but remains less powerful than CvM and Watson tests for other alternative distributions.
On the basis of power computation in the literature, we choose to compare the power of the VS test to the KS, CvM and AD tests, for close null and alternative distributions. In particular, difficulties in distinguishing a Pareto tail from that of a log-normal is an issue; see for example Malevergne, Pisarenko, and Sornette (2011). For illustration, we estimate through Monte-Carlo simulation the power of VS, KS, CvM and AD of Pareto distributions applied to samples drawn from a (shifted) log-normal distribution. We simulate 10000 replicates x n 1 of a random sample X n 1 drawn from a shifted log-normal distribution LN (0, σ) with support [1, ∞[ and σ = 1, 1.25, for n ∈ {20, 30, 50, 100}. Then, we apply the tests for the simple null hypothesis H 0 : P = Par(1, µ) , for µ = 1 when σ = 1 and µ = 0.8 when σ = 1.25; the power is estimated by the proportion of rejections of the null hypothesis among the 10000 replicates. The following code chunk illustrates the procedure, for σ = µ = 1 and n = 20, using the VS test. This procedure immediately adapts to other values of σ, µ and n and to other tests 1 . Results are presented in Table 3 (top). N <-10000 n <-20 mu <-1 set.seed(54) res.pow <-replicate(n = N, expr = vs.test(x = 1 + rlnorm(n, meanlog = 0, sdlog = 1),   distribution E(1/a) when b = 1. The main aim is thus to determine which test better discriminates between these distributions when the shape parameter of the Weibull distribution is close to 1, precisely b = 1.2 and b = 1.3. Results are given in Table 3 (bottom), clearly showing that the VS test outperforms EDF tests.
Note that the above procedure for comparing the power of GOF tests adapts easily to other sets of null and alternative distributions.

vsgoftest versus dbEmpLikeGOF for testing uniformity and normality
As mentioned in the introduction section, The package dbEmpLikeGOF in Miecznikowski et al. (2013) performs uniformity and normality tests based on empirical likelihood ratios (ELR) -say ELR tests. These tests are strongly linked to VS tests. Precisely, for testing the normality of a sample X 1 , . . . X n , the ELR test statistic is log V n , where V n = min Mere algebra yields log V n = nI mn + 1 2 , with m ∈ argmax 1≤m<n 1/2 V mn . Hence, ELR and VS tests differ only in the window size choice: the upper bound is n 1/2 for the ELR test while it is (by default) n 1/4 for the VS test and the constraint (14) is not taken into account by the EL test. Enlarging the upper bound from n 1/4 to n 1/2 may lead to a more powerful decision rule, as mentioned and illustrated in Section 3.  (12), depending on the sample size. In both cases, vs.test is approximately five times faster than dbEmpLikeGOF, as illustrated by Figure 2 4 .
2 The comparison procedure is available in the file vsgoftest performances.R, in the directory inst/doc of the package source file.
3 Some slight difference remains between the two computed values, due to numerical inaccuracy in computation procedures: the estimated entropy of the null distribution is computed from the closed form expression (4) in dbEmpLikeGOF while it is computed as the empirical mean of the log-likelihood of the sample in vs.test. 4 Simulations have been performed on a Dell Lattitude E5580 laptop, equipped with an Intel ® Core™ i7-

Application to real data
The vs.test package contains environmental data originating from a guidance report edited by the Technology Support Center of the United States Environmental Protection Agency; see Singh, Singh, and Engelhardt (1997). According to Singh et al. (1997), environmental scientists take remediation decisions at suspected sites based on organic and inorganic contaminant concentration measurements. These decisions usually derive from the computation 7600U CPU at 2.80GHz x 4, with 16GB RAM. R code for generating these figures is available in the file vsgoftest performances.R, in the directory doc of the package source file.
of confidence upper bounds for contaminant concentrations. Testing the goodness-of-fit of specified models hence appears of prior interest. Singh et al. (1997) also points out that contaminant concentration data from sites often appear to follow a skewed probability distribution, making the log-normal family a frequently-used model. The authors illustrate their purpose by applying Shapiro-Wilk test to the log-transformed of the samples aluminium1, manganese, aluminium2 and toluene (stored in the present package) 5 ; see the empirical skewness computed in the following chunk.
data(contaminants) #Load environmental data from package #Package DescTools required for this chunk unlist(lapply(X = list(aluminium1, manganese, aluminium2, toluene), FUN = DescTools::Skew)) [1] 2.323343 1.698686 1.996607 3.961129 The following code chunks intend to illustrate the use and behavior of the function vs.test for these environmental data. The significant level is fixed to 0.1 as in Singh et al. (1997). Note that warning messages notifying that there are ties in the samples have been dropped out from outputs.
set.seed(1) vs.test(x = aluminium2, densfun = 'dlnorm') Vasicek-Song GOF test for the log-normal distribution data: aluminium2 Test statistic = 0.48369, Optimal window = 2, p-value = 0.0256 sample estimates: Location Scale 8.9273293 0.8264409 Due to numerous ties in toluene, vs.test can not compute Vasicek entropy estimate unless extend is set to TRUE. Still, vs.test notifies that the constraint (14) is violated for all window sizes, which suggests that data are not likely to be drawn from the log-normal distribution; see Section 2. Turning relax to TRUE yields the following result.
set.seed(1) vs.test(x = toluene, densfun = 'dlnorm', extend = TRUE, relax = TRUE) Vasicek-Song GOF test for the log-normal distribution data: toluene Test statistic = -2.4984, Optimal window = 11, p-value = 0.7308 sample estimates: Location Scale 4.651002 3.579041 Again, this last result looks spurious because the test statistic is negative -resulting from (14) not being satisfied by setting relax = TRUE. An alternative is to test normality of the log-transformed sample as follows.
set.seed(1) vs.test(x = log(toluene), densfun ='dnorm', extend = TRUE) Vasicek-Song GOF test for the normal distribution data: log(toluene) Test statistic = 0.6536, Optimal window = 11, p-value = 2e-04 sample estimates: Mean St. dev. 4.651002 3.579041 The log-normal hypothesis is not rejected for aluminium1 and manganese while it is rejected for aluminium2 and toluene. These results are consistent with those obtained by Singh et al. (1997). Further, the goodness-of-fit to the Pareto distributions is performed for aluminium2 and toluene. Log-normal and Pareto distributions usually compete with closely related generating processes and hard to distinguish tail properties; see for example Malevergne et al. (2011). Goodness-of-fit of Pareto distribution is rejected for aluminium2.
set.seed(1) vs.test(x = aluminium2, densfun = 'dpareto') Vasicek-Song GOF test for the Pareto distribution data: aluminium2 Test statistic = 1.3676, Optimal window = 2, p-value < 2.2e-16 sample estimates: mu c 0.3288148 360.0000000 Applying vs.test to toluene with default settings yields no result because of numerous ties and the violation of (14). Uniformity of the sample transformed by the cumulative density function of the Pareto distribution can be tested as follows. Goodness-of-fit of the Pareto distribution is not rejected for toluene.

Conclusion
Vasicek-Song tests constitute powerful GOF tests for classical parametric families of distributions, relying on an information theoretical framework. They can be easily performed by using the vsgoftest package for R. Default and optional settings of the functions provided by the package make the procedure both intuitive and flexible. Its application to real datasets manages to illustrate its practical usage.
The package allows for testing GOF of a significant list of parametric models; this list could be extended in further releases. New entropy-based GOF tests could also be considered by using Rényi entropy and divergence -see Lequesne (2015a), thus extending even more the class of possible distributions, e.g., Student distributions.