npregfast: An R Package for Nonparametric Estimation and Inference in Life Sciences

We present the R npregfast package via some applications involved with the study of living organisms. The package implements nonparametric estimation procedures in regression models with or without factor-by-curve interactions. The main feature of the package is its ability to perform inference regarding these models. Namely, the implementation of different procedures to test features of the estimated regression curves: on the one hand, the comparisons between curves which may vary across groups defined by levels of a categorical variable or factor; on the other hand, the comparisons of some critical points of the curve (e.g., maxima, minima or inflection points), studying for this purpose the derivatives of the curve.


Introduction
Regression analysis plays a fundamental role in statistics. The purpose of this technique is to evaluate the influence of some explanatory variables on the mean of the response. In the case of nonparametric regression, the dependence between the response and the covariates is modeled without specifying in advance the function that links them. Development and implementation of different methods for estimation and inference regarding these models is the central focus of this work.
Nonparametric methods are now widely recognized as useful tools in regression analysis. However, they are much more computationally demanding than their parametric counterparts. In view of the high cost entailed we used Fortran (Gehrke 1995) as the programming language. To facilitate the use in practice of the methodologies proposed, a user-friendly R (R Core Team 2017) package is implemented containing the Fortran code.
A range of methods can be found in the npregfast package (Sestelo, Villanueva, and Roca-Pardiñas 2017) including estimation of the conditional mean and the derivatives with/without factor-by-curve interactions, bandwidth selection and computational acceleration. In addition, several procedures to test different features of the estimated regression curves have also been included. These developments have been applied to a couple of real data situations in a life science context.
The effect of a continuous covariate on the response may vary across groups defined by levels of a categorical variable. This means that the continuous covariate can behave in a different way in the absence/presence of a factor, producing the corresponding factor-by-curve effect. Following this, there exist many practical situations that call for comparisons of regression curves and their derivatives which may vary across groups defined by different experimental conditions. Thus, the interest might be focused on drawing inferences about some critical points of the curve (e.g., maxima, minima or inflection points), studying for this purpose the derivatives of the curve. In marine biology, for example, the growth of commercial species collected in different zones could be analyzed and compared with each other taking into account the environmental conditions prevailing in such areas. The location would be considered as the factor, and the length-weight relationship could be studied including the factor-by-curve interaction. Similarly, the first derivative of the regression curve could be calculated, thereby enabling the different stages of growth to be defined as the species increases in size. Furthermore, calculation of this derivative could have a direct application in the management of this species, possibly in estimating a size of capture (Sestelo and Roca-Pardiñas 2011;Bidegain, Sestelo, Roca-Pardiñas, and Juanes 2013;Bidegain, Guinda, Sestelo, Roca-Pardiñas, Puente, and Juanes 2015). This paper describes the R based npregfast package which is available from the Comprehensive R Archive Network (CRAN) at https://CRAN.R-project.org/package=npregfast/). The package allows to estimate the regression curves and their derivatives, to compare them between levels (in the case of including a categorical variable), and even to compare their critical points. The main estimation procedure is based on local polynomial kernel smoothers (Wand and Jones 1995;Fan and Gijbels 1996). It is also possible, however, to estimate the models using a classical parametric model -the allometric model -one of the most frequently used models in fishery management. In addition, the package implements the following two tests: (i) a global test for the equality of M regression curves; and, (ii) a local test to draw inferences about critical points linked to the derivative curves. Inference with this package (confidence intervals and tests) is based on bootstrap resampling methods (Efron 1979;Wu 1986;Liu 1988;Efron and Tibshirani 1993;Härdle and Mammen 1993;Mammen 1993;Kauermann and Opsomer 2003). Accordingly, binning acceleration techniques are also implemented to ensure that the package is computationally efficient (Fan and Marron 1994).
There are a number of contributed packages available for R, for example on CRAN, which are devoted to nonparametric estimation procedures. Particularly, a brief review of software developments for carrying out kernel-based regression could start with the ksmooth function of the stats package, which allows to obtain estimates using the Nadaraya-Watson estimator. However, to study kernel-based nonparametric estimators in depth, the KernSmooth package (Wand 2015) affords more possibilities for R users. Using its main function, locpoly, a probability density function, a regression function or their derivatives can be estimated using local polynomials. Another option might be the use of the kerdiest package (del Río and Estévez-Pérez 2012), which has been designed for computing kernel estimators of the distribution function, or the lokern package (Herrmann and Maechler 2016), which features kernel regression smoothing with adaptive local or global plug-in bandwidth selection. In a specific context, such as item response theory or graduating mortality rates, these kernel smoothers could be applied by means of the packages KernSmoothIRT (Mazza, Punzo, and McGuire 2014) or DBKGrad , respectively. Finally, in a multivariate framework, the np package (Hayfield and Racine 2008;Racine and Hayfield 2017) provides a variety of nonparametric (and semiparametric) kernel methods that seamlessly handle a mix of continuous, unordered and ordered factor data types often encountered in applied settings.
With respect to testing procedures, it is worth noting that a vast literature exists about the comparison of regression functions. Relevant papers about this topic are Hall and Hart (1990); Härdle and Marron (1990); Delgado (1993); Kulasekera (1995); Young and Bowman (1995); Bowman and Young (1996); Dette and Neumeyer (2001); Neumeyer and Dette (2003); Pardo-Fernández, Van Keilegom, and González-Manteiga (2007); Park and Kang (2008); Srihera and Stute (2010), among others. For a detailed review see González-Manteiga and Crujeiras (2013). In addition, when the previous division into groups is governed by a discrete variable, tests for the significance of this discrete variable could also be considered. Related work includes Racine, Hart, and Li (2006); Lavergne (2001) and the references therein, for example. All the above references focus on the regression functions, however, to our knowledge, there are no references dealing with the comparison of derivatives. Furthermore, there exist procedures described in the literature that test the monotonicity of the regression function (e.g., Bowman, Jones, and Gijbels 1998) or techniques as the SiZer method (Chaudhuri and Marron 1997) which let us evaluate if the observed features are really significant. This latter technique was implemented in the SiZer package (Sonderegger 2012). However, to the best of our knowledge, this is the first contribution dealing with the topic of testing critical points between curves.
In this article we explain and illustrate how numerical and graphical output for all methods can be obtained using the npregfast package via life science applications. The applications are chosen to solve two real problems related to the management of an aquatic living resource, and to the spurt in growth for school-aged children and adolescents.
The remainder of the paper is structured as follows: Section 2 describes the estimation procedures, jointly with practical questions such as bandwidth selection and computational acceleration, and the inference procedures for the performance of different tests. Section 3 presents the implementation of the methods in package npregfast. The package capabilities using a couple of real data examples are illustrated in Section 4 and lastly, Section 5 concludes with some remarks.

Statistical methodology
In many practical situations, the response variable, Y , depends on a continuous covariate, X. In such a regression framework, consideration might well be given to the nonparametric regression model where m is a smooth unknown function and ε is the regression error with zero mean. The main advantage of using these type of models is the flexibility and the ease of interpretation of m.
A generalization of the "pure" model in (1) is the regression model with factor-by-curve interactions. In these type of models, the relationship between the response and the covariates can change depending on the levels of a categorical variable, F . The possibility of incorporating factor-by-curve interactions in nonparametric regression models has already been discussed by Hastie and Tibshirani (1990). Ruppert and Wand (1994) and Coull, Ruppert, and Wand (2001) also presented an algorithm based on penalized splines (P-splines), which would enable these types of interactions to be incorporated into these types of models. Recently, Cadarso-Suárez et al. (2006) and  have successfully applied these interactions to estimate neuron firing rates.
Based on this, to study the possible effect of F on the response, the following nonparametric regression model including factor-by-curve interactions is considered where ε 1 , . . . , ε M are the zero-mean errors for each level of the factor, f 0 represents the global effect of X on the response, and f l is the specific effect of X associated with the lth level of factor F . Note that under model (2), the regression curves m l ( In order to prevent different combinations of f 0 , f 1 , . . . , f M leading to the same model, the sum of the specific effects across the levels are assumed to be zero. That is to say, for each x, M l=1 f l (x) = 0 is enforced. Note that this identifiability condition does not put any constraints on our model because it can be modified to conform to this condition.
In addition, when a factor-by-curve interaction is detected in model (2), it might be of interest to draw inferences about some critical points of curves (such as minima, maxima or inflection points), studying for this purpose the derivatives. In general, the critical point x 0l referring to the l level will be obtained from the derivative curve m r l (x), for some r. Accordingly, we define this point, x 0l , for each l level, as The present section describes the estimation procedure for these types of models and for these critical points, based on the use of local polynomial kernel smoothers, and explains in detail the inference methods implemented in the package. It also shows the procedure used to select the bandwidth of the estimator and draws attention to the technique applied to speed up both the estimation and inference methods.

Estimation procedures
The factor-by-curve regression model in (2) is estimated using local polynomial kernel smoothers (Wand and Jones 1995;Fan and Gijbels 1996).
of n independent and identically distributed (i.i.d.) observations, and considering observations in all the levels of F , the estimate of f 0 at a point x is given byf 0 (x) =α 0 (x), beingα 0 (x) the first position of the vector (α 0 (x) ,α 1 (x) , . . . ,α R (x)) which is the minimizer of where K is a kernel function (normally, a symmetric density), h 0 is the smoothing parameter or bandwidth and R is the degree of the polynomial. Moreover, the estimated rth (r ≤ R) derivative of f 0 (x) is given byf r 0 (x) = r!α r (x). Once the estimation of f 0 is obtained, the estimate of f l at a point x is given byf l (x) =α 0l (x) (for l = 1, . . . , M ), beingα 0l the first position of the vector (α 0l (x) ,α 1l (x) , . . . ,α Rl (x)) which is the minimizer of where h l is the bandwidth used to obtainf l and I the indicator function. Analogously, the estimated rth (r ≤ R) derivative of f l (x) is given byf r l (x) = r!α rl (x). Note that the obtained estimates do not necessarily meet the imposed identifiability condition. To do so, the following procedure is used. For each x, calculate the mean of the specific effects of each level, Clearly, the estimated curves for each level at point x are given bym l (x) =f 0 (x) +f l (x), for l = 1, . . . , M , and the estimated rth derivative of m l (x) is given bym r l (x) =f r 0 (x) +f r l (x). Finally, a natural estimator of the critical point x 0l (3) can be obtained as the maximizer of m r l (k 1 ), . . . ,m r l (k N ) with k 1 , . . . , k N being a grid of N equidistant points in a range of X values.
Note that the proposed methodology makes sense when the support of X is the same for all the levels and it is also a closed and bounded interval.

Inference procedures
The procedures implemented in this package enable us to test two hypotheses. The first one is a global test which assumes the hypothesis of equality of the M regression functions (or derivatives) and the second one is a local test that enables us to test the hypothesis that, among the levels of a given factor, the critical points are equal.

Global test
Here we expose a procedure to test the following null hypothesis based on the model in (2): versus the general alternative It should be noted that the previous hypothesis is equivalent to f r 1 (·) = · · · = f r M (·) = 0, and therefore f l (x) = r−1 j=0 a jl x j will be a polynomial of degree r − 1 for l = 1, . . . , M . 1 By applying Taylor's theorem to the function h up to order r, and taking into account that the derivatives of h of order higher or equal than r are zero, we obtain which shows that h(x) is a polynomial of degree r − 1.
Accordingly, the null regression model is given by and the regression curves m l are given by m l (X) = f 0 (X) + r−1 j=0 a jl X j . Note that, in the expression (7), we have abused notation slightly. In fact, if r = 0 we are actually referring to the null model Y = f 0 (X) + ε.
To test H r 0 , we propose the use of the following test statistic based on direct nonparametric estimates of f r l curves considering the L 1 norm For a detailed simulation study comparing other test statistics see Sestelo (2013).
Note that if H r 0 holds, the value of T should be close to zero. The test rule based on T consists of rejecting the null hypothesis if T is larger than its (1 − α)-percentile obtained under H 0 . To approximate the distributions of the test statistic resampling methods such as the bootstrap introduced by Efron (1979) (see also Efron and Tibshirani 1993;Härdle and Mammen 1993;Kauermann and Opsomer 2003) can be applied instead. Here we use the wild bootstrap (Wu 1986;Liu 1988;Mammen 1993) because this method is valid also for heteroscedastic models where the variance of the error is a function of the covariate. The testing procedure used here involves the following steps: Step 1. Compute the value of the test statistic, T , in the sample as explained above.
Step 2. Estimate the null regression model in (7). For this purpose, estimate f 0 (X i ) as we mentioned in the estimation procedure in Section 2.1. Calculate Y l i = Y i −f 0 (X i ) and with that fit the polynomial using least squares for each level. Obtain the pilot estimates for i = 1, . . . , n, Step 3.

Local test
If the previous test is statistical significant and the equality of the m r l curves (l = 1, . . . , M ) is thus rejected, testing the null hypothesis of equality of critical points becomes of interest. Note that it is possible for these points to be equal, even if the curves and/or their derivatives are different. For instance, taking into account the maxima of the first derivatives, interest lies in testing the following null hypothesis versus the general alternative otherwise H 0 is false. It is important to highlight the fact that, in practice, the true x 0j are not known, and consequently neither is d, so an estimated =x 0j −x 0k is used, where, in general,x 0l are the estimates of x 0l based on the estimated curvesm l .
Needless to say, sinced is only an estimate of the true d, the sampling uncertainty of these estimates needs to be taking into account. Hence, a confidence interval may be created for d at a specific level of confidence. Based on this, the null hypothesis is rejected if zero is not contained in the interval.
The steps for construction of the bootstrap confidence interval for the true d are the following: Step 1. From the sample data {(X i , F i , Y i )} n i=1 , obtain the estimates for i = 1, . . . , n based on the general model in (2), obtain the estimates of x 0l based on (3) and then retrieve thed value.
Step 2. For b = 1, . . . , B, generate bootstrap samples as in Step 3 of the algorithm for the global test presented earlier, though, in this case, using the residuals of the general model in (2) Step 1.
Finally, the limits for the 100(1 − α)% percentile confidence interval of d are given by

More technical details
It is well known that the nonparametric estimatesm r l (X) greatly depend on the bandwidths h 0 , h 1 , . . . , h M used in the kernel-based algorithm for the estimation of the partial functions f 0 , f 1 , . . . , f M . Various methods for an optimal selection have been suggested, such as generalized cross-validation (GCV; Golub, Heath, and Wahba 1979) or plug-in methods (see e.g., Ruppert, Sheather, and Wand 1995). See Wand and Jones (1995) for a good overview of this topic. However, optimal bandwidth selection is still a challenging problem.
As a practical solution, in Equation 4 of the estimation algorithm, the bandwidth h 0 is automatically selected by minimizing the following cross-validation criterion: wheref (−i) 0 (X) indicates the fit at X, leaving out the ith data point based on the smoothing parameter h 0 . Likewise, the bandwidths h l (l = 1, . . . , M ) of Equation 5 are selected by minimizing wheref (−i) l (X) indicates the fit at X, leaving out the ith data point based on the smoothing parameter h l .
Bootstrap resampling techniques are time-consuming processes because it is necessary to estimate the model many times. Moreover, the use of the cross-validation technique for the choice of the bandwidths implies a high computational cost, because it is necessary to repeat the estimation operations several times to select the optimal bandwidths. Consequently, recourse to some computational acceleration technique is fundamental to ensure that the problem can be addressed adequately in practical situations. Thus, we use binning techniques to speed up the process. A detailed explanation of this technique can be found in Fan and Marron (1994).

Overview of the package npregfast
The npregfast package contains a set of functions for estimating nonparametric models, obtaining first and second derivatives, critical points, etc., as well as different tests for drawing inferences about several features of these models. In view of the high cost entailed in these methodologies and in order to maximize computational efficiency, the actual version of the package is carried out using compiled Fortran. The functions within npregfast are briefly described in Table 1.
The package is designed along lines similar to those of other R regression packages. Hence, the main function of the package is frfast which, by default, fits a nonparametric regression model based on local polynomial kernel smoothers. The arguments of this function are shown in Table 2. Note that through the argument formula users can decide to fit a model taking into account the interaction or not, and by means of the argument smooth it is possible to select the type of smoother: kernel or splines. Numerical and graphical summaries of the fitted object can be obtained by using the print, summary, plot and autplot methods implemented for 'frfast' objects (arguments of the latter function are shown in Table 3). Another of these methods is available for the predict function, which takes a fitted model of the 'frfast' class and, given a new data set of values of the covariate, produces predictions.
As mentioned above, this package can be used to fit models taking into account factor-by-curve interactions. In this framework, it will be necessary to ascertain if the factor produces an effect

Function
Description frfast Main function for fitting regression models and obtaining the different outputs (model estimates, first and second derivatives). summary Method of the generic summary function for 'frfast' objects. autoplot Visualization of 'frfast' objects with ggplot2 (Wickham 2009) graphics.
Provides the plots for model estimates and their 95% pointwise confidence intervals based on bootstrap techniques. Additionally, with the diffwith argument it is possible to draw the differences between two factor levels. plot Visualization of 'frfast' objects with base graphics. Provides the plots for model estimates and their 95% pointwise confidence intervals based on bootstrap techniques. Additionally, with the diffwith argument it is possible to draw the differences between two factor levels. predict Takes a 'frfast' object produced by frfast() and, given a new set of values for the model covariate, produces predictions. critical Provides a table with the value of the covariate x (with 95% confidence interval) that maximizes the initial estimation, that maximizes the first derivative and where the second derivative equals zero. criticaldiff Provides a table with the 95% confidence interval for the differences between the estimation of the critical function, for every two levels. globaltest Function for testing the equality of the curves specific to each level. localtest Function for testing the equality of the critical points estimated from the respective level-specific curves. allotest Function for testing the null hypothesis of an allometric model versus a general hypothesis where the effect of the covariate on the response is flexible and unknown. runExample Launch a Shiny app that shows a demo of what can be done with the package. on the response and thus, that there is an interaction or, in contrast, the estimated regression curves are equal. To this end, the package provides the globaltest function which answers this question through a bootstrap-based test. If the factor results to be statistically significant, then the use of the diffwith argument of the autoplot method for 'frfast' objects (or of its base graphics version, the plot method for 'frfast' objects) enables the user to obtain a graphical representation that shows the differences between the estimated curves (estimate, first or second derivative) for any set of two levels of the factor. Additionally, the function critical allows to obtain the value of the covariate that maximizes the estimate and first derivative of the function and the value of the covariate where the second derivative equals zero, for each of these levels. Again, to test if these estimated points are equal for all levels, the package provides the localtest function. Note that, to compare these points between any set of two levels, a confidence interval for the difference can be obtained by applying the criticaldiff function. It should be noted that both smoothing methods (kernel and splines) are available options to test the equality of the M curves specific to each level, or to test the equality of critical points (i.e., for the globaltest and localtest functions).

Arguments Description formula
An object of class 'formula'; a symbolic description of the model to be fitted. data A data frame or matrix containing the model response variable and covariates required by the formula. na.action A function which indicates what should happen when the data contains NAs.
The default is "na.omit". model Type of model used: model = "np" for the nonparametric regression model, model = "allo" for the allometric model. smooth Type of smoother used: smooth = "kernel" for kernel smoothers and smooth = "splines" for splines using the mgcv package (Wood 2017). h0 The kernel bandwidth smoothing parameter for the global effect. By default, cross-validation is used to obtain the bandwidth. h The kernel bandwidth smoothing parameter for the partial effects. nh Integer number of equally-spaced bandwidth in which the h is discretized, to speed up computation. weights Prior weights on the data. kernel A character string specifying the desired kernel. Defaults to kernel = "epanech", where the Epanechnikov density function kernel will be used. Also, several types of kernel functions can be used: triangular and Gaussian density function, with "triang" and "gaussian", respectively. p Polynomial degree to be used. Its value must be the value of derivative + 1. The default value is 3, returning the estimation, first and second derivative. kbin Number of binning nodes over which the function is to be estimated. nboot Number of bootstrap repeats. Defaults to 500 bootstrap repeats. The wild bootstrap is used when model = "np" and the simple bootstrap when model = "allo". rankl Number or vector specifying the minimum value for an interval at which to search for the x value that maximizes the estimate, and first or second derivative (for each level). The default is the minimum data value. ranku Number or vector specifying the maximum value for an interval at which to search for the x value that maximizes the estimate, and first or second derivative (for each level). The default is the maximum data value. seed Seed to be used in the bootstrap procedure. cluster A logical value. If TRUE (default), the bootstrap procedure is parallelized (only for smooth = "splines"). Note that there are cases (e.g., a low number of bootstrap repetitions) that R will gain in performance through serial computation. R takes time to distribute tasks across the processors and also it will need time for binding them all together later on. Therefore, if the time for distributing and gathering pieces together is greater than the time needed for single-thread computing, it is not worth to parallelize. ncores An integer value specifying the number of cores to be used in the parallelized procedure. If NULL (default), the number of cores to be used is equal to the number of cores of the machine -1. Factor level to be taken into account in the plot. By default, NULL. der Number which determines any inference process. By default, NULL. If this term is 0, the plot shows the initial estimate. If it is 1 or 2, it is designed for the first or second derivative, respectively. diffwith Factor level used for drawing the differences with respect to the level specified in the fac argument. By default, NULL. The differences are computed for the rth derivative specified in the der argument. points Draw the original data into the plot. By default, TRUE. xlab A title for the x axis. ylab A title for the y axis. ylim The y limits of the plot. main An overall title for the plot. col A specification for the default plotting color.

CIcol
A specification for the default confidence intervals plotting color (for the fill). CIlinecol A specification for the default confidence intervals plotting color (for the edge). pcol A specification for the point color. abline Draw an horizontal line into the plot of the second derivative of the model. ablinecol The color to be used for abline. lty The line type. Line types can either be specified as an integer (0 = blank, 1 = solid (default), 2 = dashed, 3 = dotted, 4 = dotdash, 5 = longdash, 6 = twodash) or as one of the character strings "blank", "solid", "dashed", "dotted", "dotdash", "longdash", or "twodash", where "blank" uses "invisible lines" (i.e., does not draw them). See details in ?par. CIlty The line type for confidence intervals. Line types can either be specified as an integer (0 = blank, 1 = solid (default), 2 = dashed, 3 = dotted, 4 = dotdash, 5 = longdash, 6 = twodash) or as one of the character strings "blank", "solid", "dashed", "dotted", "dotdash", "longdash", or "twodash", where "blank" uses "invisible lines" (i.e., does not draw them). See details in ?par. lwd The line width, a positive number, defaulting to 1. See details in ?par. CIlwd The line width for confidence intervals, a positive number, defaulting to 1. cex A numerical value giving the amount by which plotting symbols should be magnified relative to the default. See details in ?par. alpha Alpha transparency for overlapping elements expressed as a fraction between 0 (complete transparency) and 1 (complete opacity). ...
Other options.

npregfast in practice
It is now time to outline the implemented functions of our package in detail, and illustrate these with two real data sets related to the life sciences. The first one is connected with the biology and management of an aquatic living resource and the second one is related to medical data, particularly, the age and height measurements of children.

Relative growth curves for barnacles
The npregfast package includes a data set called barnacle with measurements of rostrocarinal length and dry weight of barnacles from the Atlantic coast of Galicia (Spain), split by a categorical variable indicating the site of harvest (Sestelo and Roca-Pardiñas 2011). The usage of the package is illustrated by constructing the relative growth curves for this species and determining the ideal size of capture of the stalked barnacle, Pollicipes pollicipes (Gmelin, 1789). The commercial interest of this crustacean resides in their muscular peduncle, the edible stalk of the barnacle, which commands high prices on the market (Goldberg 1984).
In Spain and Portugal, where harvesting of P. pollicipes is the highest, the phenomenon of overfishing has affected this species to differing degrees (Bernard 1988;Cardoso and Yule 1995;Cruz 2000;Molares and Freire 2003). Because of the economic importance of this barnacle in several countries, we appreciate to deepen our knowledge about it. Accordingly, the main goal of this data set is to illustrate the use of the R package to analyze the relationship between the gain in weight and length of barnacles.
Each line of the data set represents the information from one specimen under study. The DW variable denotes the dry weight of the individuals in grams, the RC variable is the rostrocarinal length in millimeters -the variable that best represents the growth of the species (Cruz 1993(Cruz , 2000 -and the categorical variable F indicates the site where the specimens were collected, Punta Lens (lens) and Punta de la Barca (barca). An excerpt of the data set is shown below: DW RC F 1 0.14 9.5 barca 2 0.00 2.4 barca 3 0.42 13.1 barca To estimate the length-weight relationship of this species we first consider a nonparametric regression model without interaction. The polynomial degree was fixed to 2 based on the idea of using later the first derivative.
The graphical representation of the fit is obtained using the autoplot function. Figure 1 (left panel) shows the estimated curve (solid line) for the overall study with the pointwise confidence interval (shaded area). This curve shows the way in which individuals' size increased as their weight increased. The length-weight relationship is seen to be an increasing function in almost the complete range of values; only the final section of the curve seems to stabilize to a horizontal line. The way in which the gain weight occurs can be obtained by means of the first derivative ( Figure 1, right panel). The speed of growth (the increase in weight per unit of RC), rather than constantly increasing, displayed a maximum at a specific size, after which it began to decrease. Both plots can be plotted jointly with the following input commands: R> library("gridExtra") R> der0 <-autoplot(mwo, der = 0) R> der1 <-autoplot(mwo, der = 1) R> grid.arrange(der0, der1, nrow = 1, ncol = 2) In biological studies, and specifically in population dynamics and stock assessment, it is relevant to ascertain whether this length-weight relationship remains constant across sites and was not altered by any possible local variability in the growth of this species. Therefore, we now intend to estimate the model including the factor-by-curve interaction. This can be obtained with the following code: R> mwi <-frfast(DW~RC:F, data = barnacle, p = 2, seed = 130853) R> summary(mwi) Call: frfast(formula = DW~RC:F, data = barnacle, p = 2, seed = 130853) ********************************************* Nonparametric Model ********************************************* Type of nonparametric smoother: kernel Kernel: Epanechnikov Bandwidth: 0.21 0.31 1.00 Polynomial degree: 2 Number of bootstrap repeats: 500 Number of binning nodes 100 The summary method returns a numerical summary of the fit where it is possible to observe the kernel used, the global and partial bandwidths obtained by cross-validation, the number of bootstrap repeats used for obtaining the confidence interval for the estimation, the number of binning nodes, the sample size by levels of the factor and a small summary for the response by levels.
The fit can be again visualized by means of the generic function autoplot. As in the previous case, it is possible to represent both the estimation and first derivative with the der argument. Additionally, the selection of the factor level is obtained by the fac argument. Figure 2 shows the estimated length-weight relationship for the barnacles of the two sites of Galicia, Punta de la Barca and Punta Lens.
R> library("ggplot2") R> der01 <-autoplot(mwi, der = 0, fac = "barca") R> der11 <-autoplot(mwi, der = 1, fac = "barca") + ggtitle("") R> der02 <-autoplot(mwi, der = 0, fac = "lens") R> der12 <-autoplot(mwi, der = 1, fac = "lens") + ggtitle("") R> grid.arrange(der01, der11, der02, der12, nrow = 2, ncol = 2) The question that now arises is if the estimated curves for each level are identical and thus indicate that there is no need to include the interaction in the model or, by contrast, the barnacles show a difference in relative growth and the factor really produces an effect on the response. To this end, the bootstrap-based test proposed before for testing the equality of the M regression functions (or derivatives) was applied using the globaltest function. Taking into account the results obtained, we can conclude that both the estimates and derivatives are not equal between levels. This is also observed from the graphical representations. The test concludes that the factor site produces an effect on the response.
Finally, an important issue related with any species that is subject to exploitation is the establishment of limits on size. The estimation of adequate catch sizes for commercial marine invertebrates includes biological aspects such as individual size at sexual maturation, growth rate and length-weight relationship but also the yield in weight from the fishery (Sparre and Venema 1997). According to this, the point that maximizes the first derivative of the regression curve must be determined. This critical point would ensure high commercial yield while simultaneously guaranteeing the regeneration and conservation of the population. To obtain these points for both sites of harvest, the critical function can be applied. The estimated critical points with their 95% confidence intervals seem to be similar between the two sites of study. To ascertain the truth of this affirmation, the localtest function for testing the equality of critical points was applied. In this case, one tests whether the points that maximize the first derivatives of the curves are equal.

R> critical(mwi, der = 1)
R> localtest(DW~RC:F, data = barnacle, p = 2, seed = 130853, der = 1) d Lwr Upr Decision 1 0.1955 -0.1521 1.1893 Accepted According to the obtained confidence interval it is possible to conclude that, although the effects of the size (RC) on the weight (DW) depend on the location (F) and consequently the curves of the relative growth and their derivatives are different for each level, there is not a statistically significant difference between the estimated sizes.

Spurt in child growth
We decided to also show the capabilities of package npregfast with another data set, which contains the age and height measurements of 2500 children aged 5 to 19 years, split by sex (1292 females and 1208 males). The usage of the package is illustrated by constructing growth curves for school-aged children and adolescents and also by analyzing possible differences in the growth of boys and girls. Other studies of this type can be obtained from http: //www.who.int/childgrowth/en/. Finally, note that we applied in this section the two smoothers implemented in the package (kernel and splines) in order to compare the possible differences in the results. Below is an excerpt from the data frame of the data set used: Each line represents the information from one individual under study. The categorical variable sex indicates the individual's gender (male or female), the age variable corresponds to age in years, and height is measured in centimeters. To estimate the growth of the children overall, we firstly consider a nonparametric model without interaction.

Number of Observations: 2500
Number of Bootstrap Repeats: 500

Type of Nonparametric Smoother: splines
The graphical representation of the fitted models can easily be obtained. Figure 3 plots the estimated curves obtained by means of the two smoothers with their 95% pointwise confidence intervals. As expected, in both cases, children's height rises with the increase in years of life until they reach a specific age; and thereafter their heights remain more or less constant. We can also observe that estimates obtained using the two smoothers are very similar. This plot can be obtained by using the following commands: R> der0k <-autoplot(mwo2k, der = 0) R> der0s <-autoplot(mwo2s, der = 0) R> grid.arrange(der0k, der0s, nrow = 1, ncol = 2) A common issue is to compare the growth between boys and girls. With this in mind, we fit a model taking into account the interaction. Again, we estimate the proposed model using both smoothers. It is worth mentioning that argument formula of the main function frfast must be specified in a different manner depending on the chosen smoother.
R> mwi2k <-frfast(height~age:sex, data = children, p = 2, seed = 130853) R> mwi2s <-frfast(height~s(age, by = sex), data = children, seed = 130853, + smooth = "splines") The estimated curves for each gender can again be obtained with the function autoplot. Figures 4 and 5 show the estimated curves for males and females together with their first derivatives using both smoothers. The obtained estimates are similar, though they seem to be slightly smoother when using spline smoothing.
R> derk1 <-lapply(0:1, function(x) autoplot(mwi2k, der = x, fac = "male")) R> derk2 <-lapply(0:1, function(x) autoplot(mwi2k, der = x, + fac = "female")) R> grid.arrange(grobs = c(derk1, derk2), nrow = 2, ncol = 2) R> ders1 <-lapply(0:1, function(x) autoplot(mwi2s, der = x, fac = "male")) R> ders2 <-lapply(0:1, function(x) autoplot(mwi2s, der = x, + fac = "female")) R> grid.arrange(grobs = c(ders1, ders2), nrow = 2, ncol = 2) It is now time to asses if the factor really produces an effect on the response. To this end, we apply the bootstrap-based test implemented in globaltest(). Judging by the function output, the results would appear to suggest that the factor, sex, produces a real influence on the children's growth. This can also be observed from the graphical representation. Similarly, it can be concluded that the derivatives of these curves are different between levels. Note that the selection of the smoother does not change the hypothesis test conclusion.  In addition, Figures 4 and 5 (right panels) seem to suggest that females experience a spurt in growth earlier than males do, with the two sexes achieving maximum rates of growth at ages close to 10 and 13 years, respectively. These ages are obtained by the critical function using the following command: R> critical(mwi2k, der = 1) Paying close attention to the obtained confidence intervals using kernel smoothers, it is possible to observe that the lower limit of both intervals coincides with the smaller binning node. This occurs due to the high variability of the estimates for ages lower than eight years, resulting in a wide bootstrap confidence interval in this area. According to this, the critical point of several bootstrap replicates is estimated as the first point of the distribution. Two arguments of the frfast function (rankl and ranku) have been proposed in order to address this situation. By specifying them, the user can set a range in which the critical point will be searched.

Critical
Lwr Upr Level male 13.64412 12.971581 13.88881 Level female 10.47682 9.022302 10.85744 Finally, in order to ensure that these differences are really significant, we apply the localtest function. It tests whether the points that maximize the first derivatives of the curves are equal. Judging by these results, the sex-related differences in growth seem to be evident (using kernel or splines smoothers).

Conclusion
This paper discussed the implementation of some methods developed for estimating regression models with or without factor-by-curve interactions in the R package npregfast. Among other things, the package also implements two bootstrap-based procedures designed to test different features of the estimated curves, particularly, to analyze whether the specific curves for each level are equal and to test the equality of the critical points estimated from the respective level-specific curves.
Finally, users may be interested in viewing a live interactive demo of the package in order to see part of its capabilities before installing it. This is possible at http://sestelo.shinyapps. io/npregfast.