NonpModelCheck : An R Package for Nonparametric Lack-of-Fit Testing and Variable Selection

We describe the R package NonpModelCheck for hypothesis testing and variable selection in nonparametric regression. This package implements functions to perform hypothesis testing for the signiﬁcance of a predictor or a group of predictors in a fully nonparametric heteroscedastic regression model using high-dimensional one-way ANOVA. Based on the p values from the test of each covariate, three diﬀerent algorithms allow the user to perform variable selection using false discovery rate corrections. A function for classical local polynomial regression is implemented for the multivariate context, where the degree of the polynomial can be as large as needed and bandwidth selection strategies are built in.


Introduction
In modern science, with the advances of computing and data storage, new challenges are often found when building models that can predict the behavior of a response variable according to massive sets of covariates and large number of data points. To address these issues, one fundamental procedure that can be used is that of model checking which, besides testing parameters and assumptions, can also be viewed as testing hypotheses about the significance of a covariate or a group of covariates in the model. This is of great importance for model building due to the bias or lack of power that misspecification of covariates can cause. There are basically two main principles for model building: parsimony and sparseness. Parsimony is the idea of using few parameters and sparseness is the concept of prediction with only a few variables. Connecting these two principles, variable selection methods have been proposed, with the objective of finding a model that contains only significant covariates. Hence, model checking and variable selection have become extremely important tools for building predictive models, being implemented by several computer packages and software.
Recent research on variable selection and predictive models have significantly contributed to a wide variety of applications. However, most attention has been given to linear models, i.e., when the expected response at x, the observed vector of covariates, is assumed to be m(x) = x β, where β is the parameter vector to be estimated. One of the reasons why such an analysis is preferred is because of the readily available statistical software. With the linear model, a new class of methodologies has recently been developed, which uses penalty functions to perform variable selection through the choice of a shrinking parameter, such as LASSO (Tibshirani 1996), SCAD (Fan and Li 2001), least angle regression (Efron, Hastie, Johnstone, and Tibshirani 2004), adaptive LASSO (Zou 2006), and Dantzig selector (Candes and Tao 2005). Some of these methods are already available in R (R Core Team 2017), such as packages ncvreg for SCAD (Breheny and Huang 2011;Breheny 2017), glmnet for LASSO and elastic net (Friedman, Hastie, and Tibshirani 2010), lars for least angle regression (Hastie and Efron 2013).
Another approach explores the connection of model checking and variable selection. Let y i denote the ith response observation, x ij the ith observation of the jth predictor. Abramovich, Benjamini, Donoho, and Johnstone (2006) showed that the application of the false discovery rate (FDR) procedure of Benjamini and Hochberg (1995) on p values resulting from testing H j 0 : β j = 0 in the aforementioned linear model, can be translated into minimizing a model selection criterion of the form where S is a subset of {1, 2, . . . , d} specifying the model,β S j denotes the least squares estimator from fitting model S, |S| is the cardinality of the subset S, and the penalty parameter λ depends both on d and |S|.
The assumption of linearity of the expected response function may not always be correct. A model checking or variable selection procedure with such an assumption may not detect significant covariates that have nonlinear effects. Because of this, procedures for both model checking and variable selection have been developed under more general/flexible models; see, for example, Claeskens (2004), Li and Liang (2008), Wang and Xia (2009), Meinshausen, Meier, andBühlmann (2009), Huang, Horowitz, andWei (2010), Storlie, Bondell, Reich, and Zhang (2011), Antoniadis, Gijbels, and Lambert-Lacroix (2014), Gijbels, Verhasselt, and Vrinssen (2015) and references therein. In a completely nonparametric regression, Zambom and Akritas (2014) introduced a variable selection procedure using its conceptual connection with model checking. The idea is to perform backward elimination considering the Benjamini and Yekutieli (2001) method applied to the p values resulting from testing the significance of each covariate. The extension to testing a group of covariates and running group selection was explored in Zambom and Akritas (2015). This new methodology inspired the development of package NonpModelCheck (Zambom 2015), which provides functions for testing, in a nonparametric regression context, the significance of a covariate or a group of covariates, and functions to perform variable selection using the results of these tests.
The remainder of this paper is organized as follows. In Section 2 we summarize the methodology of hypothesis testing in nonparametric regression introduced by Zambom and Akri-tas (2014) and Zambom and Akritas (2015). The function for model checking in package NonpModelCheck is presented in Sections 2.1 and 2.2, along with some examples. Section 3 introduces the modified variable selection algorithm, based on backward elimination or forward selection and FDR corrections, followed by the description of the function npvarselect() and performance comparisons with methods in the literature. The package is available from the Comprehensive R Archive Network (CRAN) at https://CRAN.R-project.org/package= NonpModelCheck.

Nonparametric model checking
The effects of model misspecification in linear regression have been explored in several papers, such as Larson and Bancroft (1963), Ramsey (1969), Deegan Jr. (1976, among others. Also, in the nonlinear context, it has been seen that inappropriate assumptions or misspecified models can cause serious consequences, see for example White (1981), and Hartford and Davidian (2000).
One important question in checking appropriateness of a model, is to check if the covariates under consideration are significantly related to the response variable. An example of such a situation is when a researcher is doubtful of a few of the measured predictors, and wants to decide if they are necessary to accurately predict the response. Hypothesis testing can be performed on these problems in order to test a specific covariate or a group of covariates at once. Moreover, since recent experiments are based on a rather large number of predictors, often researchers do not know in advance which of those are significant to the model. Hence, the aforementioned connection between hypothesis testing and variable selection is an important tool for model building, and will be explored in the next section. Next, we will formalize the hypothesis testing for the significance of one or several covariates in the mean regression function in nonparametric regression. Denote the response variable by Y and the set of d available covariates by U. The nonparametric regression model, allowing for heterocedasticity, can be written as where σ(U) is the variance function and is the random error with zero mean and constant variance independent of U. Writing the set of covariates as U = (X, Z), where X has dimension r and Z has dimension s (r + s = d), the concept of model checking as a hypothesis test for covariates can be written in the following way This null hypothesis is basically stating that the mean regression function depends only on some of the observed covariates, specifically, only on x. Note that if s = 1, we are only testing if a single covariate should be dropped from the model, but for s > 1 this is a multivariate test.
One class of procedures is based on the idea that the null hypothesis residuals, ξ = Y −m 1 (X 1 ), satisfy E(ξ|X) = 0 under H 0 but under the alternative are equal to E(ξ|X) = m(X) − m 1 (X 1 ). Hence, it is straightforward to see that E(ξE(ξ|X)|X) = (m(X) − m 1 (X 1 )) 2 under the alternative and zero under the null. Using this idea, Fan and Li (1996) proposed a test statistic based on E[ξf 1 (X 1 )E(ξf 1 (X 1 )|X)f (X)], a weighted version of E(ξE(ξ|X)|X), avoiding the randomness in the denominator (when estimating m( under the alternative and zero under the null. They show that their test statistic is asymptotically normal under H 0 . Lavergne and Vuong (2000) propose a test statistic based on a different estimator of the same quantity as Fan and Li (1996), which is also proven to be asymptotically normal under H 0 . A related class of procedures is based on the direct estimation of E[(m(X) − m 1 (X 1 )) 2 W (X)], for some weight function W ; see, for example, Ait-Sahalia, Bickel, and Stoker (2001). The use of such test statistics is complicated by the need to correct for their bias. Other tests are also available, such as the bootstrap-based procedure of Delgado and Manteiga (2001) (which is computationally intensive) and the generalized likelihood ratio test of Fan, Zhang, and Zhang (2001) and Fan and Jiang (2005).
The R package NonpModelCheck provides a function for hypothesis testing of covariate(s) using the test statistic proposed by Zambom and Akritas (2014), which, in simulation, outperformed the procedures described above. The key idea of the test statistic proposed by Zambom and Akritas (2014) is to construct a high dimensional one-way ANOVA, with factor levels being the values of the covariates Z i , i = 1, . . . , n, andξ i = Y i −m 1 (x i ) being the observations from factor level Z i . With this setup, testing for no factor effects is an intuitive test for H 0 . In order to estimate the residualsξ i , one needs to estimate m 1 (x), which is done by local polynomial regression for its smoothness and asymptotic properties.
Due to the fact that, in high dimensional ANOVA, theoretical assumptions require more than one observation per cell, factor levels here are augmented by including residuals from nearby covariate values. To be specific, consider the simple situation when s = 1, that is, when Z is univariate. In this case, each cell is augmented by including additional (p − 1)ξ 's which correspond to the (p − 1)/2 Z values that are nearest to Z i on either side. More precisely, the (ξ i , Z i ), i = 1, . . . , n, are considered to be arranged so that Z i 1 < Z i 2 whenever i 1 < i 2 , and for each Z i , (p − 1)/2 < i ≤ n − (p − 1)/2, define the nearest neighbor window W i as whereF Z is the empirical distribution function of Z. Hence, the augmented cell corresponding to Z i is defined by the values ofξ with indices in W i . Hence, the vector of (n − p + 1)p constructed "observations" in the augmented one-way ANOVA design iŝ Let MST = MST(ξ) and MSE = MSE(ξ) denote the balanced one-way ANOVA mean squares due to treatment and error, respectively, computed on the dataξ. The test statistic is which has an asymptotic normal distribution with mean 0 and variance 2p(2p−1) Zambom and Akritas (2014) suggest the following estimator of τ 2τ For the case when Z is multidimensional, the task of ordering the factor levels according to Z i becomes a challenge. This is overcome by replacing Z i by a nonlinear version of Bair, Hastie, Paul, and Tibshirani (2006)'s first supervised principal component (SPC), P λ i = Z i C λ , defined in the following way. Consider the p values p j , obtained by applying the univariate tests H j 0 : m(X, Z j ) = m(X), j = 1, . . . , s. Let Z λ have the same coordinates of Z except those whose corresponding p values p j are smaller than a threshold parameter λ. Then, P λ i is defined to be the first principal component of Z λ . Therefore, when Z has dimension larger than 1, the windows W i are formed using P λ i instead of Z i , and the computation of the test statistic n 1/2 (MST − MSE) andτ also depend on P λ i . It is important to note that, computationally the test is carried out by a one-sided test, comparing with the standard Normal distribution, since the test statistic tends to get larger as we depart from the null hypothesis.

Hypothesis testing for a single covariate with NonpModelCheck
In this section we give specific examples on how to use the function npmodelcheck() for testing hypotheses as in (2). To use this function, the user needs to provide a few parameters/attributes which can basically be divided into 3 groups: 1. data and index of covariates to be tested, 2. parameters for the local polynomial fit and 3. parameters for dimension reduction. The input data is basically a vector or matrix of observed covariates and a vector of observed responses, and the user is also required to input the index of the covariate(s) to be tested. For the parameters of the local polynomial fit, see Appendix A.
Note that, under the assumption of a sparse model, the estimation of m 1 in a nonparametric context will suffer from the curse of dimensionality. To moderate this issue, there are available, within the function, two techniques to reduce the dimension of X before applying the local polynomial fit. The first technique is the nonlinear SPC, as described previously, with fixed λ = 0.3. The second technique available for the user is the sliced inverse regression (SIR) method proposed by Li (1991). In SIR, it is assumed that a small number of linear combinations of the predictors contain all information about Y . More specifically, it is assumed that y = m(x β 1 , . . . , x β K , ), where β's are unknown row vectors and is independent of x. This is an extension of the so called multiple index model. The estimation of these directions are based on the inverse regression of x on y coordinate-wise. A brief description of this procedure is as follows: Let x i , i = 1, . . . , n, be the p-dimensional vector of the observation i; standardize it to obtainx i =Σ −1/2 (x i −x); divide the range of y in H slices and let p h be the proportion of y i that falls in slice h ∈ H; calculatem h = 1 np h {y i in slice h}x i ; perform a weighted principal components analysis by getting the eigenvectors and eigenvalues ofV = H h=1phmhm h ; the result isβ k =η kΣ −1/2 (k = 1, . . . , K), whereη k are the K largest eigenvectors ofV . The number of eigenvectors is chosen by a sequential test by testing if the K first eigenvalues of V are significantly different from zero (see Li 1991 for details).
The call of the function, as described in the package manual, is: npmodelcheck(X, Y, ind_test, p = 7, degree.pol = 0, kernel.type = "epanech", bandwidth = "CV", gridsize = 30, dim.red = c(1, 10)) The function npmodelcheck() returns an object of the class 'npmodelcheck', containing a list with 3 items: the bandwidth used for the local polynomial fit, the predicted/smoothed values of the local polynomial fit and the p value of the hypothesis test. When simply called, it will only print the p value of the test, but a summary function is available, so that when the object of class 'npmodelcheck' is inputted, it will display all items.
To illustrate the function npmodelcheck(), we will consider the following situation. Suppose we have 100 observations of a set U of 12 covariates and a response variable Y . Assume that the true, but unknown regression function is . . , 12, and θ a known constant. In this example, we will explore the test implemented by this function when testing H 0 : m(U) = m 1 (U −(2) ), where U −(2) represents all covariates except that of index 2. We will use θ = 0, 1 and 2, for when θ = 0 we have the null hypothesis, and when θ > 0 we are under the alternative and expect the p value to be smaller as θ increases. The following code will run the test on simulated data. R> set.seed(1) R> runs <-1000 R> p <-9 R> n <-100 R> d <-12 R> result <-c(0, 0, 0) R> for (index in 1:runs) { + for (theta in 0:2) { + U <-matrix(0, n, d) The vector result contains the rejection rates, after 1000 simulation runs, of the test for θ = 0, 1, 2 in its positions 1, 2 and 3 respectively. We see that the average rejection in the first position of result, which corresponds to the type I error, is 0.051, suggesting that the level of the test is very close to the nominal level 0.05. On the other hand, the rejections for θ = 1 and 2 are higher, indicating an increase in power of the test under alternatives. For comparison purposes, the generalized likelihood ratio test (GLRT) of Fan and Jiang (2005) was also evaluated with the same settings, as it is one of the most recent tests that can deal with nonparametric hypothesis testing. The rejection rates of their test are 0.052, 0.33 and 0.90 for θ = 0, 1 and 2 respectively. These results suggest that the test npmodelcheck() not only provides good level of the test, but also higher power than its competitors.
To demonstrate the importance of the dimension reduction techniques, and the improvement on the power of the test they can provide, we ran a similar simulation to the one above with the option npmodelcheck(U, Y, 2, dim.red = 0). In this case, no dimension reduction is performed and the algorithm computes a nonparametric estimate of m 1 (X −(2) ), where X −(2) represents all X covariates except that of index 2, with local polynomial regression, hence suffering from the curse of dimensionality. The output of the vector result/runs is It is clear that the rejection rates from the test in this situation are completely different from those with dimension reduction, and the power for detecting departure from the null hypothesis is far from optimal.
The function npmodelcheck() also allows the user to input the size p of the window W i , as defined in (3). This choice is crucial, as the larger p is the stronger the dependence between neighboring cells in the one-way ANOVA, causing decrease in power. On the other hand, we do not advise using p too small due to the instability of the test. For example, the code below tests the significance of the third covariate in U using p = 3, 7 and 15. Note that for p = 15 the test fails to detect the significance of the third covariate.

Hypothesis testing for multiple covariates with NonpModelCheck
It is not unusual to find applications where covariates act together as groups. Model checking in this case can be seen as testing for the significance of all the covariates in a group at once, i.e, testing if the group should be dropped from the model. An interesting example, which has been explored in several papers recently, is DNA microarray data. Such data usually consists of thousands of predictors (genes) but only a few hundred (or less) observations. Studies show that genes can be clustered together in their association with the response, avoiding collinearity among predictors and hence improving model adequacy.
The function npmodelcheck() allows the user to test a group of covariates in a single test, i.e., hypothesis tests as in (2). For that purpose, the user only needs to input a vector of indices corresponding to the covariates to be tested for the ind_test parameter of npmodelcheck().
As an example, consider the situation where 6 covariates are observed but they are clustered together in the following groups: group 1: (X 2 , X 4 , X 5 ), group 2: (X 1 , X 3 ) group 3: (X 6 ). Assume that the covariates X = (X 1 , . . . , X 6 ) come from a normal distribution with mean µ = (0, . . . , 0) and variance Suppose that the response variable Y is associated with X through the following equation where iid ∼ N (0, 1). The code below tests the significance of group 1.

Nonparametric variable/group selection
Variable selection has become an important topic in regression analysis. It uses the assumption of sparseness to build parsimonious models. Generally, when the initial regression model is introduced, we are presented with a large number of predictors. While using many predictors decreases bias, prediction may be poor if the model is built with insignificant ones. In order to have a more accurate model, with enhanced prediction power, procedures of variable selection such as backward elimination or forward selection aim at selecting only those predictors that contribute significantly to the regression model.
The function npvarselec() in package NonpModelCheck, uses either backward elimination or forward selection to build a model, exploring the connection between model checking and variable selection. Such a procedure was introduced by Zambom and Akritas (2014), where, in backward elimination for example, the algorithm eliminates, at each step, the least significant covariate, as long as its p value is larger than the cut off point in the false discovery rate correction for dependent tests (Benjamini and Yekutieli 2001). Next we briefly describe the algorithms for backward elimination or forward selection. Assume that the model with d covariates is sparse, in the sense that there exits a subset of indices I 0 = {j 1 , . . . , j d 0 } ⊂ {1, . . . , d} such that only the covariates X j with j ∈ I 0 influence the regression function. Moreover, assume the dimension reduction model of Li (1991), i.e., m(x) = g(Bx), for a K-dimensional function g, where B is a K × d matrix. The backward elimination algorithm, based on the Benjamini and Yekutieli (2001) method for controlling the false discovery rate, is as follows: 1. Obtain p values from testing each of the hypotheses where x (−j) = (x 1 , ..., x j−1 , x j+1 , ..., x d ) in the following way: (a) Compute the test statistic using residuals formed by a local polynomial regression on the variables Bx (−j) , where B is the K × (d − 1) estimated matrix obtained by applying SIR; (b) Compute the p value for H j 0 as π j = 1 − Φ(z j ). (c) and proceed to the next step.

Compute
3. Repeat Steps 1 and 2, with the updated x and B.
Note that if the user is trying to run the backward elimination on a very large number of covariates, computational cost may be high. Simulations show that a screening procedure can be applied before running the algorithm, not affecting the power. That is, by eliminating those covariates whose p values (obtained with univariate tests npmodelcheck()) are very high, say greater than 0.7 or 0.8, the algorithm does not improve performance, but reduces computational time.
With the same assumptions of sparseness and the dimension reduction model, a forward algorithm for variable selection with FDR corrections can be described as: 1. Obtain p values from testing each of the hypotheses in the following way: (a) Compute the test statistic using (Y −Ȳ ) instead of residuals in the high-dimensional one-way ANOVA.
(b) Compute the p value for H j 0 as π j = 1 − Φ(z j ).
3. Find the smallest p value to include: obtain p values from testing each of the hypotheses in the following way: (a) Compute the test statistic using residuals formed by a local polynomial regression on the variables Bx * , where B is the K × q estimated matrix obtained by applying SIR.
(b) Compute the p value for H j 0 as π j = 1 − Φ(z j ).
4. Temporarily include in the model the covariate with smallest p value: (a) let x * = (x * , x (1) ), where x (1) is the covariate corresponding to π (1) ; and let q = q + 1; (b) update x by eliminating the covariate corresponding to π (1) ; (c) update d to d − 1; (d) and proceed to the next step.
5. Check if all the covariates in the present model are significant according to FDR: (a) Obtain p values from testing each of the hypotheses where x (−j) = (x 1 , ..., x j−1 , x j+1 , ..., x q ), in the following way: i. Compute the test statistic using residuals formed by a local polynomial regression on the variables Bx * (−j) , where B is the K × (q − 1) estimated matrix obtained by applying SIR. ii. Compute the p value for H j 0 as π j = 1 − Φ(z j ).
for a choice of level α, where π (1) , . . . , π (q) are the ordered p values. If k = q = d stop and retain all covariates. If k = q < d go to Step 3. If k < q i. remove from x * the covariate temporarily added in Step 4, and let q = q − 1; ii. stop and retain x * . This forward procedure, at each step, first finds a candidate covariate to enter the model. The candidate is the covariate that yields the smallest p value from test npmodelcheck(), taking into consideration all covariates that are already in the model. Then, the candidate covariate is only retained if the new model, with the candidate covariate included, is significant, i.e., if the test of each covariate in the new model produces a significant p value according to FDR corrections.
Note that there may be a covariate which, when added to the model, would result in a new significant model. However, when tested as candidate, it is not the one that yields the smallest p value. Hence, a different forward procedure can be defined, adding to the model only the covariate that would result in the "most significant" new model.

Obtain p values from testing each of the hypotheses
in the following way: (a) Compute the test statistic using (Y −Ȳ ) instead of residuals in the high-dimensional one-way ANOVA.
(b) Compute the p value for H j 0 as π j = 1 − Φ(z j ).
3. Add the covariate that produces the "most significant" new model: (a) For k in 1, . . . , d: i. Let u * = (x * , x j ). Obtain p values from testing each of the hypotheses where u * (−j) the vector u * without covariate u * j , in the following way: A. Compute the test statistic using residuals formed by a local polynomial regression on the variables Bu * (−j) , where B is the K × q estimated matrix obtained by applying SIR.

B. Compute the p value for H
for a choice of level α, where π (1) , . . . , π (q) are the ordered p values. If = q and if π (q) is less than the π (q) produced by the other candidates, set the covariate k as the new candidate.
(b) If no covariate was set as candidate, stop. Otherwise i. include in x * the candidate covariate and set q = q + 1; ii. update x by eliminating the candidate covariate and set d = d − 1; iii. return to Step 3a.
When it is believed that covariates come in groups, as for example clusters of genes in microarrays, one fundamental goal is to select those groups (clusters) that have a significant effect on the response variable Y . It is possible to use algorithms similar to those described above in order to run group selection, the only difference would be to use the test of the whole group in each step (see Zambom and Akritas 2015 for details).

Nonparametric variable selection with NonpModelCheck
The procedures for variable selection described in the previous section are implemented in function npvarselec() in package NonpModelCheck. The arguments for this function are basically the same as the ones needed for npmodelcheck(), except that the user is required to specify the type of the selection. The options are "backward", "forward" and "forward2", corresponding to the 3 algorithms described in Section 3, in the order they appear.
In order to assess the performance of the variable selection procedure using the three different algorithms, we simulated the above linear model 100 times and computed the average number of correctly selected predictors and the average number of incorrectly selected predictors. We also simulated this linear model when there is dependence between the covariates, that is when Σ i,j = 0.5 |i−j| . The performance of these algorithms was compared to recent nonlinear variable selection methods COSSO (R package cosso; Zhang and Lin 2013) and CISE (MATLAB package cise; accompanying material for Chen, Zou, and Cook 2010 available at http://www.stat.nus.edu.sg/~stacx/cise.zip). The results are shown in Table 1. Those for CISE were obtained using Octave (Eaton, Bateman, Hauberg, and Wehbring 2015 Table 2, 100 data sets of size n = 40 were generated from the models Y = g (X) + , = 1, 2, where ∼ N (0, 0.3 2 ), the dimension of X is d = 8, and In both linear and nonlinear models, the three proposed algorithms from npvarselec() had similar results. For the linear models, although CISE selected, in average, less incorrect predictors, it missed, on average, one more than the proposed algorithms. On the other hand, all three algorithms substantially outperformed CISE and COSSO in the case of nonlinear models. COSSO results were inferior in the simulated linear and nonlinear models compared to CISE and the proposed algorithms.
A demonstration of the nonparametric group selection algorithms using npvarselec() can be found in Section 4.

Group selection applied to a real data set
Microarray experiments generate large datasets which generally contain just a few samples but expression values for thousands of genes. In such studies, data analysts often consider studying clusters (or groups) of genes that act together, since their collective expression may have a stronger association with the response. Thus, it is crucial not only to find these clusters, but also to select those which significantly influence the outcome variable. Function group.npvarselec() in package NonpModelCheck performs group selection with algorithms similar to those in Section 3. All three types of algorithm are available: backward, forward and forward2. The difference is that at each step, all the covariates of a group are tested at the same time. Basically, the arguments of the function are the same as those for npvarselec(), except for two extra ones. First, the user needs to specify the indices of the variables in each group in an argument of the type 'list'. Second, when testing group i, instead of computing the residuals based onm 1 (X (−I i ) ), where I j is the set of the indices of covariates in group i, the user may prefer to fitm 1 (S 1 , . . . , S i−1 , S i+1 , . . . , S d ), where S j , j = 1, . . . , d is the first supervised principal component of group j, with d being the total number of groups. If the argument fitSPC is set to true, the latter is fit, otherwise the former computation is done.
A demonstration of the use of this function will be done using the prostate cancer dataset of Singh et al. (2002), available at http://icos.cs.nott.ac.uk/datasets/microarray.html. The data set is composed of gene expressions of 52 prostate tumor and 50 non-tumor prostate samples, obtained from the Affymetrix technology. Before applying the group selection algorithm, the clusters need to be formed. To do so, the procedure wilma() from package supclust was used. The following code performs a forward algorithm for group variable selection on the 60 clusters formed by wilma() (note that applying wilma() to a fixed dataset twice may result in different output clusters due to the randomness built in its algorithm). R> library("supclust") R> data <-read.

A. A new function for local polynomial fitting in R
Local polynomial regression is one of the most powerful nonparametric curve fitting approaches. It was introduced by Stone (1977) and has since been used in several applications. Its advantages over the simple Nadaraya-Watson estimator made it prominent in modern research when estimating surfaces without parametric assumptions. For example, it does not suffer from boundary effects, reduces bias, has the ability of design adaptation and is uniformly consistent over compact subsets (see Fan 1992, Fan 1993, Fan and Gijbels 1992, Ruppert and Wand 1994, Masry 1996. Due to its fundamental role in nonparametric estimation of regression curves, it is generally found in most mathematical/statistical software. Specifically in R, local polynomial estimation is implemented in functions loess (stats), locpoly (KernSmooth; Wand 2015), locfit (locfit; Loader 2013), locpol (locpol; Ojeda Cabrera 2012) and sm.regression (sm; Bowman and Azzalini 2014). However, not all the advantages of this procedure are explored in these functions: locpoly() and locpol() only admit the univariate case, sm.regression() works with 1 or 2 covariates, loess() fits up to 4 predictors and up to degree 2, and locfit() is complicated to use: It requires the user to input the predictors one by one and does not have a built-in procedure for automatically selecting the smoothing parameter.
Package NonpModelCheck provides a new function for local polynomial regression fitting, namely localpoly.reg(), which is extremely easy to use and implements clearly the classical approach. Moreover, it automatically identifies the number of predictors in the input matrix, provides several kernel type options for weights and different built-in procedures for bandwidth selection. Besides, it does not only estimate the regression function, but also its derivatives. The call of the function is localpoly.reg(X, Y, points = NULL, bandwidth = "CV", gridsize = 30, degree.pol = 0, kernel.type = "epanech", deriv = 0).
It is well known that the goodness of the fit highly depends on the type of kernel weight and degree of polynomial used. localpoly.reg() offers a variety of kernel options to the user, including "box", "trun.normal", "gaussian", "epanech", "biweight", "triweight" and "triangular". For the univariate case (when there is only one predictor), the choice of degree of the polynomial is unlimited, providing further bias reduction when desired. For the multivariate case where several predictors are available, the polynomial can be chosen up to the second degree. In a simulated example, we generated a uniform predictor X iid ∼ U (−4, 4) and set the response Y = sin(πX)X 2 + , where iid ∼ N (0, 1) and fitted a nonparametric regression using localpoly.reg() and loess() with degree 2 polynomial. In another example, we generated X iid ∼ U (−3, 10) and set Y = |X(7 − X)| cos(1/6πX) + , where iid ∼ N (0, .2), and fitted 2 different types of kernel in another simulation, comparing it to the fit of sm.regression(). The code is given below and the comparative plots in Figure 1.
A fundamental advantage of localpoly.reg() over the other functions is that it provides bandwidth selection procedures built-in, allowing the user to choose according to the situation, or even compare different methods if needed. In the univariate case, the available methods for bandwidth selection are leave-one-out cross validation, generalized cross validation and adaptive/variable bandwidth (binned) selection. The adaptive/variable bandwidth selection is a modified version of the one proposed by Fan and Gijbels (1995): first the interval of estimation (range of X) is divided into [1.5n/(10 log(n))] subintervals; then, a leave-one-out cross validation is performed using only the observations in each subinterval, selecting its corresponding bandwidth; finally the resulting bandwidth step function is smoothed using a local polynomial fit with leave-one-out cross validation to select the pilot bandwidth.
The options for bandwidth selection are all condensed in the bandwidth argument of function localpoly.reg(). To request leave-one-out cross validation, just set the bandwidth argument to "CV", while for generalized cross validation, set it to "GCV" and for adaptive bandwidth selection set it to "Adp". The following code generates 3 random simulations, demonstrating the use of the different bandwidth selection methods in localpoly.reg(). Figures 2 and 3 are the output of the following code, where localpoly.reg() fits can be compared to the ones of locpoly(), locfit(), locpol() and sm.regression().  Figure 3: Estimated curves using localpoly.reg() with adaptive bandwidth selection (red), locfit() using 0.05 for nearest neighbor argument (black) and sm.regression() using cross validation to select bandwidth (blue).
R> fit3 <-locfit(Y~lp(X, deg = 1, nn = .05)) R> plot(fit3, get.data = TRUE) R> lines(fit2$eval.points, fit2$estimate, col = 4) R> lines(fit$points, fit$predicted, col = 2) In Figure 2 on the left, the default settings of function locfit() were used, and in that case the fit produced was clearly oversmoothed. locpoly() may have the same characteristics as localpoly.reg(), as long as the user finds another code for selecting a useful bandwidth, for the user is required to input the bandwidth manually. Figure 2 on the right presents the curve estimated by locpol() and localpoly.reg(), where we can see that the latter has a slightly smoother estimate. In Figure 3, the simulated data has different patterns of relationship according to the predictor x, hence requiring a variable bandwidth. Clearly sm.regression() does not capture the first sine trend, and locfit() only does so when its nearest neighbor argument is manually set to be very small (0.05), while the adaptive algorithm built into localpoly.reg() captures all trends in the data.
An important feature that may be useful for the smoothing choice, is that a fixed bandwidth for each observed data point can be inputted to the function directly. When the observed data X is univariate, a vector of bandwidths of the same length as X may be used, corresponding to a bandwidth for each data point. For the case of multivariate X, a vector or a matrix of bandwidths can be used: If a vector, it must be of the same dimension of the columns of X, so that each entry of the bandwidth vector will be used for each covariate; if a matrix, it must be of the same dimensions of X, where each entry of the matrix corresponds to each data point.
Modern research deals with an increasing amount of data, and often scientists face situations where linear models are not adequate. With the large number of observations available, it is possible to fit nonparametric models for multivariate predictors. Several models can be used in this case such as additive models, varying coefficient models, single index models, etc., with packages in R such as gam (Hastie 2016), mgcv (Wood 2017, 2006, etc. However, for a completely nonparametric model, the options for such analysis in R are restricted to a few functions, namely loess(), locfit() and sm.regression(). While sm.regression() can only deal with up to 2 predictors, loess() and locfit() work with a set of up to 4 covariates. For function localpoly.reg() in package NonpModelCheck, the user can input a matrix X with no limit on the number of predictors (columns) or on observations (rows). Obviously, caution must be taken with the curse of dimensionality, for an estimation in several dimensions for small sample size may be very poor. The call of the function is the same as described previously, so that if the argument X is a matrix, the function will identify it and run a multivariate local polynomial regression.
In the case of a bivariate situation, the function plot3d.localpoly.reg() available in package NonpModelCheck, creates a 3-dimensional plot of the surface estimated by function localpol.reg() using function persp() from the base package graphics. The call for this function is R> plot3d.localpoly.reg(X, Y, bandwidth = "CV", gridsize = 30, + degree.pol = 0, kernel.type = "epanech", gridsurface = 30, + xlab = expression(X_1), ylab = expression(X_2), zlab = expression(Y), + theta = 30, phi = 30, expand = 0.5, col = "lightblue", ltheta = 120, + shade = 0.75, ticktype = "detailed", pch = 16, ...) The user can input the parameters for function localpoly.reg() and also for persp(). The only extra parameter is gridsurface, which corresponds to the number of points on each axis at which to estimate the local polynomial surface. Figure 4 shows an example of an estimated bivariate curve plot from plot3d.localpoly.reg(). The code for such a plot is as follows. R> par(mfrow = c(1, 2)) R> plot3d.localpoly.reg(X, Y, bandwidth = "CV2", degree.pol = 0, + gridsurface = 30, main = "Estimated function") R> f <-function(x1, x2) { 4 * sin(pi * x1) + 4 * sin(pi * x2) } R> persp ( Note that the choice of bandwidth was "CV2". This indicates a request for leave-one-out cross validation for each dimension, as a search in a tensor product. If bandwidth is set to "GCV2", the algorithm runs the tensor product search for generalized cross validation. The values "CV" and "GCV" used in the univariate case can also be used for multivariate predictors: The bandwidth for each dimension is chosen in an individual search, within the regression of each predictor on the response. In conclusion, R package NonpModelCheck is user friendly and provides a very useful combination of tools for local polynomial estimation, which is not yet found in any other existing package in R. It performs the classical method, with unrestricted degree of polynomial for the univariate case, and up to second degree for the multivariate case. Moreover, three different bandwidth selection methods are available: cross-validation, generalized cross validation and adaptive (binned) bandwidth. Hence, NonpModelCheck provides resources, not found in other packages, that combined are useful for the user seeking to do nonparametric local polynomial estimation.