Nonparametric Regression via StatLSSVM

We present a new MATLAB toolbox under Windows and Linux for nonparametric regression estimation based on the statistical library for least squares support vector machines ( StatLSSVM ). The StatLSSVM toolbox is written so that only a few lines of code are necessary in order to perform standard nonparametric regression, regression with correlated errors and robust regression. In addition, construction of additive models and pointwise or uniform conﬁdence intervals are also supported. A number of tuning criteria such as classical cross-validation, robust cross-validation and cross-validation for correlated errors are available. Also, minimization of the previous criteria is available without any user interaction.


Introduction
Nonparametric regression is a very popular tool for data analysis because it imposes few assumptions about the shape of the mean function. Therefore, nonparametric regression is quite a flexible tool for modeling nonlinear relationships between dependent variable and regressors. The nonparametric and semiparametric regression techniques continue to be an area of active research. In recent decades, methods have been developed for robust regression (Jurečková and Picek 2006;Maronna, Martin, and Yohai 2006;De Brabanter et al. 2009), regression with correlated errors (time series errors) (Chu and Marron 1991;Hart 1991;Hall, Lahiri, and Polzehl 1995;Opsomer, Wang, and Yang 2001;De Brabanter, De Brabanter, Suykens, and De Moor 2011b), regression in which the predictor or response variables are curves (Ferraty and Vieu 2006), images, graphs, or other complex data objects, regression methods accommodating various types of missing data (Hastie, Tibshirani, and Friedman 2009;Marley and Wand 2010), nonparametric regression (Györfi, Kohler, Krzyzak, and Walk 2002), Bayesian meth- space in explicit form, i.e., one only needs to replace the inner product in the feature space ϕ(X k ) ϕ(X l ), for all k, l = 1, . . . , n, with the corresponding kernel K(X k , X l ). This result is known as Mercer's condition (Mercer 1909). As a consequence, to fulfill Mercer's condition one requires a positive (semi-)definite kernel function K. LS-SVM for regression (Suykens, Van Gestel, De Brabanter, De Moor, and Vandewalle 2002b) are related to SVM (Vapnik 1999) where the inequality constraints have been replaced by equality constraints and the use of a squared loss is employed. Let D n = {(X 1 , Y 1 ), . . . , (X n , Y n )} where X ∈ R d and Y ∈ R be a given training data set, consider the model class F n,Ψ defined in (1) and let γ > 0 be a regularization parameter. Then, LS-SVM for regression is formulated as follows (2) The squared loss in (2) can be replaced by any other empirical loss. By using an L 2 loss function (and equality constraints) in LS-SVM, the solution is obtained in a linear system instead of using quadratic programming, see e.g., SVM, which speeds up computations. The problem is that LS-SVM lacks sparseness and robustness. For specialized literature on other loss functions and their properties, consistency and robustness, we refer the reader to Christmann and Steinwart (2007); Steinwart and Christmann (2008) and Steinwart and Christmann (2011). Suykens et al. (2002b) provides a benchmarking study on LS-SVM.
In Equation 2, it is clear that this model is linear in the feature space. This principle is illustrated in Figure 1. Consider a nonlinear relationship in the input space (Figure 1 left panel). Then, the inputs (X) are mapped into a high dimensional space by means of ϕ (Figure 1 right panel). In this space, a linear model is fitted given the transformed data Since ϕ is in general unknown, problem (2) is solved by using Lagrange multipliers. The Lagrangian is given by where α i ∈ R are Lagrange multipliers. Then, the conditions for optimality are given by After elimination of w and e, parameters b and α are estimated in the following linear system: with Y = (Y 1 , . . . , Y n ) , 1 n = (1, . . . , 1) and α = (α 1 , . . . , α n ) . By using Mercer's condition, Ω is a positive (semi-)definite matrix and the kl-th element of Ω is given by Hence, the kernel function K is a symmetric, continuous positive definite function. Popular choices are the linear, polynomial and radial basis function (RBF) kernel. In this paper we take K(X i , X j ) = (2π) −d/2 exp(− X i − X j 2 2 /2h 2 ). The resulting LS-SVM model is given bŷ

Model selection
In practical situations it is often preferable to have a data-driven method to estimate learning parameters. For this selection process, many data-driven procedures have been discussed in the literature. Commonly used are those based on the cross-validation criterion (Burman 1989) (leave-one-out and v-fold), the generalized cross-validation criterion (Craven and Wahba 1979), the Akaike information criterion (Akaike 1973), etc. Several of these criteria are implemented in the toolbox (see the user's manual and the next sections).
Although these model selection criteria assist the user to find suitable tuning parameters or smoothing parameters (bandwidth h of the kernel and the regularization parameter γ), finding the minimum of these cost functions tends to be tedious. This is due to the fact that the cost functions are often non-smooth and may contain multiple local minima. The latter is theoretically confirmed by Hall and Marron (1991).
A typical method to estimate the smoothing parameters would define a grid over these parameters of interest and apply any type of model selection method for each of these grid values. However, three disadvantages come up with this approach (Bennett, Hu, Xiaoyun, Kunapuli, and Pang 2006;Kunapuli, Bennett, Hu, and Pang 2008). A first disadvantage of such a grid-search model selection approach is the limitation of the desirable number of tuning parameters in a model, due to the combinatorial explosion of grid points. A second disadvantage is their practical inefficiency, namely, they are incapable of assuring the overall quality of the produced solution. A third disadvantage in grid-search is that the discretization fails to take into account the fact that the tuning parameters are continuous.
In order to overcome these drawbacks, we have equipped the toolbox with a powerful global optimizer, called coupled simulated annealing (CSA) (de Souza, Suykens, Vandewalle, and Bollé 2010) and a derivative-free simplex search (Nelder and Mead 1965;Lagarias, Reeds, Wright, and Wright 1998). The optimization process is twofold: First, determine good initial starting values by means of CSA and second, perform a fine-tuning derivative-free search using the previous end results as starting values. In contrast with other global optimization techniques CSA is not slow and can easily escape from local minima. The CSA algorithm based on coupled multiple starters is more effective than multi-start gradient descent optimization algorithms. Another advantage of CSA is that it uses the acceptance temperature to control the variance of the acceptance probabilities with a control scheme that can be applied to an ensemble of optimizers. This leads to an improved optimization efficiency because it reduces the sensitivity of the algorithm to the initialization parameters while guiding the optimization process to quasi-optimal runs. Because of the effectiveness of the combined methods only a small number of iterations are needed to reach an optimal set of smoothing parameters (bandwidth h of the kernel and the regularization parameter γ).

Standard nonparametric regression
In this section we illustrate how to perform a nonparametric regression analysis on the LIDAR data (Holst, Hössjer, Björklund, Ragnarson, and Edner 1996) and a two dimensional toy example with StatLSSVM in MATLAB.
Step-by-step instructions will be given on how to obtain the results. All the data sets used in this paper are included in StatLSSVM.

Univariate smoothing
First, load the LIDAR data into the workspace of MATLAB using load('lidar.mat'). After loading the data, one should always start by making a model structure using the initlssvm command. This model structure contains all the necessary information of the given data (xtrain and ytrain), data size (nb_data), dimensionality of the data (x_dim and y_dim) and the chosen kernel function (kernel_type). StatLSSVM currently supports five positive (semi-)definite kernels, i.e., the Gaussian kernel ('gauss_kernel'), the RBF kernel ('RBF_kernel'), the Gaussian additive kernel ('gaussadd_kernel'), a fourth order kernel based on the Gaussian kernel ('gauss4_kernel') (Jones and Foster 1993) and the linear kernel ('lin_kernel'). Note that we did not specify any value yet for the smoothing parameters, i.e., the bandwidth of the kernel (bandwidth) and the regularization parameter (gam) in the initlssvm command. We initialized these two parameters to the empty field in MATLAB by [] in initlssvm. The status element of this structure contains information whether the model has been trained with the current set of smoothing parameters (Equation 3 is solved or not). If the model is trained (Equation 3 is solved) then the field 'changed' will become 'trained'. The last element weights specifies the weights used with robust regression (see Section 5).
Any field in the structure can be accessed by using model.field_name . For example, if one wants to access the regularization parameter in the structure model, one simply uses model.gam. The next step is to tune the smoothing parameters. This is done by invoking tunelssvm and StatLSSVM supports several model selection criteria for standard nonparametric regression such as leave-one-out cross-validation ('leaveoneout'), generalized cross-validation ('gcrossval') and v-fold cross-validation ('crossval'). We illustrate the code for the v-fold cross-validation. By default, 'crossval' uses 10-fold cross-validation and the L 2 residual loss function. We will not show the complete output of the optimization process but only show the model structure output. The fields gam and bandwidth are no longer empty but contain their tuned value according the 10-fold cross-validation criterion. Note that the field status has been altered from 'changed' to 'trained'. Also the Lagrange multipliers alpha and bias term b have been added to the model structure. The last line in the structure denotes the time needed to solve the system of equations (3)

Bivariate smoothing
In this example of bivariate smoothing, the NBA data set (Simonoff 1996) is used (available in StatLSSVM as nba.mat). Since the workflow is exactly the same as in the previous example we only give the input script and visualize the results. In this example it holds that the vector x ∈ R 2 and y ∈ R. The fitted regression surface is an estimate of the mean points scored per minute conditional on the number of minutes played per game and height in centimeters for 96 NBA players who played the guard position during the 1992-1993 season. As a model selection method we choose leave-one-out cross-validation. The relevant MATLAB commands are

Robust nonparametric regression
Regression analysis is an important statistical tool routinely applied in most sciences. However, using least squares techniques, there is an awareness of the dangers posed by the occurrence of outliers present in the data. Not only the response variable can be outlying, but also the explanatory part, leading to leverage points. Both types of outliers may totally spoil an ordinary least squares analysis. We refer to the books of Hampel, Ronchetti, Rousseeuw, and Stahel (1986), Rousseeuw and Leroy (2003), Jurečková and Picek (2006) and Maronna et al. (2006) for a thorough survey regarding robustness aspects.
A possible way to robustify (2) is to use an L 1 loss function. However, this would lead to a quadratic programming problem and is more difficult to solve than a linear system. Therefore, we opt for a simple but effective method, i.e., iterative reweighting (De Brabanter et al. 2009;Debruyne, Christmann, Hubert, and Suykens 2010). This approach solves a weighted least squares problem in each iteration until a certain stopping criterion is satisfied. StatLSSVM supports four weight functions: Huber, Hampel, Logistic and Myriad weights. Table 1 illustrates these four weight functions.
A robust version of (2) is then formulated as follows (Suykens, De Brabanter, Lukas, and Vandewalle 2002a): where v i denotes the weight of the i-th residual. The weights are assigned according to the chosen weight function in Table 1. Again, by using Lagrange multipliers, the solution is given by 0 1 n . . , Y n ) , 1 n = (1, . . . , 1) , α = (α 1 , . . . , α n ) and D γ = diag{ 1 γv 1 , . . . , 1 γvn }. Suppose we observe the data D n = {(X 1 , Y 1 ), . . . , (X n , Y n )}, but the Y i are subject to occasional outlying values. An appropriate model is for a smooth function m and the ε i come from the gross-error model (Huber 1964) with symmetric contamination. The gross-error model or -contamination model is defined as where F 0 is some given distribution (the ideal nominal model), G is an arbitrary continuous symmetric distribution and is the contamination parameter. This contamination model describes the case, where with large probability (1 − ), the data occurs with distribution F 0 and with small probability outliers occur according to distribution G. In our toy example we generate a data set set containing 35% outliers, where the distribution F 0 is taken to be the Normal distribution with variance 0.01 and G is the standard Cauchy distribution. In order to obtain a fully robust solution, one must also use a robust model selection method (Leung 2005). Therefore, StatLSSVM supports a robust v-fold cross-validation procedure ('rcrossval') based on a robust LS-SVM smoother and robust loss function, i.e., the L 1 loss ('mae') or Huber's loss ('huber') instead of L 2 ('mse'). Figure 4 provides an illustration of StatLSSVM fitting via the next script for simulated data with n = 250, = 0.35, m(X) = sinc(X) and X ∼ U[−5, 5]. Note that this example requires the MATLAB Statistics Toolbox (generation of t distributed random numbers). The relevant MATLAB commands are >> X = -5 + 10 * rand(250, 1); >> epsilon = 0.35; >> sel = rand(length(X), 1) > epsilon; >> Y = sinc(X) + sel .* normrnd(0, .1, length(X), 1) + ...
(1 -sel) .* trnd ( Table 1) can be called as 'whuber', 'whampel' or 'wlogistic' as last argument of the command tunelssvm. More information about all functions can be found in the supplements of this paper or via the MATLAB command window via the help function. For example help robustlssvm. Table 1 and the complete robust tuning procedure can be found in De Brabanter (2011).

Confidence intervals
In this section we consider the following nonparametric regression setting where m is a smooth function, E[ε|X] = 0, VAR[ε|X] = 1 and X and ε are independent. Two possible situations can occur: (i) σ 2 (X) = σ 2 (homoscedastic regression model) and (ii) the variance is a function of the explanatory variable X (heteroscedastic regression model). We do not discuss the case when the variance function is a function of the regression function. Our goal is to determine confidence intervals for m.

Pointwise confidence intervals
Under certain regularity conditions, it can be shown that asymptotically (De Brabanter 2011) where b(x) and V (x) are respectively the bias and variance ofm(x). With the estimated bias and variance given in De Brabanter et al. (2011a), an approximate 100 where z 1−α/2 denotes the (1 − α/2)-th quantile of the standard Gaussian distribution. This approximate confidence interval is valid if This in turn requires a different bandwidth used in assessing the bias and variance (Fan and Gijbels 1996), which is automatically done in the StatLSSVM toolbox.

Uniform confidence intervals
In order to make simultaneous (or uniform) statements we have to modify the width of the interval to obtain simultaneous confidence intervals (see also multiple comparison theory). Mathematically speaking, we are searching for the width of the bands c, given a confidence level α ∈ (0, 1), such that for some suitable class of smooth functions F n . In StatLSSVM the width of the bands c is determined by the volume-of-tube formula (Sun and Loader 1994). We illustrate the concept of simultaneous confidence intervals with two examples. First, consider the following heteroscedastic regression example. The data generation process is as follows: As a second example, the LIDAR data set is used. Uniform and simultaneous confidence intervals are shown in Figure 6.

Additive LS-SVM models
Suppose a sample of observations (X i , Y i ) (X i ∈ R d and Y i ∈ R) is generated from an additive model where the error term ε i is independent of the X (j) i , E[ε i |X i ] = 0, VAR[ i |X i ] = σ 2 < ∞ and m j is a smooth function of the regressor X (j) i . We consider the following model class The optimization problem (2) can be rewritten w.r.t. the new model class as follows (Pelckmans, Goethals, De Brabanter, Suykens, and De Moor 2005) . . , n. As before, by using Lagrange multipliers, the solution is given by where Ω = d j=1 Ω (j) and Ω (j) l ) for all k, l = 1, . . . , n (sum of univariate kernels). The resulting additive LS-SVM model is given bŷ We illustrate the additive LS-SVM models on two examples. First, we construct a classical example as in Hastie and Tibshirani (1990). The data are generated from the nonlinear regression model with 12 variables: where the 12 variables are uniformly distributed on the interval [0, 1], ε ∼ N (0, 1) and n = 300. By using the option 'multiple' StatLSSVM tunes the bandwidth of the kernel for each estimated function. By setting this option to 'single' one bandwidth is found for all estimated functions. The relevant MATLAB commands are >> X = rand(300, 12); >> Y = 10 * sin(pi * X(:,1)) + 20 * (X(:, 2) -0.5).^2 + ... 10 * X(:, 3) -5 * X(:, 4) + randn(300, 1);  Figure 7 shows the fitted function for the additive LS-SVM model applied to our simulated data set. In general, the scales on the vertical axes are only meaningful in a relative sense; they have no absolute interpretation. Since we have the freedom to choose the vertical positionings, we should try to make them meaningful in the absolute sense. A reasonable solution is to plot, for each predictor, the profile of the response surface with each of the other predictors set at their average (see also Ruppert et al. 2003). This is automatically done by the plotlssvmadd command.
As a last example we consider the diabetes data set also discussed in Hastie and Tibshirani (1990). The data come from a study (Sockett, Daneman, Clarson, and Ehrich 1987) of the factors affecting patterns of insulin-dependent diabetes mellitus in children. The objective is to investigate the dependence of the level of serum C-peptide on various other factors in order to understand the patterns of residual insulin secretion. The response measurement is the logarithm of C-peptide concentration (pmol/ml) at diagnosis, and the predictor measurements are age and base deficit (a measure of acidity The result is shown in Figure 8 using the vertical alignment procedure discussed above. It can be seen that both effects appear to be nonlinear. The variable age has an increasing effect that levels off and the variable basedef appears quadratic.

Regression with correlated errors
In this section we consider the nonparametric regression model where E[ε|X] = 0 , VAR[ε|X] = σ 2 < ∞, the error term ε i is a covariance stationary process with E[ε i , ε i+k ] = γ k , γ k ∼ k −a , a > 2 and m is a smooth function. However, the presence of correlation between the errors, if ignored, causes breakdown of commonly used automatic tuning parameter selection methods such as cross-validation or plug-in (Opsomer et al. 2001;De Brabanter et al. 2011b). Data-driven bandwidth selectors tend to be "fooled" by autocorrelation, interpreting it as reflecting the regression relationship and variance function. So, the cyclical pattern in positively correlated errors is viewed as a high frequency regression relationship with small variance, and the bandwidth is set small enough to track the cycles resulting in an undersmoothed fitted regression curve. The alternating pattern above and below the true underlying function for negatively correlated errors is interpreted as a high variance, and the bandwidth is set high enough to smooth over the variability, producing an oversmoothed fitted regression curve.
The model selection method is based on leave-(2l + 1)-out cross-validation (Chu and Marron 1991). To tune the parameter l, a two-step procedure is used. First, a Nadaraya-Watson smoother with a bimodal kernel is used to fit the data. De Brabanter et al. (2011b) have shown that a bimodal kernel satisfying K(0) = 0 automatically removes correlation structure without requiring any prior knowledge about its structure. Hence, the obtained residuals are good estimates of the errors. Second, the k-th lag sample autocorrelation can be used to find a suitable value for l. More theoretical background about this method can be found in De Brabanter et al. (2011b).
Consider the beluga and US birth rate data sets (Simonoff 1996). We will compare the leave-(2l + 1)-out cross-validation method with classical leave-one-out cross-validation (see Figure 9). It is clear from both results that existence of autocorrelation can seriously affect the regression fit. Ignoring the effects of correlation will cause the nonparametric regression smoother to interpolate the data. This is especially visible in the US birth rate data set. Without removing autocorrelation there is no clear trend visible in the regression fit. By using the above described method for model selection the regression fit shows a clear pattern, i.e., the US joined the second world war after the attack on Pearl Harbor (December 1941), decreasing birth rate during the entire course of the war and finally increasing birth rate after the war in Europe and the Pacific was over (mid September 1945).
The relevant MATLAB commands for the beluga data set are given below. Model selection accounting for correlation is selected first using 'crossval2lp1', the classical leave-one-out cross-validation second using 'leaveoneout'. and US birth rate data set (right panel). The green line represents the estimate with tuning parameters determined by classical leave-one-out cross-validation and the red line is the estimate based on the above described procedure.

Conclusions
We have demonstrated that several nonparametric regression problems can be handled with StatLSSVM. This MATLAB-based toolbox can manage standard nonparametric regression, regression with autocorrelated errors, robust regression, pointwise/uniform confidence intervals and additive models with a few simple lines of code. Currently the toolbox is supported for MATLAB R2009b or higher.