Kernel-Based Regularized Least Squares in R ( KRLS ) and Stata ( krls )

The Stata package krls as well as the R package KRLS implement kernel-based regularized least squares (KRLS), a machine learning method described in Hainmueller and Hazlett (2014) that allows users to tackle regression and classification problems without strong functional form assumptions or a specification search. The flexible KRLS estimator learns the functional form from the data, thereby protecting inferences against misspecification bias. Yet it nevertheless allows for interpretability and inference in ways similar to ordinary regression models. In particular, KRLS provides closed-form estimates for the predicted values, variances, and the pointwise partial derivatives that characterize the marginal effects of each independent variable at each data point in the covariate space. The method is thus a convenient and powerful alternative to ordinary least squares and other generalized linear models for regression-based analyses.


Overview
Generalized linear models (GLMs) remain the workhorse modeling technology for most regression and classification problems in social science research.GLMs are relatively easy to use and interpret, and allow a variety of outcome variable types with different assumed conditional distributions.However, by using the data in a linear way within the appropriate link function, all GLMs impose stringent functional form assumptions that are often potentially inaccurate for social science data.For example, linear regression typically requires that the marginal effect of each covariate is constant across the covariate space.Similarly, logistic regression assumes that the log-odds (that the outcome equals one) are linear in the covariates.Such constant marginal effect assumptions can be dubious in the social world, where marginal effects are often expected to be heterogeneous across units and levels of other covariates.
It is well-known that misspecification of models leads not only to an invalid estimate of how well the covariates explain the outcome variable, but may also lead to incorrect inferences about the effects of each covariate (see e.g., Larson and Bancroft 1963;Ramsey 1969;White 1981;Härdle 1994;Sekhon 2009).In fact, for parametric models, leaving out an important function of an observed covariate can result in the same type of omitted variable bias as failing to include an important unobserved confounding variable.The conventional approach to dealing with this risk is for the user to attempt to add additional terms (e.g., a squared term, interaction, etc.) that can account for specific forms of interactions and nonlinearities.However "guessing" the correct functional form is often difficult.Moreover, including these higher-order terms can actually worsen the problem and lead investigators to make incorrect inferences due to misspecification (see Hainmueller and Hazlett 2014).In addition, results may be highly model dependent, with slight modifications to the functional form changing estimates radically (e.g., King and Zeng 2006;Ho, Imai, King, and Stuart 2007).
Presumably, social scientists are aware of these problems but commonly resort to GLMs because they lack convenient alternatives that would allow them to easily relax the functional form assumptions while maintaining a high degree of interpretability.While more flexible methods, such as neural networks (e.g., Beck, King, and Zeng 2000a) or generalized additive models (GAMs, e.g., Hastie and Tibshirani 1990;Beck and Jackman 1998;Wood 2004) have occasionally been proposed, they have not received widespread usage by social scientists, most likely because they lack the ease of use and interpretation that GLMs afford.This paper introduces a Stata (StataCorp.2015) package called krls which implements kernelbased regularized least squares (KRLS), a machine learning method described in Hainmueller and Hazlett (2014) that allows users to tackle regression and classification problems without manual specification search and strong functional form assumptions.To our knowledge, Stata currently offers no packaged routines to implement machine learning methods like KRLS. 1  One important contribution of this article therefore is to close this gap by providing Stata users with a routine to implement the KRLS method and thus to benefit from advances in machine learning.In addition, we also provide a package called KRLS that implements the same methods in R (R Core Team 2017).While the focus of this article is on the Stata package, below we also briefly discuss the R version and provide companion replication code that implements all examples in both Stata and R.
KRLS was designed to allow investigators to move beyond GLMs for classification and regression problems, while retaining their ease-of-use and interpretability.The KRLS estimator operates in a much larger space of possible functions based on the idea that observations with similar covariate values are expected to have similar outcomes on average. 2 Furthermore, KRLS employs regularization which amounts to a prior preference for smoother functions over erratic ones.This allows KRLS to minimize over-fitting, reducing the variance and fragility of estimates, and diminishing the influence of "bad leverage" points.As explained 1 One exception is the gam command by Royston and Ambler (1998), which provides a Stata interface to a version of the Fortran program gamfit for the GAM model written by Trevor Hastie and Robert Tibshirani (Hastie and Tibshirani 1990).
2 This notion that similar observations should have similar outcomes is also a motivation for methods such as smoothers and k-nearest neighbors models.However, while those other methods are "local" and thus susceptible to the curse of dimensionality, KRLS retains the characteristics of a "global" estimator, i.e., the estimate at a given point may depend to some degree on any other observation in the dataset.Accordingly, it is more resistant to the curse of dimensionality and can be used in data with hundreds or even thousands of dimensions.
in Hainmueller and Hazlett (2014), the regularization also helps to recover efficiency so that KRLS is typically not much less efficient than ordinary least squares (OLS) even if the data are truly linear.KRLS applies most naturally to continuous outcomes, but also works well with binary outcomes.The method has been shown to have comparable or superior performance to many other machine learning approaches for both (continuous) regression and (binary) classification tasks, such as k-nearest neighbors, support vector machines, neural networks, and generalized additive models (Rifkin, Yeo, and Poggio 2003;Zhang and Peng 2004;Hainmueller and Hazlett 2014).
Central to its usability, the KRLS approach produces interpretable results similar to the traditional output of GLMs, while allowing richer interpretations if desired.In addition, it allows closed-form solutions for many quantities of interest.Finally, as shown in Hainmueller and Hazlett (2014), the KRLS estimator has desirable statistical properties, including unbiasedness, consistency, and asymptotic normality under mild regularity conditions.Given its combination of flexibility and interpretability, KRLS can be used for a wide variety of modeling tasks.It is suitable for modeling problems whenever the correct functional form is not known, including exploratory analysis, model-based causal inference, prediction problems, propensity score estimation, or other regression and or classification problems.
The krls package is distributed through the Statistical Software Components (SSC) archive provided at http://ideas.repec.org/c/boc/bocode/s457704.html. 3The key command in the krls package is krls which functions much like Stata's reg command and fits a KRLS model where the outcome variable is regressed on a set of covariates.Following this model fit, a second function, predict, can be used to predict fitted values, residuals, and other quantities just like with other Stata estimation commands.We illustrate the use of this function with example data originally used in Beck, Levine, and Loayza (2000b).This data file, growthdata.dta,"ships" with the krls package.

Understanding kernel-based regularized least squares
The approach underlying KRLS has been well established in machine learning since the 1990s under a host of names including regularized least squares (e.g., Rifkin et al. 2003), regularization networks (e.g., Evgeniou, Pontil, and Poggio 2000), and kernel ridge regression (e.g., Saunders, Gammerman, andVovk 1998, Cawley andTalbot 2002). 4ainmueller and Hazlett (2014) provide a detailed explanation of the KRLS methodology and establish its statistical properties together with simulations and real-data examples.Here we focus on how users can implement this approach through the krls package.We thus provide only a brief review of the theoretical background.
We first set notation and key definitions.Assume that we draw i.i.d.data of the form (y i , x i ), where i = 1, . . ., N indexes the observations, y i ∈ R is the outcome of interest, and x i is a 1 × D real-valued vector x i in R D , taken to be our vector of covariate values.For our purposes, a kernel is defined as a (symmetric and positive semi-definite) function of two input patterns, k(x i , x j ), mapping onto a real-valued output. 5,6For our purpose, kernel functions can be treated as providing a measure of similarity between the covariate vectors of two observations.Here we use the Gaussian kernel, defined as where x j −x i is the Euclidean distance between the covariate vectors x j and x i and σ 2 ∈ R + is the bandwidth of the kernel function.This kernel function evaluates to its maximum value of one only when the covariate vectors x j and x i are identical, and approaches zero as x j and x i grow far apart.
As examined in Hainmueller and Hazlett (2014), KRLS can be understood through several perspectives.Here we limit discussion to the viewpoint we believe is most valuable for those without prior experience in kernel methods, the "similarity-based view" in which the KRLS method can be thought of in two stages.First, it fits functions using kernels, based on the presumption that there is useful information embedded in how similar a given observation is to other observations in the dataset.Second, it utilizes regularization, which gives preference to simpler functions.We describe both stages below.

Fitting with kernels
We begin by assuming that the target function y = f (x) can be well approximated by some function in the space of functions represented by where k(x, x i ) measures the similarity between our point of interest (x) and one of N covariate vectors x i , and c i is a weight for each covariate vector.Functions of this type leverage information about the similarity between observations.Imagine we have some test-point x at which we would like to evaluate the function value, and suppose that the covariate vectors x i and their weights c i have all been fixed.For such a test point, the predicted value is given by Since k(x , x j ) is a measure of the similarity between x and x j , we see that the value of k(x , x j ) will grow larger as we move the test-point x closer to x j .In other words, the predicted outcome at the test point is given by a weighted sum of how similar the test point is to each observation in the (training) dataset.The equation can thus be thought of as Introducing a matrix notation helps to illustrate the underlying operations.Let matrix K be the N × N symmetric kernel matrix whose jth, ith entry is k(x j , x i ); it measures the pairwise similarities between each of the N covariate vectors x i .Let c = [c 1 , . . ., c N ] be the N × 1 vector of choice coefficients and y = [y 1 , . . ., y N ] be the N × 1 vector of outcome values.Equation 2 as applied to each observed x in the observed data or training set can then be rewritten in vector form as: In this form we see KRLS as a linear system in which we estimate y for any x as a linear combination of basis functions, each of which is a measure of x 's similarity to other observations in the (training) dataset.

Regularization
While this approach reexpresses the data in terms of new basis functions, it effectively solves for N parameters using N observations.A perfect fit could be sought by choosing ĉ = K −1 y, but even when K is invertible, such a fit would be highly unstable and lacking in generalizability.To make use of the information in the columns of K, we impose an additional assumption: That we prefer smoother, less complicated functions.We thus employ Tikhonov regularization (Tychonoff 1963), solving an optimization problem over both empirical fit and model complexity by choosing argmin where V (y i , f (x i )) is a loss function that computes how "wrong" the function is at each observation, H is a hypothesis space of possible functions, R is a "regularizer" measuring the "complexity" of function f , and λ ∈ R + is a parameter that determines the tradeoff between model fit and complexity.Larger values of λ result in a larger penalty for the complexity of the function thus placing a higher premium on model parsimony; lower values of λ will have the opposite effect of placing a higher premium on model fit.
For KRLS, we choose V to be squared loss, and we choose the regularizer R to be the square of the L 2 norm,7 f, f H = f 2 K .For the Gaussian kernel, this choice of norm imposes an increasingly high penalty on "wiggly" or higher-frequency components of f .Moreover, this norm can be computed as (Schölkopf and Smola 2002).Finally, the hypothesis space H is the space of functions described above, y = Kc.The resulting Tikhonov problem is (5) Accordingly, y = Kc provides the best fitting approximation.For a fixed choice of λ, since this fit is a least-squares fit, it can be interpreted as providing the best approximation to the conditional expectation function, E [y|X, λ].Notice that this minimization is almost equivalent to a ridge regression in a new set of features, one which measures the similarity of a covariate vector to each of the other covariate vectors.8 Finally, we can solve for the solution by differentiating the objective function with respect to the choice coefficients c and solving the resulting first order conditions, arriving at the closed-form solution c = (K + λI) −1 y. (6)

Numerical implementation
One key advantage of KRLS is that we have a closed-form solution for the estimator of the choice coefficients that provides the solution to the Tikhonov regularization problem within our flexible space of functions.This estimator, as described in Equation 6, is numerically attractive.We need to build the kernel matrix K by computing all pairwise distances and then add λ to the diagonal.The resulting matrix is symmetric, positive definitive, and wellconditioned (for large enough λ) so inverting it is straightforward.The only caveat here is that creating the (N × N ) kernel matrix can be memory intensive in very large datasets.

Data processing and choice of parameters
Before examining the choice of λ and σ 2 , it is important to note that krls always standardizes variables prior to analysis by subtracting off the sample means and dividing by the sample standard deviations.9 First, we must choose the regularization parameter λ.The default in the krls function is to use a standard cross-validation technique, choosing the value of λ that minimizes the sum of the squared leave-one-out errors.In other words, we find the λ that optimizes how well a model that is fitted on all but one observation predicts the left-out observation.For any choice of λ, N different leave-one-out predictions can be made.The sum of squared errors over these gives the leave-one-out error (LOOE).One nice numerical feature of this approach is that the LOOE can be efficiently computed in O(N 1 ) time for any valid choice of λ using the formula LOOE = c diag(G −1 ) where G = K + λI (see Rifkin and Lippert 2007).Notice that the krls function also provides the lambda() option which users can use to supply a desired value of λ and this feature can be used to implement more complicated approaches if needed.
Second, we also must choose the kernel bandwidth σ 2 .In the context of KRLS this is principally a measurement decision incorporated into the kernel definition that governs how distant two covariate vectors x i and x j can be from each other and still be considered relatively similar. 10Accordingly, for KRLS our objective is to choose σ 2 such that the columns of K extract useful information from X.A reasonable requirement for social science data is that at least some observations can be considered similar to each other, some are different from each other, and many fall in-between.As explained in Hainmueller and Hazlett (2014), a reliable choice to satisfy this prior is to set σ 2 = D, where D = dim(X).A theoretical justification for this default choice is that for standardized data the average (Euclidean) distance between two observations that enters into the kernel calculation, E[ x j − x i 2 ], is equal to 2D.The choice of σ 2 = 1D typically produces a reasonable empirical distribution of the values in K.The krls command also provides a sigma() that allows the user to apply her own value for σ 2 if needed.

Interpretation and quantities of interest
One important benefit of KRLS over many other flexible modeling approaches is that the fitted KRLS model lends itself to a range of interpretational tools.Below we briefly discuss the quantities of interest that users may wish to extract and make inferences about from fitted models.

Estimating E[y|X] and first differences
KRLS provides an estimate of the conditional expectation function that describes how the average of y varies across levels of X = x.This allows the routine to produce fitted values or out-of-sample predictions.Other quantities of interest such as first differences can also be computed.For example, to estimate the average treatment effect of a binary variable, W , we can simply create two datasets that are identical to the original X, but in the first set W to one for all observations and in the second set W to zero.We can then compute the first difference using as our estimate of the average marginal effect.Of course, the covariates can be set to other values such as the sample means, medians, etc.The krls command automatically computes and reports average first differences of this type when covariates are binary, with closed-form estimates of standard errors.

Partial derivatives
KRLS also provides a closed-form estimator for the pointwise partial derivatives of y with respect to any particular covariate.Let x (d) be a particular variable, such that X = [x 1 . . .x (d) . . .x D ]. Then for a single observation, j, the partial derivative of y with respect to variable d can be estimated by Estimating the partial derivatives allows researchers to explore the pointwise marginal effects 10 Note that this differs from the role of the kernel bandwidth in traditional kernel regression or kernel density estimation where the bandwidth is typically the only smoothing parameter used for fitting.In KRLS the kernel is simply used to form K and then fitting occurs through the choice of c and a complexity penalty that is governed by λ.The resulting fit is thus expected to be less dependent on the exact choice of σ 2 than for those kernel methods where the bandwidth is the only parameter.Moreover, since there is a tradeoff between σ 2 and λ (increasing either can increase smoothness), a range of σ 2 values is typically acceptable and leads to similar fits after optimizing over λ.
of each covariate and to summarize them as desired.By default, krls computes the sampleaverage partial derivative of y with respect to x (d) at each point in the observed dataset These average marginal effects are reported in an output table that may be interpreted in a manner similar to a regression table produced by reg or other GLM commands.These are convenient to examine as they are somewhat analogous to the β coefficients in a linear model.However, it is important to remember that the underlying KRLS model now also captures non-linear relationships, and the sample average pointwise marginal effects provide only a summary.For example, a covariate could have a positive marginal effect on one area of the covariate space and a negative effect in the other, but the average marginal effect may be near zero.To this end, KRLS allows for interpretation beyond these average values.In particular, krls provides users with the means to directly assess marginal effect heterogeneity and interpret interactions, as we explain in the empirical illustrations below.

Implementing kernel-based regularized least squares
In this section we describe how users can utilize kernel-based regularized least squares with the krls package.

Installation
krls can be installed from the Statistical Software Components (SSC) archive by typing ssc install krls, all replace on the Stata command line.A dataset associated with the package, growthdata.dta, will be downloaded to the default Stata folder when the option all is specified.

Basic syntax
The main command in the package is the krls command that fits the KRLS model.The basic syntax of the krls command follows the standard Stata command form A dependent variable and at least one independent variable are required.Both the dependent and independent variables may be either continuous or binary.The if and in options can be used to restrict the estimation sample to subsets of the full dataset in memory.

Data
We illustrate the use of krls with the growthdata.dtadataset (Beck et al. 2000b) that contains average GDP growth rates over 1960-1995 for 65 countries and various other covariates that are potentially related to growth.For each country the dataset measures the following variables: • country_name: Name of the country.
• growth: Average annual percentage growth of real gross domestic product (GDP) from 1960 to 1995.• rgdp60: The value of GDP per capita in 1960 (converted to 1960 US dollars).
• tradehare: The average share of trade in the economy from 1960 to 1995, measured as the sum of exports plus imports, divided by GDP.• yearsschool: Average number of years of schooling of adult residents in that country in 1960.• assassinations: Average annual number of political assassinations in that country from 1960 to 1995 (per million population).
In comparison to the OLS results, the KRLS results also suggest a statistically significant relationship between growth rates and schooling, but the average marginal effect estimate is somewhat bigger and suggests that a one year increase in schooling is associated with a .34percentage point increase in growth rates on average.Moreover, we find that the R 2 from KRLS is about three times higher and schooling now accounts for about 32% of the variation in growth rates.
Further investigation reveals that this improved model fit results because the relationship between growth and schooling is not well characterized by a simple linear relationship as implied by the OLS model above.Instead, the relationship is highly non-linear and the KRLS fit accurately learns the shape of this conditional expectation function from the data.
To observe this we can use the predict function to obtain fitted values from the KRLS model.The predict function works much as the predict function for post-model estimation in Stata, producing fitted values by default.Other options include se and residuals to calculate standard errors of predicted values or residuals respectively.
predict Yhat_KRLS Now we plot the fitted values to compare the model fits from the regression and the KRLS model.We also add to the plot the fitted values from a more flexible OLS model, Yhat_OLS2, that includes as predictors a third order polynomial of schooling.twoway (scatter growth yearsschool, sort) /// (line Yhat_KRLS yearsschool, sort) /// (line Yhat_OLS yearsschool, sort) /// (line Yhat_OLS2 yearsschool, sort lpattern(dash)), /// ytitle("GDP growth rate (%)") /// legend(order(2 "KRLS fitted values" 3 "OLS fitted values" /// 4 "OLS polynomial fitted values")) Figure 1 reveals the results.The simple OLS fit (green solid line) fails to capture the nonlinear relationship; it over-estimates the growth rate at low and high values of schooling and underestimates the growth rate at medium values of schooling.In contrast, the KRLS model (solid red line) accurately learns the non-linear relationship from the data and attains an improved model fit that is very similar to the flexible OLS model with the third order polynomial (red dashed line).In fact, in the flexible OLS model the three polynomial coefficients are highly jointly significant (p value < 0.0001) and the new R 2 , at 0.31, is close to that of the KRLS model (0.32).
Notice that in this simple bivariate example, the misspecification can be easily corrected by making the regression model more flexible with a third-order polynomial.However, applying such diagnostics and finding the correct functional form by trial and error becomes inconvenient, if not infeasible, as more covariates are included in the model.KRLS eliminates the need for such a specification search.

Pointwise partial derivatives
An additional advantage of KRLS is that it provides closed-form estimates of the pointwise derivatives that characterize the marginal effect of each covariate at each data point in the covariate space.To illustrate this with multivariate data, we fit a slightly more complex regression in which growth rates are regressed on schooling and the average number of political assassinations in a country.

Linear regression
With this OLS model we find that one additional year of schooling is associated with a .24increase in the growth rate.However, this model assumes that this marginal effect of schooling is constant across the covariate space.To probe this assumption, we can generate a component-plus-residual (CR) plot to visualize the relationship between growth and schooling, controlling for the linear component of the assassinations variable.The results are shown in Figure 2. As in the first example, the regression is clearly misspecified; as indicated by the lowess line, the conditional relationship is nonlinear.

cprplot yearsschool , lowess
In contrast to OLS, KRLS does not impose a constant marginal effect assumption.Instead, it directly obtains estimates of the response surface that characterizes how average growth varies with schooling and assassinations, along with closed-form estimates of the pointwise marginal derivatives that characterize the marginal effects of each covariate at each data point.

hist d_yearsschool
Going further, we can also ask how and why the marginal effects of schooling vary.To do so we can plot the marginal effects against levels of schooling.The results are displayed in Figure 4.Here we can see how the marginal effect estimates from KRLS accurately track the derivative of the nonlinear conditional relationship revealed in the CR plot in Figure 2 above.We see that the marginal effect is positive at low levels of schooling, shrinks towards zero at medium level of schooling, and turns slightly negative at high levels of schooling.This is consistent with the idea that a country's human capital investments exhibit decreasing marginal returns.

lowess d_yearsschool yearsschool
This simple multivariate example illustrates the interpretability offered by KRLS.It accurately fits smooth functions without requiring a specification search, while enabling simple interpretations akin to the coefficient estimates from GLMs.Moreover, it also allows for much richer interpretations regarding effect heterogeneity through the examination of pointwise marginal effects.As seen in this example, examining the distribution of the marginal effects can lead to interesting insights about non-constant marginal effects.In some cases we might find that a covariate has fairly uniform marginal effects, while in other cases the effects might be highly heterogeneous (e.g., the effects are negative in some and positive in other parts of the covariate space).

The full model
Having demonstrated the interpretive benefits of KRLS, in this section we fit a full model and compare the results obtained by OLS and KRLS in detail.As will be shown, KRLS is able to provide a flexible fit, improving both in-and out-of-sample accuracy.

Linear regression
A strong nonlinearity is also visible when plotting the marginal effect (vertical axis) against levels of trade share in Figure 5.If the relationship between trade share and economic growth was linear, we would expect to observe a similar marginal effect across each level (a horizontal line).However, as is evident from the figure, the marginal effect on growth is much larger at higher levels of trade share.

lowess d_tradeshare tradeshare
The interaction between the trade shares and assassinations is also visible when plotting the pointwise marginal effect of trade shares against the number of assassinations:
Finally, we consider the out-of-sample performance.Given the very small sample size (N = 65), one might expect that a far more flexible model such as KRLS would suffer in terms of out-of-sample performance owing to the usual bias-variance tradeoff.However, using leaveone-out forecasts to test model performance, we find that KRLS and the original OLS models have similar performance (MSE of 2.97 for KRLS and 2.75 for OLS), with slightly over half (34 out of 65) of observations having smaller prediction errors under KRLS than under OLS.The KRLS model is also far more stable than the "comparable" OLS model augmented to have additional flexibility as above, which produces very high-variance estimates, for a MSE of 17.6 on leave-one-out forecasts.
In summary, this section illustrates how in this still fairly low dimensional example with only four covariates, linear regression is susceptible to misspecification bias, failing to capture nonlinearities and interactions in the data.By contrast, non-linear, non-additive functions are captured by the KRLS model without necessitating a specification search that is, at best, tedious and error-prone.
The example also illustrates the rich interpretations that can be gleaned from examining the pointwise partial derivatives provided by KRLS.In this case, the effect heterogeneities revealed by KRLS could be confirmed by building an augmented OLS model, illustrating the potential use of KRLS as a robustness-checking procedure.In practice, rebuilding an OLS model in this way would be unnecessary in low-dimensional problems, and often infeasible in high-dimensional problem, while KRLS directly provides an accurate fit together with pointwise marginal effect estimates for interpretation.

Binary predictors
As explained in Hainmueller and Hazlett (2014), KRLS works well with binary independent variables.However, their effects should be interpreted using first differences (rather than the pointwise partial derivatives) to accurately capture the expected difference in the outcome when moving from the low to the high value of the predictor.The krls command automatically detects binary covariates and reports first differences rather than average marginal effects in the output table and pointwise derivatives.Such variables are also marked with an asterisk as binary variables in the output table.To briefly illustrate this we code a binary variable for countries where the years of schooling is 3 years or higher and add this binary regressor.

Choosing the smoothing parameter by cross-validation
The krls command returns the number of iterations used to converge on a value for λ in the upper left panel of the function output.By default, the tolerance for the choice of λ is set such that a solution is reached when further changes in λ improve the proportion of variance explained (in a leave-one-out sense) by less than 0.01%.This sensitivity level can be adjusted using the ltolerance() option.Decreasing the sensitivity may improve execution time but may result in the selection of a suboptimal value for λ.

Further options for predictions
If the user is interested only in predictions, they can specify the suppress option to instruct krls not to calculate derivatives, first differences, and the output table.This significantly decreases execution time, especially in higher dimensional examples.
In some cases the user might also be interested in obtaining uncertainty estimates for the predicted values.These can be accomplished in KRLS because the method provides a closedform estimator of the full variance-covariance matrix for fitted and predicted values.Following the model fit, users can simply use predict, se to generate a variable that contains the standard errors for the predicted values.
The variance-covariance matrix of the coefficients is stored by default in e(Vcov_c).Users may also wish to obtain the full variance-covariance matrix for the fitted values for further computations.To save execution time this matrix is not saved by default, but it can be requested using the vcov option of the krls command.If the model is fit with this option specified, the variance-covariance matrix of the fitted values is returned in e(Vcov_y).Alternatively, the svcov(filename) option can be used to save this variance-covariance matrix to an external dataset.

Further options for extracting results
By default, krls returns the output table of pointwise derivatives and first differences in matrix form in e(Output).Alternatively, the keep(filename) option can be used to store the output table in a new dataset specified by filename.dta.sderiv(filename) can be similarly used to save derivatives in a new dataset.

Kernel-based regularized least squares in R
For R users we have developed the KRLS package (Hainmueller and Hazlett 2017) which implements the same methods as in the Stata package described above.The KRLS package is available for download on the Comprehensive R Archive Network (CRAN, https://CRAN.R-project.org/package=KRLS).We also provide a companion script that replicates all the examples described above with the R version of the package.
Overall, the R and the Stata versions produce the same results and we see no significant advantage in using one or the other (except that R is available as free software under the terms of the Free Software Foundation's GNU General Public License).In particular, the numerical implementation of the KRLS estimator is nearly identical across the two versions, with comparable run times and memory requirements.
The command structure is also broadly similar in both packages, although the commands in the R version more closely follow the typical structure of R estimation commands.In particular, the main function in the R package is krls() which fits the KRLS model once the user -at a minimum -has specified the dependent and independent variables.In addition, the convenience functions summary(), plot(), and predict() are provided to summarize or plot the results from the fitted KRLS model object and to generate predicted values (with standard errors) for in-sample and out-of-sample predictions.For example, we can replicate the full model described above using the following code R> library("foreign") R> library("KRLS") R> growth <-read.dta("growthdata.dta")R> covars <-c("rgdp60", "tradeshare", "yearsschool", "assassinations") R> k.out <-krls(y = growth$growth, X = growth

Conclusion
In this article we have described how to implement kernel regularized least squares using the krls package for Stata.We also provided an implementation in R through the KRLS package (Hainmueller and Hazlett 2017).
The KRLS method allows researchers to overcome the rigid assumptions in widely used models such as GLMs.KRLS fits a flexible, minimum-complexity regression surface to the data, accommodating a wide range of smooth non-linear, non-additive functions of the covariates.Because it produces closed-form estimates for both the fitted values and partial derivatives at every observation, the approach lends itself to easy interpretation.In future releases, we hope to improve upon the krls function by improving its speed (the current implementation begins to get slow with several thousand observations), by allowing for weights, and by providing options for heteroskedasticity-robust and cluster-robust standard errors.
We illustrate the use of the krls function by analyzing GDP growth rates over 1960-1995 for 65 countries (Beck et al. 2000b).Compared to OLS implemented through reg, krls reveals non-linearities and interactions that substantially alter both the quality of fit and the inferences drawn from the data.In this case, an OLS model could be rebuilt using insights from the krls model.In general, however, use of krls obviates the need for a tedious specification search which may still leave some important non-linearities and interactions undetected.

Figure 2 :
Figure 2: Conditional relationship between growth and schooling (controlling for assassinations).

Figure 3 :
Figure 3: Distribution of pointwise marginal effect of schooling on growth.

Figure 4 :
Figure 4: Pointwise marginal effect of schooling and level of schooling. 13

Figure 6 :
Figure 6: Pointwise marginal effect of trade share and number of assassinations.