lmSubsets : Exact Variable-Subset Selection in Linear Regression for R

An R package for computing the all-subsets regression problem is presented. The proposed algorithms are based on computational strategies recently developed. A novel algorithm for the best-subset regression problem selects subset models based on a pre-determined criterion. The package user can choose from exact and from approximation algorithms. The core of the package is written in C++ and provides an eﬃcient implementation of all the underlying numerical computations. A case study and benchmark results illustrate the usage and the computational eﬃciency of the package. Originally published


Introduction
An important problem in statistical modeling is that of subset selection regression or, equivalently, of finding the best regression equation (Clarke 1981;Hastie et al. 2001).Given a set of possible variables to be included in the regression, the problem consists in selecting a subset that optimizes some statistical criterion.The evaluation of the criterion function typically involves the estimation of the corresponding submodel (Miller 2002).Consider the standard regression model where y ∈ R M is the output variable, X ∈ R M ×N is the regressor matrix of full column rank, β ∈ R N is the coefficient vector, and ǫ ∈ R M is the noise vector.The ordinary least squares (OLS) estimator of β is the solution of βOLS = argmin where the residual sum of squares (RSS) of β is given by RSS(β) = y − Xβ 2 2 . (3) That is, βOLS minimizes the norm of the residual vector.The regression coefficients β do not need to be explicitly computed in order to determine the RSS, which can be obtained through numerically stable orthogonal matrix decomposition methods (Golub and Van Loan 1996).
Let V = {1, . . ., N } denote the set of all independent variables.A subset model (or submodel) is denoted by S, S ⊆ V .Given a criterion function f , the best-subset selection problem consists in solving Here, the value f (S) = F (n, ρ) is seen as a function of n = |S| and ρ = RSS(S), the number of selected variables and the RSS of the OLS estimator for S, respectively.Furthermore, it is assumed that f (S) is monotonic with respect to RSS(S) for fixed n, that is RSS(S 1 ) ≤ RSS(S 2 ) =⇒ f (S 1 ) ≤ f (S 2 ) , when Common information criteria (IC) exhibit this property, such as those belonging to the AIC family and defined by the formula where the scalar k represents a penalty per parameter (k > 0).The usual AIC and BIC are obtained for k = 2 and k = log M , respectively (Miller 2002).It follows that ( 4 Finding the solution to ( 7) is called the all-subsets selection problem.Thus, solving (4) can be seen as an indirect, two-stage procedure: Stage 1 For each size n, find the subset S * n (|S * n | = n) with the smallest RSS.
Stage 2 Compute f (S * n ) for all n, and determine ν such that f (S * ν ) is minimal.
By explicitly solving the all-subsets regression problem (7) once and for all (Stage 1), the list of all N submodels is made readily available for further exploration: evaluating multiple criterion functions (e.g., AIC and BIC), or conducting a more elaborate statistical inference, can be performed at a negligible cost (Stage 2).Thus, it may be advisable to adopt a twostage approach within the scope of a broader and more thorough statistical investigation.
On the other hand, precursory knowledge of the search function and of its characteristics opens up the possibility for a custom-tailored computational strategy to solve the best-subset selection problem (4) in one go; by exploiting more information about the problem at hand, the solution strategy will be rendered more efficient.
Brute-force (or exhaustive) search procedures that enumerate all possible subsets are often intractable even for a modest number of variables.Exact algorithms must employ techniques to reduce the size of the search space-i.e., the number of enumerated subsets-in order to tackle larger problems.Heuristic algorithms renounce optimality in order to decrease execution times: they are designed for solving a problem more quickly, but make no guarantees on the quality of the solution produced; genetic algorithms and simulated annealing count among the well-known heuristic algorithms (Goldberg 1989;Otten and van Ginneken 1989).
The solution returned by an approximation algorithm, on the other hand, can be proven to lie within well specified bounds of the optimum.
Several packages that deal with variable subset selection are available on the R platform.
Package leaps (Lumley and Miller 2017) implements exact, exhaustive and non-exhaustive algorithms for subset selection in linear models (Miller 2002); it has been Here, the lmSubsets package (Hofmann et al. 2018) for exact variable-subset regression is presented.It offers methods for solving both the best-subset ( 4) and the all-subsets (7) selection problems.It implements the algorithms presented by Gatu and Kontoghiorghes (2006) and Hofmann et al. (2007).A branch-and-bound strategy is employed to reduce the size of the search space.A similar approach has been employed for exact least-trimmed-squares regression Hofmann et al. (2010).The package further proposes approximation methods that compute non-exact solutions very quickly: the exigencies toward solution quality are relaxed by means of a tolerance parameter that steers the permitted degree of error.The core of the package is written in C++.The package is available for the R system for statistical computing (R Core Team 2017) from the Comprehensive R Archive Network at https:// CRAN.R-project.org/package=lmSubsets.Section 2 reviews the theoretical background and the underlying algorithms.The package's R interface is presented in Section 3. A usage example is given in Section 4, while benchmark results are illustrated in Section 5.

Computational strategies
The linear regression model (1) has 2 N possible subset models which can be efficiently organized in a regression tree.A dropping column algorithm (DCA) was devised as a straightforward approach to solve the all-subsets selection problem (7).The DCA evaluates all possible variable subsets by traversing a regression tree consisting of 2 (N −1) nodes (Gatu and Kontoghiorghes 2003;Gatu et al. 2007;Smith and Bremner 1989).
Numerically, this is equivalent to downdating an orthogonal matrix decomposition after a column has been deleted (Golub and Van Loan 1996;Kontoghiorghes 2000;Smith and Bremner 1989).Givens rotations are employed to efficiently move from one node to another.The DCA maintains a subset table r with N entries, where entry r n contains the RSS of the currentbest submodel of size n (Gatu and Kontoghiorghes 2006;Hofmann et al. 2007).Figure 1 illustrates a regression tree for N = 5 variables.The index k is symbolized by a bullet (•).
The subleading models are listed in each node.
The DCA is computationally demanding, with a theoretical time complexity of O(2 N ).A branch-and-bound algorithm (BBA) has been devised to reduce the number of generated nodes by cutting subtrees which do not contribute to the current-best solution.It relies on the fundamental property that the RSS increases when variables are deleted from a regression model, that is: A cutting test is employed to determine which parts of the DCA tree are redundant: A new node drop(S, j) is generated only if RSS(S) < r j (j = k + 1, . . ., n − 1).The quantity RSS(S) is called the bound of the subtree rooted in (S, k): no subset model extracted from the subtree can have a smaller RSS (Gatu and Kontoghiorghes 2006).Note that the BBA is an exact algorithm, i.e., it computes the optimal solution of the all-subsets regression problem (7).
To further reduce the computational cost, the all-subsets regression problem can be restricted to a range of submodel sizes (Hofmann et al. 2007).In this case, the problem (7) is reformulated as where n min and n max are the subrange limits (1 ≤ n min ≤ n max ≤ N ).The search will span only a part of the DCA regression tree.Specifically, nodes (S, k) are not computed if The size of subtrees rooted in the same level decreases exponentially from left to right.In order to encourage the pruning of large subtrees by the BBA cutting test, the variables in a given node can be ordered such that a child node will always have a larger RSS (i.e., bound) than its right siblings (Gatu and Kontoghiorghes 2006).This strategy can be applied in nodes of arbitrary depth.However, computing the variable bounds incurs a computational overhead.Thus, it is not advisable to indiscriminately preorder variables.A parameter-the preordering radius p-has been introduced to control the degree of preordering (Hofmann et al. 2007).It accepts a value between p = 0 (no preordering) and p = N (preordering in all nodes); when p = 1, preordering is performed in the root node only.Typically, p = ⌊N/3⌋ produces better results in terms of execution time.
The computational efficiency of the BBA is improved by allowing the algorithm to prune nonredundant branches of the regression tree.The approximation branch-and-bound algorithm (ABBA) relaxes the cutting test by employing a set of tolerance parameters τ n ≥ 0 (n = 1, . . ., N ), one for every submodel size.A node drop(S, j) is generated only if there exists at least one i = j, . . ., n − 1 such that where RSS full = RSS(V ) is the RSS of the full model.The algorithm is non-exact if τ n > 0 for any n, meaning that the computed solution is not guaranteed to be optimal.The greater the value of τ n , the more aggressively the regression tree will be pruned, thus decreasing the computational load.The advantage of the ABBA over heuristic algorithms is that the relative error of the solution is bounded by the tolerance parameter (Gatu and Kontoghiorghes 2006;Hofmann et al. 2007), thus giving the user control over the tradeoff between solution quality and speed of execution.
The DCA and its derivatives report the N subset models with the lowest RSS, one for each subset size.The user can then analyze the list of returned subsets to determine the "best" subset, for example by evaluating some criterion function.This approach is practical but not necessarily the most efficient to solve the best-subset selection problem (4).Let f be a criterion function such that f (S) = F (n, ρ), where n = |S| and ρ = RSS(S), satisfying the monotonicity property (5).The f -BBA specializes the standard cutting test for f under the additional condition that F is non-decreasing in n.Specifically, a node drop(S, j) is generated if and only if where r f is the single current-best solution.This results in a more "informed" cutting test, and in a smaller number of generated nodes.

Implementation in R
The R package lmSubsets provides a library of methods for variable subset selection in linear regression.Two S3 classes are defined, namely "lmSubsets" and "lmSelect", that address allsubsets (7) and best-subset (4) selection, respectively.The package offers R's standard formula interface: linear models can be specified by means of a symbolic formula, and possibly a data frame.The model specification is resolved into a regressor matrix and a response vector, which are forwarded to low-level functions for actual processing, together with optional arguments which further specify the selection problem.A routine to extract the best submodels from an all-subsets regression solution (i.e., to convert an "lmSubsets" to an "lmSelect" object) is also provided.An overview of the package structure is given in Table 1.

Specifying the selection problem
The default methods are closely modeled after R's standard lm() function: they can be called with any entity that can be coerced to a formula object (Chambers and Hastie 1992).The formula object declares the dependent and independent variables, which are typically taken from a data.framespecified by the user.For example, the call lmSubsets(mortality ~precipitation + temperature1 + temperature7 + age + household + education + housing + population + noncauc + whitecollar + income + hydrocarbon + nox + so2 + humidity, data = AirPollution) specifies a response variable (mortality) and fifteen predictor variables, all taken from the AirPollution dataset (Miller 2002).It is common to shorten the call by employing R's practical "dot-notation": where the dot (.) stands for "all variables not mentioned in the left-hand side of the formula".By default, an intercept term is included in the model; that is, the call in the previous example is equivalent to lmSubsets(mortality ~. + 1, data = AirPollution) .
Submodels can be rejected based on the presence or absence of certain independent variables.The parameter include specifies that all submodels must contain one or several variables.In the following example, only submodels containing the variable noncauc are retained: lmSubsets(mortality ~., include = "noncauc", data = AirPollution) .
Conversely, the exclude parameter can be employed to discard a specific set of variables, as in the following example: lmSubsets(mortality ~., exclude = "whitecollar", data = AirPollution) .
The same effect can be achieved by rewriting the formula as follows: lmSubsets(mortality ~. -whitecollar, data = AirPollution) .
The include and exclude parameters may be used in combination, and both may specify more than one variable (e.g., include = c("noncauc", "whitecollar")).
The criterion used for best-subset selection is evaluated following the expression where penalty is the penalty per model parameter defined in ( 6 where size is the number of regressors, and rss the residual sum of squares of the corresponding submodel.The user-specified function must be non-decreasing in both parameters.

Core functions
The high-level interface methods process the model specification before dispatching the call to one of two low-level core functions, passing along a regressor matrix x and a response vector y, together with problem-specific arguments.The parameters are summarized in Table 2.
The weights and offset parameters correspond to the homonymous parameters of the lm() function.The include and exclude parameters allow the user to specify variables that are to be included in, or excluded from all candidate models.They are either logical vectors-with each entry corresponding to one variable-or automatically expanded if given in the form of an integer vector (i.e., set of variable indices) or character vector (i.e., set of variable names).
For a large number of variables (see Section 5), execution times may become prohibitive.In order to speed up the execution, either the search space can be reduced, or one may settle for a non-exact solution.In the first approach, the user may specify values for the nmin and nmax parameters as defined in (8), in which case submodels with less than nmin or more than nmax variables are discarded.Well-defined regions of the regression tree can be ignored by the selection algorithm, and the search space is thus reduced.
In the second approach, expectations with respect to the solution quality are lowered, i.e., non-optimal solutions are tolerated.The numeric value-typically between 0 and 1-passed as the tolerance argument indicates the degree of "over-pruning" performed by the ABBA cutting test (9).The solution produced by the ABBA satisfies the following relationship: where S is the returned solution, V the full model, S * the optimal (theoretical) solution, and f the cost of a submodel (e.g., deviance, AIC).The lmSubsets_fit() function accepts a vector of tolerances, with one entry for each subset size.
The nbest parameter controls how many submodels (per subset size) are retained.In the case of lmSubsets_fit(), a two-dimensional result set is constructed with nbest submodels for each subset size, while in the case of lmSelect_fit(), a one-dimensional sequence of nbest submodels is handed back to the user.
The pradius parameter serves to specify the desired preordering radius.The algorithm employs a default value of ⌊nvar/3⌋.The need to set this parameter directly should rarely arise; please refer to Section 2 for further information.

Extracting submodels
The user is handed back a result object that encapsulates the solution to an all-subsets (class "lmSubsets") or best-subset (class "lmSelect") selection problem.An object of class "lmSubsets" represents a two-dimensional nvar × nbest set of submodels; an object of class "lmSelect", a linear sequence of nbest submodels.Problem-specific information is stored alongside the selected submodels.Table 3 summarizes the components of the result objects.
A wide range of standard methods to visualize, summarize, and extract information are provided (see Table 4).The print(), plot(), and summary() methods give the user a compact overview-either textual or graphical-of the information gathered on the selected submodels in order to help identify "good" candidates.The remaining extractor functions can be used to extract variable names, coefficients, covariance matrices, fitted values, etc.
In order to designate a submodel, "lmSubsets" methods provide two parameters to specify the number of regressors and the ranking of the desired submodel, namely size and best, respectively.For "lmSelect" methods, the size parameter has no meaning and is not defined.Some methods-i.e., variable.names(),deviance(), sigma(), logLik(), AIC(), BIC(), and coef()-can extract more than one submodel at a time if passed a numeric vector as an argument to either size (e.g., size = 5:10) or best (e.g., best = 1:3).

Case study: Variable selection in weather forecasting
Advances in numerical weather prediction (NWP) have played an important role in the increase of weather forecast skill over the past decades (Bauer et al. 2015).Numerical models simulate physical systems that operate at a large, typically global, scale.The horizontal (spatial) resolution is limited by the computational power available today.Starting from Glahn and Lowry (1972) the NWP outputs are post-processed to correct for local and unresolved effects in order to obtain forecasts for specific locations (see Wilks 2011, Chapter 7, for an overview).So-called model output statistics (MOS) develops a regression relationship based on past meteorological observations of the variable to be predicted and forecasted NWP quantities at a certain lead time.Variable-subset selection is often employed to determine which NWP outputs should be included in the regression model for a specific location.
In the following, package lmSubsets is used to build a MOS regression model predicting temperature at Innsbruck Airport, Austria, based on data from the Global Ensemble Forecast System (Hamill et al. 2013) Table 5: Estimated regression coefficients (along with standard errors) and summary statistics for models MOS0, MOS1, and MOS2.
First, the dataset is loaded and the few missing values are omitted for simplicity.
R> data("IbkTemperature", package = "lmSubsets") R> IbkTemperature <-na.omit(IbkTemperature) A simple output model for the observed temperature (temp) is constructed, which will serve as the reference model.It consists of the 2-meter temperature NWP forecast (t2m), a linear trend component (time), as well as seasonal components with annual (sin, cos) and bi-annual (sin2, cos2) harmonic patterns.
R> MOS0 <-lm(temp ~t2m + time + sin + cos + sin2 + cos2, + data = IbkTemperature) The estimated coefficients (and standard errors) are shown in Table 5.It can be observed that despite the inclusion of the NWP variable t2m, the coefficients for the deterministic components remain significant, which indicates that the seasonal temperature fluctuations are not fully resolved by the numerical model.
Next, the reference model is extended with selected regressors taken from the remaining 35 NWP variables.
Finally, an all-subsets regression is conducted on all 41 variables without any restrictions.R> MOS2_all <-lmSubsets(temp ~., data = IbkTemperature) R> MOS2 <-refit(lmSelect(MOS2_all, penalty = "BIC")) The results are illustrated in Figures 2b and 3b.Here, all-subsets regression is employedinstead of the cheaper best-subsets regression-in order to give insights into possible variable selection patterns over a range of submodel sizes.The "lm" object for the submodel with the lowest BIC is extracted (MOS2).See Table 5 for MOS2 summary statistics.
The best-BIC models MOS1 and MOS2 both have 13 regressors.The deterministic trend and all but one of the harmonic seasonal components are retained in MOS2.In addition, MOS1 and MOS2 share six NWP outputs relating to temperature (tmax2m, st, t2pvu), pressure (mslp, p2pvu), hydrology (vsmc, wr), and heat flux (sshnf).However, and most remarkably, MOS1 does not include the direct 2-meter temperature output from the NWP model (t2m).In fact, t2m is not included by any of the 20 submodels (sizes 8 to 27) shown in Figure 2b, whereas the temperature quantities tmax2m, st, t2pvu are included by all.The summary statistics reveal that both MOS1 and MOS2 significantly improve over the simple reference model MOS0, with MOS2 being slightly better than MOS1.
In summary, this case study illustrates how lmSubsets can be used to easily identify relevant variables beyond the direct model output for MOS regressions, yielding substantial improvements in forecasting skill.A full meteorological application would require further testing using cross-validation or other out-of-sample assessments.Recently, there has been increasing interest in MOS models beyond least-squares linear regression, e.g., to take into account the effects of heteroscedasticity, censoring, and truncation.In this context, other selection techniques-such as boosting (Messner et al. 2016(Messner et al. , 2017))-can be considered.

Benchmark tests
Comparative tests are conducted to evaluate the computational efficiency of the proposed methods for exact all-subsets and exact best-subset regression.The regsubsets() method from package leaps, and the bestglm() method from package bestglm serve as benchmarks, respectively.
Datasets which contain a "true" model are simulated, with nobs observations and nvar independent variables.The dependent variable y is constructed from a linear combination of ntrue randomly selected independent variables, a noise vector e, and the intercept: , where e ∼ (0, sigma 2 ) , where X is a nobs × nvar matrix of random data, and ✶ true a (random) indicator function evaluating to 1 if the corresponding column of X belongs to the "true" model.All tests were conducted on a Dell XPS15 laptop with 8GB (7.4 GiB) of memory and an Intel Core i7-6700HQ CPU@2.60GHz×8processor, running a Ubuntu 64bit operating system.Benchmark 1 concerns itself with all-subsets selection.The lmSubsets() method is compared to regsubsets().The complexity mainly depends on the number of variables (nvar): The algorithms employ the QRD to compress the data into a square nvar × nvar matrix; the initial cost of constructing the QRD is negligible.Data configurations with varying sizes        (nvar = 20, 25, 30, 35, 40) and degrees of noise (sigma = 0.05, 0.10, 0.50, 1.00, 5.00) are considered; in all cases, nobs = 1000 and ntrue = ⌊nvar/2⌋.For each configuration, five random datasets are generated, giving rise to five runs per method over which average execution times are determined.The performance of regsubsets() can be improved by "manually" preordering the dataset in advance (Hofmann et al. 2007).The average running times are summarized in Table 6, along with the relative performance (speedup) of lmSubsets().The same setup is used in Benchmark 2, where methods for best-subset selection are compared, namely bestglm() and lmSelect().As in the previous benchmark, average execution times are determined for bestglm() with and without preordering.The results are illustrated in Table 7.
It is not surprising that bestglm() is very close to regsubsets() in terms of execution time, as the former post-processes the results returned by the latter; in fact, bestglm() implements the two-stage approach to solving the best-subset selection problem, where Stage 1 is tackled by regsubsets() (see Section 1 for further details).Manually preordering the variables improves the performance of regsubsets() (and hence, of bestglm()) by a factor of approximately 2; for nvar = 40 and a high level of noise (sigma = 5.00), by a factor of almost 4.In the tests conducted here, lmSubsets() is two orders of magnitude faster than regsubsets(), even with preordering; lmSelect() is three orders of magnitude faster than bestglm().
Benchmark 3 pits all-subsets and best-subset selection, exact and approximation algorithms against one another.The average execution times of lmSubsets() and lmSelect(), for tolerance = 0.0 and 0.1, are illustrated in Table 8.Note that for large datasets (nvar = 80), subsets computed by lmSubsets() are restricted to sizes between nmin = 30 and nmax = 50 variables; the restriction does not apply to lmSelect().

Shrinkage methods
Genetic algorithms for model selection have been considered for comparative study.However, pertinent R packages have been found to impose restrictions on the class of problems that can be addressed-limited problem size (glmulti), fixed submodel size (kofnGA), or no immediate support for IC-based search (subselect)-, hampering efforts to conduct a meaningful comparison.
LASSO (Tibshirani 1996) can be seen as an alternative to exact variable selection methods, of which package glmnet brings an efficient implementation to R. The function glmnet() computes an entire regularization path and returns a sequence of sparse estimators.The method is not IC-based; rather, it employs a modified objective function that induces sparsity by penalizing the regression coefficients.
The return value of glmnet() can be post-processed for comparison with lmSelect().For each (sparse) estimator contained in the sequence returned by glmnet(), the subset model corresponding to the variables with non-zero coefficients is identified; the submodel is fitted (in the OLS sense), and the BIC extracted.The list of submodels thus obtained is sorted in order of increasing BIC, after removal of duplicates.

Figure 1 :
Figure 1: All-subsets regression tree for N = 5 variables.Nodes are shown together with their subleading models.

Figure 2 :
Figure 2: Variables selected in MOS1_best and MOS2_all.Submodels MOS1 and MOS2 are highlighted in red.
(Calcagno 2013) 2017)ed linear models by package bestglm (McLeod and Xu 2017).An active set algorithm for solving the best subset selection problem in generalized linear models is proposed by package BeSS(Wen et al. 2018).Package subselect (OrestesCerdeira et al. 2017)proposes simulated annealing and genetic algorithms that search for subsets of variables which are optimal under various criteria.Package glmulti(Calcagno 2013)provides IC-based automated model selection methods for generalized linear models in the form of exhaustive and genetic algorithms.
(Friedman et al. 2010)s 2015) uses a genetic algorithm to choose a fixed-size subset under a user-supplied objective function.Procedures for regularized estimation of generalized linear models with elastic-net penalties are implemented in package glmnet(Friedman et al. 2010).

Table 2 :
The core functions act as wrappers around the C++ library, and are declared as Core parameters for lmSubsets() and lmSelect().
The shape of the return value can be controlled with the drop parameter: a numeric or character vector (in some cases, a logical or numeric matrix) is returned if drop is TRUE; otherwise, a data.frameobject is handed back.
. The data frame IbkTemperature contains 1824 daily cases for 42 variables: the temperature at Innsbruck Airport (observed), 36 NWP outputs (forecasted), and 5 deterministic time trend/season patterns.The NWP variables include quantities pertaining to temperature (e.g., 2-meter above ground, minimum, maximum, soil), precipitation, wind, and fluxes, among others.See ?IbkTemperature for more details.