CovSel : An R Package for Covariate Selection When Estimating Average Causal Eﬀects

We describe the R package CovSel , which reduces the dimension of the covariate vector for the purpose of estimating an average causal eﬀect under the unconfoundedness assumption. Covariate selection algorithms developed in De Luna, Waernbaum, and Richardson (2011) are implemented using model-free backward elimination. We show how to use the package to select minimal sets of covariates. The package can be used with continuous and discrete covariates and the user can choose between marginal co-ordinate hypothesis tests and kernel-based smoothing as model-free dimension reduction techniques.


Introduction
The theory and practice of causal inference from observational studies is an active research field within the statistical sciences (including econometrics and epidemiology). Typical observational studies have as purpose to evaluate the effect of a causal variable (often called treatment) on an outcome of interest. Since in observational studies pre-treatment variables, henceforth covariates, are not expected to have a distribution balanced between treatment groups, as opposed to a randomized study, one needs to control for confounding covariates. Under unconfoundedness, i.e., the potential outcomes are independent of the treatment assignment given a vector of covariates, an average causal effect may be identified.
The starting point for covariate selection should be subject matter knowledge, which in practice often gives only partial guidance, and this might result in an unnecessarily highdimensional covariate vector. When estimating an average causal effect non-parametrically, controlling for too many covariates may result in poor performance of the estimator (e.g., Rubin 1997;Hahn 2004;De Luna et al. 2011) emphasizing the importance of avoiding conditioning on redundant covariates. A common practice has been to control for all covariates affecting the treatment assignment without considering whether they are also related to outcome. The lack of theoretical results in the literature on the issue of covariate selection was pointed out in Imbens and Wooldridge (2009), and new results have recently appeared; see De Luna et al. (2011), Vansteelandt, Bekaert, andClaeskens (2012), Laan and Gruber (2010); see also earlier work by Robins and Rotnitzky (1995) and Hahn (2004).
In this article we introduce the R (R Core Team 2015) package CovSel (Häggström and Persson 2015) aiming at reducing the covariate set when the purpose of the analysis is to estimate an average causal effect non-parametrically. The package is available from the Comprehensive R Archive Network (CRAN) at http://CRAN.R-project.org/package=CovSel and implements the general algorithms for covariate selection proposed by De Luna et al. (2011) and Persson, Häggström, Waernbaum, and De Luna (2013). The former paper provides the theoretical foundation for the covariate selection algorithms and the latter studies a data-driven implementation of the algorithms showing, e.g., how dimension reduction of the covariate vector can yield important mean squared error (MSE) decrease for commonly used non-parametric estimators. The data-driven covariate selection builds on marginal coordinate hypothesis tests (Cook 2004;Li, Cook, and Nachtsheim 2005) for continuous-valued covariate vectors, and on kernel smoothing with smoothing parameter thresholding when discrete and continuous-valued covariates are available (Li, Racine, and Wooldridge 2009).
In Section 2 we give a brief description of the theoretical framework. We also introduce a classic dataset, the LaLonde data, which is used to illustrate the purpose of the package. Then follows a description of the CovSel package in Section 3. In Section 4 we demonstrate the use of the package CovSel by performing covariate selection for the LaLonde data as well as in a number of different situations using simulated data.

Model
Assume that we have a random sample of individuals from a large population and want to estimate the average causal effect of a binary treatment T on an outcome Y . The Neyman- Rubin model (Splawa-Neyman 1990, which is the translated version of a paper published in 1923; Rubin 1974) is commonly used as a framework for causal reasoning and inference. For each individual in the study, it defines two potential outcomes Y (1), outcome if treated, and Y (0) outcome if not treated. Then, the individual causal effect is defined as (Y (1) − Y (0)). Because only one of the two potential outcomes can be observed for a given individual, individual causal effects are not identified. Instead, population parameters are targeted, e.g., the average causal effect Let X be a set of covariates observed before treatment on all individuals in the study. Then, the identification of the parameter β, and other population parameters, can be obtained under the following assumptions.

Algorithm 1
Step 1 Step 2 Here A ⊥ ⊥ B|C means A "is independent of" B given C (Dawid 1979). The unconfoundedness assumption holds when X consists of all the covariates affecting both the causal agent T and the potential outcomes Y (1), Y (0). An example where the unconfoundedness assumption holds trivially are randomized experiments, where we have T ⊥ ⊥ (Y (1), Y (0)). In observational studies, the unconfoundedness assumption may be realistic in situations where many covariates X are available to condition on.

Algorithms and minimal sets of covariates
The covariates one wishes to identify are those which affect treatment as well as the potential outcomes and if Assumptions 1 and 2 hold it is possible to define the minimal sets of covariates such that the treatment and the potential outcomes are independent given these sets, see De Luna et al. (2011). Algorithms for covariate selection, denoted Algorithm 1 and Algorithm 2, are described in Figures 1 and 2, where the sets X T , Q t , X t and Z t , t = 0, 1, are defined as follows: X T is the minimal set of the complete covariate vector X rendering T and the covariates not included in X T conditionally independent, T ⊥ ⊥ X \ X T |X T . Similarly, Q t is the minimal subset of X T rendering Y t and the covariates not included in Q t conditionally independent, Y t ⊥ ⊥ X T \ Q t |Q t . X t is the minimal set of the complete covariate vector X rendering Y t and the covariates not included in X t conditionally independent, Y t ⊥ ⊥ X \ X t |X t , and Z t is the minimal subset of X t rendering T and the covariates not included in X t conditionally independent, T ⊥ ⊥ X t \ Z t |Z t . For existence and uniqueness of the sets defined above see De Luna et al. (2011).

The LaLonde data
The LaLonde data was first analyzed in LaLonde (1986) and has since then been used numerous times (e.g., Dehejia and Wahba 1999;Smith and Todd 2005;Abadie and Imbens 2011). The data used in this paper is available on Dehejia's web page at http: //www.nber.org/~rdehejia/data/nswdata2.html and consists of 297 treated units from a randomized evaluation of a labor training program, the national supported work (NSW) demonstration, and 314 non-experimental comparison units drawn from survey datasets. Below, using the text files from Dehejia's web page, we construct the data frame lalonde which will be used later on to demonstrate the selection of covariates for estimation of the average causal effect of participation in NSW on post-intervention earnings. The lalonde data frame is included in the CovSel package and can be accessed with the data command. Similar but not identical datasets are included in other R packages, e.g., arm (Gelman and Su 2015), Matching (Sekhon 2011), MatchIt (Ho, Imai, King, and Stuart 2011) and cem (Iacus, King, and Porro 2009). The following code is used to create the lalonde data frame included in package CovSel: R> treated <-read. R> colnames(lalonde) <-c("treat", "age", "educ", "black", "hisp", + "married", "nodegr", "re74", "re75", "u74", "u75", "re78") The data frame lalonde consists of 614 observations on 12 variables. The first ten rows are shown below. The first variable, treat, is the treatment status indicator variable, with 1 indicating participation in NSW. The next two variables are age in years (age) and schooling in years (educ). Next, black, hisp, married and nodegr are indicator variables (1 = yes, 0 = no) for black, hispanic, marital status and high school diploma, respectively. Real earnings during the years 1974, 1975 and 1978 are given in re74, re75, re78, respectively, and u74, u75 are indicator variables for earnings in 1974, 1975 being zero.

The R package CovSel
The R package CovSel contains the functions: • cov.sel is the main function called by the user for selecting covariate sets.
• cov.sel.np is called by cov.sel if kernel smoothing should be used.
• summary method for 'cov.sel' objects produces a summary of the results returned by cov.sel.
The package also contains the simulated data sets datc, datf and datfc which are described and analyzed in the examples in Section 4.

Function cov.sel
The function cov.sel can be used for reducing the dimension of the covariate vector in situations where we want to estimate an average causal effect and the unconfoundedness assumption holds. It is used as: cov.sel(T, Y, X, type = c("dr", "np"), alg = 3, scope = NULL, alpha = 0.1, thru = 0.5, thro = 0.25, thrc = 100, ...) and takes the following arguments: • T is a binary vector indicating the treatment status.
• Y is a numeric vector of observed outcomes.
• X is a matrix or data frame containing columns of covariates. The covariates may be a mix of continuous, unordered discrete (to be specified in the data frame using factor), and ordered discrete (to be specified in the data frame using ordered).
• type is the type of method used, "dr" for marginal co-ordinate hypothesis tests and "np" for kernel-based smoothing. Marginal co-ordinate hypothesis tests are suitable in situations with only continuous covariates while kernel-based smoothing can be used if discrete covariates are also present.
• alg is used to specify which algorithm to use. alg = 1 indicates Algorithm 1, alg = 2 indicates Algorithm 2 and alg = 3 runs them both. alg = 3 is default.
• scope is a character string giving the name of one (or several) covariate(s) that must not be removed.
• alpha is a stopping criterion for the marginal co-ordinate hypothesis tests, i.e., the algorithm will stop removing covariates when the p value for the next covariate to be removed is less then alpha. The default is alpha = 0.1.
• thru is the bandwidth threshold used for unordered discrete covariates if type = "np". Values in [0, 1] are valid. thru = 0 removes all unordered discrete covariates and thru = 1 removes none of them. Default is thru = 0.5.
• thro is the bandwidth threshold used for ordered discrete covariates if type = "np". Values in [0, 1] are valid. thro = 0 removes all ordered discrete covariates and thro = 1 removes none of them. Default is thro = 0.25.
• thrc is the bandwidth threshold used for continuous covariates if type = "np". Nonnegative values are valid. Default is thr = 100.
• ... are additional arguments passed on to dr or npregbw. If type = "dr", method can be set to "sir" or "save", the first being the default. trace = 0 suppresses the output generated by dr.step. If type = "np", regtype can be set to "lc" or "ll", the first being the default and bwtype can be set to "fixed", "generalized_nn" or "adaptive_nn", where default is "fixed". See dr and npregbw for usage of na.action.
If type = "dr", marginal co-ordinate hypothesis tests are performed in each step of the algorithm using the function dr from package dr (Weisberg 2002). If on the other hand type = "np" then non-parametric kernel regression is repeatedly performed, using the function npregbw from package np (Hayfield and Racine 2008), to determine which covariates can be removed from the full covariate set.
With dr one can choose between sliced inverse regression, "sir", or sliced average variance estimation, "save". Both are methods based on studying an inverse regression problem, where the former considers dependencies only through the first moment (the mean) while the latter looks at the second moment (Cook and Weisberg 1991). Though "save" is more general than "sir", it may miss first moment information, e.g., linear trends. Thus, one may want to use both to see if they result in different choices of covariate sets. For kernel-based smoothing the regression type can be set to using a local constant or local linear kernel and the bandwidth type can be set to fixed, generalized nearest neighbors or adaptive nearest neighbors. See dr and npregbw for details.
In kernel-based smoothing the bandwidth range for an unordered discrete covariate x is 0 to 1/length(levels(x)), while for ordered discrete covariates, no matter how many levels, the range is 0 to 1. For continuous covariates the bandwidth ranges from 0 to infinity. Ordered discrete and continuous covariates are removed if their bandwidths exceed their respective thresholds (thro and thrc). Unordered discrete covariates are removed if their bandwidths are larger than thru times the maximum bandwidth.
Since cross-validation is used to select bandwidths for the kernel smoothing this is computationally intensive and a doubling of the sample size will increase the run time of npregbw by a factor of four. This means that cov.sel with type = "np" will be slow for large sample sizes since npregbw is called multiple times. The computation time can be reduced by setting some of the arguments in npregbw to non-default values. For more on this subject the reader is referred to Hayfield and Racine (2008) and to the frequently asked questions document on Jeffrey S. Racine's website (http://socserv.mcmaster.ca/racine/np_faq.pdf).
The default values for the thresholds thru, thro, thrc and alpha are arbitrary. On the other hand, we know that the bandwidth for relevant covariates will tend to zero as sample size increases (Li et al. 2009). The thresholds used should therefore be smaller the larger the sample sizes in order to avoid selecting irrelevant covariates. This said, the default values for the thresholds have shown to yield good results in simulations studies reported in Persson et al. (2013), for a wide range of situations and sample sizes of 500 and 1000. In applications, we recommend the user to investigate the sensitivity of the results to small changes in the threshold values.

Examples
We illustrate the use of the main function cov.sel for different data situations. We begin by selecting covariates in the simulated data sets and then move on to the LaLonde data. Three simulated data sets, datc, datfc and datf, are included in package CovSel and can be accessed with the data command, see below.

Continuous-valued covariates
Let us first install and load package CovSel in R.

Mixed valued covariates
The data in datfc contains four normally distributed covariates, four binary covariates as well as the potential outcomes, the treatment indicator T and the response Y . It is generated by: R> set.seed(9327529) R> n<-500 R> x1 <-rnorm(n, mean = 0, sd = 1) R> x2 <-rbinom(n, 1, prob = 0.5) As in Section 4.3, we use kernel smoothing in the selection algorithms as this data contains categorical covariates.

Real data: LaLonde
For the LaLonde data we set the arguments in cov.sel as follows: T is set equal to treat and Y to re78, columns 1 and 12 in lalonde, respectively. The rest of the variables in the data frame are given as covariates, X. Since both continuous (age, educ) and categorical covariates are included we set type equal to "np". We begin with running Algorithm 1, alg = 1, with the default threshold values for the bandwidths, and store the result in cs.

Summary
We have described the R package CovSel which implements the covariate selection algorithms developed in De Luna et al. (2011) and Persson et al. (2013) using model-free backward elimination. The use of the package has been illustrated using a classic real world dataset as well as simulated data. Controlling for confounding is an essential part in all evaluation studies of effects of non-randomized treatments. Observational studies with a rich reservoir of covariates are nowadays common, for instance with record linkage databases. Package CovSel has the potential to help empirical scientists performing such evaluation studies by identifying relevant covariate sets for estimating average causal effects.
Further insights on how to use package CovSel can be found in Persson et al. (2013), where a large set of simulations experiments are reported. There, different selected covariate sets are evaluated by comparing empirical bias, variance and MSE in the estimation of the average causal effect. In particular, this Monte Carlo study indicates that Algorithm 2 is to be preferred to Algorithm 1, the former yielding often lower MSE. Moreover, basing the estimation on the covariate set X 0 ∪ X 1 yields lower variance but not necessarily lower bias, and either X 0 ∪ X 1 or Z 0 ∪ Z 1 may yield lower MSE depending on the data generating mechanism. These finite sample properties are in line with the theoretical results derived earlier in De Luna et al. (2011).