LARF : Instrumental Variable Estimation of Causal Eﬀects through Local Average Response Functions

LARF is an R package that provides instrumental variable estimation of treatment ef-fects when both the endogenous treatment and its instrument (i.e., the treatment inducement) are binary. The method (Abadie 2003) involves two steps. First, pseudo-weights are constructed from the probability of receiving the treatment inducement. By default LARF estimates the probability by a probit regression. It also provides semiparametric power series estimation of the probability and allows users to employ other external methods to estimate the probability. Second, the pseudo-weights are used to estimate the local average response function conditional on treatment and covariates. LARF provides both least squares and maximum likelihood estimates of the conditional treatment eﬀects.


Introduction
We created the package LARF (An and Wang 2015) in R (R Core Team 2016) to implement the method proposed by Abadie (2003) that provides instrumental variable (IV) estimation of causal effects when both the endogenous treatment and its instrument are binary. The main strength of the method is that it can address treatment selection without the imposition of parametric identifying assumptions.
The method is based on the potential outcomes framework (Rubin 1974(Rubin , 1977. Suppose we are interested in estimating the average effect of a binary treatment D on outcome Y , namely, E[Y 1 − Y 0 ]. The term Y 1 represents the potential outcome if an individual receives the treatment while Y 0 the potential outcome if the individual does not receive the treatment. The average treatment effect (ATE) can also be viewed as a weighted average: πE[Y 1 − Y 0 |D = 1] + (1 − π)E[Y 1 − Y 0 |D = 0], where π and 1 − π are the proportions of the treated and the control units in the population, respectively. E[Y 1 − Y 0 |D = 1] represents the average treatment effect on the treated (ATT) and E[Y 1 − Y 0 |D = 0] the average treatment effect on the control (ATC). In a naive way, we may use the difference between the average outcomes of the treated group and the control group (i.e., E[Y 1 |D = 1] − E[Y 0 |D = 0]) to estimate the ATE. However, it is shown that this naive estimator is biased (Morgan and Winship 2014).
The bias can arise from two sources: outcome differences across the treatment groups in the absence of treatment and differential treatment effects for the treated and the control. The bias tends to be present if units which expect to benefit from the treatment select to take the treatment while units which expect not to benefit or benefit less reject the treatment. Similarly, we can show that this estimator is biased for estimating either ATT or ATC.
When there is a randomized treatment inducer Z, which is imperfectly correlated with the treatment intake D, Imbens and Angrist (1994) and Angrist, Imbens, and Rubin (1996) show that under mild assumptions, local average treatment effect (LATE) for compliers (i.e., individuals for whom D = 1 if Z = 1, and D = 0 if Z = 0) can be identified by instrumenting the treatment intake D by the treatment inducer Z. Abadie (2003) extended the previous methods by allowing the treatment inducer to be randomized conditionally on the covariates and by allowing the outcome to depend on the covariates besides the treatment intake. Abadie (2003) also provided semiparametric estimations of the probability of receiving the treatment inducement, which helps to identify the treatment effects in a more robust way.
Specifically, when the treatment inducer Z is as good as randomized after conditioning on covariates X, Abadie (2003) proposed a two-stage procedure to estimate treatment effects. In the first step, it estimates the probability of receiving the treatment inducement P(Z = 1|X) in order to provide a set of pseudo-weights. Second, the pseudo-weights are used to estimate the local average response function (LARF) of the outcome conditional on the treatment and covariates. The estimated coefficient for the treatment intake D reflects the conditional treatment effect.
The main function larf in the LARF package implements the method proposed in Abadie (2003). The package is available from the Comprehensive R Archive Network (CRAN) at https://CRAN.R-project.org/package=LARF. This paper proceeds as follows. In Section 2, we sketch the statistical theories behind LARF.
In Section 3, we demonstrate the usage of LARF with both simulated and empirical examples.
In Section 4, we compare our package with similar programs.

Identification of treatment effect for compliers
Suppose there is a binary treatment inducer Z, which is as good as randomized after conditioning on the covariates. Using D z to indicate the potential treatment intake when Z = z, we can divide the units into four groups. The first group contains compliers, i.e., those who will comply with their treatment inducement. If they are induced to take the treatment (i.e., z = 1), they will take the treatment (i.e., D 1 = 1). If they are induced into the control Treatment intake D Treatment inducement Z 0 1 0 (1) Compliers, Never Takers (2) Always Takers, Defiers 1 (3) Never Takers, Defiers (4) Compliers, Always Takers   Table 1: Dividing units into four groups by treatment inducement and intake.
condition (i.e., z = 0), they will not take the treatment (i.e., D 0 = 0). In other words, for this group, D 1 > D 0 . The second group is always-takers who will take the treatment regardless of the treatment inducement, i.e., D 1 = D 0 = 1. The third group is never-takers who will not take the treatment whatsoever, i.e., D 1 = D 0 = 0. The last group is defiers who will do the opposite to their treatment inducement. If they are induced to take the treatment, they will opt out. But if they are induced not to take the treatment, they will opt in. Thus for this group, D 1 < D 0 . Table 1 shows the breakdown of the four groups.
Based on Angrist et al. (1996), Abadie (2003) listed a set of assumptions for the proposed IV method to identify treatment effects. With slight adaption we present them below.
i) Conditional ignorability: (Y dz , D z )⊥Z|X, namely, conditioning on X, the potential outcomes Y and the potential treatment intake D are independent of Z. In other words, Z is as good as randomly assigned conditional on covariates X.
ii) Exclusion of the instrument: . This means that Z has no direct effects on Y once actual treatment intake and covariates are controlled for. The first two assumptions ensure the exogeneity of Z.
iii) Relevance of the instrument: COV(D, Z) = 0, namely, Z is predictive of D. The first three assumptions ensure that Z can serve as a valid instrument for D.
iv) Common support: 0 < P(Z = 1|X) < 1, namely, not all units with certain covariates are induced to take or not to take the treatment. Given the monotonicity assumption, we can identify the proportion of compliers, always-takers and never-takers, respectively.
π always-takers : π never-takers : Let g be a measurable real function of (Y, D, X) such that E|g(Y, D, X)| < ∞. We can show that the expectation of g is a weighted sum of the expectations of g in the three groups.
Rearranging the terms in (4) and plugging in the group proportions, Abadie (2003) shows Integrating out X (see page 258 of Abadie 2003), we get The results shows that, the joint distribution of (Y, D, X) is identifiable for compliers if we use the κ as a set of pseudo weights. Since for compliers there is no treatment selection problem, we can also identify the treatment effect of D on Y .

Estimation of the pseudo weights
In order to estimate the pseudo weights κ, it is necessary to estimate P(Z = 1|X), i.e., the probability of receiving the treatment inducement. Abadie (2003) proposed two methods to estimate P(Z = 1|X). One is based on a probit model, namely, P(Z = 1|X) = Φ(X γ).
The predicted probability along with D are then used to calculate κ. The probability P(Z = 1|X) can also be estimated by a semiparametric power series method (Abadie 2003, p. 243;Mercatanti and Li 2014). In brief, the semiparametric power series method uses the power series of the covariates to predict the probability of receiving the treatment inducement. The optimal order of the power series can be determined by cross-validations. Relatively speaking, the power series method is more robust to potential model specification errors. In the package, we implement both the probit and the power series methods. See Section 3 for examples.

Estimation of the local average response function
Define E[Y |X, D, D 1 > D 0 ] as the local average response function for the compliers (LARF). By construction, the results for g shown above are also applicable to the LARF. Abadie (2003) proposed two methods to estimate the LARF: One is based on the weighted least squares method (LS) and the other based on maximum likelihood (ML) estimation. (D, X; θ). Given the results in Section 2.1 and the availability of κ, the parameter vector θ 0 in the LS estimator is Let A = (D, X) and θ = (α; β). In LARF, for continuous outcomes h = A θ = (αD + X β).
LARF also provides ML estimates. For continuous outcomes, Y − (αD + X β) is assumed to follow a Normal distribution with a variance σ 2 . Then For binary outcomes, Y is approximated by a probit function. Then we have The variances of the coefficients in LARF are estimated according to Theorem 4.2 in Abadie (2003) if the probability of receiving treatment inducement is estimated by a probit model. The variances are estimated according to Theorem 4.5 in Abadie (2003) if the probability of receiving treatment inducement is estimated by the power series method.
• formula: specification of the outcome model in the form like either y~x1 + x2 or y~X where X is a matrix containing all the covariates excluding the treatment. Also support multi-part formulas (Zeileis and Croissant 2010). For example, y + d~x1 + x2 | z, where d represents the treatment and z the instrument.
• treatment: a vector containing the binary treatment.
• instrument: a vector containing the binary instrument for the endogenous treatment.
• data: an optional data frame. If unspecified, the data will be taken from the working environment.
• method: the estimation method to be used. The default is "LS", standing for least squares. "ML", standing for maximum likelihood, is an alternative.
• AME: whether average marginal effects (AME) should be reported. The default is FALSE, in which case marginal effects at the means (MEM) are reported.
• zProb: a vector containing the probability of receiving the treatment inducement that has been estimated externally.
The larf function returns the estimated coefficients for the treatment and covariates and their standard errors. For binary outcomes, it also returns the marginal effects at the means (MEM) or the average marginal effects (AME) if requested. 2 larf is the high-level interface to the work-horse function larf.fit. A set of standard methods (including print, summary, vcov, and predict) can be used to extract the corresponding information from a 'larf' object. For more detail, please see ?larf.fit.

Simulations
We conduct simulations to show the properties of the LARF method. The equations we used to generate the data are as follows.
The covariates include X 1 and X 2 and are generated according to The treatment inducer Z is generated according to The error terms are correlated across equations and are generated according to We then generate D * i = −0.1+0.7Z i +0.3X 1i −0.5X 2i +v i . We view D * i as the latent probability of receiving the treatment. Corresponding to a probit model, the observed treatment intake D i is then set to one if D * i is larger than zero and to zero otherwise. 3 We generate the latent outcome Y * i = −0.2 + 0.8D + 0.4X 1i − 0.6X 2i + e i . Similarly, we set the observed outcome Y i to one if Y * i is larger than zero and to zero otherwise. The correlation in the error terms leads the treatment D to be correlated with e. Thus a simple regression of the outcome on D and the covariates will lead to biased estimates of the treatment effects. Since Z is randomly assigned and is uncorrelated with the outcome except via its correlation with the treatment, it can be used as an instrument for D to estimate the treatment effect.
2 Marginal effect at the means for a continuous variable is the partial derivative of the outcome with respect to the variable evaluated at the means of other variables, i.e., φ(Ā θ )θ. For a binary variable C, it is the change in the probability when the variable is changed from zero to one while holding other variables at their means, i.e., Φ(Ā θ |C = 1) − Φ(Ā θ |C = 0). Average marginal effect is the average of individual marginal effects, i.e., E[φ(A θ )]θ. Average marginal effect for a binary variable is the average change in the probability when the variable is changed from zero to one, i.e., E[Φ(A θ |C = 1) − Φ(A θ |C = 0)]. These marginal effects are widely used in social sciences. But it is worth noting that the term "marginal effects" has been used differently, for example, to refer to the treatment effects estimated by weighting regressions with the inverse probability of treatment (Bang and Robins 2005;Lunceford and Davidian 2004).
3 This is the latent variable approach to motivating the probit model, where P(D * i > 0) = P(Di = 1).  Table 2: Results of the simulations. Note: Rows 1 and 2 show the simulation results for continuous outcomes and binary outcomes, respectively. Bias is the 10% trimmed mean of (estimates − 0.8). Namely, 10% of very large or very small estimates are removed before calculating the mean deviation in order to reduce the influence of outliers. MSE is the meansquare-error. Columns (1) and (2) show the results when the outcomes are directly predicted from the treatment and covariates. Columns (3) and (4) show the results of the LARF estimates. Columns (5) and (6) show the results of the 2SLS estimates.
We conduct two sets of simulations. First, we treat the latent outcome Y * as a observed continuous outcome. Second, we treat Y as the observed binary outcome. For comparison, we also provide naive estimates based on ordinary least squares (OLS) for the continuous outcome and probit for the binary outcome and corresponding estimates based on two-stage least squares (2SLS). We set the sample size at 2,000 and conduct 500 rounds of simulations for each type of outcome. The results are shown in Table 2.
When the outcome is continuous, the naive estimation has both much larger bias and meansquared-error (MSE) than the LARF and the 2SLS estimations. Between LARF and 2SLS estimations, there are no discernible differences. When the outcome is binary, the naive estimation also has much larger bias and MSE than LARF and 2SLS. There is also little evidence that one is qualitatively better than the other between LARF and 2SLS. This happens maybe because we have a constant treatment effect. If there is effect heterogeneity, LARF may perform better than 2SLS. In addition, LARF allows covariates to have nonlinear effects on the outcome, which is an attractive feature in many contexts. Although the above results are based on a specific simulation, the patterns of the results should hold generally because the simulation has a general structure.

An empirical example
We show an empirical example using the data on eligibility for and participation in 401(k) from the Wooldridge data sets (Wooldridge 2010). 4 The basic research questions are to what extent participation in 401(k) -usually an employer-sponsored retirement saving plan, affects individuals' net family financial assets and participation in other retirement saving plans (e.g., Individual Retirement Accounts (IRA)). Because individuals may selectively participate in 401(k) due to their different unobserved preferences for savings, the naive estimate of the treatment effect by comparing the average outcomes between the participants and nonparticipants of 401(k) may be biased.
The data includes both a continuous outcome on net family financial assets (nettfa) and a binary outcome on participation in IRA (pira). The treatment is participation in 401(k) (p401k). Eligibility for 401(k), e401k, is used as an instrument for participation in 401(k).  Table 3: Summary statistics of the data. Note: The data is obtained from the Wooldridge data sets (Wooldridge 2010), available at http://www.stata.com/texts/eacsap/. The type of data is shown after each variable's name. The first four variables are binary. Thus their means can be interpreted as proportions. For example, the mean of "Married" is 0.63, which indicates 63% of the subjects are married. Its standard deviation equals 0.48, i.e., the square root of 0.37 × (1 − 0.37). Participants are those who have participated in 401(k) while nonparticipants are those who have not enrolled in 401(k). The last column shows the p values of the t-tests that test whether the means of the covariates between the participants and non-participants are equal.
Since an individual has to be eligible for 401(k) to participate in it, there are neither defiers nor always-takers in this case. Because 401(k) eligibility is determined by employers, it is likely independent of employees' unobserved preferences for savings. To what extent 401(k) eligibility can legitimately serve as an instrument for participation in 401(k), however, is debatable (Engen, Gale, and Scholz 1994;Poterba et al. 1995). But for illustrative purposes, following Abadie (2003), we assume that it can be used as an instrument for participation in 401(k). Table 3 presents the summary statistics of the data. 5 To demonstrate the usage of larf, we load the package LARF and the data c401k.
R> est1 <-larf(nettfa~inc + age + agesq + marr + fsize, treatment = p401k, + instrument = e401k, data = c401k) R> summary(est1) Call: larf(formula = nettfa~inc + age + agesq + marr + fsize, treatment = p401k, The results show that holding everything else constant, participation in 401(k) increases family net financial assets by about 9.49 thousand dollars on average (SE = 2.16 $1,000, p < 0.001). By default the above calculation estimates the probability of being eligible for 401(k) using a probit model. To check the robustness of the results, we also estimate the probability using a semiparametric power series method (Abadie 2003;Mercatanti and Li 2014). We set the maximal order of the covariates power series to be 5. After running npse we find the optimal order of power series is 2 based on cross-validations.
R> ps <-npse(e401k~inc + age + agesq + marr + fsize, order = 5) R> ps$Lambda [1] 2 Then we use the optimal order of covariates power series to predict the probability of being eligible for 401(k) and use the predict probability to estimate the treatment effect in larf.
R> est3 <-larf(pira~inc + age + agesq + marr + fsize, p401k, e401k, + data = c401k) R> summary(est3) Call: larf(formula = pira~inc + age + agesq + marr + fsize, treatment = p401k, instrument = e401k, data = c401k, AME = FALSE) The first two columns show the estimated coefficients and standard errors. Columns (4) and (5) show the marginal effects at the means and their standard errors. The results suggest that participation in 401(k) increases the probability of participation in IRA by 0.025 (SE = 0.016). But the effect is not statistically significant at the 5% level. 6 We also estimate the average marginal effects. Again, there is no significant evidence that participation in 401(k) crowds out IRA participation.
results are shown in Table 4. For net family financial assets, both the 2SLS and the LARF estimates are much smaller than the OLS estimate, suggesting that the latter may have been biased upwardly by treatment selection. For IRA participation, the LARF estimate is the only one that is not statistically significant at the 5% level, i.e., the only one that is not in conflict with the theoretical prediction. These estimates are only for illustrative purposes, as models with different covariates will render maybe somewhat different results.

Comparison with similar programs
LARF fills an important gap in IV estimations. Programs like systemfit Hamann 2015, 2007) in R and ivregress (StataCorp. 2015) and ivregress2 (Wada 2012) in Stata have been widely used for IV estimations. They fit linear regressions and work well if both the outcome and the endogenous treatment are continuous. But if the dependent variable is binary, the predictions can be inappropriately outside of (0, 1). 7 ivprobit (StataCorp. 2015) in Stata works for binary outcomes, but still requires a continuous treatment. The bivariate probit model (e.g., biprobit (StataCorp. 2015) in Stata and binom2.rho of VGAM (Yee 2015a,b) in R) can estimate causal effects with binary endogenous treatment, but it requires the outcome to be binary and also (sometimes unrealistically) assumes that the joint distribution for the outcome and the treatment is correctly known. In short, when both the endogenous treatment and its instrument are binary, LARF can provide more appropriate estimations of treatment effects for both continuous and binary outcomes.