Synth : An R Package for Synthetic Control Methods in Comparative Case Studies

The R package Synth implements synthetic control methods for comparative case studies designed to estimate the causal eﬀects of policy interventions and other events of interest (Abadie and Gardeazabal 2003; Abadie, Diamond, and Hainmueller 2010). These techniques are particularly well-suited to investigate events occurring at an aggregate level (i.e., countries, cities, regions, etc.) and aﬀecting a relatively small number of units. Beneﬁts and features of the Synth package are illustrated using data from Abadie and Gardeazabal (2003), which examined the economic impact of the terrorist conﬂict in the Basque Country.


Introduction
Much of social science is concerned with causal questions about the effects of historical events and policy interventions on aggregate units, such as cities, regions, and countries. A classic method of answering such questions is the comparative case study, in which investigators compare outcomes for the unit(s) affected by an event or intervention (the treated group) to outcomes for one or more unaffected units (the control group). The rationale behind this method is to use the control group's outcome to approximate the outcome that would have been observed for the treated group in the absence of treatment. Traditional comparative case study methods leave the choice of control units to the analyst, prompting questions about the arbitrariness of selection and the degree to which control units can credibly proxy for treated units' counterfactual outcomes. Synthetic control methods, introduced by Abadie and Gardeazabal (2003) and Abadie et al. (2010), address these shortcomings by proposing a data-driven control-group selection procedure, a framework for assessing the suitability of the chosen control group, and a means of producing quantitative inference. Abadie and Gardeazabal (2003) and Abadie et al. (2010) define a synthetic control unit as a weighted average of available control units that approximates the most relevant characteristics of the treated unit prior to the treatment. Synthetic control methods make explicit the relative contribution of each available control unit and the degree of similarity prior to treatment between a treated unit and its synthetic counterpart. An attractive feature of the synthetic control method is that it guards against extrapolation outside the convex hull of the data because weights from all control units can be chosen to be positive and sum to one. Abadie et al. (2010) motivate the synthetic control method with a model that generalizes the difference-in-differences (fixed-effects) model commonly applied in the empirical social science literature by allowing the effect of unobserved confounding characteristics to vary over time.
The aim of this paper is to present the Synth package which implements synthetic control methods in R (R Development Core Team 2011) and is available from the Comprehensive R Archive Network at http://CRAN.R-project.org/package=Synth. 1 The central function in the package is synth(), which constructs the synthetic control unit by solving an optimization problem to identify a set of weights that are assigned to potential control units. Another important function is dataprep() which allows the user to easily organize the data in a format needed to run synth(). Other functions such as synth.tables(), path.plot(), and gaps.plot() produce tables and figures that summarize and illustrate the results. Our data example is from Abadie and Gardeazabal (2003), which introduced synthetic control methods to investigate the effects of the terrorist conflict in the Basque Country on the Basque economy using other Spanish regions as potential control units.
The rest of the paper is organized as follows. Section 2 briefly reviews the synthetic control method, restating the key elements of Abadie and Gardeazabal (2003) and Abadie et al. (2010). In section 3 we demonstrate the use of the main functions and methods of Synth with an example. Section 4 concludes by describing future extensions to the Synth package.

Synthetic Control Methods
Synthetic control methods involve the construction of synthetic control units as convex combinations of multiple control units. The weights that define the synthetic control unit are chosen such that the synthetic control unit best approximates the relevant characteristics of the treated unit during the pretreatment period. The post-intervention outcomes for the synthetic control unit are then used to estimate the outcomes that would have been observed for the treated unit in the absence of the intervention. Abadie et al. (2010) provide a formal discussion of the theoretical properties of the synthetic control method. In particular, they derive the synthetic control estimator using an econometric model that generalizes the usual difference-in-differences (fixed-effects) model commonly applied in the empirical literature. Here we focus on how the Synth package can be used to implement the synthetic control method, and thus provide only a very brief review of the material in Abadie and Gardeazabal (2003) and Abadie et al. (2010).
Suppose that we observe units j = 1, ..., J + 1 for time periods t = 1, . . . , T . Without loss of generality, we assume that only the first unit is exposed to the intervention so we have J remaining control units that can contribute to the synthetic control. 2 The set of control units is termed the donor pool. In the context of comparative case studies units are usually aggregate entities such as schools, regions, or countries, and the interventions or treatments are events such as economic shocks, the passages of laws, etc. The intervention occurs at time period T 0 + 1 so that 1, 2, ..., T 0 are the pre-intervention periods and T 0 + 1, T 0 + 2, ..., T are the post-intervention periods. 3 We define two potential outcomes: Y N it refers to the outcome that would be observed for unit i at time t if unit i is not exposed to the intervention, and Y I it refers to the outcome that would be observed if unit i is exposed to the intervention. Our goal is to estimate the effect of the intervention on the outcome for the treated unit in the post-intervention period. This effect is formally defined as the difference between the two potential outcomes for periods T 0 + 1, T 0 + 2, . . . , T . Notice that Y N it is unobserved for the treated unit in the post-intervention period. The goal of the synthetic control method is to construct a synthetic control group that yields a reasonable estimate for this missing potential outcome.
Ideally, we would like to construct a synthetic control that resembles the treated unit in all relevant pre-intervention characteristics. Formalizing this idea we define U i as a (r × 1) vector of observed covariates for each unit. These variables will commonly consist of a set of predictors of the outcome variable. 4 Moreover, we define a (T 0 × 1) vector K = (k 1 , . . . , k T 0 ) that denotes some linear combination of pre-intervention outcomes: Linear combinations of pre-intervention outcomes can be used to control for unobserved common factors whose effects vary over time. 5 The user can choose to include as many as M (linearly independent) combinations of pre-intervention outcomes (with M ≤ T 0 ) to control for such unobserved common factors. 6 To construct our synthetic control unit we define a (J×1) vector of weights W = (w 2 , . . . , w J+1 ) such that w j ≥ 0 for j = 2, . . . , J + 1 and w 2 + · · · + w J+1 = 1. Each W then represents one particular weighted average of control units and therefore one potential synthetic control unit. Abadie and Gardeazabal (2003) and Abadie et al. (2010) propose to chose the weights W * such that the resulting synthetic control unit best approximates the unit exposed to the intervention with respect to the outcome predictors U i and M linear combinations of pre-intervention outcomesȲ K 1 i , . . . , and J+1 j=2 w * j U j = U 1 hold (or hold approximately). Then yields an estimator of α 1t in periods T 0 +1, T 0 +2, . . . , T . A formal discussion of the properties 2 In cases where there are multiple units exposed to the intervention the user can first aggregate the data from the regions exposed to the intervention.
3 For notational convenience and without loss of generality we assume that the treated unit is uninterruptedly exposed to the intervention in the post-treatment period. 4 The set of covariates is usually restricted to variables that are measured before the intervention occurs, but the user could include post-intervention characteristics as long as they are unaffected by the intervention. 5 See Abadie et al. (2010) for details. 6 In the example of section III below, we use only one of such linear combinations of pre-intervention outcomes: the average of the outcome variable for ten pre-intervention periods.
of the synthetic control estimator is provided in Abadie et al. (2010).
In empirical applications it is often the case that there exists no sets of weights such that In these cases the weights are chosen so that the identity conditions hold approximately. Due to the transparency of the method, the user can easily check how similar a particular synthetic control unit is to the treated unit.
To implement the synthetic control estimator numerically, we need to define a distance between the synthetic controls unit and the treated unit. To do that, we combine the characteristics of the exposed unit in the (k × 1) matrix ) and the values of the same characteristics of the control units in the To create the most similar synthetic control unit, the synth() function chooses the vector W * to minimize a distance, X 1 − X 0 W , between X 1 and X 0 W , subject to the weight constraints. In particular, in the synth() function we solve for a W * that minimizes where V is defined as some (k × k) symmetric and positive semidefinite matrix. The V matrix is introduced to allow different weights to the variables in X 0 and X 1 depending on their predictive power on the outcome. An optimal choice of V assigns weights that minimize the mean square error of the synthetic control estimator, that is the expectation of The synth() function allows for flexibility in the choice of V . Sometimes the researcher has a good subjective assessment of the predictive power of the variables in X 1 and X 0 . In this case the user could supply his own weights via the custom.V option. These weighs populate the diagonal of the V matrix (with the non-diagonal elements equal to zero) and synth() simply minimizes equation (1) conditional on the user supplied V .
Both Abadie and Gardeazabal (2003) and Abadie et al. (2010) propose a data-driven procedure to choose V , which is implemented by default in synth() (if no custom.V is specified). In this procedure a V * is chosen among all positive definite and diagonal matrices such that the mean squared prediction error (MSPE) of the outcome variable is minimized over some set of pre-intervention periods. 7 In other words, let Z 1 be the (T P × 1) vector with the values of the outcome variable for the treated unit for some set of the pre-intervention periods and Z 0 be the (T P × J) analogous matrix for the control units, where where V is the set of all positive definite and diagonal matrices and the weights for the synthetic control are given by W * . synth() solves a nested optimization problem that minimizes equation (2), for W * (V ) given by equation (1). Abadie et al. (2010) describe how synthetic control methods facilitate inferential techniques akin to permutation tests that are well-suited to comparative case studies in which the number of units in the comparison group and the number of periods in the sample are relatively small. They propose inferential techniques for the synthetic control method that proceed by conducting so-called placebo studies. The basic principle is to iteratively apply the synthetic control method by randomly reassigning the intervention in time (i.e., pre-intervention dates) or across units (i.e., to control units where the intervention did not occur) to produce a set of placebo effects. Subsequently, we can compare the set of placebo effects to the effect that was estimated for the time and unit where the intervention actually occurred. This comparison is informative about the rarity of the magnitude of the treatment effect that was observed for the exposed unit. We can assess whether the effect estimated by the synthetic control method for the actual intervention is large relative to the effect estimated for a unit or date chosen at random. By construction, this exercise produces exact inference regardless of the number of available comparison units, time periods, and whether the data are individual or aggregate. However, as described in more detail in Abadie et al. (2010), the quality of some of the inferential exercises increases with the number of available comparison units. 9 The underlying idea of the placebo tests is thus akin to permutation inference (see, for example, Lehmann (1997)), where a test statistic is iteratively computed under random permutations of the assignment vector that determines whether a unit is in the treatment or the control group.
In section III we illustrate the placebo test proposed in Abadie et al. (2010) applying the synthetic control method to units that were not exposed to the treatment. Examples of placebo studies using the longitudinal dimension of the data are found in Appendix B of Abadie et al. (2010) and Bertrand, Duflo, and Mullainathan (2004).

Implementing Synth
We demonstrate the synthetic control method using data from Abadie and Gardeazabal (2003), which studied the economic effects of conflict, using the terrorist conflict in the Basque Country as a case study. Abadie and Gardeazabal (2003) used a combination of other Spanish regions to construct a synthetic Basque Country resembling many relevant economic characteristics of the Basque Country before the onset of political terrorism in the 1970s. The basque data contains information from 1955-1997 on 17 Spanish regions (excluding the small autonomous towns of Ceuta and Melilla on the coast of Africa), including per-capita GDP (the outcome variable), as well as population density, sectoral production, investment, and human capital (the predictor variables). Missing data are denoted by NA.

R> library("Synth") R> data("basque")
This dataset is organized in standard (long) panel-data format, with variables extending across the columns and the rows sorted first by region and then by time-period. 10 A name (character-string) and number is provided for each region. 11 At least one of these two types of unit-identifiers is required for Synth to implement the analysis below. In Abadie and Gardeazabal (2003) the 13 predictor variables, for each region, were: 1964-1969 averages for gross total investment/GDP (invest).
1964-1969 averages for the share of the working-age population that was illiterate (school.illit), the share with up to primary school education (school.prim), the with some high school (school.med), the share with high school (school.high), and the share with more than high school (school.post.high). 12 1961-1969 averages for six industrial-sector shares as a percentage of total production (these variables are named with a sec. prefix).
1969 population density measured in persons per square kilometer (popdens).

Using dataprep()
The first step is to reorganize the panel dataset into an appropriate format that is suitable for the main estimator function synth(). At a minimum, synth() requires as inputs the four data matrices X 1 , X 0 , Z 1 , and Z 0 that are needed to construct a synthetic control unit. In our example, these four data matrices are as follows: X 1 is the (13 × 1) vector of Basque region predictors and X 0 is the (13 × 16) matrix of values of the same variables for the 16 control regions. 13 Z 1 is a (10 × 1) vector and Z 0 is a (10 × 16) matrix which contain the values for the outcome variable for the Basque country and the control units for the 10 pre-intervention periods over which we want to minimize the MSPE.
10 The panel dataset does not have to be sorted in this standard form. If the time-periods are out of order and/or units are interspersed down the rows, dataprep() works correctly just the same. 11 The first unit in this dataset refers to the data aggregated for the whole country of Spain. 12 Notice that in the basque data these highest educational attainment variables are provided as the total number of people in each category (in thousands). They are transformed to percentage shares below.
13 Note that all but one of these predictors is an average value over some range of the pre-treatment period and the precise date-range varies across predictor variables.
While the user can choose to provide preprocessed data matrices and load them into synth(), our package provides a convenience function called dataprep() that the user can run first to properly organize the data. We strongly recommend using dataprep(), because it allows to conveniently extract and package all the necessary inputs for synth() in a single list object that can be passed to synth() without further arguments. The list returned by dataprep() is also used by other convenience functions such as synth.tables(), path.plot(), and gaps.plot() to produce tables and figures that summarize and illustrate the results. dataprep() also implements a number of checks that will alert the user to missing data and inconsistencies in the extracted objects.

Running synth()
The synth() command searches for the W * vector of weights that identifies the synthetic control for the Basque region by solving the nested optimization problem described in equations (1) and (2) above.
For any V synth() finds a W * (V ) by minimizing equation (1) using a constrained quadratic optimization function from R's kernlab package (Karatzoglou, Smola, Hornik, and Zeileis 2004). synth() solves for the diagonal matrix V * that minimizes equation (2) and thus the MSPE for the pre-intervention period. To solve the optimization given by equation (2) above, we run optim() (R's general-purpose optimization function). 16 As shown below, the synth() command knows how to extract its required arguments (Z1, Z0, X1, X0) from the data.prep list output. 17 No additional arguments are necessary, though one may pass arguments to optim(), ipop(), or genoud() if desired. For example, below we set optim() to use the BFGS quasi-Newton algorithm. After synth() finishes, the values of V * 's diagonal and W * are shown, as are the corresponding values of equation (2) and (1): R> synth.out <-synth(data.prep.obj = dataprep.out, method = "BFGS") X1, X0, Z1, Z0 all come directly from dataprep object.

LOSS (W): 0.2501998
The LOSS (V) output corresponds to the loss associated with equation (2). LOSS (W) is the loss associated with equation (1).

Obtaining Results: Tables, Figures, and Estimates
There are various ways to obtain and summarize results. synth() returns a list object that allows the user to easily access the output from the optimization. For example, the (J × 1) matrix of W * weights is stored in synth.out$solution.w. The results from synth() can easily be combined with the output from dataprep() to compute other quantities of interest. For example, the annual discrepancies in the GPD trend between the Basque region and its synthetic counterpart may be calculated by typing default synth() always runs the optimization twice using two sets of starting values (equal V weights and a specialized regression based method to pick V ) and returns the run that obtains lower loss. Second, the user may choose to rely on one of non-derivative based algorithms offered by optim() (e.g., SANN). Finally, synth() offers an additional argument called genoud(). If genoud() is set to TRUE then synth() will embark on a two-step optimization procedure. A first optimization is conducted using the genoud() optimizer from the rgenoud package that combines evolutionary algorithm methods with a derivative-based (quasi-Newton) method to solve difficult optimization problems (see Mebane, Jr. and Sekhon 2011 for details). Solutions from genoud() are then passed to optim() in the second step. This option will require more computing time, but could be used in difficult cases if one were concerned about local minima.
Thus the user has the flexibility to create customized summary tables and figures. To facilitate the presentation of the results and to assess the quality of the synthetic control unit, the Synth package also contains three convenience functions. Tables are produced by using the synth.tab() function: R> synth.tables <-synth.tab(dataprep.res = dataprep.out, + synth.res = synth.out) This function produces four different types of tables: R> names(synth.tables) [1] "tab.pred" "tab.v" "tab.w" "tab.loss" The first object (synth.tables$tab.pred) is a We see that Cataluna contributes 85 percent and Madrid contributes 15 percent to the synthetic Basque country; a zero weight is assigned to all the other control regions. The path.plot() function produces Figure 1, showing the trajectories of the treated unit and the synthetic control unit. To make a convincing case for a large treatment effect, we would like to see the two trajectories of the outcome variable for the treated unit and its synthetic  control unit to be quite similar prior to the intervention and to diverge sharply when the intervention occurs. This is what we see for the Basque country example where the terrorism claimed its first victim in 1968, but large scale terrorist activity only began in the mid-70s. 18 Notice that the path.plot() function allows the user to pass many arguments to the plot() command to customize items like axes labels and titles.
Here we see that the GDP trajectory for the Basque country is very similar to that of the synthetic Basque country for almost the entire pre-terrorism period. Once large scale terrorist activity arises in the mid 70s, however, the GDP trajectory in the Basque country grows at a much lower rate than in the synthetic Basque country suggesting a large negative effect of terrorism on Basque GDP. R> gaps.plot(synth.res = synth.out, dataprep.res = dataprep.out, + Ylab = "gap in real per-capita GDP (1986 USD, thousand)", Xlab = "year", + Ylim = c(-1.5, 1.5), Main = NA)

Placebo Tests
One benefit of synthetic control methods is that they lend themselves to placebo tests. These tests involve applying the synthetic control method after reassigning the intervention in the data to units and periods where the intervention did not occur. Abadie and Gardeazabal (2003) introduced the placebo test for the synthetic control method by demonstrating that when the synthetic control method is applied to Catalonia, a region similar to the Basque country in terms of the variables is X 1 and X 0 , and path.plot() is run, there is no identifiable treatment effect. As shown in Figure 3, the outcome trajectories for Catalonia and its synthetic version are very similar. To produce Figure 3, one must re-run dataprep(), this time setting the treatment.identifier to 10 and the controls.identifier to c(2:9,11:16,18), since Catalonia is region number 10 in the dataset. 19 This test is only one of several different types of placebo tests that users can run with this package. For example, one can run placebos-in-time, by using dataprep() to assign a treatment before the true treatment occurred and checking to ensure that the trajectories of the synthetic control and the treated unit follow the same path beyond that arbitrary point in time. One may also wish to run placebos with outcome variables that should be unaffected by the treatment.
As described in Section II above, one can perform exact inferential techniques akin to permutation tests by applying the synthetic control method to every control unit in the sample and collecting information on the gaps associated with each iteration. Then, as in Abadie et al. (2010), the user can plot these gaps and visually determine whether the line associated with the true synthetic control unit (e.g., the synthetic Basque region) conspicuously differentiates itself from the rest with small gaps prior to treatment and large gaps afterward. The approach is easily implemented by running a for loop to implement placebo tests across all control units in the sample and collecting information on the gaps. Figure 4 shows the results obtained when this inferential technique is applied to our data example. As recommended in Abadie et al. (2010) we exclude regions with a poor fit for the pre-treatment period (i.e., regions with a MSPE that is five time higher than for the Basque country). Placebo studies for these regions do not provide information to measure the relative rarity of the post-treatment gap obtained for the Basque country which was well-fitted prior to treatment. 20 The resulting figure demonstrates that when we reassign exposure to terrorism to other regions there is a very low probability of obtaining a gap as large as the one obtained for the Basque region.

Conclusion
We have described synthetic control methods and the way they may be implemented to estimate causal effects and perform exact inferential techniques using the Synth package for R. Several extensions to Synth are currently under active development. We are testing a regression-based method for populating the entire V matrix (not just the diagonal) and a version of Synth that selects the W weights that best fit multiple outcome variables simultaneously. We are also working on a version that accommodates multiple treatments phased-in over time.