E CONOMIC A NALYSIS W ORKING P APER S ERIES

Panel Data Toolbox is a new package for MATLAB that includes functions to estimate the main econometric methods of balanced and unbalanced panel data analysis. The package includes code for the standard ﬁxed, between and random eﬀects estimation methods, as well as for the existing instrumental panels and a wide array of spatial panels. A full set of relevant tests is also included. This paper describes the methodology and implementation of the functions and illustrates their use with well-known examples. We perform numerical checks against other popular commercial and free software to show the validity of the results.


Introduction
Panel data econometrics have grown in importance over the past decades due to increase in the availability of data related to units that are observed over a long periods of time.Panel data econometric methods are available in Stata and R, but there is a lack of a full set of functions for MATLAB, by The MathWorks, Inc. (2015).
The Panel Data Toolbox introduces such set of functions, including estimation methods for the standard fixed, between and random effects models, both balanced and unbalanced, as well as instrumental panel data models, including the error components by Baltagi (1981), and, finally, recently introduced spatial panels, Kapoor, Kelejian, and Prucha (2007) and Baltagi and Liu (2011).Numerical checks against Stata and R using well-known classical examples show that the estimated coefficients and t statistics are consistent with those obtained with the new MATLAB toolbox. 1 full set of corresponding tests is included for poolability of the data, individual effects, fixed and random effects, serial correlation, and cross-sectional dependence.An over identification test is also available for instrumental panels, as well as tests for spatial autocorrelation.Spatial econometrics in MATLAB can be estimated using the LeSage and Pace (2009) Econometrics Toolbox, which uses maximum likelihood and Bayesian methods, and Elhorst (2014a) using maximum likelihood methods.In this new Panel Data Toolbox we use a generalized spatial two stage least squares (GS2SLS) estimator for spatial panels following Kapoor et al. (2007) and Baltagi and Liu (2011).

Model estimation
The starting formulation is the panel data model with specific individual effects: i = 1, . . ., n, t = 1, . . ., T i .
(1) where µ i represents the i-th invariant time individual effect and v it the disturbance, with v it ∼ i.i.d(0, θ 2 v ), E(v i ) = 0, E(v i v i ) = θ 2 v I T and E(v i v j ) = 0 for i = j, being I T the T × T identity matrix.
Panel data models are estimated using the panel(id, time, y, X, method, options) function, where id and time are vectors of unit and time indexes, y is the vector of the dependent variable, X is the matrix of explanatory variables, and method is a string that specifies the panel data estimation method to be used among the following: po: for a pooling estimation.
fe: for a fixed effects (within) estimation.
be: for a between estimation.
re: for a random effects GLS estimation.
These estimation methods are explained in the following sections.options is an optional list of paired parameter-value to specify advanced estimating options.

Fixed effects
Under typical specifications, individual effects are correlated with the explanatory variables: COV(X it , µ i ) = 0, which motivates the use of the fixed-effects (within) estimation, so as to capture unobserved heterogeneity, Baltagi (2008).
In this context, including individual effects on the error component while performing OLS (ordinary least squares) results into a biased estimation.In order to extract these effects, the within estimator of the parameters is computed using OLS: where ỹ = y − ȳ and X = X − X are the transformed variables in deviations from the group means, ȳ and X.It is called "within" estimator because it takes into account the variations in each group.This estimator is unbiased and consistent for n → ∞.Statistical inference is generally based on the asymptotic variance-covariance matrix: where S 2 denotes the residual variance: S 2 = (e e)/(N −n−k), with residuals e = ỹ−( X βfe ).Finally, inference can be performed using the standard t and F tests.

Between estimation
The between estimation is performed by applying OLS to transformed variables: where ȳ and X are the group means of the variables.It is called "between" estimator because it takes into account the variation between groups.Again, statistical inference is based on the asymptotic variance-covariance matrix: where S 2 denotes the residual variance: S 2 = (e e)/(n − k), with residuals e = ȳ − X βbe .

Random effects model
In the panel data model (1) the loss of degrees of freedom can be avoided if the individual effects can be assumed random, where the error component u it = µ i + v it includes the i-th invariant time individual effects µ i and the disturbance v it .
The individual effect µ i is assumed independent of the disturbance v it .In addition, individual effects and disturbances are independent of the explanatory variables; i.e., COV(X it , µ i ) = 0 and COV(X it , v it ) = 0 for all i and t.For this reason, the random effects model is an appropriate specification in the analysis of n individuals randomly drawn from a large population.In this context, n is usually large and a fixed effects model would lead to a loss of degrees of freedom.
Following the formalization of Wallace and Hussain (1969), as stated in Baltagi (2008), the composed error component has the following properties: This results in a block-diagonal covariance matrix with serial correlation over time, only between disturbances of the same individual and zero otherwise: This implies the following correlation coefficient between disturbances: Therefore, the covariance matrix can be computed as follows: where J T is a matrix of ones of size T and the homoscedastic variance is for all i and t.In this case, the GLS (generalized least squares) method yields an efficient estimator of the parameters, βre with , and P and Q are the matrices that compute the group means and the differences with respect to the group means, respectively.In order to obtain the GLS estimator of the regression coefficients, it is necessary to estimate the Ω −1 matrix of dimension nT × nT .Fuller andBattese (1973, 1974), suggest premultiplying the model by σ v Ω −1/2 , which is equivalent to computing a quasi-time demeaning of the variables ỹit = y it − θ i ȳi and Xit = X it − θ i Xi , where Then, the random effects GLS estimation is computed as βre = ( X X) −1 X y.

Confidence intervals
Confidence intervals at the desired significance level can be computed with the estci functions, and appropriately displayed with the estcidisp function.Both functions take as input an estimation output structure estout and the desired significance level, which defaults to 0.05 if not specified.

Robust standard errors
If we suspect that there exists heteroskedasticity in the residuals, we can compute a robust standard error estimation of the fixed and random effects models.Liang and Zeger (1986) and Arellano (1987) propose an extension of the White (1980) sandwich estimator for panel data models, whose asymptotic properties are studied by Hansen (2007) and Stock and Watson (2008).The correct standard errors should be computed as a clustered-robust standard errors using the observation groups as the different clusters.
where, in the fixed effects estimation, X is the within transformation of the explanatory variables, e are the residuals from the within regression, and the degrees of freedom correction n/(n − 1) × N/(N − k) is usually applied.In a random effects estimation, X is the quasitime demeaning transformation of the explanatory variables, e the residuals from the random effects regression, and the degrees of freedom correction is n/(n − 1) × (N − 1)/(N − k).

Instrumental panels
The assumption of strict exogeneity of the independent variables, X, when they are uncorrelated with the disturbance, E(X it , v it ) = 0, implies that the basic panel data methods we have shown remain valid.However, there are many applications in which this assumption is untenable.In this case, when some of the regressors are endogenous, the fixed effects, between, and random effects estimators lose consistency and unbiasedness.Consequently, we can apply an instrumental variables (IV) two stage estimation to the fixed effects, between, and random effects models, Wooldridge (2010).
To apply this estimation method, we need a set of variables that are strictly exogenous, uncorrelated with the disturbance in all time periods, and relevant; i.e., correlated with the endogenous independent variables.These variables constitute the set of instrumental variables (IV).
For an application of instrumental panel data, we follow Baltagi and Levin (1992) and Baltagi, Griffin, and Xiong (2000) who estimate the demand for cigarettes using data from 46 U.S. states over the period 1963-1992. 8We estimate the consumption, log(c), measured as per capita sales, which depends on the price per pack, log(price), per capita disposable income, log(ndi), and the minimum price in neighbor states, log(pimin). 9We believe the log(price) is potentially endogenous, and use as instrumental variables the lags of the disposable income, log(ndi_1) and the lag of the minimum price log(pimin_1).Instrumental panel models are estimated using the ivpanel(id, time, y, X, Z, method, options) function, where Z is the matrix of instruments -excluding the exogenous variables in X that are instruments of themselves and are automatically added by the function.A vector of indexes corresponding to the endogenous variables must be set in the endog option.
method is a string that specifies the choice of instrumental panel data estimation method, among the following: po: for a pool estimation.
fe: for a fixed effects (within) estimation.
be: for a between effects estimation.
re: for a random effects estimation.

Two Stage Least Squares
Instrumental panel data models are estimated by Two Stage Least Squares (2SLS).The first stage of the 2SLS estimation consists of estimating the independent variables, X, by an OLS estimation of X over H = [ X * , Z], where X * are the exogenous variables in X, which are instruments of themselves, and Z is the matrix of new instruments.For simplification, the tilde over the variables denotes the corresponding within, between o quasi-time demeaning transformation.
The second stage consists in estimating the coefficients, β, using the predicted X: Wherever X and H correspond to the within, between, or quasi-time demeaning transformation of the variables, we are computing the corresponding Fixed Effects 2SLS (FE2SLS), Between 2SLS (BE2SLS), and Random Effects 2SLS (RE2SLS).
Regarding statistical inference, the statistic of individual significance is normally distributed, while the statistic of joint significance is distributed as a χ 2 with the corresponding degrees of freedom.

Spatial panels
In recent years the econometrics literature has grown with topics related to the analysis of spatial relations using panel data models.The main reason is the availability of more complete data sets in which units characterized by spatial features are observed over time.In general, a spatial panel data set contains more information and less multicollinearity among the variables than a cross-section spatial counterpart-see Anselin (1988Anselin ( , 2010)), Elhorst (2014b) and Arbia (2014) for an introduction to this literature.
In the context of cross-sectional models Kelejian and Prucha (1998) introduced a generalized spatial two-stage least squares estimator, Kelejian and Prucha (1999) 11 proposed a generalized moments (GM) estimation method that is feasible for large n, while Anselin (1988) provided the ML (Maximum likelihood) estimator.Drukker, Egger, and Prucha (2013) extended the model allowing for endogenous regressors.Most recently, Elhorst (2003Elhorst ( , 2010) ) and Lee and Yu (2010) presented the ML estimators of the spatial lag model as well as the error model extended to include fixed and random effects, solving the computational problems when the number of cross sectional units n is large.Kapoor et al. (2007), Mutl andPfaffermayr (2011), andPiras (2013) generalized the GM procedure from cross-section to panel data and derived its properties.
In order to compute different estimators in spatial panel models, we consider the general spatial panel model: A spatial panel data model can include a spatial lag of the dependent variable, W y it , a spatial lag in the error structure, W it , and a spatial lag in the explanatory variables, W X it , whose coefficients are λ, ρ, and β λ , respectively.Depending on the spatial lags they include the model receives a different name.
Procedures for estimating spatial panel data models in MATLAB are already available in LeSage and Pace ( 2009), using Bayesian methods, and in Elhorst (2014a), by Maximum Likelihood.In this toolbox, we implement the GM procedure for spatial panels, which allows the inclusion of additional endogenous covariates, and it is integrated with the rest of the toolbox, both regarding estimation and testing functions.12 In the case where only the spatial lag of the dependent variable is included, this spatial lag is endogenous and the estimation of the spatial model is performed as an instrumental variables estimation using the instruments suggested by Kelejian and Prucha (1998), If the model contains a spatial lag of the error structure, the estimation method is a GM estimation, and we refer the reader to Kapoor et al. (2007), Mutl andPfaffermayr (2011), andPiras (2013) for a full explanation of the estimation methods and the corresponding moments conditions.
The application is based on the Munnell (1990) and Baltagi (2008)  Spatial panel data models are estimated using the spanel(id, time, y, X, W, method, options), where W is the n×n spatial weight matrix.14method can be one of the following:15 fe: for a spatial fixed effects (within) estimation.
re: for a spatial random effects estimation.
ec: for the Baltagi and Liu (2011) spatial error components estimation of the model with a spatial lag of the dependent varaible.
The different spatial lags can be included by setting the following options: slagy: if set to 1 includes a spatial lag of the dependent variables.
slagerror: if set to 1 includes a spatial lag of the error structure.
slagX: a vector of indexes specifying the explanatory variables for which a spatial lag should be added.

Tests
In this section we describe the implementation of several canonical tests for the panel data regression models presented previously.Specification tests in panel data involves testing for poolability, individual effects and the Hausman test to select the efficient estimator between fixed and random effects models.In addition, we provide a suite of serial correlation and cross-sectional dependence tests.Finally, we consider as the usual diagnostic checks an overidentification test for validity of instruments in instrumental panels and tests for spatial autocorrelation in spatial panels.Appropriate corrections for heteroskedasticity and unbalanced panels for these tests are applied when available.
All test functions require as input an estimation output structure, estout, from a panel estimation and return a testout structure, described in Section 2, that can be displayed in a suitable way using the testdisp function.
H0: All mu_i = 0 F(47,764) = 75.820406p-value = 0.0000 bpretest implements the Baltagi and Li (1990) version of the Lagrange multiplier test of individual effects proposed by Breusch and Pagan (1980).This test contrasts the existence of individual effects by checking its variance that under the null hypothesis of no individual effects is equal to zero, and the LM statistic is distributed as a χ 2 1 .

Testing fixed vs. random effects
In order to determine the correct specification of the model, fixed versus random effects, it is necessary to check the correlation between the individual effects and the regressors.When the individual effects and the explanatory variables are correlated: COV(X it , µ i ) = 0, the fixed effects model provides an unbiased estimator, otherwise a feasible GLS in a random effects model is an efficient estimator.
hausmantest computes the Hausman test (Hausman 1978) that compares the GLS estimator of the random effects model, βre , and the within estimator in the fixed effects model, βfe , both of which are consistent under the null hypothesis.Under the alternative, only the GLS estimator of random effects is consistent.Therefore, the statistics is based on the difference between both estimators H 0 : β f e − β re = 0, and it is computed as: where, under the assumption of homoskedasticity: For n fixed and T large, both estimators tend to similar values, with their difference converging to zero, and Hausman's test is unnecessary.However, in applications where n is relatively large with respect to T , it can be used to choose between estimators.
The input of the hausmantest function requires the output structures of the two estimations to be compared.

Numerical checks
Numerical checks against other commercial and free software are performed by comparing the panel data estimation results from this Panel Data Toolbox in MATLAB and results reported by Stata and R. 16Results for the basic panel data models -fixed, between and random -estimations using the MATLAB, panel function, and the results reported by Stata, xtreg function, and the R package plm by Croissant and Millo (2008), plm function, are reported in Table 1.Results show that there are no differences in the estimated coefficients and t-statistics between the three programs.Numerical checks for the instrumental variables panel data models of fixed effects, random effects, and Baltagi's error components for MATLAB, ivpanel function, Stata, using the xtivreg function, and R package plm, plm function, are reported in Table 2. Again, results are equal regardless the software, although there is a slightly difference in the last decimal between Stata and the other two.Spatial panel estimations using the MATAB function spanel are checked against the R package splm by Millo and Piras (2012), using the spgm function, which performs a GM implementation.Since a large variety of models can be computed for spatial panels depending on the spatial lags we assume, we perform the numerical checks of a spatial SARAR model, which includes a spatial lag of the dependent variable and a spatial lag of the error structure, both with fixed and random effects.Although different interpretations of the literature as well as on the choice of techniques when implementing spatial econometrics lead to some differences in the results (Bivand and Piras 2015), results in Table 3 reveal no differences in the estimated coefficients and t statistics between MATLAB and R.

Conclusions
The new Panel Data Toolbox covers a wide variety of balanced and unbalanced panel data models in an organized environment for MATLAB.Estimation methods include fixed, between and random effects, as well as instrumental and spatial panels, and the full set of relevant tests for testing poolability, individual effects, serial correlation, cross-sectional dependence, overidentification and spatial autocorrelation.
Numerical checks show the consistency of the results, as the estimated coefficients and t statistics are equal to those reported by Stata and R for panel, instrumental panels and spatial panel data methods.This positions the new toolbox as a valid self-contained package for panel data econometrics in MATLAB.
Since the code is freely available in an open source repository on GitHub, under the GNU General Public License version 3, users will benefit from the review, collaboration and contri-butions from the community, and can check the syntax to learn how the theoretical formulas of econometrics can be translated into code. 17

Table 1 :
Comparison of estimated coefficients and t statistics for panel data against Stata and R.

Table 2 :
Comparison of estimated coefficients and t statistics for instrumental panel data against Stata and R.

Table 3 :
Comparison of estimated coefficients and t statistics for spatial panel data against R.