Stochastic Frontier Analysis Using SFAMB for Ox

SFAMB is a flexible econometric tool designed for the estimation of stochastic frontier models. Ox is a matrix language used in different modules, with a console version freely available to academic users. This article provides a brief introduction to the field of stochastic frontier analysis, with examples of code (input and output) as well as a technical documentation of member functions. SFAMB provides frontier models for both crosssectional data and panel data (focusing on fixed effects models). Member functions can be extended depending on the needs of the user.


Introduction
SFAMB (stochastic frontier analysis using 'Modelbase') is a package for estimating stochastic frontier production (as well as cost, distance, and profit) functions.It provides specifications for both cross-sectional data and panel data.SFAMB is written in Ox (Doornik 2009) and is operated by writing small programs (scripts).
The console versions of Ox are free for research and educational purposes.Ox Console uses OxEdit to run programs, while the commercial version of the programming language, Ox Professional, uses the graphical user environment OxMetrics instead.
The structure of the paper is as follows.In the next section, we briefly introduce the econometric foundations and related literature.The first part focuses on the theory of models for cross-sectional and panel data.A subsection presents the specifications that are available in SFAMB and mentions some related other software.Section 3 explains the usage of the package, which includes data structure, model formulation, and model output.Furthermore, it provides a detailed list of class member functions.For illustration, we present practical examples using real world data (Section 4) which are distributed with the package.We mention some possible extensions of SFAMB in Section 5. Finally, a technical appendix provides a brief overview of some underlying workings (Appendix A).

Cross-sectional data
Basic approach -POOLED model This section provides a brief introduction to stochastic frontier (SF) techniques.A more detailed introduction can be found in Coelli, Rao, O'Donnell, and Battese (2005) or Bogetoft and Otto (2010).More advanced material is covered in Kumbhakar and Lovell (2000).The basic problem in efficiency analysis lies in the estimation of an unobservable frontier (production, distance or cost) function from observable input and output data, together with price data when necessary.Standard estimation techniques like OLS are inappropriate in this setting as they aim to describe average relationships, which are not the focus of an efficiency model.
The basic approach was simultaneously developed by Aigner, Lovell, and Schmidt (1977) (ALS), and Meeusen and Van den Broeck (1977).The following example of a production frontier highlights its most important characteristics.The basic production frontier model is given by: On the left hand side, y i is the output (or some transformation of the output) of observation i (i = 1, 2, . . ., N ).On the right hand side, x i is a K × 1 vector of inputs that produces output y i , and the vector β represents technology parameters to be estimated.The most commonly used transformation of the variables is the natural logarithm.The crucial part of this formulation is the composed error term given by i = v i −u i , where v i represents statistical noise and u i represents inefficiency.Both error components are assumed to be independent of each other.Estimation is possible by means of maximum likelihood estimation (MLE) where distributional assumptions on the error components are required.The noise component is a conventional two-sided error, distributed as v i iid ∼ N (0, σ 2 v ).The inefficiency component is a non-negative disturbance that can be modeled using several distributions. 1 However, the truncated normal and half-normal distributions are most frequently used and implemented in SFAMB (see Table 2).In case of the normal-truncated normal SF model, the random variable u i is distributed as u i iid ∼ N + (µ, σ 2 u ).If µ is set to zero, the model becomes the normal-half normal SF model.
Extensions of the basic SF approach allow us to model the location and scale of the inefficiency distribution in a more flexible way.The corresponding covariates are often labeled as z-variables.Alvarez, Amsler, Orea, and Schmidt (2006) offer a comprehensive discussion of this topic and the so-called "scaling property".
Another useful overview is given by Lai and Huang (2010) who summarize and categorize several well-known models. 2The so-called KGMHLBC 3 model parameterizes µ and originally assumes the following inefficiency distribution, and the scale is modeled using an exponential form, it becomes the RSCFG 4 model, where ).The combination of both models leads to the following form, ), that according to Lai and Huang (2010) could be labeled as a generalized linear mean (GLM) model. 5  Jondrow, Lovell, Materov, and Schmidt (1982) present a point estimator of inefficiency, given by E(u i | i ).Battese and Coelli (1988) show that if the dependent variable is in logarithms, a more appropriate estimator is the point estimator of technical efficiency, given by

Unobserved heterogeneity -LSDV model
Panel data provide additional information because each individual is observed over a certain time period, where periods are indexed with t (t = 1, 2, . . ., T ).The respective production function model, estimated by OLS, can be written as: (2) This formulation includes the time dimension, a conventional two-sided error v it iid ∼ N (0, σ 2 v ), and an individual intercept α i .The model's virtue originates from the identification of these N time-invariant parameters.These "fixed" effects absorb unmeasured time-invariant individual attributes, and hence, the model accounts for unobserved heterogeneity.One commonly used name of this approach is "least squares with dummy variables" (LSDV).
Instead of estimating the dummy variables, the conventional strategy is to apply withintransformation to the dependent and independent variable(s), e.g., for an independent variable: The observation on x it is transformed by subtracting the respective individual mean xi .The resulting variable xit is a deviation from the mean.This procedure eliminates the individual effects because αi = α i − α i = 0.The transformed variables are used for model estimation.After estimation, the individual effects are calculated as: αi = ȳi − β xi .Schmidt and Sickles (1984) use the model in a frontier context.They interpret the individual with the highest intercept as 100% technically efficient and determine the inefficiency of the remaining individuals as u i = max( α) − αi .Accordingly, efficiency scores are time-invariant and are given by TE i = E(exp(−u i )). 4 Reifschneider and Stevenson (1991); Caudill, Ford, and Gropper (1995). 5Actually, they label models that include the KGMHLBC form as generalized exponential mean (GEM) models.The reason is that they refer to the exponential form of the model that has been proposed by Alvarez et al. (2006).Following the categorization of Lai and Huang (2010), the model implemented in SFAMB is a GLM model.Furthermore, note that in SFAMB the respective scale parameter in the POOLED model is (the natural logarithm of) σu,i, and not σ 2 u,i .While σ 2 u is often used, the original formulation of CFG involved σu.
If (in)efficiency should be modeled as time-invariant or not, depends on the objective and empirical application (see Greene 2008 for a comprehensive discussion).Nevertheless, in a longer panel, the assumption of time-varying inefficiency will usually be attractive.This fact has motivated extensions of the above model as well as the development of other approaches.
One famous example is the model of Battese and Coelli (1992) (BC92) that has been applied in many empirical studies.This model specifies u it = exp(−η(t − T )) × u i , and nests the case of persistent inefficiency (if η = 0).SFAMB does not support this model, but other packages do (see Table 3).

Unobserved heterogeneity in SFA: Dummy variables -TFE model
The approach of Schmidt and Sickles (1984) is a reinterpretation of the well-known panel data model.However, there is no differentiation between heterogeneity and inefficiency.A complete panel SF model takes both components into account: This model is proposed by Greene (2005, p. 277) who labels it as the "true fixed effects [TFE] stochastic frontier model".Estimation of this model requires the inclusion of all N dummy variables, i.e., the number of intercepts to be estimated corresponds to the number of individuals.With fixed T , the estimate of the error variance is inconsistent (incidental parameters problem).Furthermore, it is likely to be biased as pointed out by Chen, Schmidt, and Wang (2014).This is a relevant issue since this estimate is required for the assessment of inefficiency.
Elimination of dummies -WT model Wang and Ho (2010) propose an extension to overcome the incidental parameters problem.
Their model is based on deviations from means (within-transformation; WT): This represents either a normal-truncated normal or a normal-half normal SF model where the noise component is distributed as v it iid ∼ N (0, σ 2 v ).Let the vector of transformed v it be denoted by ṽi = (ṽ i1 , . . ., ṽiT ) .This vector has a multivariate normal distribution, i.e., ṽi ∼ MN (0, Π), where Π is a T × T covariance matrix.7 The specification of time-varying inefficiency (u it ) is more involved.Here, the ("basic") inefficiency component is assumed to be producer-specific, but time-invariant, i.e., u , where µ is equal to zero in the case of a half-normal distribution.Inefficiency varies over time by means of a scaling function: where z it is a vector of time-varying, producer-specific covariates.The transformed inefficiency component results from the transformation of the scaling function: Wang and Ho (2010) present the conditional expectation of u it in their Equation 30; efficiency estimates are given by TE it = E(exp(−u it |˜ it )).The individual effects are calculated as: αi = ȳi − β xi + ūi .

Consistent estimation with time-varying inefficiency -CFE model
Consistent estimation of the fixed effects SF model given in Equation 3 is demonstrated by Chen et al. (2014).Their approach is also based on deviations from means (Equation 4), but the CFE (consistent fixed effects) model allows inefficiency to vary over individuals and time, without an auxiliary function.
The approach is characterized by two features, within-transformation and the T −1 deviations, as well as by the use of a more general distributional theory.Firstly, within-transformation removes the incidental parameters.Secondly, the model's likelihood function is derived from the first T − 1 deviations, i.e., from the vector ˜ * i = (˜ i1 , . . ., ˜ i,T −1 ) .This strategy achieves an implicit correction of the error variance. 8The approach is based on the closed skew normal (CSN) distribution.9 The composed error, = v − u, has a skewed distribution (to the left) due to the nonnegativeness of u.Accordingly, the standard (half-normal) SF model has a skew normal distribution, with skewness parameter λ and density: While the skew normal distribution is a generalization of the normal distribution, it can be generalized itself by using the CSN distribution.The composed error has a CSN distribution, which is expressed by: The density of a CSN p,q -distribution includes a p-dimensional pdf and a q-dimensional cdf of a normal distribution.The five associated parameters describe location, scale, and skewness, as well as the mean vector and covariance matrix in the cdf.With panel data, the T -dimensional vector i = ( i1 , . . ., iT ) is distributed as: where I is the identity matrix.Chen et al. (2014, p. 67) make use of the fact that the CSN distribution is "closed under linear combination", and partition the vector i into its mean ¯ i and its first T −1 deviations ˜ * i .The model's likelihood function is derived from ˜ * i .Therefore, it is free of incidental parameters, and the parameters to be estimated are β, λ, and σ 2 -as in the basic SF model.10¯ i and ˜ * i are not independent, unless λ = 0.If λ = 0 the model becomes the fixed effects model with normal error (LSDV model).
In order to assess the inefficiency, the composed error term must be recovered: There are two ways to calculate αi .The one used here is labeled as the mean-adjusted estimate by Chen et al. (2014): (5)
There are several other software packages that incorporate (some of) these estimators.Tables 2 and 3 provide an overview of the different specifications that are available in SFAMB and other software.11 LIMDEP (Econometric Software Inc. 2014) and Stata (StataCorp LP 2015) are comprehensive commercial packages that implement frontier techniques in their standard distributions.In case of Stata, there are additional third-party add-ons such as those of Wang (2012) or Belotti, Daidone, Ilardi, and Atella (2013).Hughes (2008) has written two free packages, sfa hetmod and sfa mod, that can be used with gretl (Cottrell and Lucchetti 2014) and include variations of the standard model.
The recent package spfrontier (Pavlyuk 2016) is concerned with (the specific family of) spatial SF models.It is implemented in R (R Core Team 2017) and allows for various specifications.
The first program to implement frontier techniques was Frontier (Coelli 1996).Later, the original code was transferred to R in the package frontier by Coelli and Henningsen (2017).
This package provides the standard model (ALS), the extension of KGMHLBC as well as the model of BC92.Its functionality is augmented by some additional options (e.g., for calculating marginal effects).
Cross-sectional data SFAMB * frontier LIMDEP sfa hetmod sfa mod Stata * Here, the respective scale parameter is σ u,i , and not σ 2 u,i .
Table 2: Available model specifications for cross-sectional data.
Table 3: Available model specifications for panel data.
Similarly, SFAMB offers specific member functions that can be extended by the user.To date, it is the only package including the CFE model.

Data organization
Ox supports different data file formats (.xls, .dta, . . . ) that can be directly read into an SFAMB object.Details can be found in printed documentation (Doornik and Ooms 2007;Doornik 2009), or online at http://www.doornik.com/ox/.
The data have to be organized in columns, i.e., one column contains one variable.The first row indicates the name.Ox interprets missing values as .NaN (Not a Number).In case of panel data, the data have to be stacked by individual (i = 1, 2, . . ., N ), and within individuals by time period (t = 1, 2, . . ., T ).The data set must include two variables that indicate the individual and the time period (the panel may be unbalanced).An example is given in Table 4.

Model formulation
The sequence of model formulation is sketched out in Figure 1.To formulate the frontier function: Use Select(Y_VAR, {".",.,.}) to select the dependent variable.
To include variables that affect the one-sided inefficiency distribution (for the WT model choose only one of these groups): Use Select(U_VAR, {".",.,.}) to select variables that shift the individual location parameter of the distribution, µ i .
SetTranslog can be used to choose the functional form of the frontier function.In case of the translog specification, we recommend to normalize the variables by the respective sample means.Estimation of the model is executed via Estimate.For more details, see the documentation of member functions in Section 3.4.

Model output
Besides standard results, some model-specific numbers are printed after estimation.

Cross-sectional data
Specific numbers for model POOLED: gamma was defined by Battese and Corra (1977) and is given by VAR(u)/VAR(total) describes the variance decomposition of the composed error term.The share of the variance of u in the total variance of the composed error is given by Greene (2008, p. 118), where Test of one-sided err provides a likelihood ratio test statistic for the presence of inefficiency, i.e., for the null hypothesis H 0 : γ = 0.The critical value cannot be taken from a conventional χ 2 -table, see Kodde and Palm (1986).

Panel data
Specific numbers for model LSDV: sigma_e describes σ v , the square root of the corrected error variance, This estimate is also used to compute the standard errors.
AIC2 uses a different formula for the criterion, AIC2 = ln( SSR N T ) + (2 K+N N T ); that does not require the likelihood function and considers the number of individuals in the penalty term.
Specific number for models TFE, WT and CFE: lambda given by λ = σ u /σ v .

Class member functions
These functions (user interface) together with the data members build up the 'SFAMB' class.Some internal functions are not listed here.The interested user may consult the package's header file and source code file.Note that the class derives from the Ox 'Modelbase' class, and hence, all underlying functions may be used.mifr is the condition that specifies the observation to be dropped, see the documentation of selectifr.

Return value
Returns the calculated output elasticities and the respective t-values for the specified (single) input variable (for all sample observations).Description -Use with SetTranslog() -Only if a translog functional form is used, N T × 2 matrix.The elasticity is calculated for each observation of the sample: the output elasticity of input k is given by ∂ ln y i /∂ ln x ki .The t-values are calculated using the delta method (Greene 2012), extracting the required values from the data and covariance matrix of the parameter estimates.
sXname is the name of the corresponding input variable (string).
GetResults GetResults(const ampar, const ameff, const avfct, const amv); No return values Description -Only POOLED model -This function can be used to store the results of the estimation procedure for further use.All four arguments should be addresses of variables.
mpar consists of a Npar × 3 matrix, where Npar is the number of parameters in the model.The first column contains the coefficient estimates, the second column the standard errors, and the third column the appropriate probabilities.
eff consists of a Nobs × 3 matrix, where Nobs is the number of total observations.The first column holds the point estimate for technical efficiency, the second and third columns contain the upper and lower bound of the (1 − α) confidence interval.
fct holds some likelihood function values (OLS and ML), as well as some information on the correct variance decomposition of the composed error term.

Ident Ident(const vID, const vPer);
No return values Description -Only panel data -Identifies the structure of the panel.
vID is an N T × 1 vector holding the identifier (integer) of the individual.
vPer is an N T × 1 vector holding the identifier (integer) of the time period.
If only (K) technology parameters are given, the grid search is also applied.

SetTranslog SetTranslog(const iTl);
No return values Description This function expects an integer to control the construction of additional regressors from the selected X-variables.
A value of zero indicates no further terms to be added, e.g., for a log-linear model, this corresponds to the Cobb-Douglas form.
A value of one indicates that all square and cross terms of all independent variables should be constructed, e.g., for a log-linear model, this corresponds to the full translog form.
An integer value of k > 1 indicates that the square and cross terms should be constructed for only the first k independent variables (useful when the regressor matrix contains dummy variables).

TE TE();
Return value Returns point estimates of technical efficiency, N T × 1 vector.Description These predictions are given by the conditional expectation of exp(−u), see Section 2 for details.

Return value
Returns point estimates of technical efficiency as well as lower and upper bounds.Description -Only POOLED model -This function expects a double constant, indicating the error probability for the construction of confidence bounds (default 0.05); for details see Horrace and Schmidt (1996), for an application Brümmer (2001).It returns an N T × 3 matrix structured as (point estimate -lower bound -upper bound).

TestGraphicAnalysis TestGraphicAnalysis();
No return values Description Only useful in conjunction with the free Ox package GnuDraw (Bos 2014), which is an Ox interface to gnuplot (gnuplot Team 2015).This function draws two (or three) graphs: A histogram of the efficiency point estimates and a respective boxplot.It displays an additional graph in case of the POOLED model, depicting the interval estimates of technical efficiency at the specified significance.

Example: hbest1.ox
The first example is a generalized linear mean (GLM) model, where u i ∼ N + (µ, σ u,i = exp(δ 0 + δ z i )).The original data are in levels and are transformed using member function PrepData to accommodate the translog functional form.The data are a subset of FAO/USDA data prepared by Fuglie (2012), including Sub-Saharan African countries and South Africa.
General usage and details of the Ox language are explained in Doornik and Ooms (2007).The sample file hbest1.ox is presented below.At the beginning of each program some header files are linked in: #include <oxstd.h>#include <packages/gnudraw/gnudraw.h> #import <packages/sfamb/sfamb> The first file, the so-called standard header, ensures that all standard library functions can be used.The second line includes the header file of GnuDraw (Bos 2014), an Ox interface to gnuplot (gnuplot Team 2015).If it is not installed or you do not want to use this package, delete this line.However, graphics output will then be disabled in the free Ox Console version (in the commercial OxMetrics version, graphics would still be available).Alternatively, you can comment it out via //: // #include <packages/gnudraw/gnudraw.h> The third line imports the (compiled) source code of the package (you may also use #include <packages/sfamb/sfamb.ox>).Each Ox program is executed by the main() function that contains the main loop of Ox.
main(){ ... } The next steps outlined follow the structure of Figure 1.A new object of class 'Sfa' has to be declared.
The data are loaded with a call to the member function Load.The argument of SetMethod chooses the respective estimator (see Table 1).Here, the model for cross-sectional data is specified.The function SetConstant creates a constant (intercept).
14 Normalization of inputs (and output): ln PrepData is a member function of this package (see Section 3.4).Both of the other functions are member functions of the 'Database' class (see Doornik and Ooms 2007).
fob.Select(U_VAR, { "Constant", 0, 0 }); Likewise, covariates used to model the scale of the distribution are selected in the group Z_VAR, i.e., these variables parameterize σ u,i (in case of the WT model, it is σ 2 u,i ).

fob.SetTranslog(1); fob.Estimate();
A number of results can be obtained after estimation.In the SF context, the efficiency scores (TE i ) are of particular interest.Here, the point estimates are extracted, together with the lower and upper bounds of a 95% confidence band.The respective function is TEint.The function Ineff extracts the point estimates of inefficiency, E(u i | i ).These results are labeled and appended to the object using Renew.The original database together with the transformed variables and results is saved to file via Save.
fob.Renew(fob.TEint(0.05),{"TE", "lower", "upper"}); fob.Renew(fob.Ineff(), {"jlms"}); fob.Save("out.xls"); There is a graphical functionality involving the package GnuDraw that allows for a visual assessment of the efficiency scores.The function TestGraphicAnalysis displays the graphics presented in Figure 2. The confidence band can be changed with SetConfidenceLevel, where an error probability of 0.05 is the default. fob.SetConfidenceLevel(0.05);fob.TestGraphicAnalysis(); The output of this program appears as follows (omitting information on the maximization procedure).Some general information: The transformed variables facilitate the interpretation of the estimated coefficients of the translog functional form.Thus, the first order coefficients listed below can be interpreted as output elasticities at the sample mean.These estimates are positive and meet the requirement of monotonicity -except for the machinery input whose (insignificant) estimate violates the regularity condition.The parameter associated with trend indicates the estimated average rate of technical change per year.Since ln (σ u ) is parameterized using covariates, there are several related estimates.The order of coefficients corresponds to the specification ln (σ u ) = δ 0 + 4 l=1 δ l × z l where l = 1 (labor), 2 (land), 3 (machinery), 4 (fertilizer); and the z 's are in logarithms.Higher use of z l is associated with a lower level of inefficiency (or higher technical efficiency) if the estimated parameter has a negative sign.Here, the inefficiency distribution is supposed to have a non-zero mean, u i ∼ N + (µ = µ 0 , σ 2 u,i ), i.e., the location parameter is a constant (µ 0 ) common to all individuals.Additional covariates could be introduced.The omission of U_VAR in the model specification leads to µ = 0, and hence, results in the normal half-normal model.Note that this estimate (here, the third Constant) is always the last Constant term in the list (if a truncated-normal is specified).CFE is the estimator selected.Here, the function SetConstant does not create a constant because it is not required.However, this line can be kept for convenience.The function Ident identifies the panel structure of the data.The required information includes the variable names of the individuals ("ID") and the period ("time").
After estimation, the member function Elast can be used to calculate the output elasticity ( ji ) of each input for each observation: β jl ln x li .
The following example illustrates one possible way the function may be used.Here, results are plotted as histograms (see Figure 4).Note that indexing starts at 0 in Ox (Elast returns an N T × 2 matrix but only the first column is considered here).The first three arguments of DrawDensity are the most important here: area (panel) index, variable, label.See the documentations of Ox or GnuDraw for a full description.

Future developments
The basic version of SFAMB dates back to the mid 1990s where the capability was restricted to cross-sectional data.As the package now allows for panel data and the literature on SF methods is considerably broader and still growing, there is scope for potential extensions.Some related possibilities are mentioned here.
In the model framework of Chen et al. (2014) there are two ways to calculate the individual effects.As an alternative to Equation 5, the individual "between estimator of α i " can be used.It could be implemented as an optional function, involving a second maximization.Its availability would allow us to compare results and check the consequences for TE scores.
While the current focus of panel methods is on fixed effects estimation, a more comprehensive supplement might involve random effects models.The most recent SF approach using the CSN distribution is presented by Colombi, Kumbhakar, Martini, and Vittadini (2014).Its specification is similar to Equation 3, but the time-invariant part is further decomposed into two residuals (persistent inefficiency and time-invariant unobserved heterogeneity).Filippini and Greene (2016) introduce computational simplifications and label the model as the "Generalized True Random Effects SF model ".
13 AiHat AiHat(); Return value Returns the calculated individual effects αi , N × 1 vector.Description -Only panel data -These values can be obtained after estimation, see Section 2 for the respective formulas.Different functions to extract data: Return value Different vectors or matrices.Description These functions can be used with convenient ('Database') functions such as Save, Renew or savemat.IDandPer(); is an N T ×2 matrix with the ID of the individual (e.g., 1, 1, 1, 2, . . ., N, N ) in the first column and the individual panel length T i in the second column.-Only panel data -GetLLFi(); returns the individual log-likelihood values.It is an N T ×1 vector for both POOLED model and LSDV model, but an N × 1 vector for the other models.GetResiduals(); returns the (composed) residual of the respective observation, N T ×1 vector.GetTldata(); returns the corresponding vectors of Y , X, square, and cross terms of X. -Use with SetTranslog() -GetMeans(); returns the means of Y -and X-variables, N × (K + 1) matrix.-Only panel data -GetWithins(); returns the within-transformed Y -and X-variables, N T × (K + 1) matrix.-Only panel data -DropGroupIf DropGroupIf(const mifr); No return values Description -Only panel data -Allows the exclusion of a whole individual from the sample if the condition in one (single) period is met.Call after function Ident.

Figure 3 :
Figure 3: TE scores of the CFE model.

Table 1 :
Available estimators, with name of estimation method and sample files.
After the technology parameters, the estimates of σ v and σ u are listed in the form of their natural logarithms.The next line refers to the noise component.
Chen et al. (2014)he CFE model ofChen et al. (2014)is specified using, again, the data set USDAafrica.xls and a translog functional form.You can immediately switch to the LSDV or TFE model, respectively, by changing the argument of SetMethod.A large part of this example corresponds to the code of the previous subsection.However, as panel data are involved here some things are different.
For this model, there is no calculation of the confidence bounds involved.The efficiency scores can be extracted as point estimates using function TE.