npbr: A Package for Nonparametric Boundary Regression in R

The package npbr is the first free specialized software for data edge and frontier analysis in the statistical literature. It provides a variety of functions for the best known and most innovative approaches to nonparametric boundary estimation. The selected methods are concerned with empirical, smoothed, unrestricted as well as constrained fits under both single and multiple shape constraints. They also cover data envelopment techniques as well as robust approaches to outliers. The routines included in npbr are user friendly and afford a large degree of flexibility in the estimation specifications. They provide smoothing parameter selection for the modern local linear and polynomial spline methods as well as for some promising extreme value techniques. Also, they seamlessly allow for Monte Carlo comparisons among the implemented estimation procedures. This package will be very useful for statisticians and applied researchers interested in employing nonparametric boundary regression models. Its use is illustrated with a number of empirical applications and simulated examples.


Introduction
In the standard regression model where the data (x i , y i ) are observed, a variety of programs specializing in nonparametric and semi-parametric estimation have recently appeared. Prominent among these routines is the popular np package (Hayfield and Racine 2008), which allows R (R Core Team 2017) users to conduct, for instance, nonparametric mean and quantile regression. In the non-standard boundary regression model, in contrast to classical theory, the regression errors (ε i ) are not assumed to be centred, but to have a one-sided support (−∞, 0], and the regression function ϕ describes some boundary curve. The present npbr package (Daouia, Laurent, and Noh 2017) is a collection of functions that perform a variety of nonparametric estimation techniques of the frontier function ϕ in the statistical software environment R. Specifically, suppose that we have n pairs of observations (x i , y i ), i = 1, . . . , n, from a bivariate distribution with a density Nonparametric boundary regression is clearly a problem involving extreme value theory. Already in the case of production econometrics, Hendricks and Koenker (1992) stated, "In the econometric literature on the estimation of production technologies, there has been considerable interest in estimating so called frontier production models that correspond closely to models for extreme quantiles of a stochastic production surface". Chernozhukov (2005) and Daouia, Gardes, and Girard (2013) may be viewed as the first attempt to actually implement theoretically the idea of Hendricks and Koenker, respectively, in a linear regression model and in a general nonparametric model. However, their approaches aim to estimate an extreme regression quantile curve instead of the true full frontier ϕ. Thereby the use of high regression quantiles might be viewed as an exploratory tool, rather than as a method for final frontier analysis. To this end, one may employ the R packages cobs (Ng and Maechler 2007), quantreg (Koenker 2017) and splines (R Core Team 2017), to name a few.
to build robust frontier estimators by using the ideas from Dekkers, Einmahl, and de Haan (1989) and . Moreover, they provide different useful asymptotic confidence bands for the boundary function under the monotonicity constraint in the case of general β x . However, such techniques are not without their disadvantages. As it is often the case in extreme-value theory, they require a large sample size to ensure acceptable results.
The overall objective of the present package is to provide a large variety of functions for the best known approaches to nonparametric boundary regression, including the vast class of methods employed in both Monte Carlo comparisons of Daouia et al. (2016) and Noh (2014) as well as other promising nonparametric devices, namely the extreme-value techniques of Gijbels and Peng (2000), Daouia et al. (2010) and Daouia et al. (2012). The various functions in the npbr package are summarized in Table 1. We are not aware of any other existing set of statistical routines more adapted to data envelope fitting and robust frontier estimation. Only the classical nonsmooth FDH and DEA methods can be found in some available packages dedicated to the economic literature on measurements of the production performance of enterprises, such as the R package Benchmarking (Bogetoft and Otto 2011). Other contributions to the econometric literature on frontier analysis by Parmeter and Racine (2013) can be found at http://socserv.mcmaster.ca/racinej/Gallery/Home.html. The package npbr is actually the first free specialized software for the statistical literature on nonparametric frontier analysis. The routines included in npbr are user friendly and highly flexible in terms of estimation specifications. They allow the user to filter out noise in edge data by making use of both empirical and smooth fits as well as (un)constrained estimates under separate and simultaneous multiple shape constraints. They also provide smoothing parameter selection for the innovative methods based on local linear techniques, polynomial splines, extreme values and kernel smoothing, though the proposed selection procedures can be computationally demanding. To solve the different involved optimization problems, we mainly use the Rglpk package (Theussl and Hornik 2017, version >= 0.6-2) based on the C library GLPK (Makhorin 2017), version 4.61.
In addition, the package will be very useful for researchers and practitioners interested in employing nonparametric boundary regression methods. On one hand, such methods are very appealing because they rely on very few assumptions and benefit from their modeling flexibility, function approximation power and ability to detect the boundary structure of data without recourse to any a priori parametric restrictions on the shape of the frontier and/or the distribution of noise. On the other hand, the package offers R users and statisticians in this active area of research simple functions to compute the empirical mean integrated squared error, the empirical integrated squared bias and the empirical integrated variance of various frontier estimators. This seamlessly allows the interested researcher to reproduce the Monte Carlo estimates obtained in the original articles and, perhaps most importantly, to easily compare the quality of any new proposal with the competitive existing methods. The package npbr is available from the Comprehensive R Archive Network (CRAN) at http://CRAN.Rproject.org/package=npbr. Section 2 presents, briefly, five unrelated motivating data examples concerned with annual sport records, the master curve prediction in the reliability programs of nuclear reactors and with the optimal cost/production assessment in applied econometrics. Section 3 describes in detail the implemented functions of the package and provides practical guidelines to effect the necessary computations. In Section 4, we provide some computational tips that facilitate Monte-Carlo comparisons among frontier estimation methods in a similar way to the simulation studies undertaken by Daouia et al. (2016) and Noh (2014).

Empirical applications
In this section, we illustrate the use of the npbr package via five different empirical applications taken from the recent literature. Each dataset is chosen to highlight the specifics of a class of estimation methods: • The dataset records is concerned with the yearly best men's outdoor 1500-metre run times starting from 1966. These annual records, depicted in Figure 1 (a), display some interesting features. Following Jirak, Meister, and Reiss (2014), the lower boundary can be interpreted as the best possible time for a given year. This boundary steadily decreases from 1970 until around the year 2000, followed by a sudden increase. This event leaves room for speculations given that, until the year 2000, it had been very difficult to distinguish between the biological and synthetical EPO. Here, the boundary is not believed to be shape constrained and can be estimated by the polynomial, local linear, spline or kernel smoothing methods described in Sections 3.1 and 3.3.
• The dataset nuclear from the US Electric Power Research Institute (EPRI) consists of 254 toughness results obtained from non-irradiated representative steels. For each steel i, fracture toughness y i and temperature x i were measured. The scatterplot is given in Figure 1 (b). The objective is to estimate the lower and upper limits of fracture toughness for the reactor pressure vessel materials as a function of the temperature. Given that the nuclear reactors' data are measured accurately, it is natural and more realistic for practitioners to rely on data envelopment estimation techniques that we regroup in Sections 3.1-3.3. Here, the lower support boundary is believed to be both increasing and convex, while the upper extremity is only known to be monotone nondecreasing (see Daouia et al. 2014Daouia et al. , 2016. • The dataset air is concerned with the assessment of the efficiency of 37 European Air Controllers. The performance of each controller can be measured by its "distance" from the upper support boundary, or equivalently, the set of the most efficient controllers. This dataset is taken from Mouchart and Simar (2002). The scatterplot of the controllers in the year 2000 is given in Figure 1 (c), where their activity is described by one input (an aggregate factor of different kinds of labor) and one output (an aggregate factor of the activity produced, based on the number of controlled air movements, the number of controlled flight hours, etc.). Given the very small sample size and the sparsity in data, only the class of polynomials, piecewise polynomials and spline approximations seems to provide satisfactory fits in this applied setting. This class includes the families of empirical and smooth estimation methods described in Section 3.1. Note also that the efficient frontier here is monotone and can be assumed to be in addition concave (see Daouia, Florens, and Simar 2008;Daouia et al. 2016).
• The dataset post about the cost of the delivery activity of the postal services in France was first analyzed by Cazals, Florens, and Simar (2002) and then by Aragon, Daouia, and Thomas-Agnan (2005) and Daouia, Florens, and Simar (2010) among others. There are 4000 post offices observed in 1994. For each post office i, the input x i is the labor cost measured by the quantity of labor, which accounts for more than 80% of the total cost of the delivery activity. The output y i is defined as the volume of delivered mail (in number of objects). As can be seen from the scatterplot in Figure 1 (d), some observations look so isolated in the output direction that they seem hardly related to the other observations. As a matter of fact, this dataset is known to contain outliers and it would then look awkward for practitioners to rely on estimation techniques based on data envelopment ideas (see Daouia and Gijbels 2011). This motivated the quest for robust frontier estimation methods in Section 3.4. It should be clear that only these methods allow one to construct valid asymptotic confidence intervals for the unknown support boundary.
• The dataset green consists of 123 American electric utility companies. As in the set-up of Gijbels, Mammen, Park, and Simar (1999), we used the measurements of the variables y i = log(q i ) and x i = log(c i ), where q i is the production output of the company i and c i is the total cost involved in the production. A detailed description and analysis of these data can be found in Christensen and Greene (1976). The scatterplot is given in Figure 1 (e). Here, the assumption of both monotonicity and concavity constraints is well accepted and any restricted data envelopment technique such as, for instance, kernel smoothing in Section 3.3 can be applied. Also, in the absence of information on whether these data are recorded accurately, one may favor robust frontier estimation. We caution the user that the robust methods based on extreme-value ideas may require a large sample size of the order of thousands to achieve acceptable fits and confidence intervals.
To help users navigate the methods in the npbr package, we describe in Table 2 the type of estimation and shape constraints allowed by each method.  For our illustration purposes, each of the five datasets contains only two variables: one input and one output.

Main functions
This section describes in detail the main functions of the npbr package. The two first arguments of these functions correspond to the observed inputs x 1 , . . . , x n and the observed outputs y 1 , . . . , y n . The third argument is a numeric vector of evaluation points at which the estimator is to be computed. Basically, the user can generate a regular sequence of size 100, or any finer grid of points, from the minimum value of inputs x i to their maximum value. The other arguments of the functions depend on the underlying statistical methods.
We do not presume that the user is familiar with nonparametric frontier modeling hence briefly describe the underlying estimation methodology and tuning parameters selection for each method. Section 3.1 is concerned with piecewise polynomial fitting, Section 3.2 with local polynomial estimation, Section 3.3 with kernel smoothing techniques, and Section 3.4 with robust regularization approaches.

Piecewise polynomial fitting
We commence with the traditional empirical DEA, FDH and Linearized FDH estimators. We then proceed to polynomial boundary estimators (Hall, Park, and Stern 1998), and finally to constrained spline estimators (Daouia, Noh, and Park 2016).

DEA, FDH and LFDH frontiers
The function dea_est implements the empirical FDH, LFDH and DEA frontier estimators programmed earlier in the Benchmarking package (Bogetoft and Otto 2011). There are two popular methods for preserving monotonicity in the frontier setting: the free disposal hull (FDH) introduced by Deprins et al. (1984) and the data envelopment analysis (DEA) proposed by Farrell (1957). The FDH boundary is the lowest "stair-case" monotone curve covering all the data points An improved version of this estimator, referred to as the linearized FDH (LFDH), is obtained by drawing the polygonal line smoothing the staircase FDH curve. It has been considered in Hall and Park (2002) and Jeong and Simar (2006). When the joint support of the data is in addition convex, the DEA estimator is defined as the least concave majorant of the FDH frontier. Formally, the DEA estimator of the joint support Ψ is defined bŷ Then the DEA estimator of the frontier function ϕ at x is defined by Note that the FDH, LFDH and DEA estimators are well defined whenever there exists an x i such that x i ≤ x. To illustrate the difference between these three empirical frontiers, we consider the air and green data. First, we generate a vector of evaluation points. R> x.air <-seq(min(air$xtab), max(air$xtab), length.out = 101) R> x.green <-seq(min(log(green$COST)), max(log(green$COST)), + length.out = 101) Then, we compute the DEA, FDH and LFDH estimates.

Polynomial estimators
The function poly_est is an implementation of the unconstrained polynomial-type estimators of Hall, Park, and Stern (1998) for support frontiers and boundaries.
Here, the data edge is modeled by a single polynomial ϕ θ (x) = θ 0 + θ 1 x + . . . + θ p x p of known degree p that envelopes the full data and minimizes the area under its graph for x ∈ [a, b], with a and b being respectively the lower and upper endpoints of the design points x 1 , . . . , x n . The function is the estimateφ n,P (x) =θ 0 +θ 1 x + . . . +θ p x p of ϕ(x), whereθ = (θ 0 ,θ 1 , . . . ,θ p ) minimizes b a ϕ θ (x) dx over θ ∈ R p+1 subject to the envelopment constraints ϕ θ (x i ) ≥ y i , i = 1, . . . , n. The polynomial degree p has to be fixed by the user in the 4th argument of the function.

Selection of the polynomial degree
As the degreee p determines the dimensionality of the approximating function, we may view the problem of choosing p as model selection by calling the function poly_degree. By analogy to the information criteria proposed by Daouia et al. (2016) in the boundary regression context, we obtain the optimal polynomial degree by minimizing The first one (option type = "AIC") is similar to the famous Akaike information criterion (Akaike 1973) and the second one (option type = "BIC") to the Bayesian information criterion (Schwartz 1978). They aim to balance the fidelity to data and the complexity of the fit in the boundary regression context. There are several ways to motivate the use of the total absolute residuals in these criteria instead of the standard residual sum of squares. For instance, it can be derived directly assuming exponential errors as motivated by Daouia et al. (2016) in Section 2.1.

Quadratic spline smoothers
The function quad_spline_est is an implementation of the (un)constrained quadratic spline estimates proposed by Daouia et al. (2016).

Unconstrained quadratic fit
Let a and b be, respectively, the minimum and maximum of the design points x 1 , . . . , x n . Denote a partition of [a, b] by a = t 0 < t 1 < . . . < t kn = b (see below the selection process of k n and {t j }). Let N = k n +2 and π(x) = (π 0 (x), . . . , π N −1 (x)) be the vector of normalized Bsplines of order 3 based on the knot mesh {t j } (see, e.g., Schumaker 2007). The unconstrained (option method = "u") quadratic spline estimate of the frontier function ϕ(x) is defined as ϕ n (x) = π(x) α, whereα minimizes . . , n. A simple way of choosing the knot mesh in this unconstrained setting is by considering the j/k n th quantiles t j = x [jn/kn] of the distinct values of x i for j = 1, . . . , k n − 1. Then, the choice of the number of inter-knot segments k n is viewed as model selection by making use of the function quad_spline_kn (option method = "u") described in a separate paragraph below.

Monotonicity constraint
When the true frontier ϕ(x) is known or required to be monotone nondecreasing (option method = "m"), its constrained quadratic spline estimate is defined byφ n (x) = π(x) α, wherê α minimizes the same objective function asα subject to the same envelopment constraints and the additional monotonicity constraints π (t j ) α ≥ 0, j = 0, 1, . . . , k n , with π being the derivative of π. Considering the special connection of the spline smootherφ n with the traditional FDH frontier ϕ f dh (see the function dea_est), Daouia et al. (2016) propose a refined way of choosing the knot mesh. Let (X 1 , Y 1 ), . . . , (X N , Y N ) be the observations (x i , y i ) lying on the FDH boundary (i.e., y i = ϕ f dh (x i )). The basic idea is to pick out a set of knots equally spaced in percentile ranks among the N FDH points (X , Y ) by taking t j = X [jN /kn] , the j/k n th quantile of the values of X for j = 1, . . . , k n − 1. The optimal number k n is then obtained by using the function quad_spline_kn (option method = "m").

Concavity constraint
When the monotone boundary ϕ(x) is also believed to be concave (option method = "mc"), its constrained fit is defined asφ n (x) = π(x) α , whereα ∈ R N minimizes the same objective function asα subject to the same envelopment and monotonicity constraints and the additional concavity constraints π (t * j ) α ≤ 0, j = 1, . . . , k n , where π is the constant second derivative of π on each inter-knot interval and t * j is the midpoint of (t j−1 , t j ]. Regarding the choice of knots, the same scheme as forφ n is applied by replacing the FDH points ( , that is, the observations (x i , y i = ϕ dea (x i )) lying on the piecewise linear DEA frontier (see the function dea_est). Alternatively, the strategy of just using all the DEA points as knots also works quite well for datasets of modest size as shown in Daouia et al. (2016). In this case, the user has to choose the option all.dea = TRUE.

Optimal number of inter-knot segments
The function quad_spline_kn computes the optimal number k n for the quadratic spline fits proposed by Daouia et al. (2016). For the implementation of the unconstrained quadratic spline smootherφ n , based on the knot mesh {t j = x [jn/kn] : j = 1, . . . , k n − 1}, the user has to employ the option method = "u". Since the number k n determines the complexity of the spline approximation, its choice may be viewed as model selection via the minimization of the following Akaike (option type = "AIC") or Bayesian (option type = "BIC") information criteria: For the implementation of the monotone (option method = "m") quadratic spline smoother ϕ n , the authors first suggest using the set of knots {t j = X [jN /kn] , j = 1, . . . , k n − 1} among the FDH points (X , Y ), = 1, . . . , N , as described above. Then, they propose to choose k n by minimizing the following AIC (option type = "AIC") or BIC (option type = "BIC") information criteria: A small number of knots is typically needed as elucidated by the asymptotic theory.
For the implementation of the monotone and concave (option method = "mc") spline estimatorφ n , just apply the same scheme as above by replacing the FDH points (X , Y ) with the DEA points (X * , Y * ).

Practical guidelines
We describe here how to construct the necessary computations of the (un)constrained quadratic spline fits under both separate and simultaneous shape constraints. By way of example we consider the air and green data. To conduct the unconstrained estimation, we first determine the optimal number of inter-knot segments via the BIC criterion.

Cubic spline frontiers
The function cub_spline_est is an implementation of the (un)constrained cubic spline estimates proposed by Daouia et al. (2016).
As in the quadratic spline setting, let a and b be respectively the minimum and maximum of the design points x 1 , . . . , x n , and denote a partition of [a, b] by a = t 0 < t 1 < . . . < t kn = b.
Here, N = k n + 3 and π(x) = (π 0 (x), . . . , π N −1 (x)) is the vector of normalized B-splines of order 4 based on the knot mesh {t j }. The unconstrained (option method = "u") cubic spline estimate of the frontier ϕ(x) is then defined in the same way as the envelopment quadratic splineφ n (x) with the same knot selection process, that is, t j = x [jn/kn] is the j/k n th quantile of the distinct values of x i for j = 1, . . . , k n − 1. The number of inter-knot segments k n is obtained by calling the function cub_spline_kn (option method = "u"), which consists in minimizing the information criterion AĨC(k) (option type = "AIC") or BĨC(k) (option type = "BIC").
Regarding the monotonicity constraint, it cannot be formulated into linear constraints at the knots since, as opposed to quadratic splines, the first derivative of cubic splines is a quadratic spline. Daouia et al. (2016) have been able to come up with an alternative formulation of monotonicity in terms of standard second-order cone constraints, but in our R package for computational convenience we use the following sufficient condition to ensure monotonicity: This condition was previously used in Lu, Zhang, and Huang (2007) and Pya and Wood (2014). Note that since the condition corresponds to linear constraints on α, the estimator satisfying the monotonicity constraint can be obtained via linear programming.
When the estimate is required to be both monotone and concave, we use the function cub_spline_est with the option method = "mc". The estimate is obtained as the cubic spline function which minimizes the same linear objective function as the unconstrained estimate subject to the same linear envelopment constraints, the monotonicity constraint above and the additional linear concavity constraints π (t j ) α ≤ 0, j = 0, 1, . . . , k n , where the second derivative π is a linear spline. Regarding the choice of knots, we just apply the same scheme as for the unconstrained cubic spline estimate.

Localized boundary regression
This section is concerned with localizing the frontier estimation and considers local linear fitting (Hall et al. 1998;Hall and Park 2004), local maximum and extreme-value smoothing (Gijbels and Peng 2000).

Local linear fitting
The function loc_est computes the local linear smoothing frontier estimators of Hall, Park, and Stern (1998) and Hall and Park (2004). In the unconstrained case (option method = "u"), the implemented estimator of ϕ(x) is defined bŷ ϕ n,LL (x) = min z : there exists θ such that where the bandwidth h has to be fixed by the user in the 4th argument of the function. This estimator may lack of smoothness in case of small samples and has no guarantee of being monotone even if the true frontier is so. Following the curvature of the monotone frontier ϕ, the unconstrained estimatorφ n,LL is likely to exhibit substantial bias, especially at the sample boundaries. A simple way to remedy to this drawback is by imposing the extra condition θ ≥ 0 in the definition ofφ n,LL (x) to get ϕ n,LL (x) = min z : there exists θ ≥ 0 such that As shown in Daouia et al. (2016), this version only reduces the vexing bias and border defects of the original estimator when the true frontier is monotone. The option method = "m" indicates that the improved fitφ n,LL should be utilized in place ofφ n,LL .
Optimal bandwidth choice Hall and Park (2004) proposed a bootstrap procedure for selecting the optimal bandwidth h inφ n,LL andφ n,LL . The function loc_est_bw computes this optimal bootstrap bandwidth. To initiate Hall and Park's bootstrap device, one needs to set a pilot bandwidth, which seems to be quite critical to the quality ofφ n,LL andφ n,LL .

Practical guidelines
To see how the local linear unconstrained estimateφ n,LL and its improved versionφ n,LL perform in the case of records, air and nuclear data. We first compute the optimal bandwidths over 100 bootstrap replications by using, for instance, the values 2, 2 and 40 as pilot bandwidths.

Local maximum estimation
The function loc_max implements the local maximum estimates of ϕ(x) proposed by Gijbels and Peng (2000): a local constant estimator at first (option type = "one-stage") and subsequently a local DEA estimator (option type = "two-stage").
The methodology of Gijbels and Peng consists of considering a strip around x of width 2h, where h = h n → 0 with nh n → ∞ as n → ∞, and focusing then on the y i values of observations falling into this strip. More precisely, they consider the transformend variables z xh i = y i 1 I {|x i −x|≤h} , i = 1, . . . , n, and the corresponding order statistics z xh (1) ≤ . . . ≤ z xh (n) . The simple maximum z xh (n) = max i=1,...,n z xh i defines then the local constant estimator (option type = "one-stage") of the frontier point ϕ(x). This opens a way to a two-stage estimation procedure as follows. In a first stage, Gijbels and Peng calculate the maximum z xh (n) . Then, they suggest to replace each observation y i in the strip of width 2h around x by this maximum, leaving all observations outside the strip unchanged. More specifically, they definẽ Then, they apply the DEA estimator (see the function dea_est) to these transformed data (x i ,ỹ i ), giving the local DEA estimator (option type = "two-stage").
The bandwidth h has to be fixed by the user in the 4th argument of the function. By way of example, in the case of the green data, the value h = 0.5 leads to reproduce in Figure 7 (left) the estimates obtained by Gijbels and Peng (2000).
R> loc_max_1stage <-loc_max(log(green$COST), log(green$OUTPUT), x.green, + h = 0.5, type = "one-stage") R> loc_max_2stage <-loc_max(log(green$COST), log(green$OUTPUT), x.green, + h = 0.5, type = "two-stage") A data-driven rule for selecting h Note that the frontier point ϕ(x) is identical to the right-endpoint of the cumulative distribution function F (·|x) of Y given X = x, and that the local constant estimate z xh (n) coincides with the right-endpoint of the kernel estimator with K(·) being the uniform kernel. When the interest is in the estimation of the conditional distribution function, one way to select the bandwidth h is by making use of the following commands R> require("np") R> bw <-npcdistbw(log(OUTPUT)~log(COST), data = green, + cykertype = "uniform", bwtype = "fixed")$xbw R> (h.opt <-max(bw, max(diff(sort(log(green$COST))))/2)) [1] 0.4152283 The first command returns the bandwidth bw computed via the least squares cross-validation method (see Li, Lin, and Racine 2013, for details). As the resulting bandwidth can be smaller than half the maximum spacing due to sparsity in data, the second command selects the maximum value. On may then use this value to compute the estimates of the conditional endpoint ϕ(x) itself. This is an ad hoc choice, but it works quite well. It might be viewed as an exploratory tool, rather than as a method for final analysis. The corresponding local maximum frontier estimates are graphed in Figure 7 (right). R> loc_max_1stage.opt <-loc_max(log(green$COST), log(green$OUTPUT), x.green, + h = h.opt, type = "one-stage") R> loc_max_2stage.opt <-loc_max(log(green$COST), log(green$OUTPUT), x.green, + h = h.opt, type = "two-stage") The following code will generate Figure 7.

Local extreme-value estimation
The function pick_est computes the local Pickands type of estimator introduced by Gijbels and Peng (2000). The implemented estimator of ϕ(x), obtained by applying the well-known extreme value approach of  in conjunction with the transformed sample (z xh 1 , . . . , z xh n ) described above in Section 3.2.2, is defined as:

npbr: Nonparametric Boundary Regression in R
It is based on three upper order statistics z xh (n−k) , z xh (n−2k) , z xh (n−4k) , and depends on the bandwidth h as well as an intermediate sequence k = k(n) → ∞ with k/n → 0 as n → ∞. The two smoothing parameters h and k have to be fixed by the user in the 4th and 5th arguments of the function. Also, as for the two-stage local frontier estimator presented above, writing one can then apply the DEA estimator to these transformed data (x i ,ỹ i ), giving thus the local DEA estimator (option type = "two-stage").
Regarding the choice of the smoothing parameters, it should be clear that any automatic data-driven method has to pick up h and k simultaneously, which is a daunting problem. Doubtlessly, further work to define a concept of selecting appropriate values for h and k will yield new refinements.

Kernel smoothing
Recently, kernel smoothing methods have been developed for estimating smooth frontier functions. The function kern_smooth implements two up-to-date approaches in such direction.

Definition of the estimator
To estimate the frontier function, Parameter and Racine (2013) considered the following generalization of linear regression smoothersφ is the kernel weight function of x for the i-th data depending on x i 's and the sort of linear smoothers. For example, the Nadaraya-Watson kernel weights are with the kernel function K being a bounded and symmetric probability density, and h is a bandwidth. Then, the weight vector p = (p 1 , . . . , p n ) is chosen to minimize the distance D(p) = (p − p u ) (p − p u ) subject to the envelopment constraints and the choice of the shape constraints, where p u is an n-dimensional vector with all elements being one. The envelopement and shape constraints arê i (x)y i is the s-th derivative ofφ(x|p), with M and C being the collections of points where monotonicity and concavity are imposed, respectively. In our implementation of the estimator, we simply take the entire dataset {(x i , y i ), i = 1, . . . , n} to be M and C and, in case of small samples, we augment the sample points by an equispaced grid of length 201 over the observed support [min i x i , max i x i ] of X. For the weight A i (x), we use the Nadaraya-Watson weights.

Optimal bandwidth
Bandwidth selection is crucial to good performance of the frontier estimator as with other kernel smoothing estimators. Parmeter and Racine (2013)'s recommendation is to adapt the optimal bandwidth for mean regression curve estimation chosen by least squares crossvalidation to the boundary regression context. This is implemented with bw_method = "cv" in the function kern_smooth_bw. We also refer to existing functions from the np (Hayfield and Racine 2008) and quadprog (Turlach and Weingessel 2013) packages that can be found at http://socserv.mcmaster.ca/racinej/Gallery/Home.html. Noh (2014) considered the same generalization of linear smoothersφ(x|p) for frontier estimation, but with a different method for choosing the weight p. This is implemented in the function kern_smooth with option technique = "noh".

Definition of the estimator
In contrast with Parmeter and Racine (2013), along with the same envelopment and shape constraints, the weight vector p is chosen to minimize the area under the estimatorφ(x|p), is the true support of X. In practice, we integrate over the observed support [min i x i , max i x i ] since the theoretic one is unknown. In what concerns the kernel weights A i (x), we use the Priestley-Chao weights where it is assumed that the pairs (x i , y i ) have been ordered so that x 1 ≤ . . . ≤ x n . The choice of such weights is motivated by their convenience for the evaluation of the integral

Optimal bandwidth
Following Parmeter and Racine (2013)'s recommendation, we may use the resulting bandwidth from cross-validation for Noh (2014)'s estimator. Another option proposed by Noh (2014) is to select the bandwidth which minimizes a BIC-type criterion developed for frontier estimation. The criterion is the following: wherep(h) is the chosen weight vector given the bandwidth h, and tr(S(h)) is the trace of the smoothing matrix We refer to Noh (2014) for a thorough discussion of the rationale for this BIC-type criterion. The function kern_smooth_bw computes the optimal bandwidth from this criterion with option bw_method = "bic".

Comparison between the two estimators
To illustrate the use of kern_smooth and compare the two estimators, we consider the green data and compute each estimator under the monotonicity constraint (option method = "m"). First, using the function kern_smooth_bw we compute the optimal bandwidth for each estimator.

Robust regularization approaches
In applied settings where outlying observations are omnipresent, as is the case for instance in production data, it is prudent to seek a "robustification" strategy. To achieve this objective, we propose in this section three regularization extreme-value based methods Simar 2010, 2012). All of these methods are based on the assumption that the frontier function ϕ is monotone nondecreasing.

Moment frontier estimator
The function dfs_momt is an implementation of the moment-type estimator and the corresponding confidence interval developed by Daouia et al. (2010) under the monotonicity constraint. Combining the ideas from Dekkers, Einmahl, and de Haan (1989) with the dimensionless transformation {z x i := y i 1 I {x i ≤x} , i = 1, . . . , n} of the observed sample {(x i , y i ), i = 1, . . . , n}, they estimate the conditional endpoint ϕ(x) bỹ are the ascending order statistics of the transformed sample {z x i , i = 1, . . . , n}, and ρ x > 0 is referred to as the extreme-value index and has the following interpretation: When ρ x > 2, the joint density of data decays smoothly to zero at a speed of power ρ x − 2 of the distance from the frontier; when ρ x = 2, the density has sudden jumps at the frontier; when ρ x < 2, the density increases toward infinity at a speed of power ρ x − 2 of the distance from the frontier. As a matter of fact, we have ρ x = β x + 2, where β x is the shape parameter of the joint density introduced in Section 1. Most of the contributions to the econometric literature on frontier analysis assume that the joint density is strictly positive at its support boundary, or equivalently, ρ x = 2 for all x.

Estimation strategy when ρ x is unknown
In this case, Daouia et al. (2010) suggest to use the following two-step estimator: First, estimate ρ x by the moment estimatorρ x implemented in the function rho_momt_pick by utilizing the option method = "moment", or by the Pickands estimatorρ x by using the option method = "pickands" (see the paragraph Moment and Pickands estimates of the tailindex ρ x below for a detailed description of the function rho_momt_pick). Second, use the estimatorφ momt (x), as if ρ x were known, by substituting the estimated valueρ x orρ x in place of ρ x .

Confidence interval
The 95% confidence interval of ϕ(x) derived from the asymptotic normality ofφ momt (x) is given by

Selection of the sequence k
The number k = k n (x) plays here the role of the smoothing parameter and varies between 1 and N The question of selecting the optimal value of k n (x) is still an open issue and is not addressed yet. Daouia et al. (2010) have only suggested an empirical rule implemented in the function kopt_momt_pick (option method = "moment") that turns out to give reasonable values of the sequence k n (x) for estimating the frontier ϕ(x) [see the paragraph Threshold selection for moment and Pickands frontiers below for a detailed description of the function kopt_momt_pick]. However, as it is common in extreme-value theory, good results require a large sample size N x of the order of several hundreds. If the resulting pointwise frontier estimates and confidence intervals exhibit severe instabilities, the user should call the function kopt_momt_pick by tuning the parameter wind.coef in the interval (0, 1] until obtaining more stable curves (default option wind.coef = 0.1). See help(kopt_momt_pick) for further details.

Practical guidelines
For our illustration purposes using the large dataset post, we consider the following three possible scenarios: either ρ x is known (typically equal to 2 if the assumption of a jump at the frontier is reasonable), or ρ x is unknown and estimated by the moment estimatorρ x , or ρ x is unknown independent of x and estimated by the (trimmed) mean ofρ x . First, we select the points at which we want to evaluate the frontier estimator. In the case where the extreme-value index ρ x is known and equal to 2, we set R> rho <-2 Then, we determine the sequence k = k n (x) inφ momt (x).
R> best_kn.1 <-kopt_momt_pick(post$xinput, post$yprod, x.post, rho = rho) When ρ x is unknown and dependent of x, its estimateρ x is computed via the command R> rho_momt <-rho_momt_pick(post$xinput, post$yprod, x.post, + method = "moment") To determine the number k in the two-stage estimatorφ momt (x), we use R> best_kn.2 <-kopt_momt_pick(post$xinput, post$yprod, x.post, + rho = rho_momt) Here, for the post data, we used the default value wind.coef = 0.1 in the function kopt_momt_pick to avoid numerical instabilities. When employing another large dataset, the user should tune this coefficient until the resulting pointwise frontier estimates and confidence intervals exhibit stable curves (see the function kopt_momt_pick for details).
When ρ x is unknown but independent of x, which is a more realistic setting in practice, a robust estimation strategy is obtained by using the (trimmed) mean over the moment estimatesρ x .

Pickands frontier estimator
The function dfs_pick computes the Pickands type of estimator and its associated confidence interval introduced by Daouia et al. (2010) under the monotonicity constraint.
Built on the ideas of , Daouia et al. (2010) proposed to estimate the frontier point ϕ(x) byφ from the transformed data {z x i := y i 1 I {x i ≤x} , i = 1, . . . , n}, where ρ x > 0 is the same tail-index as in dfs_momt.
If ρ x is known (typically equal to 2 if the joint density of data is believed to have sudden jumps at the frontier), then one can use the estimatorφ pick (x) in conjunction with the data driven method for selecting the threshold k as described below.
In contrast, if ρ x is unknown, one could consider using the following two-step estimator: First, estimate ρ x by the Pickands estimatorρ x implemented in the function rho_momt_pick by using the option method = "pickands", or by the moment estimatorρ x by utilizing the option method = "moment" [a detailed description of the function rho_momt_pick is provided below in a separate paragraph]. Second, use the estimatorφ pick (x), as if ρ x were known, by substituting the estimated valueρ x orρ x in place of ρ x .
Finally, to select the threshold k = k n (x), one could use the automatic data-driven method of Daouia et al. (2010) implemented in the function kopt_momt_pick (option method = "pickands") as described below in the last paragraph.

Practical guidelines
For our illustration purposes, we used again the large dataset post and considered the following three scenarios: either ρ x is known (typically equal to 2 if the joint density has sudden jumps at the frontier), ρ x is unknown and estimated by the Pickands estimatorρ x , or ρ x is unknown independent of x and estimated by the (trimmed) mean ofρ x . When ρ x is known and equal to 2, we set
The user can also appreciably improve the estimation of ρ x and ϕ(x) itself by tuning the choice of the lower limit (default option lrho = 1) and upper limit (default option urho = Inf).

Threshold selection for moment and Pickands frontiers
The function kopt_momt_pick is an implementation of an experimental method by Daouia et al. (2010) for the automated threshold selection (choice of k = k n (x)) for the moment frontier estimatorφ momt (x) [see dfs_momt] in the case where method = "moment" and for the Pickands frontier estimatorφ pick (x) [see dfs_pick] in the case where method = "pickands". The idea is to select first (for each x) a grid of values for the number k n (x) given by k = 1, . . . , stands for the integer part of √ N x with N x = n i=1 1 I {x i ≤x} , and then select the k where the variation of the results is the smallest. To achieve this here, Daouia et al. (2010) compute the standard deviations ofφ momt (x) [option method = "moment"] or ϕ pick (x) [option method = "pickands"] over a "window" of size max(3, [wind.coef× √ N x /2]), where the coefficient wind.coef should be selected in the interval (0, 1] in such a way to avoid numerical instabilities. The default option wind.coef = 0.1 corresponds to having a window large enough to cover around 10% of the possible values of k in the selected range of values for k n (x). The value of k where the standard deviation is minimal defines the desired number k n (x). See the note in help(kopt_momt_pick) for further details.

Probability-weighted moment frontier estimator
The function dfs_pwm computes the regularized frontier estimator introduced by Daouia et al. (2012). It is based on the unregularized probability-weighted moment (PWM) estimator where the trimming order m ≥ 1 is an integer such that m = m n → ∞ as n → ∞, and F (y|x) = n i=1 1 I (x i ≤x,y i ≤y) / n i=1 1 I (x i ≤x) . The implemented estimator of ϕ(x) is then defined with a ≥ 2 being a fixed integer andρ x estimates the same tail-index ρ x = β x + 2 as in dfs_momt and dfs_pick. If the true value of ρ x is known, we setρ x = ρ x in the expressions above. In contrast, if ρ x is unknown, its estimateρ x can be obtained separately in an optimal way by calling the function rho_pwm described below in the last paragraph. In both cases, we use the frontier estimatorφ pwm (x) as ifρ x were known by plugging in its value. As pointed out by Daouia et al. (2012), it is most efficient to conduct tail-index estimation and frontier estimation separately. Then, knowing the valueρ x , it remains to fix the two smoothing parameters a and m in order to calculate the frontier estimatorφ pwm (x). A practical choice of these parameters that Daouia et al. (2012) have employed is the simple rule of thumb a = 2 [default option in the 5th argument of the function] and m = coefm × N 1/3 x , where N x = n i=1 1 I {x i ≤x} and the integer coefm is to be tuned by the user in the 4th argument of the function. Daouia et al. (2012) have suggested in their numerical illustrations to use, for instance, the value coefm = 1. An automatic data-driven rule for choosing the optimal tuning parameter coefm is implemented in the function mopt_pwm described below.

Confidence interval
The pointwise 95% confidence interval of ϕ(x) derived from the asymptotic normality of ϕ pwm (x) is given by . Note that the standard deviation σ(m, x)/ √ n of the biascorrected estimatorφ pwm (x) is adjusted by a bootstrap estimator in the numerical illustrations of Daouia et al. (2012), whereas the exact estimateσ(m, x)/ √ n is utilized in our implemented function.

PWM estimate of the tail-index ρ x
The function rho_pwm computes the probability-weighted moment (PWM) estimatorρ x utilized in the frontier estimateφ pwm (x) [see dfs_pwm]. This estimator depends on the smoothing parameters a and m. A simple selection rule of thumb that Daouia et al. (2012) have employed is a = 2 [default option in the 4th argument of the function] and m = coefm × N 1/3 x , where N x = n i=1 1 I {x i ≤x} and the integer coefm is to be tuned by the user. To choose this parameter in an optimal way for each x, we adapt the automated threshold selection method of Daouia et al. (2010) as follows: We first evaluate the estimatorρ x over a grid of values of coefm given by c = 1, . . . , 150. Then, we select the c where the variation of the results is the smallest. This is achieved by computing the standard deviation of the estimatesρ x over a "window" of max ([ √ 150], 3) successive values of c. The value of c where this standard deviation is minimal defines the value of coefm.
The user can also appreciably improve the estimation of ρ x and ϕ(x) itself by tuning the choice of the lower limit (default option lrho = 1) and upper limit (default option urho = Inf).

Numerical illustrations
Comparisons among most of the selected estimation methods described above have been undertaken by Daouia et al. (2016) and more recently by Noh (2014) via simulation experiments. To encourage others to explore these methods and easily compare the quality of any new proposal with the competitive existing methods, we provide some guidelines that facilitate comparision based on Monte-Carlo simulations in a similar way to the devices of Daouia et al. (2016) and Noh (2014).

Comparison criteria
After estimating the true frontier function ϕ(x) from N independent samples of size n, Daouia et al. (2016) and Noh (2014) considered the empirical mean integrated squared error (MISE), the empirical integrated squared bias (IBIAS2) and the empirical integrated variance (IVAR), which are given by where {z i , i = 0, . . . , I} is an equispaced grid having width 1/I over [a, b] (the true support of the input variable), with I = 1000,φ (j) (·) is the estimated frontier function from the j-th data sample andφ(z i ) = N −1 N j=1φ (j) (z i ). Although the definition of these comparison criteria is quite straightforward, some caution should be taken when calculating them. The reason is that the estimation ofφ (j) (z i ) is possible only when z i lies between the minimum and maximum of the inputs of the jth sample x (j) 1 , . . . , x (j) n . In our package, when storing the estimatesφ (j) (z i ), i = 1, . . . , n, we let the valueφ (j) (z i ) assigned to zero for distinction when the estimation is not possible. The function evaluation automatically computes the comparison criteria using only nonzero estimates at every grid point z i . The first argument of this function is the matrix where the estimation results are stored, the second argument is the evaluation grid vector, and the third argument is the vector of values of the true frontier function at the grid points.

Some Monte Carlo evidence
By way of example, to evaluate finite-sample performance of the empirical LFDH and DEA frontier estimators in comparison with the polynomial, spline and kernel smoothed estimators, we have undertaken some simulation experiments following Daouia et al. (2016)'s study. The experiments all employ the model y i = ϕ(x i ) v i , where x i is uniform on [0, 1] and v i , independent of x i , is Beta(β, β) with values of β = 0.5, 1 and 3 [corresponding, respectively, to a joint density of the (x i , y i )'s increasing toward infinity, having a jump or decreasing to zero as it approaches the support boundary]. Tables 3 and 4 report the obtained Monte Carlo estimates when ϕ(x) = x 1/2 and ϕ(x) = exp(−5 + 10x)/(1 + exp(−5 + 10x)), respectively. All the experiments were performed over N = 5000 independent samples of size n = 25, 50, 100 and 200.