A Greedy Algorithm for Unimodal Kernel Density Estimation by Data Sharpening

We consider the problem of nonparametric density estimation where estimates are constrained to be unimodal. Though several methods have been proposed to achieve this end, each of them has its own drawbacks and none of them have readily-available computer codes. The approach of Braun and Hall (2001), where a kernel density estimator is modified by data sharpening, is one of the most promising options, but optimization difficulties make it hard to use in practice. This paper presents a new algorithm and MATLAB code for finding good unimodal density estimates under the Braun and Hall scheme. The algorithm uses a greedy, feasibility-preserving strategy to ensure that it always returns a unimodal solution. Compared to the incumbent method of optimization, the greedy method is easier to use, runs faster, and produces solutions of comparable quality. It can also be extended to the bivariate case.


Introduction
Nonparametric methods for smoothing, regression, and density estimation produce estimators with great shape flexibility. Although this flexibility is an advantage, the practical value of nonparametric methods would be increased if qualitative constraints-natural-language shape restrictions-could also be imposed on the estimator. In density estimation, the most common such constraints are monotonicity (the density must be nondecreasing or nonincreasing) and unimodality (the density must have only one peak).
The work presented here takes unimodal kernel density estimation as a representative problem in constrained nonparametric estimation. The method proposed for handling the constraint is data sharpening. The following section provides a brief review of the unimodal kernel density estimation problem and the data sharpening approach to solving it.
Afterwards, the goals and scope of the current work are discussed in more detail.

Unimodal Kernel Density Estimation
Let x be a set of n independent observations from a distribution with pdf f . The kernel density estimator (KDE), denoted byf x , iŝ ponent having variance h 2 . If h is sufficiently small, the estimatef y will have n modes; if it is sufficiently large, there will be only one mode. This is illustrated in Figure  The problem of interest here is to obtain a kernel density estimate that is forced to be unimodal. A few possibilities exist. The weight on the i th point (i = 1, 2, . . . , n) could be changed from 1/n to some probability p i that ensures a single mode (Hall and Heckman 2002). Alternatively, a different bandwidth could be used at each point; the resulting bandwidth vector h could be constrained to enforce unimodality 1 . The monotone rearrangement method of Fougeres (1997) could be used; it starts with a standard KDE and uses transformations to produce a unimodal density. There are, in addition, several other non-kernel-based methods for unimodal or monotone density estimation and the related problem of isotonic regression. See Racine and Parmeter (2008), Silvapulle and Sen (2005), and Cheng et al. (1999) for further information and references.

Data Sharpening
Data sharpening (Braun and Hall 2001;Hall and Kang 2005) is the method of constraint handling to be pursued here. This method modifies the data, rather than modifying the estimator, to accommodate the constraint. A new set of "sharpened" data, y, is formed by finding a minimal perturbation of x that results in a unimodal KDE. The sharpened estimate of f is then justf y -estimator (1.1) is used without modification on the sharpened data y. In this case the best way to shift the points was obvious, but in real problems it will be unclear which perturbation of the data yields the best unimodal KDE. The present research deals with the problem of designing an automatic procedure to make the data perturbations in a sensible way.
The data sharpening concept does not depend on the statistical method being used. As such, sharpening is potentially applicable to a wide range of nonparametric estimators and constraint types. Extension to higher-dimensional data is also conceptually straightforward.
This potential generality of data sharpening is its major advantage over other constraint-1 It appears that this method of adaptive kernels has not been applied to handling constraints.

Goals and Scope
The difficulty in achieving the potential of data sharpening is in actually finding y, the sharpened data set. The difficulty is twofold: first, how does one define which y is best; and second, how does one actually find this solution? The remainder of this report constitutes a first step toward answering these questions. The unimodal KDE problem is used as a vehicle for development, but the ultimate goal is to develop procedures that can be used more generally. For this reason, every effort will be made to avoid dependence on the characteristics of kernel density estimators or the unimodality constraint.
This report has the following three goals: 1. Introduce the unimodal KDE data sharpening problem, and thoroughly discuss the issues involved in data sharpening optimization. Justify the use of heuristic optimiza- tion methods for the general data sharpening problem.
2. Describe in detail the improve function that has been developed to perform data sharpening. This function uses a greedy optimization heuristic. While it may be useful in its own right, it is primarily envisioned as a building block for more sophisticated algorithms in the future.
3. Lay the foundation for future developments, such as sharpening with other estimators, other constraints, higher dimensions, and so on.
It should be noted that while the generality of data sharpening has been stressed, the work presented here will be narrow in focus. Only kernel density estimation will be considered; the only constraint tested will be unimodality; and the development will only consider scalar data. The problem of bandwidth selection will not be addressed, nor will the question of the statistical properties of sharpened estimators. All such extensions and related topics are deferred to future work to allow a focus on the core optimization problem.

The Sharpening Optimization Problem
The goal of data sharpening is to find the "best" solution that satisfies a chosen qualitative constraint-in this case, unimodality of a KDE. The sharpening optimization problem will now be defined and explored in detail. Particular emphasis will be placed on understanding the sources of difficulty in setting up and solving the problem.

Problem Definition
As in the introduction, let x be the observed vector of n data points, and y be any proposed sharpened data vector. The kernel density estimates based on x and y aref x andf y , respectively. Let δ(y, x) be a smaller-the-better objective function defining which y are best, and let y * be the minimizer of this function.

Known Mode
If the estimator is constrained to have its mode at a known location m, the optimization problem is a search for y * , where which makes it clear that unimodality is a derivative constraint on the density estimator.
In a practical implementation, the constraints of (2.1) will be enforced only at a grid of points g = (g 1 , . . . , g G ) covering the support of the estimate 1 . If m falls between gridpoints k and k + 1, the problem may be formulated as (2.2) This formulation converts the unimodality constraints into a set of G explicit inequalities involving y.
Expressed as (2.2), the sharpening problem looks like a standard problem in constrained nonlinear optimization, for which there is extensive theory and numerous algorithms available. The algorithm of primary interest here is sequential quadratic programming (SQP), which has been used by others (Braun and Hall 2001;Hall and Huang 2002;Hall and Kang 2005;Racine and Parmeter 2008) to perform nonparametric inference subject to constraints. Nocedal and Wright (1999) provide a thorough discussion of SQP and related methods.

Unknown Mode
In the more realistic case where the mode is unknown, the problem is more complicated because m (or k) depends on y. One way to proceed is to treat m as fixed, solve the sharpening problem, and then repeat the process according to a 1-D optimization scheme to find the best value of m. This allows the constraints to be handled explicitly as in (2.2), but with a corresponding computational cost.
It is desirable (in terms of generality, and possibly performance and speed as well) to design an algorithm capable of performing data sharpening directly. The algorithm would simply search for the best solution satisfying the constraint. The optimization problem is then the search for where C is the set of all feasible solutions (those giving unimodal KDEs): For computer implementation one would use a discretized version of (2.4) in terms of g i and k, as in (2.2).
In this general formulation, it is not possible to express the unimodality constraint as a set of inequalities in y. Though this makes it impossible to use standard deterministic algorithms such as SQP, there are many heuristic search techniques that are potentially well suited to this problem. These heuristics require only a feasibility check or test function that evaluates whether or not a candidate solution y satisfies the constraint.
A practical way to test for feasibility is to get exact or approximate derivative values of f y at the grid points g, and then count the number of modes based on the number of sign changes. An algorithm that only requires this feasibility check could simultaneously choose the mode and perform data sharpening.

Possible Objective Functions
The choice of objective function will influence both the nature of the optimization problem and the qualitative behaviour of the resulting density estimates. Several natural possibilities for δ(y, x) are discussed below.

Objectives Based on Data Vectors
Because the sharpened data are considered perturbations of the original data, it is natural to take the objective to be a measure of distance between vectors. One option is to base δ(y, x) on a norm of the difference x − y, defining The parameter α in (2.5) can be interpreted as controlling the tendency to sharpen by moving single points or groups of points. Setting α = 2 discourages movement of single points through large distances, while setting α = 1 makes the optimizer indifferent to the number of points moved. The value of α can particularly affect behaviour in the tails of the distribution, where there are few data points.
The L α distance was used by Braun and Hall (2001) and Hall and Kang (2005), with SQP as the optimizer. Those studies found that α = 1 gave better mean integrated squared error (MISE) performance in test problems, but caused problems with numerical stability, occasionally leading to non-convergence. Failure to converge was attributed to differentiability: To improve the numerical stability of optimization Hall and Kang (2005) proposed using a metric defined as where The reason for using this function was to mimic the linear behaviour of L 1 away from d i = 0 while maintaining differentiability at zero. In the present work, the following rounded-corners objective will be used instead: The summand of this function is a hyperbola in y i − x i . Figure 2.1 compares the summands of L 1 , L 2 , Ψ tan , and RC. The RC objective achieves the same aims as Ψ tan , but without the need for integration 2 .
The goal of data sharpening problem is to find a good density estimate that satisfies the constraint, but the optimization problem is posed in terms of data vectors, not density estimates. The set of solutions {y} has a many-to-one mapping onto the set of density estimates {f y }, because the KDE is invariant to permutations of the sharpened data points while the L α and RC objectives are not. If two solution vectors y 1 and y 2 are permutations of each other then they are practically equivalent; nonetheless, numerical optimization routines using objective functions (2.5) or (2.6) will consider them to be different because δ(y 1 , x) = δ(y 2 , x) in general for those objectives.
The objective function can be modified to include a matching step to enforce permutation invariance on the solutions. The simplest way to match points in 1-D problems is to start with x sorted in ascending order and then sort any proposed y before calculating the objective function. A sorted L α objective can then be defined as: where y (i) represents the i th largest point in y. The sorted version of RC can be similarly defined and denoted RC s (y, x).
An optimization algorithm using only the un-sorted objective function will have no way of knowing whether a solution's objective value could be improved by re-matching its points to x. The algorithm may fail to use promising solution paths, or may converge to sub-optimal solutions when points "cross over" each other into an un-matched state. Performing matching before evaluating the solution might improve performance and reliability of solution methods.

Objectives Based on Density Estimates
Another natural approach to choosing an objective function is to use a metric based on the sharpened and unsharpened density estimates,f y andf x . There are a number of suitable distance or discrepancy measures available, including integrated squared error (ISE), Kullback-Leibler divergence (KL), and total variation (TV), respectively defined as ISE is an integrated L 2 distance between estimates, while T V is an integrated L 1 distance.
Note that KL (also known as relative entropy) is not a true distance, because KL(y, x) = KL(x, y).
Devroye and Lugosi (2001) examined different distance measures in a density estimation context and concluded that the TV distance has several theoretical advantages. In particular, it admits a probability interpretation: for any Borel set B, |Pf is the maximum possible difference attainable when the same probability is calculated with the two density estimatesf y andf x .
Objectives based directly on the estimates have the advantage of being insensitive to the ordering of y and x. On the other hand, the density-based objectives are specific to the density-estimation context, and would not apply if, for example, the sharpening methods were used in a monotone regression problem.

A Likelihood Objective
A final objective function to be considered is based on the negative log-likelihood of the original data x under the sharpened densityf y . That is, define (2.11) The LIK measure, like KL, is not commutative. It is justified on the basis of the principle of maximum likelihood. Using this objective, the goal of sharpening is to find the unimodal KDE that assigns greatest likelihood to the observed data.
The motivation for using likelihood as an objective is to avoid the awkward situation where the sharpened estimator puts negligible density over an observed data point. This could happen if there is an outlying observation and the estimator (using, say, an L 1 objective) pulls this outlier into the main group of points. The likelihood objective will strongly penalize such solutions 3 .
One potential drawback of LIK as an objective is that its unconstrained minimum value (the maximum likelihood) might be achieved at some point other than the unsharpened data x. In this case the objective of minimizing LIK(y, x) will conflict with the goal of moving y toward x, and data sharpening becomes more like fitting a normal mixture model than like performing a modification to a KDE 4 .

Visualizing the Problem
The most important aspects of the sharpening optimization problem will now be demon-  That is, the sharpening is to be performed by moving only the first and last points. The problem has solutions of the form (y 1 , y 8 ).
The unsharpened estimate for this example is symmetric with three modes, as shown in Figure 2.2. The L 1 -optimal solution is also shown on the figure. It is found by shifting each of the two moveable points slightly away from the center of the distribution. Seven of the objective functions have their minimum value at the observed data. LIK is the exception, reaching its minimum when both y 1 and y 8 are shifted slightly inward from their corresponding x values.
The L 1 , L 2 , and RC objectives are all convex. The sorted objective L s 2 and the four density-based objectives (ISE, T V , KL, and LIK) are not-they exhibit many local optima, ridges, and plateaus. They are all symmetric around the y 1 = y 8 line, reflecting the symmetry of the original problem. From an optimization standpoint, the convexity of the first three functions is attractive. All common deterministic optimization routines require a convex objective function in order to reliably find a global optimum. Nevertheless, it would be better if the choice of objective was not motivated by optimization convenience, and in any given situation it may be desirable to use one of the non-convex objectives for theoretical or practical reasons.
Whether or not the chosen objective function is convex, the most significant difficulties in sharpening optimization will be due to the shape of the feasible set, C. In the present example, C can be considered a region in the solution space. Figure 2.4 shows the constraint region superimposed on the L 1 and L s 1 contours. The constraint region for this case is not only non-convex, it is not even contiguous. Finding the best solution on the constraint boundary is a very difficult problem, even with only two points to optimize.
The plots above each set of contours show the unsharpened KDE and the best sharpened estimates for the two objectives (the best estimates are the same for both objectives). Note that when the sorted objective is used the objective function presents a more truthful picture of solution space. Both of the minima on the L s 1 surface are global minima, and they both correspond to the same density estimate.
The figure also shows the performance of a sequential quadratic programming algorithm as it searches for an optimum from different starting points. The starting points are indi- Even so, SQP had difficulty dealing with the complex constraint region, and was not able to find the true optimum from any of the cases. In general, SQP was not successful unless the initial value was very close to the optimum.

Example 2: Multiple Solutions
A second example will show that the sharpening problem can have a multiplicity of solutions in two different senses. First, there may be more than one solution with optimal or nearoptimal objective values. Second, the solution can change considerably when different objective functions are used.
The unsharpened and sharpened data vectors in this new example are: The optimal solution is not unique for either objective function. In both cases there are four optimal (y 1 , y 4 ) pairs, corresponding to two best estimates that are mirror images of each other. The best estimates also differ between the two objective functions, suggesting the need for a principled way to decide which objective function makes a best overall choice.
All of the optimal solutions in this example lie on the constraint boundary. In larger problems, optimal solutions will often involve many points sitting just on the boundary.
This adds complexity to the optimization, since the high-dimensional constraint boundary is generally only discoverable through trial and error.

Summary
The classical formulation of constrained optimization is (2.12) The functions u i (y) and v j (y) define equality and inequality constraints that must be satisfied by the optimal solution.
Problems of the form (2.12) are classified according to the characteristics of the objective and constraint functions. So-called linear programming problems arise when both the objective function and the constraint are linear in the unknown variables; nonlinear programming refers to all problems where the objective function is not linear. Among nonlinear programming problems, those with quadratic objectives and linear constraints are called quadratic programming problems, while those with general convex objectives and general convex constraints are termed convex programming problems (Hillier and Lieberman 1980, ch. 19). Well-established global optimization algorithms exist for each of these problem classes.
The unimodal KDE data sharpening problem only fits into the standard form (2.12) when the mode is pre-specified, as in (2.2). Even in that case, the problem does not fall The left case uses the L s 1 objective, and the right uses L s 2 . In both cases there are two optimal estimates, and these optima are different for the different objectives.
into any of the aforementioned standard categories. It has non-convex constraints and, depending on the choice of δ(y, x), it may have a non-convex objective as well. The high dimensionality of the problem (dimension n) and the potential for multiple optima only exacerbate optimization difficulties. There are no general algorithms that can guarantee achievement of a global optimum for problems of this type. The best that can be hoped for is convergence to a local optimum.
Recent research on constrained nonparametric estimation (Hall and Kang 2005;Racine and Parmeter 2008) has focused on convex objective functions and used SQP to find solutions when the constraints are not convex. But SQP is not totally suitable for problems with non-convex constraints: it may find the global minimum, it could return a poorer local optimum, or it could even fail to converge. The outcome depends on the quality of the initial guess and the complexity of the constraint.
When a problem is not suitable for convergent deterministic optimization, it is preferable to use one of the various heuristic methods available for such problems. Heuristic approaches are usually more flexible, and therefore more capable of getting good approximate solutions across a variety of difficult problems. The remainder of this report presents a greedy heuristic that could form a part of data sharpening algorithms with greater reliability and generality than SQP.

Chapter 3 A Greedy Data Sharpening Algorithm
The term heuristic is used to describe optimization methods that are defined primarily through their search algorithms. They are usually applied when exhaustive search or standard function optimization approaches do not work.
The simplest way to develop an optimization heuristic is often the greedy approach.
Greedy algorithms break the overall optimization problem, which may be complicated, into a series of simple steps. At each step, the solution is improved as much as possible. The advantage of greedy algorithms is their simplicity, but it is usually the case that greed in early decisions forces the search to make poor moves later on-resulting in sub-optimal solutions.
Below, a greedy approach is presented for data sharpening in the unimodal KDE problem. While the limitations of greedy methods are acknowledged, the new algorithm has one major advantage: it will always provide a solution satisfying the unimodality constraint.
One of the most challenging and problem-dependent aspects of algorithm design is constraint handling, so a simple method with guaranteed constraint-satisfaction could be a very useful component of more sophisticated heuristics in the future 1 .
The greedy method for unimodal kernel density estimation will now be described, first as a general approach, and then in two specific implementations.
1 There are many broad categories of more sophisticated heuristics. See Michalewicz and Fogel (2004) for a good review of evolutionary algorithms and many strategies for constraint handling; see Engelbrecht (2005) for particle swarm algorithms. Other approaches include simulated annealing and tabu search.

Structure of the Algorithm
The premise of the proposed algorithm is to start with an initial guess solution that is feasible, and to improve this guess by moving its points closer to the original data. Points are moved one at a time, and the algorithm may cycle through the points repeatedly as long as the solution keeps improving. A critical requirement is that feasibility must be maintained throughout. No point may be moved in a way that causes the density estimate to gain a second mode.
The sharpened solution vector, to be modified throughout the search, is denoted y. The initial solution is v, and the original ("unsharpened") data is x. The vectors y and x are assumed to be matched elementwise, and each x i , i = 1, . . . , n is called the target or home position for the corresponding y i . This terminology reflects the goal of sharpening, which is to achieve unimodality while keeping the y i as close as possible to the x i .
The solution is improved during the algorithm by moving each y i toward home. When the unimodality constraint prevents a point from moving closer to home, the point is said to be pinned. A point is considered moveable if it is neither pinned nor at home.
The flow chart in Figure 3.1 shows the structure of the algorithm. The central step is a sweep or pass through all moveable points in y. In each pass, every moveable point is moved closer to its target position. Starting with the initial guess, the algorithm sweeps through the data repeatedly. Between sweeps, the sharpened data are re-ordered to match the ordering of the original data, and the sequence in which the points will next be processed is determined. The search terminates when there are no moveable points, that is, when all sharpened points are either at home or are pinned by the constraint boundary.

Implementation Issues
Figure 3.1 also indicates six implementation issues that must be addressed to produce working code. These issues are discussed below, starting from the last and proceeding to the first.

Choice of Tolerance
Issue. Any numerical optimization routine will rely on a tolerance specification to determine convergence. The tolerance is an adjustable parameter in the algorithm. Its setting will influence the computation time and, more importantly, could also affect the quality of solutions found.  Resolution. A single tolerance τ will be used where necessary to control the precision of calculations. The tolerance helps to determine when points have reached home: It also is used to determine when the constraint is active. A point is considered pinned if shifting it by a distance τ causes a constraint violation: In section 3.4, two versions of the algorithm will be introduced, based on two different line search tactics: bisection and grid search. Both line searches will use the tolerance τ to assess convergence. It will be shown that insensitivity to τ is one of the main advantages of the grid search version of the algorithm.

Objective Conflict
Issue. Depending on the objective function used, it is possible that the goal of moving y closer to x may conflict with the overall goal of minimizing δ(y, x). This cannot happen with objectives of the L α type, but it can be a problem with objective functions based on the density estimate and not on the solution vector y.

Resolution.
A solution to the problem is to include a reduction in the objective function as a part of the constraint. If v is the initial solution and y ′ is the intermediate solution formed by moving y [i] , the notion of feasibility is slightly modified: That is, each move is accepted as long as the solution is unimodal and the objective is no greater than the starting objective value. With this restriction the algorithm will always return a solution vector that is no further from x than the initial guess and a density estimate that is no worse, in terms of the objective function, than the initial estimate. It should be reiterated that definition (3.3) results in no change when the objective is based on the L α or RC metrics.

Rules for Moving a Point
Issue. The proposed algorithm is greedy in the sense that each time a single point is moved, its movement is optimized without considering other points or the future progress of the y [i] x [i] feasible regions search. In principle the algorithm will work with any movement rule that causes points to be moved closer to home at each step. The particular choice of movement rule will, however, have considerable impact on the performance of the algorithm and on the details of its implementation.
Consider the movement of a sharpened point from its original location y i to some new value y ′ i . For simplicity assume that the point is moving to the right, which means y i < x i . Resolution. First note that not every point is moveable. To be moveable, a point must be neither pinned nor at home. Considering equations 3.1 and 3.2, this means Since a 1-D optimization problem needs to be solved for each moveable point in each pass through the data, it is important to have an efficient routine for finding good solutions. The bisection and grid search versions of the greedy algorithm resolve this problem in different ways. They will be discussed in sections 3.4.1 and 3.4.2, respectively.

Sweep Order
Issue. The order in which the points are to be moved must be determined before each pass through the data. The choice of sweep order will influence the performance of the algorithm, because moving any one point can influence the ability of other points to move. It is not sensible to sweep through the points in sequence from y 1 to y n , as this could introduce a positional bias in the results. Another rule for sweeping must be devised.
Several possible rules could be constructed. For example, the points could be moved in random order, or starting in the tails, or starting in the center of the distribution. The rule that was found to work most reliably, however, is based on the distances between the points and their home positions.
Resolution. The standard rule is to move the points in descending order of their distance from home: the points with the greatest value of |y i − x i | are moved first. If all points start in the central part of the distribution, this rule has the effect of moving points toward the tails first, and then moving interior points that are closer to home.
When discussing the movement of individual points during a sweep, the notation y [i] will be used to denote the moveable point i th farthest from home; x [i] will denote its matching element of x. This notation was also used in Figure 3.1.

Matching of Sharpened and Unsharpened Data
Issue. The idea of matching elements of y to those in x was introduced in section 2.2 as a way to make the search invariant to permutations of y. In the context of algorithm development, matching is also important to ensure that the elements of y are being moved toward sensible targets. After a single pass through the data, some points may have crossed over others, meaning that the point-to-target distances can be reduced by reordering y. For this reason a re-matching step between passes can improve performance.

Resolution.
A natural way to match points in one-dimensional problems is to arrange y so that its j th largest element is in the same position as the j th largest element of x, that is, match y (j) to x (j) . If the original data starts out in sorted order, this amounts to simply sorting y between passes 2 .
Adding a sorting step to the algorithm introduces the possibility that the search could take backward steps or even cycle endlessly by revisiting previous arrangements of y. To prevent this, the search is limited to only accepting new rearrangements of the points-the matching step may never result in an ordering of y that was previously used 3 . Implementing this restriction on matching requires the algorithm to store a record of previous matching arrangements.

Dependence on Initial Guess
Issue. The only requirement on the starting solution v is that it must produce a unimodal kernel density estimate. The algorithm will produce a solution from any feasible starting point, but the solution will depend on the particular choice of v. The best choice of v is likely to be problem-specific, but it is desirable to have a simple rule for generating a starting solution regardless of the particular data at hand. Experience has shown that this simple method works well and prevents points from getting pinned too early in the search 4 . The location of the highest mode can be easily approximated by evaluating the unsharpened kernel density estimator at a grid of points. Note that throughout the algorithm, the sharpened solution maintained feasibility. The reason the three pinned points are not moveable after three passes is the plateau to the left of the mode. Moving any of the pinned points would introduce a second mode by causing this plateau to become a trough.

The improve Function
The ideas of the preceding sections have been put together in a function called improve.
The function is so named because when y = improve(v, x) is called, the result y will have L α (y, x) ≤ L α (v, x). Any unimodal starting solution will be transformed into another unimodal solution with an improved objective function value 6 .
The main detail not specified in the preceding sections was the form of the line search involved in the movement of each point. In the following three sections, two versions of improve are discussed and then compared. The first version uses bisection to do the line search, and the second uses a type of grid search.

Initial Approach: Bisection
The point-moving problem depicted in Figure 3.2 can be solved using bisection. Starting with an interval known to contain the solution, the interval is repeatedly divided in half.
At each division the half not containing the solution is discarded, and the process continues until the interval is sufficiently small.
Assuming the point is moving to the right as in the figure, the search interval is [y i , x i ].
The solution to be found is a location where the point y ′ i shifts from being feasible to being infeasible. The point y i is moveable, so by definition (see equation 3.4) there must be at least one such point in the interval. The midpoint (x i +y i )/2 can be evaluated for feasibility.
If it is feasible, search focuses on the right half of the interval; if not, search focuses on the left half. This process continues until the search interval has width less than τ . Note that, because the feasible region does not need to be contiguous, there is no guarantee that the best solution will be found.
A short pseudocode describing the bisection version of improve is given in Table 3.1. A more detailed pseudocode, including details of the bisection step, is shown in Table A.1 in the Appendix.
A discussion of the performance of the bisection approach is deferred to section 3.4.3, where the bisection and grid-search approaches are compared. In that section, it will be shown that using bisection gives the improve function both poor performance and high sensitivity to the tolerance τ .

Improved Approach: Shrinking Grid
Grid search is a line search method even more basic than bisection. A simple grid search could be used for the point-moving problem of Figure 3.2 as follows: test solutions y ′ i at a sequence of locations in the interval [y i , x i ], and then choose the largest y ′ i that gives a feasible density estimate.
This simplistic grid search is computationally intensive and thus not desirable for the improve function. Instead, the search is modified in two ways: 1. The grid search will proceed from y i toward x i and will terminate as soon as the first infeasible y ′ i is found. That is, the search will "step out" from the sharpened point toward home, and not all grid points will be tested.
2. The number of grid steps, S, will be modified globally throughout the search. The algorithm will start with S = 1, meaning that the search tries to move points directly to home. After a complete pass through the data, if no points were able to move, the number of steps will be doubled. This process continues, with the number of grid steps successively doubling whenever no points can be moved.
These two modifications greatly improve the efficiency of grid search. They also allow data sharpening to start with large moves first, and then to move on to smaller and smaller refinements of the estimate as the search continues. This helps to overcome the tolerance sensitivity and early-pinning problems of the bisection approach.
The nature of this shrinking-grid search is illustrated in Figure 3.4 by looking at a single point. For each of six passes through the data, the interval [y i , x i ] is shown with a grid of S steps superimposed. When the point is moveable, but cannot step out along the grid, the grid is made more fine by doubling S. As long as the point remains moveable, each pass y [i] x [i] regions of feasibility An important fact that is not properly depicted in Figure 3.4 is that the grid-doubling step happens only when none of the moveable points are able to step out at the current value of S. The algorithm tries to move points as much as possible at the current coarse grid setting, and only shrinks the grid when no more improvement can be made. This is shown more clearly in the pseudocode of Table 3.2 (and in the more detailed pseudocode found in Table A.2). Comparison with the bisection code (Table 3.1) shows that the two algorithms are structurally very similar. As long as points are moving, the grid-search algorithm proceeds in exactly the same way as the bisection algorithm; but once points stop moving, the process is interrupted to double the resolution of the grid. The ultimate termination of the algorithm does not depend on the value of S; it occurs when no more points are considered moveable based on the tolerance τ .

Comparing the Two Approaches
Performance of the two versions of improve was evaluated using a test case from Braun and Hall (2001). The true density in this example is a mixture of N (−1, 0.6 2 ), N (1, 2.5 2 ), and N (5, 1.5 2 ) densities, with mixing proportions 0.35, 0.5, and 0.15, respectively. This density was constructed to be unimodal, but with a long, flat shoulder so that kernel density estimates will usually be multimodal. The density is shown in Figure 3.5. Braun and Hall (2001) performed data sharpening on this example, with a sample size of n = 25, using an L 1.375 (x, y) objective function and a Gaussian kernel density estimator.
They used an off-the-shelf sequential quadratic programming routine to find the solution (with a separate optimization step to determine the best mode location), and found that a bandwidth of approximately h = 0.6 was optimal in terms of mean integrated squared error (MISE) between the sharpened estimate and the truth. The MISE at this optimal bandwidth was about 0.0155.
To permit comparison with these earlier results, the bandwidth was fixed at 0.6 throughout the present simulations. One thousand samples were drawn from the test distribution, and both versions of the improve algorithm were run on each sample at seven different tolerance levels: τ = 10 −k , k = 2, 3, . . . , 8. The tolerance sensitivity of bisection visible in figure 3.6 is actually flattering to the bisection method because of averaging across the 1000 simulation runs. Inspection of individual runs shows that the bisection method does not actually converge, in the sense that the resulting density estimates should stop changing as the tolerance is decreased. Figure  Table 3.3. Convergence of solutions.
The convergence problem is further illustrated in Table 3 The shrinking-grid approach avoids the problem of over-searching by moving all points at a coarse level first. The search only works hard to get close to the constraint boundary later in the search, when the points are already close to their final positions. In this sense the grid search is less greedy than the bisection search.
Based on these results, the shrinking-grid version of improve will be taken as the default function in subsequent discussion and future work.     .7: Variation among solutions at different tolerance levels for a typical simulation run. Top: using bisection, there is high variability in the density estimates as the tolerance is changed. Bottom: using grid search, the effect of tolerance is barely discernable. In both plots, the points show the unsharpened data.

Performance Evaluations
The relative merits of sequential quadratic programming and the new improve function were explored through simulation, using two challenging unimodal KDE problems.

Study Design
The premise of the study was to generate multiple data vectors from test cases and then to perform data sharpening on each data set using different combinations of bandwidth, optimization method, and objective function. All estimation was done using the Gaussian

KDE.
A test case is defined by a true density and a sample size. Due to time constraints, only two densities and a single sample size were used: True density. Density 1 is the t distribution with three degrees of freedom, and density 2 is the three-component normal mixture introduced in section 3.4.3 (Figure 3.5) 1 .
Sample size. The sample size was n = 25. The original intent was to include sample size 100 in the design, but as discussed in the following section, optimization by SQP was prohibitively slow at the larger sample size.
Two hundred random samples of size 25 were generated from each of the true distributions. To avoid trivial cases where the unsharpened estimate is already unimodal, a rejection step was included when generating samples. Any sample producing a unimodal estimate (using the normal-scale bandwidth selector) was replaced with a new sample until a multimodal estimate was obtained.
For each of the data vectors, sharpening was performed using two different bandwidths, three different optimization methods, and up to four different objective functions.
Bandwidth. Let h N be the bandwidth selected by the normal-scale rule for a particular data set. Optimization was performed using bandwidths h N and 0.5h N .
Optimization method. The three methods were: Greedy. The greedy algorithm, using the improve function.
SQP. Sequential quadratic programming using the unsharpened data x as a starting value, with outliers corrected 2 .
Combined. SQP using the greedy solution as its starting point.
Objective function. The RC, RC s , T V , and LIK objectives were used.
For each x, sharpening was done for the 18 combinations of these three factors shown in Table 4.1, and the sharpened data set was recorded along with the run time for later analysis.
The greedy method was not run with all four objective functions, since the objective function will play no role in the search when improve is started from a naïve guess.
The simulation consisted of 7200 optimizations (200 samples from each of two distributions, with 18 optimizations per sample). In the analysis to follow, appropriate subsets of the simulation runs will be used to answer questions of interest.
Remark 1. Sequential quadratic programming was implemented using the fmincon function in version 3.1.1 of the MATLAB optimization toolbox. SQP requires the mode to be pre-specified, so the optimizer was run at 20 candidate mode points evenly spaced between the first and third quartiles of the data. The solution with the best objective function value was returned. The constraint was enforced at 100 evenly-spaced points covering the range of the data.
Remark 2. It was originally intended to include an annealing algorithm such as those in Braun and Hall (2001) and Hall and Kang (2005) as a third optimizer, but attempts to produce a sufficiently automatic version of the algorithm were not successful 3 .

Method
Objective

Convergence, Ease of Use, and Run time
The greedy algorithm, by design, has guaranteed convergence 4 , and requires minimal tuning regardless of problem type or starting point. In contrast, getting SQP to work reliably across simulation runs proved to be problematic. It is worthwhile to discuss SQP implementation challenges before reviewing the simulation results proper.
To be useful and fair in a simulation, the SQP code should perform as well as possible, run quickly, and require no case-specific user intervention. These goals naturally conflict to some extent. Reconciling them required judicious choices of starting solutions and iteration limits.
Performance of SQP was very sensitive to the starting solution. The algorithm would often fail to converge when using the unsharpened data as the initial solution. Other attempted starting points (all points at the unsharpened mode, evenly-spaced points, etc.) also performed poorly.
Eventually it was observed that convergence problems were often associated with outliers. The best way to promote convergence was to use x as the starting solution, but with outlying observations shifted toward the main body of points. This was achieved either by simply shifting the outliers inward, or by using the greedy solution as a starting point.
When SQP did converge, run times were usually reasonable (a few minutes or less); conversely, when it failed to converge, run times could be unacceptably long (hours). It was ultimately discovered that this problem was due to an inability to find feasible solutions.
As its name implies, SQP involves the solution of a series of quadratic programming (QP) subproblems. At each subproblem, there is a search for a feasible solution. When no such solution can be found, the search will run the maximum allowed time, only to fail at the end.
To alleviate this problem, the QP subroutine was modified to reduce the maximum • The greedy method always gives a feasible solution, thus its convergence proportion is always one.
• Greedy search also runs much more quickly than SQP in all cases. Median run times were less than 0.5 seconds for improve, and range from several seconds to about one and a half minutes for SQP.
• In terms of convergence, the t 3 problem is harder than the mixture density problem, and problems with smaller bandwidth are harder than those with larger bandwidth.
This is related to the prevalence of outliers and the difficulty finding feasible solutions.
For the t 3 problem with bandwidth 0.5h N , the majority of SQP attempts failed to converge.
• As measured by run time, problem difficulty for a chosen objective function is dominated by bandwidth. Run time differences between different objective functions are hard to compare because the objectives themselves have different computational costs (T V and LIK being much more expensive to evaluate than RC).
• The proportion of converged runs is almost the same for SQP and for the combined method. Using the greedy solution as the starting point in SQP does not seem to improve the chance of converging.
• Supplying the greedy solution as the start value in SQP does reduce run time, however.
Average run time was lower for the combined method in almost all cases.
• SQP was most likely to converge when using the T V objective. It is not clear why this is true. Total variation runs also took the longest time, because of the computational cost involved with calculating T V (y, x).

Solution Quality
In the preceding results, the greedy method shows several advantages over SQP. It always gives a useable result, it does so quickly, and it is insensitive to the details of the problem or the data. Not surprisingly, these advantages come with the trade-off that the solutions are usually inferior to the results of a converged SQP run.   converged. It shows both the median RC(y, x) value and the proportion of times each method found the better solution. As with run time, the median is shown rather than the average because the distributions of objective values were right-skewed.
The  is nearly unimodal (cases 3, 5, and 8 in the figure), the two methods agree closely. If the unsharpened data has more strongly separated modes or more outliers (cases 1, 2, 6, 7, and 9), the SQP estimate tends to reach somewhat farther into the tails, while the greedy solution may retain higher fidelity near the unsharpened mode. Case 4 shows a result very similar to the small scale example of Figure 2.5: the unsharpened estimate is bimodal and nearly symmetric, and the two competing sharpened estimates are nearly mirror images of one another.
While it does not outperform SQP on its own, one can make use of the greedy method to further improve SQP performance. Table 4.5 compares the performance of the combined  In all but two cases, the median objective value was either the same or lower for the combined method than for SQP with default starting value. In all but one case, the combined method achieved a better solution than SQP in the majority of replications.

Comparison of Different Objective Functions
A final topic of interest in the simulation is comparing the characteristics of the different objective functions (using SQP as the optimizer). It has already been seen that the objectives differ in terms of computation time, and that T V has an advantage in terms of promoting convergence, but it would be beneficial to have a comparison of the statistical properties of the estimators arising from each objective. runs. There appears to be no major or systematic difference among the estimates. Another means of comparison is a plot of MSE versus position for each problem/bandwidth combination. These plots (not shown here) did not reveal large or consistent differences between objectives. In particular, the MSE differences in the tails were not large enough to have practical significance.
The preliminary investigation of different objectives here is not deep enough to be conclusive. Improved study of the different objectives would be easier after the optimization step has been sufficiently improved. At that point it would be more appropriate to design a systematic study of the effect of these objective functions.

Discussion
Data sharpening has the potential to become a general approach to constraint handling in nonparametric estimation. One of the major barriers to the realization of this potential is optimization. The optimization difficulties have been explained at length in the preceding chapters, using unimodal kernel density estimation as a representative problem.
The greedy algorithm developed in Chapter 3, implemented in the improve function, is a tool to help improve data sharpening optimization. The following sections review the characteristics of the greedy method, discuss how it might be used, and describe the next stages in this line of research.

Advantages and Disadvantages of the Greedy Algorithm
The greedy method has several desirable properties. It is fast; it is guaranteed to produce a feasible solution; its only adjustable parameter is a tolerance, to which the end result is not sensitive; and the algorithm's design is not specific to a particular problem or constraint.
As a result of these properties, the improve function is reliable and easy to use.
When applied to unimodal kernel density estimation with its default starting solution, improve tends to return solutions that match the unsharpened estimate at points away from constraint violations. Modifications to the data are typically restricted to the troughs and tails of the unsharpened density estimate (a desirable property, as discussed in Hall and Kang 2005).
The limitations of the greedy method follow directly from its design, and are typical of greedy algorithms in general. While a feasible solution is guaranteed, the solution will normally be sub-optimal; if other solution methods (like SQP) are able to converge, they will often perform better. In addition, the greedy method does not directly consider the chosen objective function-it just makes the sharpened points move toward their targets at each pass through the data. This movement rule hard-wires the algorithm to distance measures like L α and RC, for which the objective always decreases as y moves closer to x.
There is no natural way to adapt improve to objectives not having this property, such as T V and LIK.

Applications
The performance limitations of improve limit its applicability as a stand-alone data sharpening method. Nevertheless, its advantages suggest it could still be useful in at least three ways.
1. As a starting value for SQP. The simulations of the previous chapter suggest that using the greedy solution as the starting point for SQP improves both run time and solution quality. The greedy algorithm being much faster than SQP, this strategy provides a free gain for SQP. The fact that such a gain is possible, however, reflects the inadequacy of SQP as an optimizer in this context. Furthermore, the improved starting point does not improve the likelihood of convergence. This means that SQP still remains severely limited in sufficiently challenging sharpening problems.
2. As a repair method in a more advanced heuristic. One general strategy for constraint handling in heuristic optimization is to use a "repair" operator to convert infeasible solutions into feasible ones. A promising application of improve is as such an operator in a more sophisticated algorithm. If y F and y I are feasible and infeasible solutions, respectively, then y R = improve(y F , y I ) will return a feasible repaired solution. The repair is performed by treating y I as the unsharpened target, and moving the initial solution y F toward it.
3. As an optimizer in other constrained estimation scenarios. The greedy method (either by itself or as part of a larger heuristic) could ultimately be applied to problems beyond the unimodal KDE setting. The first such expansion could be to density estimation with other shape constraints, like k-modality, symmetry, or bell shape 1 . Another level of expansion could be to higher dimensions, starting with unimodal bivariate kernel density estimation. Finally, the methods begun here could be moved to different nonparametric estimators altogether, to tackle problems like shape-constrained regression.

Future Work
The possibilities for further research can be divided into three mutually-supporting categories: work toward applications, theoretical work, and work on the statistical properties of sharpened estimators.

Work Toward Applications
This branch of work is focused on further improvements in optimization, using the improve function as a part of a larger optimization heuristic. Existing classes of heuristic that could be suitable include genetic algorithms, particle swarm approaches, and simulated annealing 2 .
As mentioned above, the greedy algorithm could be incorporated in one of these standard methods as a way to handle constraints. Given the importance of the constraints in data sharpening, however, it is possible that best results will be achieved by a built-to-suit hybrid of ideas from various heuristic approaches. Preliminary work to this end suggests that a promising approach is to maintain a population of solutions while repeatedly perturbing them, repairing them, and discarding the worst candidates.
Many application possibilities open up if a robust optimizer with good performance can be developed around the greedy algorithm. For example, the optimizer could be used to try unimodal estimation using adaptive kernels (optimize the bandwidths individually per point) or by adjusting points' weights (as in Racine and Parmeter 2008). This could potentially be done by setting the target to a uniform vector of h in the first case or 1/n in the second. Beyond this, the optimizer could be applied to a variety of constraints or estimators, as mentioned in the previous subsection.
Adaptation of the greedy method to higher dimensions is reasonably straightforward.
The movement rules have natural multi-dimensional analogs. The primary sources of difficulty in a higher-dimensional greedy algorithm will be in the matching step and in constraint checking. Defining how unsharpened points should be matched to sharpened ones is much harder in higher dimensions than it is on the line. The problem of constraint checking is more computational: as dimension increases, constraint-checking over a grid of points becomes increasingly burdensome.

Theoretical Work
The heuristic nature of the present developments make it difficult to establish properties of the method theoretically. Still, some theoretical considerations would be useful to develop further in the future.
The computational complexity of the greedy algorithm (or its subsequent heuristics) is one area where it should be possible to make progress. It would be beneficial to be able to make statements about how the algorithm will scale with sample size, for example, without having to rely wholly on simulation studies.
Another area where theoretical insights would be welcome is the choice of objective function. Without input from theory, only extensive simulations can guide the choice of objective. For example, does the likelihood objective-which amounts to fitting a normal mixture to the data-have any advantageous properties? Or can any of the objectives be shown to encourage desirable tail behaviour in terms of MISE or MSE?

Statistical Properties
One very important area for future work is in the development of confidence bands for sharpened estimators. It is felt that resampling methods should be suitable to generate bands, but how exactly to do this has not been explored.
Another development that would be very welcome is a test of the validity of the constraint. This would be particularly valuable if it could be developed in a way that matched the generality of the data sharpening approach. The combination of data sharpening and a test for constraint validity would be a very powerful data analysis tool.
There are in addition to intervals and tests, other statistical considerations commonly addressed in kernel density estimation. These include bandwidth selection, asymptotic performance, and boundary issues. Bandwidth selection in particular can always be done by an empirical method such as cross-validation. As for other considerations, the hope thus far is that by using standard estimators and only perturbing the data, the favourable properties of the estimator will be retained.