CEoptim: Cross-Entropy R Package for Optimization

The cross-entropy (CE) method is simple and versatile technique for optimization, based on Kullback-Leibler (or cross-entropy) minimization. The method can be applied to a wide range of optimization tasks, including continuous, discrete, mixed and constrained optimization problems. The new package CEoptim provides the R implementation of the CE method for optimization. We describe the general CE methodology for optimization and well as some useful modifications. The usage and efficacy of CEoptim is demonstrated through a variety of optimization examples, including model fitting, combinatorial optimization, and maximum likelihood estimation.


Introduction
The cross-entropy (CE) method originates from an adaptive variance minimization algorithm in Rubinstein (1997) for the estimation rare event probabilities in stochastic networks. It was realized in Rubinstein (1999) that many optimization problems could be converted into a rare-event estimation problems, providing a rare-event based approach to optimization, where a sequence of probability densities is generated that converges to a degenerate density that concentrates its mass close to the optimizer.
Generally, the CE method involves two iterative phases: 1. Generation of a set of random samples (vectors, trajectories, etc.) according to a specified parameterized model.
2. Updating of the model parameters, based on the best samples generated in the previous step. This is done by Kullback-Leibler (also called cross-entropy) minimization.
Since the appearance of the CE monograph (Rubinstein and Kroese 2004) and the tutorial (De Boer et al. 2005), the CE method has continued to develop and has been successfully applied to a great variety of difficult optimization problems, including motion planning in robotic systems (Kobilarov 2012), electricity network generation, (Kothari and Kroese 2009), control of infectious diseases (Sani and Kroese 2008), buffer allocation (Alon et al. 2005), Laguerre tessellation (Duan et al. 2014), and network reliability (Kroese et al. 2007). An extensive list of recent work can be found in (Botev et al. 2013). Websites that provide MATLAB code include www.cemethod.org and www.montecarlohandbook.org. Since R has become an essential tool for statistical computation, it is useful to provide an accessible implementation of the CE method for R users, similar to R packages for simulated annealing (Xiang et al. 2013), evolutionary methods (Mullen et al. 2011), and particle swarm optimization methods (Bendtsen 2012).
Some advantages of the CE method are: • The CE method is a global optimization method which is particularly useful when the objective function has many local optima.
• The CE method can be used to solve continuous, discrete, and mixed optimization problems, which may also include constraints.
• The CE code is extremely compact and is readily written in native R, making further development and modifications easy to implement.
• The CE method is based on rigorous mathematical and statistical principles.
Our aim is not to replace the standard optimization solvers such as optim and nlm but to provide a viable alternative in cases where standard gradient or simplex-based solvers are not applicable (e.g., when the optimization problem contains both discrete and continuous variables) or are expected to do poorly (e.g., when there are many local optima).
The rest of this paper is organized as follows. In Section 2, we sketch the general theory behind the CE method, which leads to the basic CE algorithm. In Section 3, we describe a variety of optimization scenarios, including continuous, discrete and constrained mixed problems, to which CE can be applied effectively. The description and usage of the CEoptim package are given in Section 4. Section 5 demonstrates the capability of the package through a range of numerical examples. In the final section we make concluding remarks for CEoptim.

CE method for optimization
Let X be an arbitrary set of states and let S be a real-valued performance function on X . Suppose the goal is to find the minimum of S over X , and the corresponding minimizer x * (assuming, for simplicity, that there is only one). Denote the minimum by γ * , so that The CE methodology for optimization is adapted from the CE methodology for rare event estimation in the following way. Associate with the above problem (1) the estimation of the probability = P(S(X) γ), where X has some probability density f (x; u) on X (for example corresponding to the uniform distribution on X ) depending on a parameter u and a level γ. Thus, for optimization problems randomness is purposely introduced in order to make the model stochastic. If γ is chosen close to the unknown γ * , then is typically a rare-event probability. One of the most effective ways to estimate rare-event probabilities is to use importance sampling. In particular, to estimate = P(S(X) γ) one can use the importance sampling estimator where X 1 , . . . , X N are iid samples from a well-chosen importance sampling density g. The optimal importance sampling density is in this case g * (x) = f (x)I{S(x) γ}/ , which gives a zero-variance estimator, but depends on the unknown quantity . The main idea behind the CE method for estimation is to adaptively determine an importance sampling pdf f (x; v * )hence within the same family as the original distribution -that is close to g * in Kullback-Leibler sense. Specifically, a parameter v * is sought that minimizes the cross-entropy distance This is equivalent to maximizing, with respect to v, which in turn can be estimated by maximizing the sample average where X 1 , . . . , X N is an iid sample from f (x; u). This is, in essence, maximum likelihood estimation. In particular, (2) gives the maximum likelihood estimator of v based on only the samples X 1 , . . . , X N that have a function value less than or equal to γ. These are the so-called elite samples.
The relevance to optimization is that when γ is close to the (usually unknown) minimum γ * , then the importance sampling density g * concentrates most of its mass in the vicinity of the minimizer x * . Sampling from such a distribution thus produces optimal or near-optimal states. The CE method for optimization produces a sequence of levels (γ t ) and reference parameters (v t ) determined from (2) such that the former tends to the optimal γ * and the latter to the optimal reference vector v * , where f (x; v * ) corresponds to the point mass at x * ; see, e.g., (Rubinstein and Kroese 2008, Page 251).
The generic steps for CE optimization are specified in Algorithm 1.

Algorithm 1 Generic CE algorithm
Input: Initial parameter vector v 0 . Sample size N . Rarity parameter . Output: Sequence of levels (γ t ) T t=1 and parameters (v t ) T t=1 . 1: Let N e = N (number of elite samples) and set t = 1 (level counter). 2: while the sampling distribution is not degenerate do 3: Generate X 1 , . . . , X N ∼ iid f (·; v t−1 ). Calculate the performances S(X i ) for all i, and order them from smallest to largest: S (1) . . . S (N ) . Let γ t be the sample -quantile of performances; that is, γ t = S (N e ) .

4:
Use the same sample X 1 , . . . , X N and solve the stochastic program Denote the solution by v t . Increase t by 1.

5: end while
To run the algorithm, one needs to provide the class of sampling densities {f (·; v)}, the initial vector v 0 , the sample size N , the rarity parameter , and the stopping criterion. It is prudent to keep track of the overall best function value and corresponding state, and report these at the end of the algorithm as the optimal value and optimizer, respectively. The progression of level parameter γ t gives an indication how well the algorithm converges.
As (3) is simply a maximum likelihood estimation step involving only the elite samples, it is possible to derive easy parameter updates for standard sampling distributions. The following two special cases are of particular importance.
1. Multivariate normal distribution. Suppose each X is sampled from an n-dimensional multivariate normal distribution with independent components. The parameter vector v in the CE algorithm can be taken as the 2n-dimensional vector of means and standard deviations. In each iteration these means and standard deviations are updated according to the sample mean and sample standard deviation of the elite samples.

Multivariate Bernoulli distribution.
Suppose each X is sampled from an n-dimensional Bernoulli distribution with independent components. The parameter vector v in the CE algorithm can be taken as the n-dimensional vector of success probabilities. In each iteration the ith success probability is updated according to the mean number of successes (1s) at the ith position of the elite samples.

Remark 1 (Parameter Smoothing)
Various modifications of the basic CE algorithm have been proposed in recent years. One such is modification is parameter smoothing, where at the tth iteration the sampling parameter is updated via where v t is the solution to (3) and 0 α 1 is a fixed smoothing parameter.
Smoothed updating can prevent the sampling distribution from converging too quickly to a sub-optimal degenerate distribution. This is especially relevant for the multivariate Bernoulli case where, once a success probability reaches 0 or 1, it can no longer change.
It is also possible to use different smoothing parameters for different components of the parameter vector (e.g., the means and the variances).
Remark 2 (Choice of sampling densities) Although sampling distributions with independent components are the most convenient to use in a CE implementation, it is sometimes advantageous consider more complex sampling models, such as mixture models. In this case the updating of parameters (maximum likelihood estimation) may no longer be trivial, but one can instead employ fast methods such as the EM algorithm to determine the parameter updates.
Remark 3 (Choice of the CE parameters) The CE method is fairly robust with respect to the choice of the parameters. The rarity parameter is typically chosen between 0.01 and 0.1. The number of elite samples N e = N should be large enough to obtain a reliable parameter update in (3). For example, if the dimension of v is d, the number of elites should be in the order of 10 d or higher.

Optimization scenarios
In this section we consider a number optimization scenarios to which CEoptim could be applied.

Continuous optimization
Consider a continuous optimization problem with state space X = R n . The sampling distribution on R n can be quite arbitrary and does not need to be related to the objective function S. Usually, the random vector X = (X 1 , . . . , X n ) ∈ R n is generated from a Gaussian distribution with independent components, characterized by a vector µ of means and a vector σ of standard deviations. At each iteration of the CE method, these vectors of parameters are updated as the means and standard deviation of the elite samples. During the course of the algorithm a sequence of (µ t ) and (σ t ) are generated, such that µ t tends to the optimizer x * , while the vector of standard deviations tends to the zero vector. At the end of the algorithm one should obtain a degenerated probability density with mean µ T approximately equal to the optimizer x * and all standard deviations close to 0. A possible stopping criterion is to stop when all components in σ T are smaller than some ε. This scheme is referred to as normal updating.
CEoptim implements the normal updating scheme for continuous optimization.

Discrete optimization
If the state space X is finite, the optimization problem is often referred to as a discrete or combinatorial optimization problem, where X could be the space of combinatorial objects, such as binary vectors, trees, graphs, etc. To apply the CE method to a discrete optimization problem, one needs a convenient parameterized random mechanism to generate samples.
For discrete optimization CEoptim implements sampling from state spaces X of the form {0, 1, . . . , c 1 − 1} × · · · × {0, 1, . . . , c n − 1}, where the {c i } are strictly positive integers. The components of the random vector X = (X 1 , . . . , X n ) ∈ X are taken to be independent, so that its distribution is determined by a sequence of probability vectors p 1 , . . . , p n , with the jth component of p i corresponding to p ij = P(X i = j). For a given elite sample set E of size N e , the CE updating formulas for these probabilities are where I denotes the indicator function. Hence, at each iteration, probability p ij is updated simply as the average number of times that the ith component of the elite vectors is equal to j. A possible stopping rule for a discrete optimization problem is to stop when the overall best objective value does not change over a number of iterations. Alternatively, one could stop when the sampling distribution has degenerated sufficiently; for example, when all {p ij } are no further than ε away from either 0 or 1.

Constrained optimization
The general optimization problem (1) also covers constrained optimization, where the search space X could, for example, be defined by a system of inequalities: One way to deal with constraints is to use acceptance-rejection: generate a random vector X on a simple search space that contains X , and accept or reject it based on whether the sample falls in X or not. Alternatively, one could try to sample directly from a truncated distribution on X , e.g., using Gibbs sampling.
CEoptim implements linear constraints for continuous optimization of the form Ax b, where A is a matrix and b a vector. The program will use either acceptance-rejection or Gibbs sampling to sample from the multivariate normal distribution truncated to the constraint set.
A second approach to handle constraints is to introduce a penalty function. For example, for the constraints (6), the objective function could be modified to where H i < 0 measures the importance of the ith penalty. To use the penalty approach with CEoptim the user simply needs to modify the objective function according to (7). The choice of the penalty constants {H i } is problem specific and may need to be determined by trial and error.

CEoptim description
In this section we describe how to use CEoptim.
The CEoptim function is the main function of the package CEoptim. It can be used to solve continuous and discrete optimization problems as well as mixtures thereof.

Usage
CEoptim ( continuous List of arguments for the continuous optimization part, consisting of: mean Vector of initial means.
sd Vector of initial standard deviations. -smoothMean Smoothing parameter for the vector of means. Default value 1 (no smoothing). -smoothSd Smoothing parameter for the standard deviations. Default value 1 (no smoothing).
-sdThr Positive numeric convergence threshold. Check whether the maximum standard deviation is smaller than sdThr. Default value 0.001.
-conVec Value vector of linear constraint linear constraint conMat x conVec.
discrete List of arguments for the discrete optimization part, consisting of: categories Integer vector which defines the allowed values of the categorical variables.
The ith categorical variable takes values in the set {0, 1, . . . , categories(i) − 1}. probs List of initial probabilities for the categorical variables. Defaults to equal (uniform) probabilities.
-smoothProb Smoothing parameter for the probabilities of the categorical sampling distribution. Default value 1 (no smoothing).
-probThr Positive numeric convergence threshold. Check whether all probabilities in the categorical sampling distributions deviate less than probThr from either 0 or 1. Default value 0.001.

N
Integer representing the CE sample size.
rho Value between 0 and 1 representing the elite proportion.
iterThr Termination threshold on the largest number of iterations.
noImproveThr Termination threshold on the largest number of iterations during which no improvement of the best function value is found.
verbose Logical value set for CE progress output.

Value
CEoptim returns a list with the following components. optimum Optimal value of f.
optimizer List of the location of optimal value, consisting of: continuous Continuous part of the optimizer.
discrete Discrete part of the optimizer.
termination List of termination information consisting of: niter Total number of iterations upon termination.
convergence One of the following termination statements: • Not converged, if the number of iterations reaches iterThr; • The optimum did not change for noImproveThr iterations, if the best value has not improved for noImproveThr iterations; • Variances converged, otherwise.

states
List of intermediate results computed at each iteration. It consists of the iteration number (iter), the best overall value (optimum) and the worst value of the elite samples, (gammat). The means (mean) and maximum standard deviation (maxSd) of the elite set are also included for continuous cases, and the maximum deviations (maxProbs) of the sampling probabilities to either 0 or 1 are included for discrete cases.

states.probs
List of categorical sampling probabilities computed at each iteration. Will only be returned for discrete and mixed cases.

Note
• Although partial parameter passing is allowed outside lists, it is recommended that parameters names are specified in full. Parameters inside lists have to specified completely.
• Because CEoptim is a random function it is useful to (1) set the seed for the random number generator (for testing purposes), and (2) investigate the quality of the results by repeating the optimization a number of times.

Numerical examples
The following examples illustrate the use, flexibility, and efficacy of the CEoptim function from the package CEoptim.

Maximizing the peaks function
Suppose we wish to maximize MATLAB's well-known peaks function, given by To solve the problem with CEoptim, using normal updating, we must specify the vector of initial means µ 0 and standard deviations σ 0 of the 2-dimensional Gaussian sampling distribution. The initial sampling distribution should cover, roughly, the region where the maximizer is thought to lie. As an example we take µ 0 = (−3, −3) and σ 0 = (10, 10  (1234) # for verification purpose only R> mu0 <-c(-3,-3); sigma0 <-c(10,10) R> res <-CEoptim(fun, maximize=T, continuous=list(mean=mu0,sd=sigma0)) R> res The output of this implementation is as below: Optimizer for continuous part: -0.009390034 1.581405 Optimum: 8.106214 Number of iterations: 7 Convergence: Variance converged The reader may check that optim applied to the minimization of −f can easily find the wrong optimizer, e.g., when the starting value is (0, 0).

Non-linear regression
We next consider a more complicated optimization task, involving data generated from the well-known FitzHugh-Nagumo differential equations: which model the behavior of certain types of neurons (Nagumo et al. 1962). Ramsay et al. (2007) consider estimating the parameters a, b, and c from noisy observations of (V t ) by using a generalized smoothing approach. The simulated data in Figure 2 (saved as data(FitzHugh)) correspond to the values of V t obtained from (9) at times 0, 0.05, . . . , 20.0, adding Gaussian noise with standard deviation 0.5. That is, we use the non-linear regression model where the {ε i } are iid with a N(0, σ 2 ) distribution, V 0.05i (x) is the solution to (9) for time t = 0.05i, and x = (a, b, c, V 0 , R 0 ) is the vector of parameters. The true parameter values are here a = 0.2, b = 0.2, and c = 3. The initial conditions are V 0 = −1 and R 0 = 1.
Estimation of the parameters via the CE method can be established by minimizing the leastsquares performance where the {y i } are the simulated data from the model (10). Note that we assume that also the initial conditions are unknown.
We use the deSolve package to numerically solve the FitzHugh-Nagumo differential equations (9). Hereto, we first define the function FN.
To illustrate how the sampling distributions change during the CE process, we have plotted in Figure 3 the evolution of the sampling pdf for the first parameter a, from the 15th to the final iteration. As can be seen from the figure, the sampling distribution converges to a point distribution around the optimal value for a.

Max-cut problem
The max-cut problem in graph theory can be formulated as follows. Given a weighted graph (V, E) with node set V = {1, . . . , n} and edge set E, partition the nodes of the graph into two subsets V 1 and V 2 such that the sum of the (nonnegative) weights of the edges going from one subset to the other is maximized. Let C = (C ij ) be the matrix of weights. The objective is to maximize over all cuts {V 1 , V 2 }. Such a cut can be conveniently represented by a binary cut vector x = (x 1 , x 2 , . . . , x n ), where x i = 1 indicates that i ∈ V 1 . Let X be the set of cut vectors and let S(x) be the value of the cut represented by x, as given in (12).
To maximize S via the CE method one can generate the random cut vectors by drawing each component (except the first one, which is set to 1) independently from a Bernoulli distribution, that is, X = (X 1 , X 2 , . . . , X n ) ∼ Ber(p), where p = (1, p 2 , . . . , p n ). In this case the updated success probability for the ith component is the mean of the i-th components of the vectors in the elite set.
As an example, consider the network from Knuth (1993) describing the coappearances of 77 characters from Victor Hugo's novel Les Miserables. Each node of the network represents a selected character and edges connect any pair of characters that coappear. The weights of the edges are the number of such coappearances. Using CEoptim, the data can be loaded via the command data(lesmis). The network is displayed in Figure 4, using the graph analysis package sna.
R> library(sna) R> library(CEoptim) R> data(lesmis) R> gplot(lesmis,gmode="graph") For any fixed cost matrix costs and cut vector x, the objective function of the max-cut problem can be written as: To optimize this function with the CEoptim package, we specify the following arguments: discrete$probs={(0,1); (0.5.0.5);...;(0.5,0.5)}, sample size N=3000 and optimization type: maximize=T. To see the output we set verbose=TRUE. The other arguments are taken as default. Note that users only need to specify either categories or probs, if both of them are specified, then categories will be overridden. Note that character 1 (Myriel) is always in group1. The initial probabilities for the other characters are 0.5. With states.probs, we can plot the evolution of the probabilities that each character belongs to group1; see Figure 5.
We have run the program for 1000 times randomly. In 312 cases the optimal solution (535) was found. The frequency of the results of CEoptim is given in Figure 6.

Constrained minimization of the griewank function
To illustrate constrained optimization with CEoptim, we consider the minimization of the griewank function, which is widely used to test the convergence of optimization algorithms. The griewank function of order n is defined as where x = (x 1 , . . . , x n ) takes values in some subset of R n . The function has many local minima with (in the unconstrained case) a global minimum at x * = (0, . . . , 0) of S(x * ) = 0.
R> direct minimizer = 3.139669 3.991955 R> direct minimum = 0.05685487 It is also possible to use a penalty approach for this problem. Here we take the penalty function which can be implemented in the following way.
We will use CE method to obtain the maximum likelihood estimate by direct maximization of the log-likelihood for the Dirichlet distribution given the data.
For a particular example, a data size of n = 100 points are sampled from the Dirichlet(1, 2, 3, 4, 5) distribution with the assistance of the function rdirichlet in the CEoptim package.

Lasso regression
Suppose that we observed some data from the following model: where x i = (x i1 , . . . , x ip ) is the p-vector of explanatory variables, β = (β 1 , . . . , β p ) is the pvector of regression coefficients, and the {ε i } are the noise terms with E[ε i ] = 0, Var[ε i ] = σ 2 , for all i and Cov(ε i , ε j ) = 0 (∀i = j). Consider a Lasso regression approach to estimate the regression vector β: where Y = (Y 1 , . . . , Y n ) and X = (x 1 , . . . , x n ) is the (n × p) design matrix. The tuning parameter λ controls the amount of regularization.
For a given value of λ, we will use CE method to obtain the Lasso regression coefficient and compared our results with those obtained by the function glmnet from the package glmnet presented by Friedman et al. (2008).
We generate data of size n = 150, with p = 60 explanatory variables independently generated from a standard normal distribution. The true coefficients from β are chosen such that 10 are large (between 0.5 and 1) and 50 are exactly 0. The variance of the noise is equal to 1.
R> (RSS.penalized(beta.CEoptim,X=X,Y=Y,lambda=lambda.10)) [1] 1.990268 R> (RSS.penalized(beta.glmnet,X=X,Y=Y,lambda=lambda.10)) [1] 1.993622 Further, we compare the results obtained by CEoptim with the ones given by glmnet for the sequence of tuning parameter λ used by default in the glmnet function. Results given by CEoptim are slightly better than glmnet optimizer (see Figure 5.6). In more than 90% of the cases (over the 73 values of λ investigated) CEoptim gives a lower value for the objective function than glmnet. However, the coordinate descent algorithm (Friedman et al. 2008) used in glmnet is computationally less demanding than the CE approach.

AR(1) model with regime switching
As a final illustration of the use of CEoptim, we consider a model fitting problem involving both continuous and discrete variables.
Let Y t be the added value of a stock at time t, at day t = 1, 2, . . . , 300; that is, the increase (which may be negative) in stock price relative to the price at time t = 0. Let X t be the increment at day t. Hence, X i , t = 1, . . . , 300.
We assume that the {X i } satisfy a zero-mean AR(1) model with three possibly different regimes. Specifically, we assume θ (1) , i = 1, . . . , r 1 θ (2) , i = r 1 + 1, . . . , r 2 θ (3) , i = r 2 + 1, . . . , 300 , 1 r 1 < r 2 < 300, |θ i | 1, i = 1, 2, 3, and the error terms {ε i } are iid and normally distributed with standard deviation σ. The model thus has two discrete and three continuous parameters, as well as a nuisance parameter σ. Define θ = (θ (1) , θ (2) , θ (3) ) , r = (r 1 , r 2 ) , and let x 1 , . . . , x 300 be the observed increments. We put x 0 = 0. We fit the parameters by minimizing the least squares function where x i is the fitted value θ i x i−1 , and θ i is determined by θ and r via (14). The vector of fitted values, say x, can be written in matrix notation as x = X θ, where X is a 300 × 3 matrix where the elements in rows 1, . . . , r 1 in the first column are equal to x 0 , . . . , x r 1 −1 ; the elements in rows r 1 + 1, . . . , r 2 in the second column are equal to x r 1 , . . . , x r 2 −1 ; the elements in rows r 2 + 1, . . . , 300 in the third column are equal to x r 2 , . . . , x 299 ; and all other elements are 0. The implementation of the least squares function is given below. Note that the function requires input r − 1 rather than r, because each categorical variable used in CEoptim takes value in a set {0, . . . , c} for some c.

Normal Q−Q Plot
Theoretical Quantiles residuals Figure 11: Diagnostic residuals of the model: scatterplot of the residuals (left) and quantile quantile normal plot(right).

Concluding remarks
CEoptim provides the R implementation of the cross-entropy method for optimization. The versatility and effectiveness of this new package have been illustrated through a variety of optimization example, involving continuous, discrete, mixed and constrained optimization problems. We have demonstrated how this simple algorithm can be of benefit in statistical inference, including model fitting, regression, maximum likelihood, and lasso methods. CEoptim is available from the Comprehensive R Archive Network (CRAN) at http: //cran.r-project.org/.