DiceKriging, DiceOptim: Two R Packages for the Analysis of Computer Experiments by Kriging-Based Metamodeling and Optimization

We present two recently released R packages, DiceKriging and DiceOptim, for the approximation and the optimization of expensive-to-evaluate deterministic functions. Following a self-contained mini tutorial on Kriging-based approximation and optimization, the functionalities of both packages are detailed and demonstrated in two distinct sections. In particular, the versatility of DiceKriging with respect to trend and noise specifications, covariance parameter estimation, as well as conditional and unconditional simulations are illustrated on the basis of several reproducible numerical experiments. We then put to the fore the implementation of sequential and parallel optimization strategies relying on the expected improvement criterion on the occasion of DiceOptim’s presentation. An appendix is dedicated to complementary mathematical and computational details.


Introduction
Numerical simulation has become a standard tool in natural sciences and engineering. Used as cheaper and faster complement to physical experiments, simulations sometimes are a necessary substitute to them, e.g., for investigating the long term behavior of mechanical structures, or the extreme risks associated with geological storage (e.g., CO 2 sequestration or nuclear waste deposit). A first step to such quantitative investigations is to proceed to a fine mathematical modeling of the phenomenon under study, and to express the function of interest as solution to a set of equations (generally partial differential equations). Modern simulation techniques such as finite-elements solvers and Monte-carlo methods can then be used to derive approximate solutions to the latter equations. The price to pay in order to derive accurate simulation results is computation time. Conception studies based on an exhaustive exploration of the input space (say on a fine grid) are thus generally impossible under realistic industrial time constraints. Parsimonious evaluation strategies are hence required, all the more crucially that the computation times and the dimensionality of inputs are high. Mathematical approximations of the input/output relation -also called surrogate models or metamodels -are increasingly used as a tool to guide simulator evaluations more efficiently. Once a class of surrogate models has been chosen according to some prior knowledge, they can be built upon available observations, and evolve when new data is assimilated. Typical classes of surrogate models for computer experiments include linear regression, splines, neural nets, and Kriging. Here we essentially focus on several variants of the Kriging metamodel, and on their use in prediction and optimization of costly computer experiments.
Originally coming from geosciences (Krige 1951) and having become the starting point of geostatistics (Matheron 1963), Kriging is basically a spatial interpolation method. Assuming that an unknown function y : D ⊂ R d −→ R is one sample of a real-valued random field (Y (x)) x∈D with given or partially estimated probability distribution, Kriging consists in making predictions of unknown y(x (0) ) values (x (0) ∈ D) based on the conditional distribution of Y (x (0) ) knowing observations of Y at a design of experiments X = {x (1) , . . . , x (n) } (n ∈ N). Note that in such a framework, the uncertainty does not refer to an actual random phenomenon, but to a partially observed deterministic phenomenon. One often refers to the latter kind of uncertainty as epistemic, whereas the former one is called aleatory (see for instance Matheron 1989, for a rich discussion on the subject). Coming back to Kriging, let us mention the remarkable fact that the Kriging predictor is interpolating the observations, provided that they are assumed noise-free. It is undoubtedly one of the reasons why this metamodel has been imported from its geostatistical cradle (d = 2 or 3) to the high-dimensional framework of computer experiments (d ∈ N). Indeed, following the impulse given by the seminal paper of Sacks, Welch, Mitchell, and Wynn (1989), many works dedicated to Kriging as a surrogate to computer experiments have been published including Welch, Buck, Sacks, Wynn, Mitchell, and Morris (1992) about screening, Koehler and Owen (1996) with an emphasis on experimental design, or O'Hagan (2006) and numerous achievements in uncertainty assessment by the same author since the early nineties. To get a global overview of state-of-the-art results and developments in Kriging for computer experiments, we refer to contemporary books of reference such as Santner, Williams, and Notz (2003), Fang, Li, and Sudjianto (2006), and Rasmussen and Williams (2006a).
Since the goal of computer experiments is often to tune some control variables in order to optimize a given output, it is obviously tempting to replace a costly simulator by a Kriging metamodel in order to speed-up optimization procedures. Numerical experiments presented in Jones (2001) show that directly optimizing the Kriging predictor is however generally not efficient, and is potentially leading to artifactual basins of optimum in case of iterated optimizations with metamodel update. Fortunately, efficient criteria like the expected improvement (EI) have been proposed for sequential Kriging-based optimization, as discussed and compared with other criteria in Schonlau (1997) and Sasena, Papalambros, and Goovaerts (2002). Already proposed in previous works (e.g., Mockus 1988), EI has become an increasingly popular criterion since the publication of the efficient global optimization (EGO) algorithm in Jones, Schonlau, and Welch (1998). Recent advances concern the extension of EI to a multipoints criterion as well as Kriging-based parallel optimization strategies, such as proposed in Ginsbourger, Le Riche, and Carraro (2010). More detail and some additional perspectives can be found in the recent tutorial by Brochu, Cora, and de Freitas (2009).
Back to the implementation of Kriging, let us give a short overview of existing codes, and motivate the emergence of new open source packages for Computer Experiments. Statistical software dedicated to Kriging has flourished since the 1990's, first in the framework of lowdimensional spatial statistics, and later on in computer experiments and optimization.
Several R (R Development Core Team 2012) packages like spatial (Venables and Ripley 2002), geoR (Ribeiro and Diggle 2001), gstat (Pebesma 2004), and RandomFields (Schlather 2001) propose indeed a wide choice of functionalities related to classical 2-and 3-dimensional geostatistics. These packages are unfortunately not suitable for applications in higher dimensions, for which similar Kriging equations but specific parameter estimation techniques have to be used. Alternatively, MATLAB toolboxes emanating from the computer experiments community have become popular among practitioners, like MPerK (Santner et al. 2003) and DACE (Lophaven, Nielsen, and Sondergaard 2002), or GPML (Rasmussen and Nickisch, 2011;Rasmussen and Williams, 2006b) in the context of Gaussian process regression and classification for machine learning. Several R programs in this vein have also emerged, like the bundle BACCO (Hankin 2005) and the packages mlegp (Dancik 2011) and tgp (Gramacy 2007;Gramacy and Taddy 2010). BACCO contains the packages calibrator and approximator, which offer an implementation of the calibration and multi-objective models introduced by O'Hagan (2000, 2001), as well as a first R package implementing universal Kriging (UK) in a Bayesian framework, emulator. This package considers one choice of priors that provide analytical results, and is limited to the Gaussian correlation function. Furthermore, it especially concerns prediction and parameter estimation, which is performed in a basic way. In mlegp, more efforts have been paid to parameter estimation and the scope is extended to the applications of Kriging for sensitivity analysis and for stochastic filtering. However, it is restricted to Kriging with Gaussian correlation function and first degree polynomial trend, and does not offer any Kriging-based optimization algorithm. In other respects, tgp focuses on treed Gaussian process (TGP) models, with a Bayesian treatment of covariance parameters, and includes a variant of EGO relying on TGP models. Such highly sophisticated metamodels, relying on Markov chain Monte Carlo techniques, are quite calculation intensive.
Here we consider regular UK metamodels, and aim at providing the user with the maximum of potentialities and user-friendliness. Compared to the existing R packages, we propose several covariance kernels, corresponding to different levels of smoothness for the underlying random field models. We also implemented the trend structure with the convenient formula syntax, and pushed the limit of applicability of UK in higher dimensions thanks to a careful implementation of likelihood maximization routines, relying on the analytical gradient. Furthermore, we have used the object-oriented S4 formalism, for more user-friendliness and genericity. In particular, the effort paid to code in an object-oriented way is meant to facilitate the forthcoming implementation of novel kinds of covariance kernels, and to continue using the same syntax for calling methods such as predict, plot, show, etc. in the next versions. Finally, we specifically aim at providing efficient Kriging-based optimizers in the spirit of EGO, with optional features for parallel computing.
DiceKriging and DiceOptim have been produced in the frame of the DICE (standing for Deep Inside Computer Experiments) consortium. DICE joined major french companies and public institutions having a high research and development interest in computer experiments with academic participants during the years 2006-2009. The main aim was to put in common industrial problems and academic know-how in order to foster research and transfer in the field of design and analysis of computer experiments. Four R packages summarizing a substantial part of the conducted research have been released on the Comphrehensive R Archive Network (CRAN, http://CRAN.R-project.org/) at the end of the consortium. The four packages DiceDesign (Franco, Dupuy, and Roustant 2011), DiceKriging, DiceEval (Dupuy and Helbert 2011), and DiceOptim should be seen as a small software suite, tailored for different but complementary needs of computer experimenters. For instance, DiceDesign might be a good option to generate some original space-filling designs at an initial stage of metamodel fitting with DiceKriging, while DiceEval might be used to assess the coherence and the fidelity of the obtained metamodel to the observed data. DiceOptim is more a complement of DiceKriging dedicated to expected improvement functions, offering sequential and parallel Kriging-based optimization routines. DiceDesign and DiceEval will be presented by their authors in a forthcoming paper (Dupuy, Helbert, and Franco 2010). They will not be used nor detailed here.
The main aim of this article is to give a practical introduction to DiceKriging and DiceOptim, together with an overview of the underlying statistical and mathematical background. In order to produce a self-contained document, we chose to recall basic assumptions in the body of the document. This is precisely the aim of Section 2. Section 3 then gives an overview of the main functionalities of the two packages. Sections 4 and 5 provide illustrative examples with code chunks of DiceKriging and DiceOptim, respectively. Finally, the four appendices focus on different aspects of the implementation. In particular we give a table of computational cost and memory size of the main procedures (Appendix C.3), some comments about speed (Appendix C.4), and two tests of trustworthiness for the covariance estimation algorithms (Appendix D).

Statistical background
Prior to presenting the DiceKriging and DiceOptim packages from a user perspective, let us recall some statistical basics of Kriging metamodeling and Kriging-based optimization.

From simple to universal Kriging for deterministic simulators
In this section, the simulator response y is assumed to be a deterministic real-valued function of the d-dimensional variable x = (x 1 , . . . , x d ) ∈ D ⊂ R d . y is assumed to be one realization of a square-integrable random field (Y x ) x∈D with first and second moments known up to some parameters.
Let us recall that X = {x (1) , . . . , x (n) } denote the points where y has already been evaluated, and denote by y = (y(x (1) ), . . . , y(x (n) )) the corresponding outputs. For any x ∈ D, the aim of Kriging will be to optimally predict Y x by a linear combination of the observations y. Simple Kriging and universal Kriging constitute the best linear predictors in two particular cases concerning the field (or "process") Y and what is known about it.
Simple Kriging: From spatial linear interpolation to Gaussian process conditioning In simple Kriging (SK), Y is assumed to be the sum of a known deterministic trend function µ : x ∈ D −→ µ(x) ∈ R and of a centered square-integrable process Z: where Z's covariance kernel C : (u, v) ∈ D 2 −→ C(u, v) ∈ R is known. Most of the time, Z is assumed second order stationary, so that C(u, v) = σ 2 R(u − v; ψ) where the so-called correlation function R is a function of positive type with parameters ψ, and σ 2 is a scale parameter called the process variance. More detail concerning these parameters can be found in a forthcoming subsection. Concerning Y 's trend, note that it is very easy to come back to the centered process Y (x) − µ(x) since µ is known. Without loss of generality, we will hence first assume that Y is centered. Now, let us recall that the best linear unbiased predictor of Y (x) based on the observations Y (X) is obtained by finding λ * (x) ∈ R n minimizing the mean squared error MSE( If C is invertible, the strict convexity of MSE ensures both the existence and the uniqueness of λ * (x), and the SK weights are given by: ,j≤n is the covariance matrix of Y (X), and c(x) = (C(x, x (i) )) 1≤i≤n is the vector of covariances between Y (x) and Y (X). Substituting both the random vector Y (X) by its realization y and λ(x) by λ * (x) in the expression λ(x) Y (X), we get the so-called SK mean prediction at x: m SK (x) := c(x) C −1 y. Similarly, by plugging in the optimal λ * (x) in the expression of the MSE, one gets the so-called SK variance at x: Generalizing to the case of a non-centered process Y with known trend function µ(·), we get the SK equations: where µ = µ(X) is the vector of trend values at the experimental design points. Classical properties include the fact that m SK interpolates the data (X, y), and that s 2 SK is non-negative and zero at the experimental design points. Furthermore, the SK variance is independent of y (homoscedasticity in the observations). Note that these properties hold whatever the chosen kernel C, without any condition of compatibility with the data. In addition, let us remark that in the typical case of a stationary kernel C(x, x ) = σ 2 R(x − x ; ψ), the simple Kriging equations simplify to m SK (x) = µ(x) + r(x) R −1 (y − µ) and s 2 SK (x) = σ 2 1 − r(x) R −1 r(x) , where r(x) and R respectively stand for the analogues of c(x) and C in terms of correlation. In particular, m SK is hence not depending on σ 2 , while s 2 SK is proportional to it.
One major fact concerning the SK equations is that they coincide with classical conditioning results in the case where the process Z is assumed Gaussian. Indeed, the orthogonality of Y (x) − λ * (x) Y (X) and Y (X) ensures independence in the latter case, so that m SK ( Similarly, s 2 SK then coincide with the conditional variance VAR[Y x |Y (X) = y], so that the conditional law of Y x can finally be written in terms of SK quantities: More generally, the law of the whole random process Y conditional on Y (X) = y is Gaussian with trend m SK , and with a conditional covariance structure which can be analytically derived in the same fashion as s 2 SK (see e.g., Ginsbourger et al. 2010, for details). The latter is the key to conditional simulations of Y knowing the observations, as we will illustrate in Section 4. We now turn to the case where the trend µ is known up to a set of linear trend coefficients.
When some linear trend coefficients are unknown: Ordinary and universal Kriging Let us focus on the case where the trend is of the form µ( , where the f j 's are fixed basis functions, and the β j 's are unknown real coefficients. Universal Kriging (UK) consists in deriving best linear predictions of Y based on the observations Y (X) while estimating the vector β := (β 1 , . . . , β p ) on the fly. Note that in the specific case where the basis functions reduce to a unique constant function, UK is referred to as ordinary Kriging (OK). The UK equations are given by: where f (x) is the vector of trend functions values at x, F = (f (x (1) ), . . . , f (x (n) )) is the n × p so-called experimental matrix, and the best linear estimator of β under correlated residual is given by the usual formula β := (F C −1 F) −1 F C −1 y. Basic properties of UK include similar interpolating behavior as SK, with a variance vanishing at the design of experiments. Furthermore, m UK (x) tends to the best linear fit f (x) β whenever the covariances c(x) vanish, which may typically happen when x is far away from the design X for some norm . , in the case of a stationary covariance kernel decreasing with u − v . Note also the inflation of the Kriging variance term, which reflects the additional uncertainty due to the estimation of β.
As in the case of SK, it is possible to interpret UK in terms of random process conditioning, at the price of some specific assumptions. Indeed, working in a Bayesian framework with an improper uniform prior over R p for β (see Helbert, Dupuy, and Carraro 2009) leads to a Gaussian posterior distribution for the process Y conditional on the observations. Again, m UK (x) and s 2 UK (x) appear respectively as conditional mean and variance, and the analytically tractable conditional covariance kernel enables the use of conditional simulations at any set of new design points. This model is a first step towards Bayesian Kriging, where more general prior distributions can be chosen, not only for β but also for all kernel parameters such as σ 2 or ψ. We will not develop this approach any further here since a generic version of Bayesian Kriging is not proposed in the present version of DiceKriging, but send the interested reader to the seminal article Omre (1987) and the works of O'Hagan (see http://www.tonyohagan.co.uk/academic/pub.html).

Filtering heterogeneously noisy observations with Kriging
In many practical situations, it is not possible to get exact evaluations of the deterministic function y at the design of experiments, but rather pointwise noisy measurements. This is the case for instance for partial differential equation solvers relying on Monte Carlo methods (e.g., in nuclear safety), or in partially converged simulations based on finite elements (e.g., in fluid dynamics). In such cases, for a given x ∈ D, the user does not have access to y(x), but to an approximate response y(x) + . When it is reasonable to assume that is one realization of a "noise" random variable ε, it is still possible to derive Kriging approximations. Here we assume that the probability distribution of ε may depend on x and other variables, and that its realizations may differ for different measurements of y at the same x. So instead of referring to the measurements of y in terms of x's, we will denote byỹ i = y(x (i) ) + i the sequence of noisy measurements, where the x (i) 's are not necessarily all distinct, and by τ 2 i the corresponding noise variances. Following the convenient particular case of Monte Carlo simulations, we finally make the assumption that ε i ∼ N (0, τ 2 i ) (1 ≤ i ≤ n) independently. Note that although not explicitly addressed in this paper, the case of multi-Gaussian vector of ε i 's with prescribed covariance matrix is a straightforward generalization of the model considered here. We now recall the Kriging equations for heterogeneously noisy observations, in the Gaussian process framework.
If we suppose that y is a realisation of a Gaussian process following the simple Kriging assumptions above, theỹ i 's can now be seen as realizations of the random variables Y i := Y (x (i) )+ε i , so that Kriging amounts to conditioning Y on the heterogeneously noisy observations Y i (1 ≤ i ≤ n). Indeed, provided that the process Y and the Gaussian measurement errors ε i are stochastically independent, the process Y is still Gaussian conditionally on the noisy observations Y i , and its conditional mean and variance functions are given by the following slightly modified Kriging equations: where y = (ỹ 1 , . . . ,ỹ n ) , and ∆ is the diagonal matrix of diagonal terms τ 2 1 , . . . , τ 2 n . The only difference compared to noiseless SK equations is the replacement of C by C + ∆ at every occurrence. Specific properties of this variant of the SK metamodel include the fact that m SK (·) is not interpolating the noisy observations (i.e., where no observation has been done with τ = 0), that s 2 SK (·) does not vanish at those points, and is globally inflated compared to the noiseless case. Note that although s 2 SK (·) now depends on both the design X and the noise variances τ 2 := {τ 2 1 , . . . , τ 2 n }, it still does not depend on the observations, similarly as in the noiseless case. Note finally that the same filtering effect applies in the case of universal Kriging, where the equations are similarly obtained by replacing C −1 by (C + ∆) −1 .

Covariance kernels and related parameter estimation
The choice of the covariance kernel C has crucial consequences on the obtained Kriging metamodel, all the more so when the trend is known or assumed constant. In order to be admissible, C has to be chosen in the set of positive definite kernels. Checking that positivedefiniteness holds for a given C is however not an easy task, and non-parametric estimation of it seems unrealistic. So what one typically does in Kriging is to select beforehand a parametric family of kernels known to be positive definite, and to estimate the corresponding parameters based on available data, for instance by maximizing a likelihood function or minimizing the average cross-validation error. A usual restriction is to consider kernels depending only on the increment u − v, called stationary kernels. Admissible stationary kernels coincide with the functions of positive type, which are characterized by Bochner's theorem (see e.g., Rasmussen and Williams 2006a) as Fourier transforms of positive measures. Some of the most popular 1dimensional stationary kernels include the Gaussian kernel, Fourier transform of the Gaussian density, as well as the Matérn kernel, Fourier transform of the Student density (Stein 1999). One convenient way of getting admissible covariance kernels in higher dimensions is to take tensor products of 1-d admissible kernels. Such kernels, called separable, are the most widely used in computer experiments literature. The covariance kernels available in the current version of DiceKriging are built upon this model, up to a multiplicative constant σ 2 > 0: where h = (h 1 , . . . , h d ) := u − v, and g is a 1-dimensional covariance kernel. Although this could make sense in some contexts, the package does not allow mixing different covariance kernels by dimensions. The parameters θ j are chosen to be physically interpretable in the same unit as the corresponding variables. They are called characteristic length-scales by Rasmussen and Williams (2006a). The analytic formula for g are taken from this book, and are given in Table 1, where θ > 0 and 0 < p ≤ 2.
The above covariances will result in different level of smoothness for the associated random processes. With Gaussian covariance, the sample paths of the associated centered Gaussian process have derivatives at all orders and are thus very smooth (they are even analytical). With Matérn covariance with parameter ν, the process is (mean square) differentiable at order k if and only if ν > k. Thus with ν = 5/2, the process is twice differentiable; with ν = 3/2 only once. With ν = 1/2, equivalent to the exponential covariance, the process is only continuous. This is also the case for the power-exponential covariance when the power parameter p is strictly less than 2. The general Matérn covariance depends on the modified Bessel function, and has not been implemented yet. However, when ν = k + 1/2 where k is a non-negative integer, the covariance expression is given by an analytical expression. The two provided cases correspond to commonly needed levels of smoothness (ν = 3/2 and ν = 5/2).
Despite the offered flexibility in terms of differentiability level, all the covariance kernels above correspond to Gaussian processes with continuous paths. Now, in applications, the assumption of continuity is sometimes untenable for simulator outputs, although deterministic, due in particular to numerical instabilities. Hence, even if several evaluations at the same point deliver the same response, a slight change in the input vector may sometimes result in a jump in the response. Such discontinuities are classically handled in geostatistics using a so-called nugget effect, consisting in adding a constant term τ 2 (similar to what is called a jitter in machine learning and often used for numerical purposes) to c(0): The consequences of such a modification of the covariance kernel on Kriging are fairly similar to the case of noisy observations. Up to a rescaling of the Kriging variance due to the fact that the process variance changes from σ 2 to σ 2 + τ 2 , predicting with nugget or noisy observations coincide when considering points outside of the design of experiments: A diagonal with constant term is added to the covariance matrix of observations, smoothing out the noiseless Kriging interpolator and inflating the variance. A major difference, however, is that Kriging with nugget effect conserves the interpolation property: Since τ 2 appears this time in the covariance vector too, the Kriging mean predictor systematically delivers the original observation in virtue of the covariance vector being a column of the covariance matrix, as in the deterministic case without nugget and contrarily to the noisy case.
In DiceKriging, the covariance parameters can be either given by the user or estimated. The first situation is useful for academic research or for Bayesian computations. The second one is needed for non-Bayesian approaches, and for Kriging-based optimization. At the present time, two estimation methods are proposed: maximum likelihood estimation (MLE) or penalized MLE (PMLE). The latter is based on the smoothly clipped absolute deviation (SCAD) penalty defined in Fan (1997) but is still in beta version. In Appendix A, we give the expressions of the likelihoods, concentrated likelihoods and analytical derivatives involved. We also refer to Section 3.3 for details on the optimization algorithms used.

Kriging-based optimization: Expected improvement and EGO
Optimization (say minimization) when the objective function is evaluated through costly simulations creates a need for specific strategies. In most cases indeed, the non-availability of derivatives prevents one from using gradient-based techniques. Similarly, the use of metaheuristics (e.g., genetic algorithms) is compromised by severely limited evaluation budgets.
Kriging metamodels has been successfully used for the global optimization of costly deterministic functions since the nineties (Jones et al. 1998). A detailed review of global optimization methods relying on metamodels can be found in Jones (2001). The latter illustrates why directly minimizing a deterministic metamodel (like a spline, a polynomial, or the Kriging mean) is not efficient. Kriging-based sequential optimization strategies address the issue of converging to non optimal points by taking the Kriging variance term into account, hence leading the algorithms to be more explorative. Such algorithms produce one point at each iteration that maximizes a figure of merit based upon the law of Y (x)|Y (X) = y. Several infill sampling criteria are available, that balance Kriging mean prediction and uncertainty. The expected improvement criterion has become one of the most popular such criteria, probably thanks to both its well-suited properties and its analytical tractability.

The expected improvement criterion
The basic idea underlying EI is that sampling at a new point x will bring an improvement of min(y(X)) − y(x) if y(x) is below the current minimum, and 0 otherwise. Of course, this quantity cannot be known in advance since y(x) is unknown. However, the Gaussian process model and the available information Y (X) = y make it possible to define and derive: for which an integration by parts yields the analytical expression (see Jones et al. 1998): where Φ and φ are respectively the cdf and the pdf of the standard Gaussian distribution. The latter analytical expression is very convenient since it allows fast evaluations of EI, and even analytical calculation of its gradient and higher order derivatives. This used in particular in DiceOptim for speeding up EI maximization (see Appendix C.1). This criterion has important properties for sequential exploration: It is null at the already visited sites, and non-negative everywhere else with a magnitude that is increasing with s(·) and decreasing with m(·).
The "efficient global optimization" algorithm EGO (see Jones et al. 1998) relies on the EI criterion. Starting with an initial Design X (typically a Latin hypercube), EGO sequentially visits a current global maximizer of EI and updates the Kriging metamodel at each iteration, including hyperparameters re-estimation: 1. Evaluate y at X, set y = y(X), and estimate covariance parameters of Y by ML.
2. While stopping criterion not met: (a) Compute x new = argmax x∈D EI(x) and set X = X ∪ {x new }.
(b) Evaluate y(x new ) and set y = y ∪ {y(x new )}.
(c) Re-estimate covariance parameters by MLE and update Kriging metamodel.
EGO and related EI algorithm have become commonplace in computer experiments, and are nowadays considered as reference global optimization methods in dimension d ≤ 10 in cases where the number of objective function evaluations is drastically limited (see e.g., Jones 2001or Brochu et al. 2009). One major drawback of EGO is that it does not allow parallel evaluations of y, which is desirable for costly simulators (e.g., a crash-test simulation run typically lasts 24 hours). This was already pointed out in Schonlau (1997), where the multipoints EI was defined but not further developed. This work was continued in Ginsbourger et al. (2010) by expliciting the latter multi-points EI (q-EI), and by proposing two classes of heuristics strategies meant to approximately optimize the q-EI, and hence simultaneously deliver an arbitrary number of points without intermediate evaluations of y.

Adaptations of EI and EGO for synchronous parallel optimization
Considering q ≥ 2 candidate design points X new := {x n+1) , . . . , x (n+q) }, the q-points EI is defined as conditional expectation of the joint improvement brought by the new points: Unlike in the 1-point situation, q-EI is not known in closed form (See Ginsbourger et al. 2010, for a formula in the case q = 2). However, it is possible to estimate it by a standard Monte Carlo technique relying on Gaussian vector simulation (see Table 2).
q-EI can potentially be used to deliver an additional design of experiments in one step through the resolution of the optimization problem However, the optimization problem defined by equation 14 is of dimension d × q, and with a noisy and derivative-free objective function in the case where the criterion is estimated by Monte Carlo. Pseudo-sequential greedy strategies have been proposed that approach the 1: function q-EI(X, y, X new ) 2: for i ← 1, . . . , n sim do 4: Simulating the improvement at X new 6: end for 7: Estimation of the q-points expected improvement 8: end function  solution of 14 while avoiding its numerical cost, hence circumventing the curse of dimensionality. In particular, the constant liar (CL) is a sequential strategy in which the metamodel is updated (still without hyperparameter re-estimation) at each iteration with a value L ∈ R exogenously fixed by the user, here called a "lie" (see Table 3). L should logically be determined on the basis of the values taken by y at X. Three values, min{y}, mean{y}, and max{y} were considered in Ginsbourger et al. (2010). In the present version of DiceOptim, the CL function has a tunable L, whereas the parallel version of EGO relying on CL (called CL.nsteps) has a lie L fixed to min{y}. More detail is given in Section 5, dedicated to numerical examples produced with the main functions of DiceOptim.

The main functions
DiceKriging performs estimation, simulation, prediction and validation for various Kriging models. The first step is to define a Kriging model, which is the goal of the km function. It is suited either for SK and UK, noise-free and noisy observations, and allows some flexibility in estimating all or only some parameters. Its functioning is detailed in Table 4. Simulation, prediction, and validation are implemented as simulate, predict and plot methods, that apply directly to km objects. For prediction, the kind of Kriging must be indicated in the argument type ("SK" or "UK"). The plot method corresponds to leave-one-out validation. A k-fold validation can be found in DiceEval.

Kriging model description
Arguments to be specified in km Unknown trend and cov. kernel parameters (UK) (Default) Nothing to specify Known trend and cov. kernel parameters (SK) coef.trend, coef.cov, coef.var Known trend, unknown cov. kernel parameters coef.trend Optional nugget effect -Known nugget (homogeneous to a variance) -Unknown nugget.estim = TRUE Case of noisy observations noise.var (homogen. to a variance) (incompatible with a nugget effect in this package) Table 4: Possibilities of the function km for the definition of Kriging models. km estimates the unknown parameters and creates the resulting km object. Default is universal Kriging (noisefree observations, no nugget effect). The other Kriging models are obtained by specifying arguments in km as indicated above.

R function Description EI
One-point noise-free EI criterion qEI q-points noise-free EI criterion (estimated by Monte Carlo) EGO.nsteps Sequential EI Algorithm -model updates including re-estimation of covariance parameters -with a fixed number of iterations (nsteps) max_EI One-point noise-free EI maximization. No call to the objective function max_qEI.CL (sub-)maximization of the q-points EI, based on the constant liar heuristic.
No call to the objective function CL.nsteps Parallel EI Algorithm -model updates including re-estimation of covariance parameters -with a fixed number of iterations (nsteps) DiceOptim performs sequential and parallel Kriging-based optimization, based on the 1-point and multipoints EI criteria. The main functions are described in Table 5. Note that in the version presented here, DiceOptim is limited to noise-free observations.

Trend definition and data frames
A convenient way to specify linear trends in R is to use the class formula that provides compact symbolic descriptions. Among advantages of formulas, arbitrary functions of the inputs can be considered, and updating models is very easy. For instance, the linear model arguments of the functions lm or glm (package stats) are given as objects of class formula. The same is possible in DiceKriging, through the km function, with some specificities that we describe now.
First, the design X must be provided as a data.frame in the argument design of km, which implies that all columns of X are named. Then the trend can be specified in the argument formula, using these names. The formula mechanism mainly works as in the lm function from the stats package, as shown in Table 6; In particular, notice that the inhibition function I is sometimes required, especially for polynomial terms. Note however that the left hand term is not needed (and will not be used if provided): This is because formula is only defining a trend (and not a transformation of the response y).
Once the trend is specified, one can estimate or build a km object (see last section). For prediction (or simulation), the new location(s) X new have to be specified in the argument newdata of the predict (or simulate) method. In practice, the user may not want to lose time converting matrices to data.frames. Thus, X new can be provided simply as a matrix (or even a vector in case of a single new data). However, with this practical syntax, no variable name is stored for X new , and it is always assumed that the columns of X new are provided in the same order as for X. For evident reasons, the user must not depart from this rule.

Comments about the optimization algorithms
Optimization algorithms are used on two different occasions: Likelihood maximization and EI maximization. These problems are quite hard, due to the cost of objective functions, numerical instabilities, multimodality and dimensionality issues (see e.g., Santner et al. 2003). In DiceKriging and DiceOptim, we have addressed them by following four leading ideas. First, rely on trustworthy state-of-the-art algorithms; Second, choose at least one algorithm capable to overcome multimodality; Third, to improve speed and accuracy, supply the analytical gradients; Fourth, enable tunability of key parameters by the user. For these reasons, we have selected a quasi-Newton Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm (function optim) and the hybrid algorithm genoud from package rgenoud (Mebane and Sekhon 2011), taking advantage of both the global search provided by a genetic algorithm and the local search based on gradients. Let us now turn to some specificities.

Likelihood maximization
The implementation is based on the efficient algorithm proposed by Park and Baek (2001), and extended to the Kriging models addressed here (see the implementation details in the Appendix A). On the one hand, the results obtained with BFGS are the quickest, but may be variable. To reduce this variability, BFGS is run from the best point among an initial population drawn at random. The size of this population can be tuned by changing pop.size in the argument control. On the other hand, rgenoud usually gives more stable results, at the cost of a higher computational time. Again, the most influential parameters can be tuned by the user. Some results concerning the performance of the optimizers are given in the examples section.

EI maximization
The branch-and-bound algorithm proposed by Jones et al. (1998) was not chosen here, since the boundaries depend on the covariance function and may be hard to find in general. Due to the multimodal form of EI (which is equal to 0 at every visited point), the genoud algorithm is used. The analytical gradient of EI is derived in Appendix B.

Examples with DiceKriging
4.1. An introductory 1D example with known parameters.
First, let use km to build a Kriging model with the following characteristics: second order polynomial trend 11x + 2x 2 , Matérn 5/2 covariance structure with σ = 5 and θ = 0.4. Recall that the trend form is interpreted with an object of type formula, in the same way as for linear models (see lm{stats}). As this formula is based on the variables names, the argument design of km must be a data.frame. Thus in the following example, the vector inputs is converted to a data.frame with name x, which must also be used in the argument formula.
R> t <-seq(from = -2, to = 2, length = 200) R> p <-predict(model, newdata = data.frame(x = t), type = "SK") Note that the command: gives a warning: Since t is not named, there is no guarantee that the values contained in it correspond to the same variable than x. Hence, newdata should be provided as a data.frame with the same variable names as the design.
Finally plot the results: SK mean and 95% confidence intervals (see Figure 1).

Influence of the range parameters
The range parameters in DiceKriging (i.e., the θ's) are length scales: A small value is analogous to a high frequency, and a large one to a low frequency. To illustrate this, we ran the code above with three values of θ: 0.05, 0.4 and 1. The result is represented in Figure 2.

Influence of the trend
Let us visualize the influence of the trend, by comparing constant, affine and sine trends. The sine trend is chosen instead of the quadratic trend, to show that the trends proposed in DiceKriging are not limited to polynomials. The only modifications in the R code are in the arguments formula and trend. For a constant trend, we used: For the affine trend: R> formula <-~x R> trend <-c(0, 10) The first coefficient in trend is the intercept, and the second one the slope: Thus, mathematically the affine trend is 10x. Note that in this case, another possible choice for formula is: R> formula <-~.

Simulations of Gaussian processes underlying Kriging models
Two kinds of simulations are used in the context of Kriging modeling: Unconditional simulations, which consists in simulating sample paths of a Gaussian process with given (unconditional) mean and covariance functions, and conditional simulations, where the mean and covariance functions are conditional on fixed values of the Gaussian process at some design points. Conditional and unconditional simulations are implemented as the so-called simulate method, applying to km objects. For instance, simulate can be applied to the object model defined in the previous section (Section 4.1). By default, (unconditional) simulations are performed at the design points. To simulate at new points, the new locations must be specified in the argument newdata. The argument nsim specifies the number of simulations required.
R> t <-seq(from = -2, to = 2, length = 200) R> y <-simulate(model, nsim = 5, newdata = data.frame(x = t)) Note that, as for the predict method, the command: R> y <-simulate(model, nsim = 5, newdata = t) gives a warning: Since t is not named, there is no guarantee that the values contained in it correspond to the same variable than x. Hence, newdata should be provided as a data.frame with the same column names as the design.
Formally, y is a matrix where each row contains one simulation. All simulations are represented in Figure 4. The trend is the second order polynomial previously considered.

Influence of covariance functions
The smoothness of stationary Gaussian processes sample functions depends on the properties (at 0) of the covariance functions. To illustrate this, we redo the simulations above with different covariance functions, namely: "gauss", "matern3_2" and "exp". As expected, when chosen in this order, the corresponding simulated paths are more and more irregular, as can be seen in Figures 5, 6 and 7. To sum up, the four covariance functions are represented in Figure 8, with one simulated path for each (to save space, the code is not inserted here, but can be found in the documentation of simulate.km.). Finally, the same kind of experiment can be done with the power-exponential covariance function ("powexp"), as can be seen from the results presented in a summarized form in Figure 9.

Conditional simulations
Still following the first example of Section 4.1, conditional simulations can be performed, simply by turning the argument cond to TRUE:  R> y <-simulate(model, nsim = 5, newdata = data.frame(x = t), cond = TRUE) The conditional simulations are represented in Figure 10. This procedure will be used to do two trustworthiness tests. Firstly, the conditional simulations can be compared with the simple Kriging mean and variance. Secondly, it will be used to evaluate the accuracy of the maximum likelihood estimators of the Gaussian process covariance parameters. More details are given in Appendix D.

Estimation and validation of Kriging models
A 2-dimensional case study To illustrate the functionalities of DiceKriging on non-simulated data, we start with the famous 2-dimensional Branin-Hoo function (Jones et al. 1998). This function is a usual case study in global optimization with the particularity of having three global minimizers. Thus, it will be used in several DiceOptim examples. For now, let us estimate a Kriging model. In this first example, we consider a very naive design, which is a 4 × 4 grid. R> X <-expand.grid(x1 = seq(0, 1, length = 4), x2 = seq(0, 1, length = 4)) R> y <-apply(X, 1, branin) For the trend, we use a first order polynomial, but this choice does not much influence the results here. Finally, assuming a priori that the Branin-Hoo function is smooth, we use a Gaussian covariance function. Then, the construction of the Kriging model (including estimation of the trend and covariance parameters) is performed using km: R> m <-km(~., design = X, response = y, covtype = "gauss") R> m <-km(~., design = X, response = y, covtype = "gauss", + control = list(trace = FALSE)) but there are several advantages to keep it, at least here, to see what is tunable in estimation.
Indeed, many default values are proposed by km: The maximization of the likelihood is performed with the BFGS optimizer, using analytical gradient and an initial random search based on 20 initial points. Also, the domain over which the parameters are estimated, depending on the ranges of the design X in each direction. But in some cases, it can be wise to change some of these default values. For instance, in order to limit the variability of the optimization results, one may increase the size of the initial random search: R> m <-km(~., design = X, response = y, covtype = "gauss", + control = list(pop.size = 50, trace = FALSE)) or prefer to BFGS the genetic algorithm rgenoud, which is a global optimizer, at the cost of a longer running time. This can be done by means of the argument optim.method: R> m <-km(~., design = X, response = y, covtype = "gauss", + optim.method = "gen", control = list(trace = FALSE)) As for BFGS, the main optimization parameters such as the size of the initial population can be tuned in control. More detail about estimation options can be found in the documentation of km. Coming back to the default values, we now print the estimation results:

R> m
Call: km(formula =~., design = X, response = y, covtype = "gauss") We can see a clear anisotropy with a longer range in the x 2 direction. The estimated value of θ 2 reached the boundary 2. Since it depends on the two ranges θ 1 and θ 2 only, the concentrated likelihood can be plotted. In the following, we compute logLikFun over a 30 × 30 grid: R> n.grid <-30 R> x.grid <-seq(0.01, 2, length = n.grid) R> X.grid <-expand.grid(x.grid, x.grid) R> logLik.grid <-apply(X.grid, 1, logLikFun, m) The result can then be drawn, and the optimum added by extracting it from the km object m:  In Figure 11, we see that the optimum is found in a correct way by BFGS. This is because the likelihood surface is pretty simple. Of course, this may not be always the case, especially in higher dimensions. Turning the argument optim.method to "gen" may be useful. Now, we can draw the Kriging mean and visualize the prediction accuracy.
Finally, a leave-one-out validation is implemented as a plot method (see results in Figure 13):

R> plot(m)
A more complete validation procedure is available in package DiceEval (not presented here), including k-fold cross validation.

A 6-dimensional approximation example
Let us now consider the 6-dimensional, negative valued, Hartman function (see Jones et al. 1998). It is a standard test function in optimization, and is implemented in DiceKriging. Following (Jones et al. 1998), the transformation − log(−(·)) is first applied to the response. For estimation, a 80-point optimal Latin hypercube (LH) design is chosen, using the function optimumLHS from package lhs (Carnell 2012). The leave-one-out diagnostic is represented in Figure 14, and shows no strong departures from the model assumptions.
R> n <-80 R> d <-6 R> set.seed(0) R> X <-optimumLHS(n, d) R> X <-data.frame(X) R> y <-apply(X, 1, hartman6) R> mlog <-km(design = X, response = -log(-y)) R> plot(mlog) To go further, since the 6-dimensional Hartman function is cheap-to-evaluate, we study the performance of the model on a 250-point test LH sample generated at random. In practice, of course, such a validation scheme would be intractable, and k-fold cross-validation would be a sensible substitute for it. Coming back to out-of-sample validation, we draw the predicted values versus the true ones ( Figure 15). We have also added the results that would be obtained with a trend (first order polynomial + interactions) or without the log transformation of the response. In this case, it seems clear that the log transformation is improving prediction R> n.test <-250 R> set.seed(0) R> X.test <-randomLHS(n.test, d) R> colnames(X.test) <-names(X) R> y.test <-apply(X.test, 1, hartman6) R> ylog.pred <-predict(mlog, newdata = X.test, type = "UK")$mean

The case of noisy observations
In the case where the observations are assumed to be noisy, whether they stem from stochastic simulation (Fernex, Heulers, Jacquet, Miss, and Richet 2005), from partially converged deterministic simulations (Forrester, Bressloff, and Keane 2006), or from real experiments, it is crucial to quantify the corresponding noise variance values and to take them into account within Kriging metamodeling.
Here we rely on a basic one-dimensional example for all cases of Kriging with nugget effect, Kriging with homogeneous noise, and Kriging with heterogeneous noise (in reverse order here). Note that the results shown are directly generalizable i) to multivariate cases, ii) involving simple as well as universal Kriging models.
Finally, the same data is used in Figure 18 to illustrate what kind of Kriging model would have been obtained with a nugget instead of a homogeneous noise variance.
R> model <-km(design = data.frame(x = x), response = y, + coef.trend = 0, coef.cov = theta, coef.var = 1, nugget = 4/100) R> p <-predict.km(model, newdata = data.frame(x = sort(t)), type = "SK") As announced in the Statistical Background section, the Kriging mean predictor with nugget effect is quite similar to the one with homogeneously noisy observations, up to its behavior at q q q q q q q q q q q qq qqqqq q qq q q q q q q q q q q q q q q q q q q q q q q q q qq qqq qqqq qqq qq qq qq qq qq qq qq q qqqqqqqq qqqqqqqq q q q q q q q q q q q q q q q q q qq q  Figure 17: Kriging with homogeneously noisy observations. This is a variant of Figure 16 in the case where all observations have the same noise variance.
q q q q q q q q q q q qq qqqqq q qq q q q q q q q q q q q q q q q q q q q q q q q q qq qqq qqqq q qq qq qq qq qq qq qq qq q qqqqqqqq qqqqqqqq q q q q q q q q q q q q q q q q q q  Figure 16. The Kriging mean is now interpolating the observations (stars), and the Kriging prediction intervals (filled area) vanish at the design points.
the design points. Indeed, one can see that the Kriging predictor interpolates the observations. As in the deterministic case, this goes with a zero Kriging variance at the corresponding points. Outside of the design points, however, the Kriging variance has the same shape as in the analogous Kriging model with homogeneous noise, but with values translated by a factor of τ 2 . This is explained by the fact that the process Y does not have the same variance whether τ 2 stands for a nugget coefficient or for a noise variance.

Sensitivity analysis -Wrapper to package sensitivity
Sensitivity analysis (SA) is not implemented yet in this version of DiceKriging. However, connecting DiceKriging to packages devoted to SA like sensitivity (Pujol 2008) is easy, and we suggest an efficient way of doing this in the present section.
Let us first recall the elements of SA that are useful for this section. SA aims at "studying how uncertainty in the output of a model (numerical or otherwise) can be apportioned to different sources of uncertainty in the model input" (Saltelli et al. 2008). It relies on the so-called Sobol (or FANOVA decomposition). Let us assume that the input variables X 1 , . . . , X d are independent random variables, and denote by ν the probability measure of X = (X 1 , ..., X d ).
Then any function f ∈ L 2 (ν) can be decomposed in the following way: where for all subset I, the µ I (X I )'s are centered and satisfy the condition E(µ I (X I )|X J ) = 0 for all strict subset J I. With these assumptions, the decomposition is unique and its terms are mutually orthogonal. In particular, the variance of the output is also decomposed as: Therefore, the global influence of X I onto f (X) can be measured by the ratios, called Sobol indices: For a given input variable X i , two Sobol indices are of special interest: The first order index S i , that quantifies the influence of X i solely, and the total effect S T i = J ⊇{i} S J , that takes into account all the terms involving X i . We refer to Sobol (1993) and Saltelli, Chan, and Scott (2000) for more details.
Let us now illustrate a possible coupling bewteen DiceKriging and the package sensitivity.

A toy example
Before considering an example suited to SA, first consider the Branin function. To perform SA with Branin, we first need to adapt its implementation to matrix-typed arguments.

R> branin.mat <-function(X) apply(X, 1, branin)
Then, assuming independent uniform distributions over [−1, 1] for the inputs, the Sobol indices can be computed using the function fast99 (from sensitivity), by typing (see help file of fast99 for other examples): R> SA.branin <-fast99(model = branin.mat, factors = 2, n = 1000, + q = "qunif", q.arg = list(min = 0, max = 1)) Now, let us compute the SA of the Kriging metamodel estimated from few runs of the Branin function again. After constructing the model (with a 16-point factorial design, as above), R> m.branin <-km(design = X, response = y) we create a connection function based on predict.km, that desactivates name checking (useless in this particular context) and returns only the Kriging mean: The results can be printed, or drawn with a plot method taken from the package sensitivity.
For the Branin function, the metamodel is precise, so the Sobol indices (main and total effects) calculated with the metamodel are very close to the true ones ( Figure 19).

A standard SA 8-dimensional example
Let us now take the 8-dimensional Sobol function implemented in sensitivity. A Kriging metamodel is estimated with a 80-point optimal LH (generated by optimumLHS from lhs).

Known problems and their solutions
Despite the care taken in implementation, some problems can be encountered. Some of them are purely numerical, and can be solved easily, but others are intrinsic to Kriging and may be due to an inappropriate choice of design points or covariance kernels.

Non invertibility of covariance matrices
Interpolation is sometimes hard to achieve numerically. Roughly speaking, this happens when the design points are too close relatively to the spatial correlation length. This results in nearly non invertible covariance matrices. The problem is more severe with the Gaussian covariance, which adds a strong smoothness constraint on the sample paths. With our kernels, the worst case with this respect is the Gaussian kernel, which implies the existence of derivatives at any order. On the other hand, the Matérn kernel with ν = 5/2 (for which the derivatives exist up to order 2) gives much better conditioned covariance matrices. Thus, the first recommendation is to avoid using the Gaussian kernel and prefer the Matérn kernel with ν = 5/2. Such a choice is strongly advocated by Stein (1999). This kernel is the default choice in DiceKriging.
Another possibility, that can be combined to the former one, is to add a nugget effect, or jitter. This method is sometimes referred to as diagonal inflation. However, the sample paths are then discontinuous at the design points. Therefore, the nugget value should not be too large in order to avoid abusive departure from continuity, but not too small to prevent from numerical problems.
To illustrate these two solutions, let us consider again the example of the last section, but increase the size of the design of experiments.

Error in chol.default(R): the leading minor of order 47 is not positive definite
An error message indicates that the covariance matrix could not be inverted. To overcome this difficulty, one can choose the diagonal inflation method: R> km(design = X, response = y, covtype = "gauss", nugget = 1e-8 * var(y)) or replace the Gaussian covariance kernel by the (default) Matérn kernel (ν = 5/2): R> km(design = X, response = y)

Identifiability issues caused by large design interdistances
A dual difficulty is encountered when the design points are not close enough relatively to the spatial correlation length. In such situations, estimation of Kriging models may give either misleading results or flat predictions, corresponding to range parameters θ estimated to zero. Analogous issues are faced in signal theory, where recovering a periodic signal is not possible if the sampling frequency is too small. A solution is penalizing. For instance, Li and Sudjianto (2005) have shown some promising results obtained by Penalized MLE with SCAD penalty (Fan 1997). This method has been implemented in DiceKriging, but should be considered as a beta version at this stage, since the estimation results for the tuning parameter are not convincing. Another possibility is simply to add a constraint θ ≥ θ min in MLE. One then face the analogous problem of choosing the lower threshold θ min .
To illustrate the potential virtues of penalizing, consider the sine function proposed by Li and Sudjianto (2005), and compare the estimation results obtained with (only) 6 design points by three procedures: MLE, PMLE with SCAD penalty function, and MLE constrained by θ ≥ θ min (Figure 21). The Gaussian covariance kernel is used, with a small nugget effect. The usual MLE gives an unrealistic value of θ, estimated to 0.15 approximately, which seems much too small in comparison to the distances between design points. On the other hand, both modified estimation procedures give realistic estimation results. However, a difficulty still remains in proposing a general method for choosing either the tuning parameter λ in PMLE or θ min . In the present case, λ has been estimated by cross validation, and θ min fixed to the mean distance between design points.
We can observe on Figure 22 that EI behaves as described in the Statistical Background section: it is multimodal, null at the already sampled locations, and positive everywhere else with a magnitude increasing with both the decreasing Kriging mean and the increasing Kriging variance. A call to EI at one of the design points results indeed in a null value:

R> EI(x[3], model, type = "UK")
[1] 0 The maximum of EI is reached here at a unique point between the two first design points, where the uncertainty is the highest. A numerical maximization of the EI function can be obtained by using the dedication function max.EI, with tunable rectangular search domain, starting point, and selected control parameters for the underlying rgenoud algorithm: R> x_star <-max_EI(model, lower = 0, upper = 1, parinit = 0.5, + control = list(pop.size = 10, max.generations = 10, + wait.generations = 5, BFGSburnin = 10)) max.EI returns the optimal candidate point and corresponding EI value:

R> print(x_star)
$ Let us now consider the 2-dimensional Branin function. This time, EI depends on a Kriging model that needs to be estimated. In the sequel, the design is an optimal 15-point LH, generated with the function optimumLHS of the lhs package. The Kriging model is obtained with km, using the default values. R> d <-2 R> n <-15 R> set.seed(0) R> design <-optimumLHS(n, d) R> design <-data.frame(design) R> names(design) <-c("x1", "x2") R> response.branin <-apply(design, 1, branin) R> fitted.model1 <-km(design = design, response = response.branin) The corresponding EI is then computed over a grid: R> x.grid <-y.grid <-seq(0, 1, length = n.grid <-25) R> design.grid <-expand.grid(x.grid, y.grid) R> EI.grid <-apply(design.grid, 1, EI, fitted.model1) The resulting EI is represented on Figure 23 for our optimum LH design, and also for two additional random LH designs (obtained with seeds 0 and 100). In most cases, EI detects interesting regions when starting from 15 points, but the nature of the result may deeply differ depending on the initial design drawn. However, we will see in next section that the final optimization results are not very sensitive to the initial design choice, provided that enough points are sequentially added within the EGO algorithm.
Finally, like in the 1-dimensional example, we observe that EI is multimodal. Thus a genetic algorithm is recommended for its optimization. To improve efficiency, the analytical gradient is implemented for the standard case (constant trend). One example of gradient field is represented in Figure 24, obtained with a 3 × 3 factorial design (See help file of EI.grad).

EGO illustrated on the Branin example
Now, let us apply the EGO algorithm to the Branin function. For the LH design used in last section, we run 10 steps of EGO by means of the EGO.nsteps function.
R> nsteps <-10 R> lower <-rep(0, d) R> upper <-rep(1, d) R> oEGO <-EGO.nsteps(model = fitted.model1, fun = branin, nsteps = nsteps, + lower, upper, control = list(pop.size = 20, BFGSburnin = 2)) The obtained sequence is shown on Figure 25 (left), as well as the EI contours corresponding to the last model (right). We observe that the 3 basins of global optima have been explored. Furthermore, the remaining EI after 10 steps is focused one these areas, showing that there is not much interest to explore outside at this stage.
R> par(mfrow = c(1, 2)) R> response.grid <-apply(design.grid, 1, branin) R> z.grid <-matrix(response.grid, n.grid, n. Now, as pointed out before, the whole EGO sequence depends on the initial LH design, and it is important to study the sensitivity of the results to the design choice. Thus, we have performed 10 steps of EGO with 100 random LH designs of size 15. Figure 26 represents the 3 points which are the closest to the 3 global optimizers of Branin function, for all 100 drawings. One may observe that the final result is nearly always satisfactory since all 3 regions are visited, and the optimum found is close to the true one.

Applications of EGO to the 6-dimensional Hartman function
We now come back to the 6-dimensional Hartman function previously considered, and build an optimization example upon it. Since it has been shown earlier that a logarithmic transformation of y was necessary to obtain a well-behaved Kriging metamodel, we choose to work here at first with suitably transformed data. The initial design chosen here is a 50-point design obtained by uniform sampling over [0, 1] 6 . For purpose of reproducibility, and since the variability is here greater than in 2 dimensions, one simulated design has been saved as mydata (the one of Ginsbourger 2009), and is used all over the section.
Starting from a 10-point design One of the crucial questions when using metamodel-based optimization algorithms with a severely restricted budget is the trade-off between initial design points and additional points obtained during the algorithm itself. Answering this question in a general way is of course out of scope here, but the following complementary experiment might contribute to motivate and illustrate the potential computational benefits of addressing this trade-off well. Figure 28 shows the slower convergence of EGO when starting with a 10-point design. It is very interesting, however, to notice that the global expense in terms of total number of points is about twice smaller in the second case. Investigating the right trade-off in a generic framework seems a valuable research perspective. q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 0 10 20 30 40 50 −3.0 −2.0 −1.0 0.0 q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q Comparison to the results obtained without a transformation of y To finish with examples concerning the regular EGO algorithm, let us try to apply the algorithm to the Hartman function, without any transformation of y. Figure 29 represents the results obtained with both previously considered initial designs.
Rather surprisingly, the obtained results are not worse than with an adapted transform, and even a bit better in the case of the 10-point initial design. This illustrates the robustness of EGO to some kinds of model misspecification, such as non-Gaussianity. To our knowledge, however, such algorithms may be very sensitive to misspecifications of the covariance kernel, in particular when it comes to correlation length parameters.

Parallelizations of EI and EGO
Distributing metamodel-based algorithms on several processors has become a contemporary industrial challenge since taking advantage of existing parallel facilities is a key to increase optimization performances when the time budget is severely restricted.

q-points EI
DiceOptim includes a qEI function dedicated to the Monte Carlo estimation of the multipoints EI criterion. Here we come back to the first 1-dimensional example considered in the present section, and give an illustration of both 1-point and 2-points EI criteria estimations. Candidate design point Estimated EI q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 0.0 0. The results shown on Figure 30 illustrate the adequacy between the Monte Carlo estimation of EI by means of qEI and the analytical formula implemented in EI.
The 2-points EI is represented on Figure 31, where it appears that sampling at the two highest bumps is likely to offer the best joint performance, i.e., in terms of multipoints expected improvement. More on the two-points EI criterion can be found in Ginsbourger et al. (2010).

Iterated constant liar test with Branin
Let us finish this section with an illustration of the constant liar algorithm, an approximate multipoints EI maximizer, applied to the Branin function previously used in several examples. Starting from the initial LH design drawn in the previous Branin example, 3 iterations of constant liar with 8 points are applied sequentially by means of the CL.nsteps function (See results on Figure 32). Basically, each iteration of CL.nsteps consists in the creation of a 8point design relying on the constant liar strategy, on the evaluation of the objective function at these points, and on the update of the Kriging model with the latter actual values.

Conclusion
We have presented two packages, DiceKriging and DiceOptim, for Kriging-Based design and analysis of computer experiments. Based on several worthwhile user feedbacks, the presently available versions of DiceKriging (≤ 1.3) seem to be mainly appreciated for the versatility of covariance and trend specifications, for the fast and efficient estimation of trend and covariance parameters relying on a global optimizer with gradient like the genoud algorithm of the package rgenoud, and for the efforts done in order to recycle intermediate calculations as often as possible and avoid calculating twice the same quantities. In forthcoming versions, we plan to enable connecting the likelihood maximization routines of km virtually with any optimizer, so that expert users can keep more control on the whole workflow, including the invested computing time, the stopping criteria, and more ambitiously the way the computations may be distributed over clusters or clouds.
Future improvements of DiceKriging include the development of new covariance classes, for which the preparation work done in terms of S4 programming will greatly facilitate the integration in the existing code (as was already done for the isotropic option). Covariance classes currently in preparation range from various classes of kernels with input space deformation (a first version with separable transformations in the spirit of Xiong, Chen, Apley, and Ding 2007, is already available in version 1.3), kernels taking algebraic invariances into account such as in Ginsbourger, Bay, Roustant, and Carraro (2012), additive kernels as proposed in Durrande, Ginsbourger, and Roustant (2012), and other kinds of graph-structured kernels for high-dimensional modeling as in Muehlenstaedt, Roustant, Carraro, and Kuhnt (2012).
The DiceOptim package is a complement to DiceKriging, dedicated to Kriging-Based global optimization criteria and algorithms. Similar performances as for the likelihood maximization are available for the EI maximization, also relying on the global optimizer with gradient genoud. In the current (1.2) version of DiceOptim, a partial freedom is left to the user concerning the parametrization of the EI optimizer (through the control argument, allowing to tune selected parameters of genoud or optim). This trade-off is certainly convenient for a typical user, who would like to have some influence on the algorithm without being overwhelmed by a too large number of parameters. However, it might be frustrating for the expert user not to control everything nor use its own routines for some particular tasks. We will try improving the modularity of our codes in the next versions, while keeping the basic usage reasonably simple for non-expert users. This may take the form of a set of very modular functions completed by easier user-oriented interfaces.
Future improvements of DiceOptim include Kriging-Based algorithms for "noisy" simulators (Picheny, Wagner, and Ginsbourger 2012), as well as constrained optimization algorithms in the vein of EGO. Further parallel strategies may be investigated in all cases (noiseless, noisy, constrained), in both synchronous and asynchronous frameworks. To conclude with, let us remark that a new born package dedicated to Kriging-Based Inversion, KrigInv (Chevalier, Picheny, and Ginsbourger 2012), was recently released on CRAN, and that the forthcoming developments of DiceOptim and KrigInv will also strongly rely on the next functionalities of DiceKriging, whether in terms of model update or of sequential parameter estimation methods.
of the ReDice consortium (http://www.redice-project.org/). The authors wish to thank Laurent Carraro, Delphine Dupuy and Céline Helbert for fruitful discussions about the structure of the code. They also thank Gregory Six and Gilles Pujol for their advices on practical implementation issues, Khouloud Ghorbel for tests on sensitivity analysis, as well as the DICE members for useful feedbacks. Finally they gratefully acknowledge Yann Richet (IRSN) for numerous discussions concerning the user-friendliness of these packages.

A. Expressions of likelihoods and analytical gradients
The computations of likelihoods, concentrated likelihoods and analytical gradients are based mostly on Park and Baek (2001). They have been adapted for the Kriging models dealing with noisy observations. Beware that the formulas below must not be used directly for implementation. For numerical stability and speed considerations, one should use the Cholesky decomposition of the covariance matrix and more generally avoid directly inverting matrices.
In DiceKriging, we have used the efficient algorithm presented in Park and Baek (2001). See also the implementations details given in Rasmussen and Williams (2006a).
The vector of observations y is normal with mean Fβ and covariance C. Thus, with notations of the Statistical Background section, the likelihood is: The matrix C depends at least on the covariance parameters θ, σ 2 .

A.4. Case of known trend
When the vector β of trend parameters is known, one can check that all the aforementioned formula for (concentrated) likelihoods and derivatives still stand by replacing β by β.

B. Analytical gradient of expected improvement
The aim of this section is to present the efficient algorithm developed for package DiceOptim to compute the analytical gradient of expected improvement (EI) in the most common case: noise-free observations with constant mean (Ordinary Kriging). The method adapts for a general linear trend but requires the implementation of the derivatives of all trend basis functions f k . First recall the EI expression: where a is the current function minimum value, m(x) = m UK (x) and s 2 (x) = s 2 UK (x) are the prediction mean and variance for universal Kriging, and z(x) = (a − m(x))/s(x). By using the relations Φ = φ and φ (t) = −tφ(t), the gradient of EI(x) reduces to: ∇EI(x) = −∇m(x) × Φ(z(x)) + ∇s(x) × φ(z(x)) Denoteμ = 1 C −1 y 1 C −1 1 . Then for a constant trend we have: m(x) = µ + c(x) C −1 (y − 1 µ) s 2 (x) = σ 2 − c(x) C −1 c(x) + (1 − 1 C −1 c(x)) 2 1 C −1 1 From which, using that ∇s 2 (x) = 2s(x)∇s(x), we deduce: ∇m(x) = ∇c(x) C −1 (y − 1 µ) ∇c(x) C −1 c(x) + (1 − 1 C −1 c(x))∇c(x) C −1 1 1 C −1 1 To compute these expressions efficiently, we use the Cholesky decomposition of the covariance matrix and the resulting auxiliary variables z, M, u defined in appendix C.1. As F = 1, M is a n × 1 vector, and is renamed u in this section. In the same way, we introduce the n × d matrix W = (T ) −1 (∇c(x) ) . Then we can rewrite ∇m(x) and ∇s(x) in the concise form and ∇EI(x) is obtained with formula (16).

Computational cost
We now indicate the order of the marginal computational cost, knowing the results of past computations. In particular we assume that a km object was first created, and thus the auxiliary variables T, z, u have already been computed. In addition, ∇EI(x) is supposed to be computed after that EI(x) was evaluated, as it is the case during the optimization procedure. Finally we ignore the terms of order n (assuming that n d and n p). Then the complexity of EI(x) is given by the Kriging mean computation step, equal to n 2 (see appendix C.2). During this step, s(x), Φ(z(x)) and φ(z(x)) are stored. Therefore, the complexity for ∇EI(x) is due to the computation of W, which is done by solving d upper triangular systems of linear equations, and to some linear algebra. This requires dn 2 operations.

C. Implementation notes C.1. Auxiliary variables
To prevent from numerical instabilities and avoid re-computations, four auxiliary variables are used in DiceKriging and DiceOptim, three of them were proposed by Park and Baek (2001)

Procedure
Complexity Memory Limiting step Log-likelihood 1 3 n 3 + d+p 2 + 2 n 2 n 2 Cholesky dec. of C Log-likelihood gradient 1 3 n 3 + (6d + 1)n 2 n 2 Inverting C from T Kriging mean (SK, UK) mn 2 (n + m)n (T ) −1 c(x) Kriging variance (SK) 2mn (n + m)n Kriging variance (UK) 2m(p + 1)n + m p 3 3 + p 2 (n + m)n EI n 2 n 2 Kriging mean EI gradient dn 2 n 2 (T ) −1 (∇c(x) ) Table 7: Computational cost (complexity) and memory size of the most important procedures implemented in DiceKriging and DiceOptim. Recall that n is the number of design points, d the problem dimension, p the number of trend basis functions and m the number of new data where to predict.
instance the log-likelihood complexity is approximately 1 3 n 3 : Multiplying by 2 the number of experiments results in multiplying by 8 the computational time, and by 4 the size of the covariance matrix involved.

Comments
1. The complexity related to log-likelihoods is from Park and Baek (2001); The factors d+p 2 + 2 n 2 and (6d + 1)n 2 are for Gaussian covariance: It can be larger for more complex covariances but will not modify the order of magnitude. For other procedures, see Appendices B and C.2. 2. The memory size is no less than n 2 , which corresponds to the storage of the n × n covariance matrix. When computing gradients, only the final derivatives are stored, and not the auxiliary matrix derivatives. The memory size for computing the Kriging mean and variance can be large: This is because we have vectorized the computation of formulas described in Appendix C.2, to improve speed.

C.4. Speed
Some functions have been implemented in the C language. There was no need to do so with linear algebra R routines, since they do themselves call Fortran or C routines. On the other hand, the computation of covariance matrices seems to be very slow in R, due to the presence of triple loops. They have been implemented in C (functions covMatrix, covMat1Mat2).

D. Trustworthiness
Producing a trustworthy result is the main concern -or "prime directive", as named by Chambers (2008) -of software providers. To investigate the trustworthiness of our packages, we have implemented several tests. The first one is a consistency test between the prediction formulas of simple Kriging and simulation results. In limit cases, prediction with universal Kriging is also successfully compared to the linear regression confidence bounds computed with the function lm from package stats. It is included in the help page of simulate.km.  Now, we detail the second one, which code can be found in the help page of km. It is a performance study of maximum likelihood estimators, achieved by monitoring Kriging parameters estimates based on simulated paths of a Gaussian process which law is actually known. The idea is to simulate Gaussian Process paths corresponding to a Kriging model with known pa-rameters and, for observations of the sampled paths at a given design, to estimate the model parameters by using km. Based on the obtained results (say, for 100 cycles of simulation and estimation per Kriging model), we can then analyze the empirical distribution of these estimates and make comparisons to the true values (especially in terms of bias and variance).
In this paragraph, we show the results obtained in 3-dimensions and 10-dimensions. Following Loeppky, Sacks, and Welch (2009), we fix the number of runs n proportionally to the problem dimension d. However, we have chosen n = 15d, instead of the informal rule "n = 10d", since -without any penalization of the likelihood -this seems to give more stable results in estimation. Then, we have used a maximin LH design, obtained with package lhs. Note that with a LH design with such a few runs, only large enough values of the θ s can be estimated. Thus the θ s are chosen equally spaced from 0.3 to 0.7 in the 3-dimensional case, and equally spaced from 0.5 to 1.5 in the 10-dimensional one (in both cases, the domain is [0, 1] d ). The arguments of km are the default values, except in 10-dimensions where the upper bound was fixed to 3. In particular, the optimization method is BFGS. Better results can be obtained with the genetic algorithm, but will depend on its parameters, and thus are not shown here.
The results are shown in Figures 33 and 34.