Introducing BACCO, an R Bundle for Bayesian Analysis of Computer Code Output

This paper introduces the BACCO bundle of R routines for carrying out Bayesian analysis of computer code output. The bundle comprises packages emulator and calibrator, computerized implementations of the ideas of Oakley and O’Hagan (2002) and Kennedy and O’Hagan (2001a) respectively. The bundle is self-contained and fully documented R code, and includes a toy dataset that furnishes a working example of the functions. Package emulator carries out Bayesian emulation of computer code output; package calibrator allows the incorporation of observational data into model calibration using Bayesian techniques. The package is then applied to a dataset taken from climate science.


Introduction
In this paper I introduce BACCO, a new R (R Development Core Team 2005) bundle for Bayesian analysis of computer code output. The bundle comprises packages emulator and calibrator, which implement the ideas of Oakley and O'Hagan (2002) and Kennedy and O'Hagan (2001a) respectively. The BACCO bundle can be obtained from the Comprehensive R Archive Network, CRAN, at http://CRAN.R-project.org/.
Many computer models, including climate prediction models such as C-GOLDSTEIN (Marsh, Edwards, and Shepherd 2002), take many hours, or even weeks, to execute. This type of model can have tens to hundreds of free (adjustable) parameters, each of which is only approximately known. Consider a scenario in which a particular model has been run in the past with different sets of input parameters. The code output (here attention is confined to a single scalar value, such as global average temperature) is desired at a new set of parameters, at which the code has not been run. Under the Bayesian view (Currin, Mitchell, Morris, and Ylvisaker 1991), the true value of the code output is a random variable, drawn from a distribution that is conditioned by our prior knowledge, and in this case by the previous code runs; the computer code is thus viewed as a random function. Package emulator gives statistical inferences about this random function, and may be used to furnish computationally cheap-yet statistically rigorous-estimates of the computer code output. Although deterministic-in the sense that running the model twice with identical inputs gives identical outputs-the Bayesian paradigm is to treat the code output as a random variable. Formally, the code output y is represented as a function η(x; θ) of the input parameter vector x and parameter vector θ; if no confusion can arise, y = η(x) is written. Although η(·, ·) is known in principle, in practice this is not the case. C-GOLDSTEIN, for example, comprises over 10 4 lines of computer code, and the fact that the output is knowable in principle appears to be unhelpful in practice.
It is perhaps easier to understand output from a deterministic code as a random variable if one imagines oneself to be at a computer terminal, waiting for a run to finish. The fact that both the code itself, and its input parameters, are perfectly prescribed (thus the output is in principle predetermined), does not reduce one's subjective uncertainty in the output; the Bayesian paradigm treats this as indistinguishable from uncertainty engendered by a conventional random variable. Of course, if the code has been run before at slightly different point in parameter space, one may be able to assert with some confidence that this particular run output shouldn't be too different from the last run's (and of course if the code is run at exactly the same point in parameter space, we can be certain that the code output will be identical 1 ). Such considerations are formalized in the Gaussian process model, discussed in detail in Section 1.2.
A case often encountered in practice is that the values of one or more components of x are uncertain, typically because of practical difficulties of measurement (cloud albedo is a commonly-cited example). It is therefore appropriate to consider X, the true input configuration, to be a random variable with distribution G(x). Thus the output Y = η(X) is also a random variable and it is the distribution of Y -the 'uncertainty distribution'-that is of interest.
In the case of C-GOLDSTEIN, direct Monte-Carlo simulation of Y is so computationally intensive as to become impractical: a single run typically takes 24 hours, and the parameter space is 27 dimensional.
Package emulator allows one to emulate the code: the unknown function η(·) is assumed to be a Gaussian process, and previous code runs constitute observations. I will show in this paper that emulation can be orders of magnitude faster than running the code itself; and, at least in the examples considered here, the emulator provides an estimate that is reasonably close to the value that code itself would produce.
In this paper, the object of inference is the random function that is evaluated by the computer code. Although one may question whether this object is actually of interest given that the model is unlikely to predict reality correctly, Kennedy and O'Hagan (2001a) point out that even a good model may be rendered ineffective by uncertainty in the input; and that 1 Many computer models are chaotic in the sense that running the model twice with closely separated but non-identical parameter values will result in very different outputs. Such systems are generally not amenable to the approach outlined here because the standard parameterization for the correlation function c(x, x ) discussed in the next Section breaks down. Such systems may be modelled using a device known as a nugget which breaks correlation between infinitesimally different points in parameter space. However, the climate models used here as test cases are known to be non-chaotic. quantification of the uncertainty distribution is the first step towards reducing uncertainty in the predictions.

Bayesian calibration of computer models
Notwithstanding that computer models are interesting and important entities in their own right, the ultimate object of inference is reality, not the model. Package calibrator implements a statistically rigorous method of incorporating observational data into uncertainty analyses, that of Kennedy and O'Hagan (2001a) and Kennedy and O'Hagan (2001b) (hereafter KOHa and KOHb respectively).
When preparing a computer model for making predictions, workers often calibrate the model by using observational data. In rough terms, the idea is to alter the model's parameters to make it fit the data.
A statistically unsound (Currin et al. 1991) methodology would be to minimize, over the allowable parameter space, some badness-of-fit measure over the physical domain of applicability of the code.
To fix ideas, consider C-GOLDSTEIN predicting sea surface temperature (SST) as a function of position x (here styled "latitude" and "longitude"). The naïve approach discussed above would be to minimize, over all parameter values θ in a set of allowable parameters P, the squared discrepancies between observations z(x) and model predictions η(x, θ) summed over observational space X : >From a computational perspective, minimizing the object function above necessitates minimization over a large dimensional parameter space; optimization techniques over such spaces are notorious in requiring large numbers of function evaluations, which is not be feasible in the current context. Also note that the method requires code evaluations at the same points ("x") as the observations z(·), which may not be available.
Other problems with this approach include the implicit assumption that observation errors (and model evaluations) are independent of one another. This assumption is likely to be misleading for closely separated observation points and for pairs of code evaluations that use parameter sets that differ by only a small amount.
Bayesian methods are particularly suitable for cases in which the above problems are present, such as the climate model predictions considered here. KOHa present a methodology by which physical observations of the system are used to learn about the unknown parameters of a computer model using the Bayesian paradigm.

The Gaussian process
The notion of Gaussian process underlies both emulator and calibrator packages. Consider a function f : X −→ R with X ⊆ R p . If f (·) is regarded as an unknown function, in the Bayesian framework it becomes a random function. Formally, f (·) is a Gaussian process if, for every n ∈ N, the joint distribution of f (x 1 ), . . . , f (x n ) is multivariate normal provided x i ∈ X .
In particular, f (x) is normally distributed for all x ∈ X .
The distribution is characterized by its mean function m(·), where m(x) = E {f (x)}, and its covariance function c(·, ·), where c( It is usual to require that f (x) and f (x ) be closely correlated if x and x are themselves close; here, it is assumed that c(·, ·) = σ 2 r(x − x ) with r(d) = exp(−d Ωd), Ω being a symmetric positive definite matrix (which is usually unknown and has to be estimated).

Emulation: Computer output as a Gaussian process
For any random function η(·) and set of points {x 1 , . . . , x n } in its domain, the random vector {η(x 1 ), . . . , η(x n )} is assumed to be a multivariate normal distribution with mean conditional on the (unknown) vector of coefficients β and h(·), the q known regressor functions of x = (x 1 , . . . , x p ) ; a common choice is h(x) = (1, x 1 , . . . , x p ) , but one is free to choose any function of x. The covariance is given by where c(x, x ) is a correlation function that measures the correlation between x and x ; here the form will be used, where B is a positive semidefinite matrix. This form has the advantage that η(·) has derivatives of all orders; other forms for the covariance function do not have this desirable property. Oakley and O'Hagan use the weak conjugate prior for β and σ 2 : and discuss methods for expert elicitation of a, r, z and V . The random function η(·) may be conditioned on observations at specific points in parameter space (ie code runs). This set of points is known as the experimental design, or the design matrix. Although it is possible to tailor design matrices to a particular problem, here a Latin hypercube system is used. With the specified prior, and an experimental design D = {x 1 , . . . , x n } on which η(·) has been evaluated giving d = η(D), it can be shown (Oakley 1999;Oakley and O'Hagan 2002) that Thus m * (x) is a computationally cheap approximation to η(x); uncertainties are given byσ c * (x, x). These equations are consistent in that the estimated value for points actually in the design matrix is in fact the observations (with zero error). Writing X = (x 1 , . . . , x n ) , we have where we use the fact that h(X) = H and t(X) = A, and expandβ directly. It may similarly be shown that c * (x, x) = 0 for x ∈ D, as expected: the emulator should return zero error when evaluated at a point where the code output is known.

Package emulator in use: Toy problem
The central function of package emulator is interpolant(), which calculates the terms of Equation 3. Here, I use the package to investigate a simple toy world in which observations are a Gaussian process on six parameters named a to f: the mean is 0*a + 1*b + 2*c + 3*d + 4*e + 5*e + 6*f, corresponding to β = 0:6; the residuals are correlated multivariate Gaussian, with unit variance and all scales unity 2 . The overall approach is to use this scheme to generate data, then use the data to estimate the parameters, and finally, compare the estimates with the true values.
I first consider a synthetic dataset consisting of observations of the predefined Gaussian process on a design matrix generated by emulator's function latin.hypercube(n,d). This function takes two arguments: the first, n, is the number of observations to take; the second, d is the number of dimensions of the design space. The output of the function latin.hypercube() is a matrix with each row corresponding to one point in parameter space at which the Gaussian process is observed. Taking 20 observations in a 6 dimensional design space appears to give a small yet workable toy problem: 2 Here, "scales" means the e-folding lengthscales for correlations between datapoints. The concept is a special case of Equation 1 in which B is diagonal, with elements Bi, say; the relevant lengthscales Si will be Bi = 1/S 2 i . Then the correlation between two points x and x is given by Sometimes it is more convenient to consider the elements of matrix B, and sometimes it is better to think in terms of the lengthscales Si (they have the same dimensions as the parameters). It is now possible to specify the expectation of the Gaussian process as a linear product, with coefficients 0:6 acting on regression function H1.toy(), which prepends a 1 to its argument (viz regressor.basis(x)=c(1,x)). In R idiom: where regressor.multi() is a vectorized version of regressor.basis(). Recall that we suppose f() to be an unknown function; one object of package emulator is to infer its coefficients. Next, matrix A is calculated, with scales all equal to 1: > toy.scales <-rep(1, 6) > toy.sigma <-0.4 > A <-toy.sigma * corr.matrix(xold = toy, scales = toy.scales) Thus, for example, A[1,2] gives the covariance between observation 1 and observation 2. We can simulate observational data by sampling from the appropriate Gaussian process directly, using the mvtnorm package (Hothorn, Bretz, and Genz 2001): > d <-as.vector(rmvnorm(n = 1, mean = expectation, sigma = A)) Vector d is the vector of observations, with variance matrix A. Now function interpolant() of package emulator can used to estimate the unknown function f(), at a point not in the training set: > x.unknown <-rep(0.5, 6) > jj <-interpolant(x.unknown, d, toy, scales = toy.scales, g = TRUE) > print(drop(jj$mstar.star)) [1] 9.810046 The estimated value (element mstar.star) is reasonably close to the correct value (which we know to be 0.5*sum(0:6)=10.5). Also, the estimated value for the regression line, given by element betahat, is close to the correct value of 0:6: c d e f -0.4661564 1.5226936 2.8875000 2.7562766 3.6881656 5.0433217 5.0511526 The package will also give an estimate of the error onβ: showing in this case that the true value of each component of β lies within its marginal 95% confidence limits.

Estimation of scales
In the above analysis, the scales were set to unity. In real applications, it is often necessary to estimate them ab initio and a short discussion of this estimation problem is presented here.
The scales are estimated by maximizing, over allowable scales, the posterior likelihood. This is implemented in the package by optimal.scales(), which maximizes the a-postiori probability for a given set of scale parameters. The likelihood is which is evaluated in the package by scales.likelihood(). It is more convenient to work with the logarithms of the scales, as they must be strictly positive. Function optimal.scales() is essentially a wrapper for multidimensional optimizing routine optim(), and is used as follows: > scales.optim <-optimal.scales(val = toy, scales.start = rep(1, + 6), d = d, give = FALSE) > print(scales.optim) [1] 1.541431e+01 2.992507e-06 1.072606e+01 3.786117e+00 1.689909e-02 The estimated scales are not particularly close to the (known) real values, all of which are unity; the scales are known to be "difficult to estimate" (Kennedy and O'Hagan 2000). One interpretation might be that there is not a large amount of information residing in the residuals once β has been estimated.
However, using the estimated scales does not seem to make much difference in practice: > interpolant(x.unknown, d, toy, scales = toy.scales, g = FALSE) [1] 9.810046 > interpolant(x.unknown, d, toy, scales = scales.optim, g = FALSE) [1] 9.738393 2.2. Emulation applied to a dataset from climate science The methods above are now used for a real dataset obtained from C-GOLDSTEIN , an Earth systems model of intermediate complexity 3 .
The results.table dataset is a dataframe consisting of 100 rows, corresponding to 100 runs of C-GOLDSTEIN. Columns 1-19 are input values and columns 20-27 are code outputs. Here, the column corresponding to global average surface air temperature (SAT) is used. One interpretation of the accurate prediction of SAT would be that this output variable is well-described by the Gaussian process model.
However, variables which exhibit violently nonlinear dynamics might be less well modelled: examples would include diagnostics of the thermohaline circulation (THC), which is often considered to be a bistable system (Vellinga and Wood 2002), exhibiting catastrophic breakdown near some critical hypersurface in parameter space. Such behaviour is not easily incorporated into the Gaussian process paradigm and one would expect emulator techniques to be less effective in such cases.

Computational efficiency
C-GOLDSTEIN takes 12 to 24 hours to complete a single run, depending on the parameters. The emulator software presented here takes about 0.1 seconds to calculate the expected value and estimated error-a saving of over five orders of magnitude.
One computational bottleneck is the inversion of matrix A, but this only has to be done once per emulator.

Bayesian calibration: Package calibrator
Computer climate models generally require two distinct groups of inputs. One group is the (unknown) parameter set about which we wish to make inferences; these are the calibration inputs. It is assumed that there is a true but unknown parameter set θ = (θ 1 , . . . , θ q 2 ) with the property that if the model were run with these values, it would reproduce the observations plus a model inadequacy term plus noise. For any single model run, it is convenient to denote the calibration parameters by t where t = (t 1 , . . . , t q 2 ).
The other group comprises all the other model inputs whose value might change when the calibration set is fixed, conventionally denoted by x = (x 1 , . . . , x q 1 ). In the current context, this group is the latitude and longitude at which mean surface temperature is desired.
A model run is essentially an evaluation-or statistical observation-of the unknown random function η(·, ·) at a specified point in parameter and location space η(x, t); a field observation is a statistical observation of another unknown random function ξ(·).
This system has the notational advantage of distinguishing the true but unknown value θ of the calibration inputs from the value t that were set as inputs when running the model.
The calibration data comprise the n field observations z = (z 1 , . . . , z n ) (ie observations of ξ(x i ), subject to error), and the outputs y = (y 1 , . . . , y N ) from N runs of the computer code evaluated at points on a design matrix D 1 where D 1 = {(x * 1 , t 1 ), . . . , (x * N , t N )}. Thus y = η(D 1 ), or y i = η(x * i , t i ); we write d = (y , z ) for the full set of calibration data.

Model
KOHa suggest that is an adequate representation of the relationship between observation and computer model output. Here δ(·) is a model inadequacy term that is independent of model output η(·, ·). It is then assumed that the observations z i = ξ(·) + N (0, λ). The true value of neither λ nor ρ is known and has to be inferred from observations and the calibration dataset.
This statistical model suggests a non-linear regression in which the computer code itself defines the regression function through the term ρ · η(x i , θ) with parameters ρ and θ. KOHa consider the meaning of model fitting in this context and conclude that "the true value for θ has the same meaning and validity as the true values of regression parameters [in a standard regression model]. The 'true' θ is a best-fitting θ in the sense of representing the data faithfully according to the error structure specified for the residuals".

Prior knowledge
The Bayesian paradigm allows the incorporation of a priori knowledge (beliefs) about the parameters of any model being used. Consider a model in which climate sensitivity λ is a free parameter 4 . In principle, we are free to 'tune' λ to give the best fit to observational data. However, most workers would consider 0.4 − 1.2 K · W −1 · m 2 to be a reasonable range for this physical constant, in the sense that finding an optimal value for λ outside this range would be unacceptable.
It is possible to incorporate such 'prior knowledge' into analysis by ascribing a prior distribution to λ. One might interpret the upper and lower ends of the range as 5th and 95th percentiles of a normal, or lognormal, distribution; if the bounds are "hard", then a uniform PDF might be used.
The prior PDF for the vector of parameters θ is denoted p(·). KOHb assume that θ is multivariate Gaussian with θ ∼ N (m θ , V θ ).

Hyperparameters
KOHb define hyperparameters ρ, λ, ψ 1 , ψ 2 that specify the correlations between the observations in the dataset, and the relationship between the computer model and observations.
The variance matrix of d has N +n rows and columns; its (j, j )-th element is c 1 ((x * j , t j ), (x * j , t j )). KOHa derive results for the case where where Ω x and Ω t are arbitrary functions of hyperparameter ψ 1 . A similar definition gives the variance matrix of x in terms of a second hyperparameter ψ 2 .
For the purposes of implementing KOHa, it is convenient to use a single R object that contains not only the hyperparameters above, but also other information such as σ 2 1 and ρ. The values of these variables have to be inferred in any case and it is logical to consider them alongside the formal hyperparameters.
In the R implementation presented here, the hyperparameter object phi also contains extra information such as the Cholesky decomposition of the square matrices, and the prior mean and variance of θ. This device greatly simplifies many functions in the package; an example is given in the discussion of ht.fun() below.

Design of package calibrator
The overarching design philosophy of package calibrator is to be a direct and transparent implementation of KOHa and KOHb. Each equation appearing in the papers has a corresponding R function, and the notation is changed as little as possible. One reason for this approach is debugging: with such a structured programming style, it is possible to identify differences between alternative implementations. Speed penalties of this approach appear to be slight.

Notation
In calibrator, most notation is an ascii version of that used by KOHb. Function names are descriptive where possible, or (as in p.eqn8.supp()), named for the relevant equation number in the supplement KOHb.

Toy dataset
Partly to facilitate code testing, and partly as a didactic aid in the man pages, package calibrator includes a simple "toy" dataset. Toy objects have a .toy suffix, as in theta.toy. Each function's documentation includes a working example in which toy data is processed appropriately.
For any real application (such as the climate analysis discussed below), each member of the toy dataset must be replaced with the corresponding real dataset.
Toy datasets are loaded by data(toys) and ?toys gives a complete list. The toy dataset is designed to be simple and transparent, thus offering a clear test of the package's methods. The dataset comprises the following objects: • Design matrix. Two elements: -D1.toy, matrix of 8 rows of code run points, with five columns. The first two columns ('x' and 'y') represent the input space of ζ (nonparameters styled "latitude and longitude"); the next three are parameter values.
-D2.toy, matrix of 5 rows of field observations on two variables 'x' and 'y'.
• Observational data. Three elements: y.toy, a vector of length 8. Each element corresponds to the code output at points corresponding to the rows of design matrix D1.toy.
z.toy, a vector of length five. Each element corresponds to a measurement at each of the rows of D2.toy.
d.toy, a data vector consisting of length 13: elements 1-8 are y.toy and elements 9-13 are z.toy.
• Functions. Ten elements: create.new.toy.datasets(N,n), a function to create new toy datasets with N observations and n code runs; error distributions generated by directly simulating the appropriate Gaussian Process on a design matrix that is a Latin hypercube generated by latin.hypercube().
-E.theta.toy(), a function that returns expectation of H(D) with respect to theta -Edash.theta.toy() as above, but returns expectation with respect to the "dashed" normal distribution detailed on page 4 of KOHb.
extractor.toy(), a function that extracts x * and t from D2; a toy example is needed because the extraction method depends on the nature of D1.
hpp.fun.toy(), a function to create hyperparameter object phi in a form suitable for passing to the other functions in the library.
hpp.change.toy(), as above but modifies hyperparameter object phi computer.model(), function that represents the unknown computer program. In the toy case, it is a simple Gaussian process, with mean h.toy(x,t) %*% REAL.COEFFS and variance matrix REAL.SIGMA1SQUARED * diag(REAL.SCALES). The capitalized variable names are the true but unknown parameters of the computer model. They appear in the R source code but are not modifiable in the context of package usage, because they represent the "real world"). Estimates of these quantities should be close to the actual values.
reality(), a function to simulate observational data. This toy function is a simple Gaussian process, with internal (unknown) parameters. The mean is given by computer.model(), plus a Gaussian process with appropriate capitalized parameters. As above, these parameters are stand for unobservable properties of the system under examination, whose values must be inferred from observation.
The objects in the toy dataset are chosen to be small enough for the functions of the package to operate quickly. However, for the purposes of testing the optimization strategy, longer datasets are needed. Such are given by function create.new.toy.datasets(), which automagically creates a dataset of the same form as the toy dataset discussed above but with the number of observations supplied by the user. The observations are drawn directly from the appropriate multivariate Gaussian distribution.

Example function
In order to show the design philosophy of calibrator, and to illustrate some of the programming issues that occur when implementing a conceptually complex methodology, I discuss in some detail a typical function of the package. The function chosen is ht.fun(), which is one of many needed when calculating the conditional covariance matrix of z. Function ht.fun() calculates the matrix where h 1 (·, ·) is the basis function for calculating expectation of y, and t(·, ·) is the set of distances to a point with k-th element With these definitions, KOHa show that the k-th column of Equation 7, and thus ht.fun(...) [,k], is where E Θ is discussed below. Equation 9 is one of several similar equations appearing in KOHa. In the implementation, function ht.fun() takes ten arguments: > args(ht.fun) function (x.i, x.j, D1, extractor, Edash.theta, H1, fast.but.opaque = TRUE, x.star = NULL, t.vec = NULL, phi) NULL Apart from the arguments already discussed, this function requires several further arguments: • extractor() is a function that splits D 1 into x * and t. It is necessary to pass this function explicitly if the split between code input space and parameter space is to be arbitrary. In the toy example, with the code input space being R 2 and the parameter space being R 3 , the working toy example extractor.toy() (supplied as part of the toys dataset in the package) returns a list with elements x[1:2] and x[3:5].
• Edash.theta() is a function that returns expectation with respect to the normal dis- • fast.but.opaque is a Boolean variable, with default TRUE meaning to take advantage of Cholesky decomposition for evaluating quadratic forms (the hyperparameter object phi includes the upper and lower triangular decompositions by default). However, this requires the calling routine to supply x * and t explicitly.
Setting fast.but.opaque to FALSE should give numerically identical results in a slower, but more conceptually clear, manner. This option is used mainly for debugging purposes.
• phi. The hyperparameter object. In KOHb, "hyperparameters" refers to ψ 1 and ψ 2 , but in this computer implementation is it convenient to augment φ with other objects that do not change from run to run (such as the a priori PDF for θ and Cholesky decompositions for the various covariance matrices). This way, it is possible to pass a single argument to each function in the package and update it in a consistent-and object-oriented-manner.
The package includes methods for setting and changing the components of phi (the toy examples provided are hpp.fun.toy() to create a hyperparameter object and hpp.change.toy() to change elements of the hyperparameter object).
This example illustrates several features of the KOHa approach. Firstly, the system is algebraically complex, requiring a multi-level hierarchy of concepts. The design philosophy of emulator is to code, explicitly, each function appearing in KOHa; and to include working examples of these functions in the online help pages. This structured approach aids debugging and code verification.

Optimization of parameters and hyperparameters
Kennedy and O'Hagan estimate the hyperparameters first, then the parameters. In this paper, I use 'hyperparameters' to mean ψ 1 and ψ 2 together with ρ, λ, σ 2 1 , σ 2 2 ; all have to be estimated.
The package provides functions stage1() and stage2() which use numerical optimization techniques to estimate hyperparameters ψ 1 and ψ 2 in two stages, as per KOHb. These functions maximize the posterior likelihood of observations y.
Function stage3() optimizes the model parameters by maximizing the posterior PDF of θ; but note that KOHb explicitly state that point estimation of θ is not generally desirable, and suggest that calibrated prediction is usually a better approach.

Package calibrator in use
Two case studies of calibrator are now presented. First, a simple"toy"dataset will be analysed. The object of this is to show how the software behaves when the dataset has only a simple structure that is known in advance.
The package is then applied to a real climate problem in which model parameters are assessed using model output and observational data.

The toy problem
In this section, calibrator is used to estimate the hyperparameters of the toy dataset, in the two-stage process outlined above and in more detail in KOHb; all executable R code is taken directly from the stage1() help page. The parameters of the model are then estimated, using the estimated hyperparameters.
One advantage of considering a simple toy example is that the covariance structure may be specified exactly. Thus one can generate observations and code runs directly from the appropriate multivariate Gaussian distribution using rmvnorm(n=1, mean, sigma), where rmvnorm(), from package mvtnorm returns samples from a multivariate Gaussian distribution with specified mean and variance.
If this is done, one knows that the conditions and assumptions specified by KOHb are met exactly, with the additional advantage that the basis functions, scales, regression coefficients, and model parameters, are all known and adjustable.
KOHb state explicitly that exact determination of the hyperparameters tends to be difficult in practice, and indeed even in the toy example the numerically determined values for phi differ somewhat from the exact values, although they are correct to within half an order of magnitude. Note that KOHb consider only the case of h 1 (x, t) = h 2 (x) = 1, corresponding to a constant prior with β 1 and β 2 being scalars; but in the toy example I consider the more complex case of h 1 (x, t) = (1, x, t) and h 2 (x) = (H(0.5 − x[1]), H(x[2] − 0.5)) where H is the Heaviside function 5 . In the climatic case considered below, Legendre functions are needed to specify a prior for global temperatures as a function of latitude and longitude.

Results from toy problem
In this section, various parameters and hyperparameters of the toy problem are estimated using standard numerical optimization techniques. The correct values are viewable by examining the source code, or by using computer.model() with the export argument set to TRUE. Because these operations are not possible in real applications (parameters are unobservable), accessing them causes the package to give a warning, reminding the user that he is carrying out an operation that requires omniscience 6 , an attribute not thought to be possessed by the typical user of calibrator.
In the first stage, ψ 1 is estimated by maximizing p(ψ 1 |y), given by Equation 4 of KOHa. This is carried out by stage1() of the package. In stage 2, the remaining hyperparameters ψ 2 are estimated by maximizing the posterior probability p(ρ, λ, ψ 2 |d, ψ 1 ) given by the (unnumbered) equation on page 4, evaluated by R function p.page4().
Taking the example given in the help page for stage1(), but with 74 code runs and 78 observation points, chosen as a compromise between informative dataset and reasonable execution time, yields comparing reasonably well with the real values of 2 and 3 for the scales, 0.2 for lambda, and 0.3 for σ 2 1 , hard-coded in to reality(). Again, the estimated values are close to the exact values but further improvements could be obtained by taking more observations or using more code runs; but too many observations can invite numerical problems as several large, almost singular, matrices have to be inverted.
Stage 3 finds a maximum likelihood estimate for the model parameters, by maximizing the apriori PDF for θ, given by p.eqn8.supp() in this implementation.  Table 1: Estimates for the parameter set using three different values for the hyperparameters φ comparing reasonably well with the real values of c(0.9, 0.1, 0.3) hard coded in to reality(). Note that the three stages above operate sequentially in the sense that each one estimates parameters, which are subsequently held fixed.

Effect of inaccurate hyperparameters
As shown above, even with a system specifically tailored for use with KOHb, the estimated values for the hyperparameters may differ from the true values by a substantial amount.
To gain some insight into the effect of hyperparameter misprediction, Table 1 presents some predictions conditional on the true hyperparameters, the estimated hyperparameters, and several incorrect values. Observe how the true value of the hyperparameters yields the most accurate values for θ, although in this case the difference is slight. It is important to realize that the best that can possibly be expected is for the predicted value being drawn from the same distribution as generated the observation. In this case, both observations and computer predictions are Gaussian processes.

Conclusions from toy problem analysis
The toy problem analyzed above shows that the implementation is satisfactory in the sense that the true values of the parameters and hyperparameters may be estimated with reasonable accuracy if their true value is known and the assumptions of KOHb are explicitly satisfied.
However, certain difficulties are apparent, even in this highly simple system, which was specifically created in order to show the methods of KOHb in a favourable light.
The primary difficulty is execution time for the optimization process to find the best set of hyperparameters. Stage 1 is reasonably fast, but the objective function minimized in stage 2 takes about 5 minutes to evaluate with the setup described above: execution time goes as the fourth power of number of observations. In the toy case considered here, any number of observations could be taken (the generating function is very cheap), but too many renders the methods unworkably slow; and too few makes the estimates unreliable.
Note that the situation is ameliorated somewhat by stage 2 requiring only four-dimensional optimization.

Bayesian calibration applied to climate science
The calibrator package is now applied to a problem of greater complexity, namely climate science output from the C-GOLDSTEIN computer model. The techniques presented here are complementary to the Kalman filtering techniques of Annan, Hargreaves, Edwards, and Marsh (2005).
The problem considered here is slightly different from that presented in part 1. Here, KOHa is used to calibrate temperature data generated by C-GOLDSTEIN using observational data taken from the SGEN dataset.
The procedure is as follows: 1. Pick a few representative points on the Earth's surface whose temperature is of interest. This set might include, for example, the North and South Poles, and a point in Western Europe. Seven points appears to be a good compromise between coverage and acceptable runtime.
2. Choose a reasonable prior distribution for the 16 adjustable C-GOLDSTEIN parameters, using expert judgement 3. Generate an ensemble of calibration points in parameter space by sampling from the prior PDF. For each member of this ensemble, run C-GOLDSTEIN at that point in parameter space and record the predicted temperature at each of the seven representative points on Earth's surface. For this application, a calibration ensemble of about 100 code runs appears to represent a reasonable compromise between coverage and acceptable runtime.
4. Determine the hyperparameters for the dataset exactly as for the toy problem above 5. From the prior PDF, sample an ensemble of say 10000 points and, using the hyperparameters estimated in step 4 above, use the emulator package to estimate the European temperature conditional on the field observations z and code runs y.

From Equation 8
, estimate the probability of each of the 10000 points in parameter space 7. Construct a weighted CDF for the temperature predictions.
Such a procedure will specify a CDF that incorporates observed temperature data from the NCEP dataset. In essence, parameter sets that give predicted temperature distributions closely matching observed data are assigned a high probability; if the match is poor, a low probability is assigned.
Although it might be possible to maximize the posterior PDF for parameter space (and thus determine a maximum likelihood estimate for the "true" parameters), such an approach would preclude techniques that require averaging over parameter space.

Results
The results section splits into two: first, numerical estimates of the scales and other hyperparameters, and second, results conditional on those hyperparameters.

Estimation of the hyperparameters
Optimization is always difficult in high dimensional spaces. Here, location of the true global minimum is an acknowledged difficulty; the emphasis in this paper is on the considerably easier problem of locating points in hyperparameter space that are significantly better than the starting point of the optimization routine. It is also the case that the true global minimum, even if it could be found, is a statistic of the observations in the sense that it is a partition of the sample space-and thus would be different from trial to trial (if repeated sampling were admitted).
Use of the simulated annealing process helps to reduce the undesirable possibility of being "trapped" in a local minimum, but does not eliminate it. For the present, it is suggested that the difficulties of multidimensional optimization are conceptually distinct from the main thrust of this paper (and are far better dealt with in the specialist optimization literature).
In any case, as shown in the toy example section above, using incorrect scales is unlikely to lead to serious mispredictions.
The values for the hyperparameters used in the remainder of this paper were the result of a weekend's optimization time on a dual 2GHz P5.

Modification of the prior
Results are now presented which show distributions for temperatures in Northern Europe, Note how the median temperature, for example, falls by about one degree centigrade when the observational dataset is included.

Conclusions from climate problem analysis
The primary conclusion from the above analysis is that it is possible to apply the methods of KOHa to a real problem in climate dynamics, in a computationally efficient manner.
This is, as far as I am aware, the first time that a posterior PDF has been rigorously derived for the parameter space of a climate model.
The fact that the estimated PDF changes significantly on incorporation of observational data suggests that the prior is uninformative. Such conclusions can be useful in climate science, by suggesting where new observational data can be most effective.

Conclusions
Viewing a deterministic computer code as a Gaussian process with unknown coefficients and applying Bayesian analysis to estimate them appears to be a useful and powerful technique. It furnishes a fast, statistically rigorous approximation to the output of any computer program.
This paper has presented emulator, an R package that interprets computer code output as a Gaussian process, and uses the ideas of Oakley and O'Hagan (2002) to construct an emulator: that is, a statistical approximation to the actual code.
The package is applied successfully to a toy dataset, and a problem taken from climate science in which globally averaged surface air temperature (SAT) is predicted as a function of 16 unknown input parameters.
Average SAT is predicted well by the emulator, as this output variable is well-described by the Gaussian process model. This paper has shown how Bayesian methods may be used interpret climate model output in a statistically rigorous way. The BACCO bundle of R packages, incorporating packages emulator and calibrator, implements the formulae of Oakley and O'Hagan (2002) and Kennedy and O'Hagan (2001a) respectively in a transparent and maintainable manner.
Both packages were demonstrated using the built in "toy" dataset, and a dataset taken from climate science.
The emulator package produced an approximation to C-GOLDSTEIN output that ran five orders of magnitude faster, and was accurate to within about 0.1 • C.
The calibrator package applied formal Bayesian methods to show that predictions for temperature over Northern Europe were about 1 • C cooler when observational data is taken into account.