Introducing multivator : A Multivariate Emulator

A multivariate generalization of the emulator technique described by Hankin (2005) is presented in which random multivariate functions may be assessed. In the standard univariate case (Oakley 1999), a Gaussian process, a ﬁnite number of observations is made; here, observations of diﬀerent types are considered. The technique has the property that marginal analysis (that is, considering only a single observation type) reduces exactly to the univariate theory. The associated software is used to analyze datasets from the ﬁeld of climate change. This vignette is based on Hankin (2012).


Introduction
Many scientific disciplines require the use of complex computer models. Such models, also known as "simulators", are valid objects of inference and are often assumed to be random functions and assessed using the Bayesian statistical paradigm (Currin, Mitchell, Morris, and Ylvisaker 1991); in particular, computer models are often assumed to be Gaussian Processes (Oakley and O'Hagan 2002). Although deterministic-in the sense that running the simulator twice with identical inputs gives identical outputs-the Bayesian paradigm is to treat the code output as a random variable because, before the computational task is finished, one has subjective uncertainty about the outcome; de Finetti (1974) discusses the philosophy of this approach. Hankin (2005) discusses this issue from a computational perspective.
Given that the simulator is a random function, uncertainty about its behaviour is reducible to an arbitrarily low level by running the simulator sufficiently many times. However, because many modern simulators require large amounts of computer time to run, this is not possible; in practice one is typically presented with a fixed number of simulator runs as data.
One tool used to make inferences about simulators under these circumstances is the emulator (Oakley 1999), and the BACCO suite of R packages (Hankin 2005). The emulator is an established technique that has been used in many fields including Earth systems science (Mc-Neall 2008), oceanography (Challenor, Hankin, and Marsh 2006), and climate science (Warren et al. 2008). However, BACCO is limited to univariate random functions. In this paper, I present a generalization of the Gaussian Process which allows the technique to be used for multivariate simulator output.
The current work is a generalization of that of Conti and O'Hagan (2010), who presented a separable covariance structure. Here, I present a generalization of that work in which the roughness lengths of the components of the multivariate process are allowed to differ.

Review of theory for the univariate emulator
This section presents a very brief review of the univariate emulator. Much of the material is taken directly from Oakley (1999) with slight changes of notation.
For any random univariate function η: R d → R and set of points {x 1 , . . . , x n } with x i ∈ R d , the random vector y = (η(x 1 ), . . . , η(x n )) ⊤ is assumed to be multivariate normal: where H = (h(x 1 ), . . . , h(x n )) ⊤ is the matrix of (known) regressor functions h: R d → R q so the regressor matrix H is n by q, denoted H [n×q] ; it is sometimes convenient to write H = H (X) where X [n×d] is the design matrix. Equation 1 is conditional on the (unknown) vector of coefficients β [q] and the variance matrix Σ [n×n] . A common choice for h(·) is h(x) = (1, x 1 , . . . , x d ) ⊤ [thus q = d + 1], but one is in principle free to choose any function of x.
The variance matrix is, explicitly: (2) (Oakley writes σ 2 A for Σ, where A [n×n] is a matrix of correlations and σ 2 is an overall variance). It can be shown that where If an improper flat prior for β is used, its posterior conditional distribution can be shown to be is the posterior mean. It is possible to integrate out β to obtain where The two superscript stars mean that the results have been integrated with respect to the posterior distribution of β. What these equations mean is that Or, in words, that m * * (x) is a quick approximation for the η(x) in the sense that its posterior distribution is Gaussian with mean and variance given by the right hand side of Equation 9 and 10 respectively. It is usual to refer to Equation 11 as the emulator; observe that the entire posterior distribution is specified.

Positive definiteness
The covariance matrix, Equation 2, is required to be positive definite for any choice of design matrix. This can be guaranteed by appropriate choice of covariance function.
, then Bochner's theorem (Feller 1966) shows that Σ is positive definite for any x 1 , . . . , x n if and only if c(t) is the characteristic function of a symmetric probability Borel measure: One standard choice (Hankin 2005) is a standard multivariate Gaussian distribution 1 with mean zero and variance S [d×d] . This gives In practice one writes B = S −1 /2 and absorbs the normalization constant into a σ 2 term leaving: giving Then Σ in equation 2 is guaranteed to be positive-definite.
1 A number of different choices for f (·) have been used in the literature. Stein (1999), for example, advocates a Student t-distribution, but the corresponding generalization of Equation 13 is the subject of "controversy and difficulties" (Dreier and Kotz 2002), possessing no closed form solution (Sutradhar 1986), and further work would be needed to implement it in the context of BACCO.

Earlier multivariate work
A natural generalization of Equation 3 is to consider η: R d → R p with a separable covariance function. Conti and O'Hagan (2010), for example, generalized equation 8 to a matrix Gaussian with a column covariance matrix given by Equation 15, and a row covariance matrix Λ [p×p] which they treated as an additional hyperparameter. Rougier (2008), considering the common problem of multidimensional model output that is indexed by a Cartesian grid, presented a computationally advantageous method; and Higdon (2008) considered the principal components of multivariate experimental results.
However, all these approaches suffer from the disadvantage that the separability of the covariance matrix implies that the roughness lengths of each of the components are identical. This assumption is often not justified: For example, in climatology, although rainfall and temperature are correlated, orographic effects mean that spatial correlation lengths are smaller for rainfall than temperature. The simple example given in Section 5.1 uses terminology inspired by this motivating example.

Non-separable covariance structures
To accommodate differing roughness lengths, it is necessary to use non-separable covariance structures. Examples include that of Majumdar and Gelfand (2007), who observed that the convolution of two positive-definite covariance functions is again positive definite. However, Majumdar and Gelfand noted that in practice the convolution will have no closed form, a drawback not affecting the present work.
Recent unpublished work by Fricker, Oakley, and Urban (2010) also uses convolution techniques and presents a nonseparable covariance structure of which the present work is shown to be a generalization.
Related work might also include Kennedy and O'Hagan (2000) 2 , who presented methods to analyze a hierarchy of levels of a model. The present work, however, does not make the Markov assumption (their Equation 1), and does not have the nested design restriction (D t ⊆ D t−1 in their notation).

Dimension reduction and Bayesian estimation
Highly multivariate output (such as a temperature field over a 3D Cartesian lattice) is difficult to deal with and many workers have sought methods to reduce such output to a more manageable format. The techniques discussed above are a special case of dimension reduction but other techniques have been presented in the context of Bayesian inference.
Principal Component Analysis is one frequently used tool. Higdon (2008), for example, considers high dimensional data from a series of experiments involving high explosive and applies the methods of Kennedy and O'Hagan (2001), although the principal components are assumed to be independent, an assumption not necessary in the present approach.
Other techniques for dimension reduction exist. Bayarri, Berger, Cafeo, Garcia-Donato, Liu, Palomo, Parthasarathy, Paulo, Sacks, and Walsh (2007), for example, use wavelet decomposition and use a thresholding procedure to produce a manageable number of coefficients. The techniques outlined in the present paper are applicable in principle to wavelet decompositions, but further work would be needed.

The multivariate case
In this section, I outline a scheme by which the emulator of Section 1.1 may be generalized to the multivariate case in a computationally tractable manner, with exact expressions for the (conditional) covariance matrix 2. The presentation uses a generalization of Bochner's theorem in such a way as to precisely delineate the space of admissible covariance functions.
In the multivariate case, there are p different types of observation, say η r (·) for r = 1, . . . , p. Each type of observation is a Gaussian process (hence susceptible to analysis by the emulator package), but here we admit covariances between the observation types, so that COV(η r (x), η s (x ′ )) = 0 for r = s. Here η r (x) is the value of an observation of type r at point x.
We suppose that observations of type r are made at points X (r) = X (r) 1 , . . . , X (r) nr ⊤ for 1 r p. Thus observations of type r are made at points on a design matrix X (r) [nr×d] . It is straightforward to specify the expectation. This is just where h r (·) are the basis functions for the observation types r with 1 r p; thus is a generalized regressor matrix. See how the overall coefficient vector β = β 1 , . . . , β p ⊤ may be partitioned into its several components. It is not necessary for all the β r to be of the same length.
The overall variance matrix will be where Σ (rs) refers to the covariance between observations of type r and s, specifically Σ (rr) are the restricted univariate variance matrices for observation type r = 1, . . . , p and the off-diagonal entries represent covariances.
Generalization to the multivariate case is subtle. We seek a method of determining Σ of Equation 17 in such as way that Σ (rr) may be specified using standard techniques (typically from a univariate analysis; the Σ (rr) being determined on the basis of different B r in 15 in general), and the Σ (rs) , r = s represent covariances between observations of type r and s in a reasonable way. It is necessary to guarantee that Σ in Equation 17 is positive definite.
Formally, we seek functions C rs (·, ·) with C rs (x, x ′ ) = COV (η r (x) , η s (x ′ )). In the notation of Equation 17, we would have Σ (rs) = C rs X (r) , X (s) as the matrix of covariances between observations of type r at X (r) and observations of type s at X (s) , that is, between η r X (r) and η s X (s) . These functions are required to be positive-definite in the sense that Σ of Equation 17 must be positive definite for any set of points X (1) , . . . , X (p) .
A matrix generalization of 12 was presented by Cramer (1940) which will be used here: C rs (t) are positive-definite if and only if they are of the form for some positive definite F ij (ω). If attention is restricted to absolutely integrable functions (a condition which will be dropped subsequently), this becomes If we write ||f (ω)|| for the matrix with (r, s) entry f rs (ω), then we require ||f (ω)|| [p×p] to be positive definite for all ω.
Considering functions of the form discussed in Equation 13, one approach would be to specify the off-diagonal elements to be zero. Here p = 3 is used for illustration; the general case follows directly: where the S i are positive-definite matrices corresponding to the (marginal) univariate covariance matrices of Equation 15. This matrix is positive definite for all ω. This approach would only be appropriate if the covariances between observation types were zero.
One way to account for nonzero covariance between observation types is suggested by the fact that, given positive numbers is positive semidefinite for all ω provided only that the S i are positive definite; observe that the diagonal elements of Equations 20 and 21 match. Observe that, with fixed diagonal entries, offdiagonal elements cannot exceed those given in Equation 21 while retaining positive definiteness. The matrix thus corresponds to "maximal correlation" in this sense and the general terms are then: where we follow standard convention (Oakley 1999) and write B r = S −1 r /2. Similar expressions occur in the study of nonstationary covariance functions (Paciorek and Schervish 2006;Higdon 2002); a special case (diagonal matrices) is given by Fricker et al. (2010). These authors construct the covariance matrix using process convolutions, observing that the convolution theorem for Fourier transforms ensures positive definiteness (Higdon 2002(Higdon , 2008.

Equation 22
gives a positive-semidefinite variance matrix for any design matrix. Noting that the Schur (elementwise) product of a positive-semidefinite matrix and a positive definite matrix is positive definite, the relation is a positive definite function. Here M [p×p] is a positive-definite matrix that accounts for covariance between observation types.

Other forms for the covariance matrix
It is possible to use covariance functions other than the Gaussian form used in Equation 21.
The probability measures are required to be symmetric, and the geometric mean of two measures is required to have a characteristic function in closed form.
Measures that are proportional to an indicator function, that is where C is the normalization constant and A ⊂ | d is a support set, are a natural choice.
In this case element (i, j) would be I A i ∩A j (x); one could consider support sets that are hyperspheres or, more interestingly, orthotopes.
One other natural choice would be the multivariate t-distribution, but further work would be necessary to assess its suitability in this context.

Summary
The univariate emulator is generalized to the p-variate case. Univariate expectation Hβ is generalized to the multivariate form given in Equation 16, and the univariate variance matrix of Equation 2 is generalized to the multivariate form 17 with arising from the positive definite function C ′ (·, ·) of Equation 23. The matrices Σ (rr) correspond to univariate variance matrices and each is obtained from a matrix B i of roughnesses in the same way as in the univariate case. The univariate variance σ 2 generalizes to a variance matrix M whose diagonal elements correspond to the univariate variances σ 2 i , 1 i p.

Discussion
The above analysis suggests a method by which a covariance matrix may be determined for multivariate observations. Here I discuss some implications of Equation 24.
Consider the case where B i are known. Then consider the correlation c(·, ·) between the types of observations at the same point, ie t = 0 in Equation 22. This is just |c(r, s)| M rr M ss . The second inequality follows from the concavity of log |B| (Cover and Thomas 1988) and is thus sharp if and only if B r = B s . In the case of 1 × 1 matrices (ie scalars), the matrices commute and the maximum correlation Figure 1 shows an example of two maximally correlated Gaussian processes with different roughness lengths).
Thus B r = B s imposes an active upper bound on c(r, s): two Gaussian processes with different roughness coefficients cannot be perfectly correlated.
It is also evident that for any matrices M, B r , B s .

Estimation of hyperparameters
The multivator package requires a generalized set of hyperparameters compared with the emulator package. The emulator package needs a single positive-definite matrix B that expressed the roughness length of the response function; multivator requires matrices B 1 , . . . , B p : One matrix per type of observation. Each matrix represents the marginal roughness characteristics of each observation type. Oakley (1999), and many subsequent authors, assumed that the overall variance matrix Σ was given by Σ = σ 2 A, where σ 2 is a scalar and A a matrix of correlations. Oakley (1999) proceeded to integrate out σ 2 (using a weak prior distribution) to obtain an expression for the posterior distribution of the process in terms ofσ 2 , the estimated value for σ 2 .
The approach advocated here, by contrast, generalizes the scalar variance σ 2 to M , a p × p positive-definite matrix which expresses the overall variances and covariances of the p different types of observation; subsequent analysis is conditional on the values of the B i and M .
The procedure used in the package is a three step process: 1. Estimate the roughness parameters for each observation type separately, using techniques of the emulator package, 2. Calculate the marginal variance terms, using an analytical expression for the posterior mode, following Oakley (1999); these are the diagonal elements of M , 3. Estimate the off-diagonal elements of M by numerical determination of the posterior mode. To ensure positive-definiteness, an improper flat prior with nonzero support extending over the positive-definite matrices may be used.
This multi-stage procedure is reminiscent of the two-stage process outlined in Kennedy and O'Hagan (2001). It seems to work reasonably well in practice. The process is not perfect: One might wish to calculate the joint likelihood of M and the B i simultaneously; the relevant likelihood is given by Oakley's Equation 2.36, which in our notation is and optimize that, but such an approach seems impractical, even for the toy example considered here.

The package in use
The multivator package of R (R Development Core Team 2011) routines is now demonstrated using three examples: A toy dataset in which the underlying assumptions are known to be true; evaluates of a simple function, following Oakley (1999); and a larger dataset drawn from the discipline of physical oceanography. A brief discussion of the package as applied to modular systems such as CIAS (Warren et al. 2008) is also given.

Toy example
Although the toy dataset and associated R objects are simple, they represent the most general form of the package's functionality and furnish a comprehensive suite of tests of the package functionality.
Toy dataset toy_mm is a simple design matrix on three levels: temp, rain, and humidity. Thus toy_point corresponds to measuring all three levels (temp, rain, humidity) at a single point in parameter space. It is straightforward to use the package to provide an estimate for the process at this point, using multem(): R> (e <-multem(toy_point, toy_expt, toy_mhp, toy_LoF, give = TRUE)) [Object toy_expt is an S4 object with slots for the design matrix and observations, produced by experiment()]. The return value of multem() is a two-element list with the first being a vector whose elements are the posterior mean for each row of the multivariate design matrix toy_mm, and the second is the conditional variance matrix. Thus we see that, at this point in parameter space, temperature and rainfall are negatively correlated. The diagonal of the matrix gives the (conditional) marginal variances for the three levels (temp, rain, humidity).

Estimation of the hyperparameters in the package
In this section, the hyperparameters for the synthetic dataset considered above are estimated using the package, following the scheme suggested above.
In common with the emulator and calibrator packages, the multivator package includes functionality to create datasets with values drawn from the appropriate distribution.
R> mm <-toy_mm_maker(81,82,83) R> d <-obs_maker(mm, toy_mhp, toy_LoF, toy_beta) R> jj_expt <-experiment(mm,d) Here mm is a multivariate design matrix, created using a latin hypercube; the three arguments specify the number of points in parameter space at which each observation type is made. Function obs_maker() creates observations drawn from the appropriate distribution.
Here, toy_mhp is a hyperparameter object (a matrix M [3×3] of covariances, and three B [4×4] roughness matrices, one per observation type); toy_LoF is a list of regressor functions, and toy_beta is a vector of regression coefficients. The function optimal_scales() first estimates the B i matrices and then, conditional on this, estimates the overall covariance matrix M , conditional on the B i , using Equation 27: R> mhp_opt <-optimal_params(jj_expt, toy_LoF, option="b") Specifying option="b" restricts the B i to diagonal matrices. The optimized value for M , the matrix of covariances is then given by Compare the true value:

Validation
It is possible to validate the above approach by the technique of using half the dataset for fitting the emulator (as above), then the remaining half for validation. The appropriate R expression would be R> est2 toy_expt,toy_mhp,toy_LoF) where toy_mm and toy_mm2 are components of the same multivariate observation taken from the distribution specified in Equation 1. Figure 2 shows such an exercise, exhibiting reasonable agreement between observed and predicated values.

Simple functional analysis
In this section, a simple function f : R 2 → R 2 is considered, and univariate inference is compared with the multivariate techniques introduced above.
From a computational perspective, an analysis using the multivator package is presented "from scratch"; standard R objects are coerced to the appropriate S4 objects.
The functions considered are f a (x, y) = sin(5 · (x + y)) and f b (x, y) = 7 sin(5 · (x + y)) + sin(20 · (x − y)). These functions correspond to observations of type 'a' and 'b' respectively, and are chosen so that they are correlated, but f a might be expected to have a smoother response than f b . An experimental design is then needed for each function, which in this case is a simple latin hypercube: Thus xa and xb are standard R matrices. It is now possible to evaluate f a and f b over their experimental designs: R> a_obs <-apply(xa,1,fa) R> b_obs <-apply(xb,1,fb) Thus there are two design matrices xa and xb, and two corresponding sets of observations, here a_obs and b_obs, all in the form of standard R objects (matrices and vectors respectively).
It is now straightforward to apply the multivator package methods.

Data analysis using multivator
The package is now used to analyze climate change data obtained from the Genie-Goldstein model, a computationally efficient Earth-Systems model designed to assess climate change from an oceanographical perspective on a timescale of centuries to millennia (Edwards and Marsh 2005).
McNeall (2008) considered Genie-Goldstein output and used Principal Component Analysis as an analytic technique. Here, I consider the first four principal components using the multivator package. Although principal components are mutually orthogonal, they are not necessarily independent with respect to any given regressors. I now show how data provided by McNeall may be analyzed using the multivator package.
R> data("mcneall") R> dim(mcneall_temps) [1] 2048 92 The mcneall_temps matrix has 92 columns, one for each of 92 runs of Genie-Goldstein. Each column, of 2048 numbers, corresponds to a map of global temperature; an example is given in Figure 4 in which the showmap() function is used to reshape the vector to a form suitable for display.
Dataset mcneall_pc has 92 rows, one per run, and 20 columns. The first 16 columns show the design matrix of independent variables 3 . The last four columns are the first four principal components of the output; an interpretation is given in Figure 5.
Note that these correlations are conditional upon the form of the regressor functions (here the default set default_LoF).

Modular systems
Multivariate emulation appears to be a useful technique in the context of modular systems such as CIAS (Warren et al. 2008) in which a model comprises various component "modules".
In the case of CIAS, the modules address various aspects of the global climate system and examples include E3MG which models the global economy, MAGICC which models the physical global climate, and ICLIPS which models the impacts of climate change. The modules exchange information at runtime using the BFG protocol.
One feature of CIAS is that it is possible to replace any module with another functionally equivalent one. Multivariate emulation is useful when considering the behaviour of CIAS used in this way. If one has p different interchangeable modules, then the output of CIAS is a p-variate random variable that may be analyzed using the multivator package.
In an associated vignette, visible from within an R session by typing vignette("cias"), a short analysis of a synthetic dataset is presented.

Discussion
A generalization of the emulator to multivariate datasets is proposed and the multivator package has been introduced. The package is used to analyze datasets drawn from the fields of oceanography and climate change. The variance structure proposed appears to have pleasing and useful properties. Further work might include extension of the ideas presented here to complex functions.