Multivariate emulation of computer simulators: model selection and diagnostics with application to a humanitarian relief model

Summary We present a common framework for Bayesian emulation methodologies for multivariate output simulators, or computer models, that employ either parametric linear models or non‐parametric Gaussian processes. Novel diagnostics suitable for multivariate covariance separable emulators are developed and techniques to improve the adequacy of an emulator are discussed and implemented. A variety of emulators are compared for a humanitarian relief simulator, modelling aid missions to Sicily after a volcanic eruption and earthquake, and a sensitivity analysis is conducted to determine the sensitivity of the simulator output to changes in the input variables. The results from parametric and non‐parametric emulators are compared in terms of prediction accuracy, uncertainty quantification and scientific interpretability.


Introduction
There are many systems in the physical, social and engineering sciences for which physical experimentation is infeasible or unaffordable.Some examples include investigations on ecosystems, infectious diseases, climate change, and galaxy formation (see Kennedy et al., 2006, for a number of case studies).In such situations, it is now common for the scientist or engineer to develop a simulator, or computer model, that provides an approximation of the observed response from the physical system.In essence, the simulator is a deterministic or stochastic mathematical function that maps the inputs of a system to a prediction of its outputs.
A simulator that has been successfully calibrated and validated, perhaps using physical data, can be employed for a number of tasks including prediction, optimisation, and sensitivity and uncertainty analyses (Kennedy and O'Hagan, 2001).However, both calibrating and exploiting the simulator typically requires very many simulator evaluations.For complex problems, the computational expense of the simulator means brute-force approaches to these problems are infeasible, taking many hours, days or even weeks.Therefore, a fundamental step in understanding and using simulators is often the construction of a statistical emulator, or meta-model, through a computer experiment (Sacks et al., 1989).Here, the simulator is run at a carefully selected collection of combinations of the input variables and the resulting evaluations are treated as data to which a statistical model, the emulator, is fitted.The emulator can then be used to produce fast predictions of the output of the simulator for any values of the input variables, along with an associated measure of the prediction uncertainty.The emulator can then replace and supplement the simulator in both statistical calibration and scientific investigation.For more on computer experiments, see Santner et al. (2003), Fang et al. (2006), and Levy and Steinberg (2010).
A Bayesian approach is very natural when constructing statistical emulators (O' Hagan, 2006) with the chosen statistical model treated as a prior distribution on the simulator outputs and prediction, with associated uncertainty quantification, via the posterior predictive distribution (see Section 1.2).Typically, a nonparametric Gaussian process (GP) regression model (Rasmussen and Williams, 2006) is employed; its advantages include flexibly adapting to the simulator evaluations and, for deterministic simulators, interpolating between data points.However, for some simulators, these advantages may be more than offset by the computational expense of estimating the GP model, and simpler and more computationally efficient models, such as multivariate linear regression, may be effective and more interpretable.Whatever statistical approach is taken to constructing the emulator, an important step is assessing its adequacy through formal statistical diagnostics (Bastos and O'Hagan, 2009).
Frequently, each run of a simulator outputs a multivariate response, perhaps as a result of a time series or other dynamic process.The purpose of this paper is to present a Bayesian framework for covariance-separable emulation of multivariate simulators using parametric and nonparametric models and to develop novel model diagnostic procedures appropriate for such emulators.As part of our presentation, we unify the multivariate Gaussian process emulator of Conti and O'Hagan (2010) and the lightweight emulator of Rougier (2007).Through an application to a simulator of a humanitarian relief mission, we demonstrate effective emulation, model selection and model checking for multivariate problems with a mixture of continuous and categorical input variables.

A humanitarian relief simulator with multivariate dynamic output
Simulators have a long history of use in military and civilian emergency planning (see, for example, Ingber et al., 1991).DIAMOND (DIplomatic And Military Operations in a Non-warfighting Domain; Taylor and Lane, 2004) is an emergency planning simulator for modelling peace support operations such as humanitarian relief and peace keeping.DIA-MOND is mission-based, with high-level operational plans deconstructed into missions for individual units.It is able to model the actions and interactions between a wide range of agents, including military forces in non-warfighting roles, non-governmental organisations (NGOs), indigenous forces and civilians.A range of environmental and infrastructure features can also be varied.
Our application of DIAMOND provides a deterministic model of a humanitarian relief mission to Sicily after an earthquake and subsequent eruption of Mount Etna.Etna is an active stratovolcano on the east coast of Sicily near the cities of Catania and Giarre (see  Figure 1).It has been designated a "Decade Volcano" by the International Association of Volcanology and Chemistry of the Earth's Interior and the United Nations due to its history of large eruptions and proximity to populated areas.Historically, more fatalities have been caused by earthquakes in the region, such as in 1693 when an earthquake of estimated magnitude 7.4 on the moment magnitude scale devastated the area and caused about 12,000 deaths in Catania (∼ 63% of the population at the time; Guidoboni et al., 2007).
The simulator models damage to the food supply, hospitals and housing (shelter) in Giarre and Catania resulting from the earthquake and eruption.An NGO launches a humanitarian relief operation which has two missions:

Food Aid Mission
To supply food to Catania and Giarre by helicopter from the NGO base.

Repair Mission
To transport engineers from the NGO base to Giarre and Catania, where they repair the food supply infrastructure and/or the shelter.
We consider a scenario designed by the UK Defence Science and Technology Laboratory for the explicit and sole aim of model-testing; the scenario is not intended to support any real world decisions.Here, the NGO has four helicopter teams, two engineering teams and a single food depot.Two helicopter teams are assigned to the food aid mission and the others to transporting the engineers for the repair mission.
The simulator has p = 13 input variables, which represent the scale of the disaster and features of the humanitarian relief operation (see Table 1).Eleven of these variables are continuous, with the other two being categorical with each having two levels.Input variables x 1 -x 6 determine the impact of the earthquake and eruption on the population of Giarre and Catania by specifying the capacity of hospitals, shelter and food supply immediately following the disaster.The specification of these input variables creates a shortfall between population and shelter and/or food supply, leading to casualties.
The remaining input variables (five continuous, two categorical) control certain features of the humanitarian relief mission.The continuous input variables are self-explanatory with the exception of x 7 : the weighting of the engineer toolbox.This variable controls the relative importance given to repairing shelter and the food supply by the two engineering teams; x 7 = 0 (x 7 = 1) corresponds to engineers only repairing the shelter (food supply).
The two levels for categorical variable x 12 correspond to, respectively, supplying food aid to both Giarre and Catania or to Catania alone.Although the second option is perhaps morally and politically unappealing, it may be practically relevant as there can be a much greater shortfall between the available and required food in Catania.Simulation modelling allows investigation of the impact of potentially unattractive options.For x 13 , the two levels correspond to the NGO base being (i) in continental Europe or (ii) part of a military task force located on a fleet of ships in in the Strait of Messina between Italy and Sicily (see Figure 1).
Each run of the simulator is defined by a setting for x 1 -x 13 .The output from each simulator run is the number of civilian casualties that have occurred on each of days 2,3,4,5 and 6 following the disaster.Therefore, the output for each run is a five dimensional vector.

Bayesian emulators
A Bayesian approach will be taken to constructing an emulator for the DIAMOND simulator.Let x = (x 1 , ..., x p ) T ∈ X ⊂ R p denote the vector of p input variables, with X the p-dimensional input space.The simulator is assumed to be a black-box function, is the k × 1 output vector from the simulator at input combination x.An emulator for f (•) is a prediction equation that provides a surrogate for f (x 0 ), where x 0 is an input combination at which the simulator has not previously been evaluated.
For a collection of input combinations ζ = {x 1 , ..., x n }, with x i = (x i1 , ..., x ip ) T , the simulator outputs are collated into an n × k output matrix A priori, we assume that Y is a realisation from a probability distribution, specified up to a d × 1 vector of unknown parameters θ ∈ Θ, with Θ ⊂ R d the parameter space.After running the simulator for the input combinations in ζ, the emulator is constructed as the posterior predictive distribution (see, for example, O'Hagan and Forster 2004, p. 89) of y 0 = f (x 0 ), given by Here, π(θ|Y ) is the posterior density function for θ, found using Bayes theorem, and π(y 0 |θ, Y ) is the conditional posterior predictive density for y 0 .
In the remainder of this article, methodology for multivariate Bayesian emulation is developed and applied.In Section 2, the detailed methodology used to obtain the posterior predictive distribution is described for both multivariate Gaussian processes and linear models.In Section 3, model selection and diagnostics for multivariate emulators are developed and discussed.In Section 4, results are presented from applying the methodology to emulating the DIAMOND simulator.Section 5 gives a brief discussion.
Code to fit the emulators described in this paper and the training and test datasets are provided as supplementary material.
2 Multivariate emulation via the posterior predictive distribution In this section, the posterior predictive distribution is derived for a general class of multivariate linear models that includes Gaussian process (GP) models and linear regression models.As such, the multivariate GP emulator of Conti and O'Hagan (2010) and lightweight emulator of Rougier (2007) are special cases.We also demonstrate how the multivariate GP emulator can include categorical input variables using the distance metrics of Qian et al. (2008).
Our basic modelling assumption is that any finite set of multivariate responses has a joint matrix normal distribution (Dawid, 1981) with mean function a linear combination of unknown model parameters and a separable covariance structure with, potentially, correlations between outputs from the same run and also between different runs of the simulator.That is, for where HB is the n × k mean matrix and Σ and A are, respectively, k × k and n × n positive definite column and row scale matrices.Note that is a multivariate normal distribution, where vec(•) denotes the vectorisation function that stacks columns of a matrix and ⊗ denotes the Kronecker product.
In (2), the matrix H is the n × m model matrix with ith row given by h(x i ) T , where h : X → H ⊂ R m is a known function of the simulator inputs (i = 1, . . ., n).For example, if h(x) = (1, x 1 ), then the model contains an intercept and a linear term in x 1 .If some input variables are categorical, then we define the appropriate elements of h(x i ) through the usual constraints, for example corner-point or sum-to-zero.The matrix B is an m × k matrix of unknown regression parameters.
The separability of the covariance structure implied by this matrix normal distribution results in a common scale matrix Σ for the k multivariate responses at each of the n simulator runs.An emulator with a separable covariance structure is both easier to implement and interpret.If diagnostic measures (see Section 3.1) suggest inadequacy of the separable emulator, alternative methodologies could be employed (see, for example, Fricker et al., 2013, and references therein).
If homogeneity of variance across the simulator runs is assumed, that is Var {f (x i )} = Σ for all i = 1, . . ., n, then A can be specified as a correlation matrix.For the multivariate GP emulator, we define A through a stationary correlation function, and set ijth entry equal to a ij = c(|x i − x j |; r), i.e. the correlation between any two rows of Y depends only on the distance between x i and x j (i, j = 1, . . ., n) and a vector of unknown correlation parameters r.The lightweight emulator is defined as a special case with Thus we can replace conditioning on A in (2) by conditioning on r.
We use the conditionally conjugate (given r) matrix-normal-inverse-Wishart (MNIW) prior distribution for B and Σ, denoted MNIW m,k (M, Ω, S, δ), where Here, IW k denotes the inverse-Wishart distribution for k × k positive-definite matrices, M , Ω and S are the m × k, m × m and k × k matrices of hyperparameters, respectively, and δ > 0 is the prior degrees of freedom.The corresponding probability density function is given in Section 1 of the Supplementary Material, up to a normalising constant; see also Rougier (2007).
Using this prior distribution the conditional posterior distribution, given r, is see Section 2 of the Supplementary Material, where To predict the simulator output where H 0 is the n 0 × m matrix with uth row h(x 0u ) T , A 0 is the n 0 × n 0 matrix with uvth element given by c(x 0u , x 0v ; r), and T is the n × n 0 matrix with iuth element given by c(x i , It can be shown (see Section 3 of the Supplementary Material) that the conditional distribution of Y 0 is From ( 5) and ( 6), we can see the fundamental difference between the GP and lightweight emulators; for the lightweight emulator, the output from different simulator runs is assumed independent given {B, Σ} and hence the matrix, T , of correlations between the observed and unobserved simulator runs will be a zero matrix.Hence, conditional on B and Σ, the distribution of Y 0 does not depend on Y .For the GP emulator, with non-zero correlations between simulator runs, the dependence between Y 0 and Y remains even after conditioning on B and Σ.
To obtain the posterior predictive distribution of Y 0 , given r, we integrate (6) with respect to the posterior distribution of B and Σ (see Section 4 of the Supplementary Material): where and MT n 0 ,k (Q, Ŝ, R, δ) denotes the matrix t-distribution (Javier and Gupta, 1985) with location matrix Q, column scale matrix Ŝ, row scale matrix R and degrees of freedom δ.Marginal posterior predictive distributions for the uth simulator run, y 0u = f (x 0u ), and the sth output, y 0,us = f s (x 0u ) are multivariate and univariate t distributions, respectively: Here, q u is the uth row of Q and q us is the usth element of Q, R uu is the uth diagonal element of R and Ŝss is the sth diagonal element of Ŝ.
For the lightweight emulator, where A = I n , an n × n identity matrix, (7) provides closedform posterior predictive distributions.For the multivariate GP emulator, and the most commonly used correlation functions c(•, •; r), there does not exist a prior distribution for r such that a closed-form expression can be obtained for the marginal posterior predictive distribution of Y 0 .Typically, one of two approaches is taken: (i) r is replaced by a "plugin" estimate r, a representative value with respect to the marginal posterior distribution of r; or (ii) Markov Chain Monte Carlo (MCMC) methods are used to sample from the marginal posterior distribution of r and then for each sampled value of r, a value is drawn from the conditional posterior predictive distribution (7).
The plug-in approach is less computationally expensive than the fully Bayesian approach and provides a closed-form emulator.We adopt the plug-in approach for prediction using the marginal posterior mode of r, obtained by maximising the unnormalised marginal posterior density where π r (r) is the prior probability density function for r.
The final step in building the multivariate GP emulator is choice of the correlation function c(•, •; r).The most commonly used function is the power exponential function, which was extended by Qian et al. (2008) to incorporate both quantitative and qualitative variables.
Assuming without loss of generality that the variables are ordered, so that the first p 1 variables in x are quantitative and the next p − p 1 are qualitative variables, a correlation function that is exchangeable in the levels of the qualitative variables has the form Qian et al. ( 2008) suggested a number of correlation functions for qualitative variables, each reducing to the common form ( 9) for two-level qualitative variables.Throughout this paper, we fix g l = 2 for all l.

Emulator diagnostics and improvement
In this section, we address diagnostics for emulator adequacy and methods for improving emulator performance, including variable selection and the addition of a nugget term for the multivariate Gaussian process.

Emulator diagnostics
We start by developing generalisations to multivariate emulators of the diagnostics provided by Bastos and O'Hagan (2009) for univariate Gaussian process emulators.These diagnostics assess the assumption underlying (2), that the responses conditionally follow a matrix normal distribution with specified mean and correlation functions.Their evaluation requires an additional validation set of simulator runs, ζ 0 and Y 0 , to be available.

Individual prediction errors
As suggested by Bastos and O'Hagan (2009), standardised prediction errors can be explored graphically or used to construct nominal-level predictive probability intervals.If the emulator is an adequate model of the simulator, from (8), the standardised individual prediction error has a standard t-distribution, conditional on Y with δ degrees of freedom (u = 1, . . ., n 0 ; s = 1, . . ., k).A large number of outlying standardised prediction errors, with respect to the reference distribution, indicates serious inadequacy of the emulator.Bastos and O'Hagan (2009) suggested various graphical methods for identifying patterns in outliers and, subsequently, causes for emulator inadequacy; for example, plots of the individual prediction errors against each input variable or the predictive mean.
Individual (1 − α) × 100% predictive probability intervals for each element of Y 0 can be constructed as where c α is the (1 − α/2)th quantile of the standard t-distribution with δ degrees of freedom.The obtained coverage of these intervals can be compared against 1 − α, with low coverage suggesting the emulator is underestimating the prediction uncertainty.

Omnibus diagnostic
We now develop a summary statistic for overall emulator adequacy, analogous to the Mahalanobis distance diagnostic of Bastos and O'Hagan (2009).Define E as the n 0 × k matrix of standardised predictions Javier and Gupta (1985), for an adequate emulator, the conditional posterior distribution of E is We now define the diagnostic with extreme (large or small) values of U , relative to the reference distribution, indicating emulator inadequacy.Following Dickey (1967), the reference distribution for U is a U k,n 0 ,k+ δ−1 distribution (conditional on Y and r).Anderson (2003, p. 307) showed that the U k,n 0 ,k+ δ−1 distribution has the same distribution as a product of k independent Beta random variables, i.e.
where X s ∼ Beta (k + δ − s)/2, n 0 /2 .Summaries of this distribution can be calculated by simulation.
The matrices G R and G S are not unique and depend on the chosen decomposition of R and Ŝ, respectively; for example, the eigen or Cholesky decomposition.However, and therefore the value of the diagnostic U is invariant to the choice of decomposition.
Assuming (2), also note that and hence the elements of δ 1 2 E form an uncorrelated sample from the t-distribution with δ degrees of freedom.Quantile-quantile (QQ) plots of these elements can be used as an additional check on emulator adequacy.The elements of E are dependent on the decomposition used to obtain G R and G S .However as noted by Bastos and O'Hagan (2009), any choice of decomposition method is appropriate for use in a QQ-plot, and we use the Cholesky decomposition.
For univariate simulator output (k = 1), the omnibus statistic reverts to the Mahalanobis distance suggested by Bastos and O'Hagan (2009).Now, E is an n 0 × 1 vector following a t n 0 (0, (1/ δ)I n 0 , δ) distribution, E T E is scalar and 1 The quantity E T E/( δ − 2) is the Mahalanobis distance and F(a, b) denotes an F distribution with a and b degrees of freedom.

Emulator improvement
The diagnostics in Section 3.1 can be used to suggest improvements to a multivariate emulator.For example, graphical assessment of standardised errors may suggest different mean functions h(x), transformations of inputs, or regions of X where new simulator runs should be performed; see Bastos and O'Hagan (2009).We focus on selection of an appropriate mean function and improvement of GP emulators via the addition of a nugget.

Mean function selection via model comparison
It is common in the application of GP emulators to usually assume a simple form for the mean function such as h(x) = 1 or h(x) = c(1, x) (see, for example, Bayarri et al., 2007).
Clearly, for the lightweight emulator, with uncorrelated errors, such a simple assumption will usually be inappropriate.We demonstrate in Section 4 that using an overly complex mean function (i.e.overfitting) can also be detrimental to the accuracy of the emulator on an independent test data set, as with the more usual applications of the linear model.This motivates the use of Bayesian model comparison as a vehicle for the selection of an appropriate mean function.
Let each unique choice of h(x) be indexed by v, i.e. we label mean functions as h v (x), with v ∈ V and V denoting the set of possible models.Then, following equations ( 2) and ( 7), where and δ v are hyperparameters for the vth model, r v holds the correlation parameters for the vth model, and H v,0 , H v , A v , A v,0 , T v and B v for model v are analogous to matrices defined in Section 2.
A fully Bayesian approach would average (11) with respect to the posterior distribution of the correlation parameters, r v , and the posterior model probabilities to provide a model-averaged posterior predictive distribution.Alternatively, Bayesian model comparison can be used to identify a model v, based on the posterior model probabilities, and Y 0 |Y, rv , v can be employed as an emulator.The obvious choice for v is the posterior modal model with highest posterior model probability.We adopt this latter approach, both for computational convenience and also to provide interpretable emulators that aid scientific understanding of the simulator.
The posterior model probability for model v is given by | Ŝv | ( δv+k−1)/2 , and Γ k (•) is the multivariate gamma function (Javier and Gupta, 1985) The term π(Y |r v , v)π(r v |v)dr v which features in the posterior model probability is known as the marginal likelihood.For the GP emulator, the integration required to evaluate the marginal likelihood will not be analytically tractable.For the lightweight emulator, where A v = I n and does not depend on r v , the marginal likelihood is available in closed form.However, if the number of models, |V|, is large then calculating the marginal likelihood for every model will be computationally infeasible.Instead we generate a sample from the posterior distribution of the model index, v, using MCMC methods.For a GP emulator, each iteration of the MCMC method has two phases.
Phase 1 uses the MCMC model composition algorithm (Raftery et al., 1997) to update the model index conditional on the current value of the correlation parameters.Suppose the current model is v and a move to a model w is proposed with probability ρ(v, w) where the correlation parameters remain unchanged, i.e. r w = r v .The move is accepted with probability Phase 2 updates the correlation parameters, r v , conditional on the current model v using a suitable MCMC method.We employ a random walk Metropolis-Hastings algorithm.
For the lightweight emulator, phase 2 is not required.After a large number of iterations, when the chain has reached a stationary distribution, the proportion of iterations that visit model v provides an approximation to π(v|Y ).We choose ρ(v, w) such that (i) proposed models can only add or remove a single term from the current model, adhering to marginality, and (ii) all possible models that obey these conditions are equally likely to be proposed.

Non-zero nugget
Gramacy and Lee (2012) discussed improving the adequacy of univariate GP emulators via the inclusion of a non-zero nugget parameter, principally to mitigate the effects of incorrect model assumptions.Use of a nugget changes the (i, j)th element of A, where η ≥ 0 is the nugget parameter and I(i = j) is the indicator function.For prediction, we again adopt a plug-in approach for the nugget parameter, and replace η by a representative value η (the posterior mode).For model selection, the value of the nugget is sampled in phase 2 of the MCMC algorithm.The prior for η used in this paper is given by π(η) = (1 + η 2 ) −1 , previously used by Conti and O'Hagan (2010) for correlation parameters.
4 Application to the DIAMOND simulator In this section, the methodology from Sections 2 and 3 is employed to construct and check multivariate GP and lightweight emulators for the DIAMOND simulator.Recall that the scenario under investigation has been solely designed for model testing purposes.Hence, when, for example, we refer to the importance of specific input variables, we do so only in that context.In particular, we do not intend these observations to be applied to other situations.For the construction of each emulator, we scale the continuous input variables to [0, 1] and denote the levels of the categorical variables as {0, 1}.

Prior information
When constructing individual GP and lightweight emulators, we assume weak prior information for the model parameters B, Σ and r, following Conti and O'Hagan (2010): The correlation parameters r are assumed independent, with prior distributions specified using the approach of Linkletter et al. (2006).We rewrite c(x 1 , x 2 ; r), from (9), as , where ρ l = exp(−r l ) ∈ (0, 1) for r l > 0 (l = 1, . . ., p).We assume a uniform prior distribution for ρ l , leading to the induced prior for r l being an exponential distribution with E(r l ) = 1.
When performing model comparison for the selection of the mean function with only weak prior information available for the parameters of each model, we adopt prior hyperparameters S v = 0 k×k and δ v = −k + 1 for Σ v , which is present in all models, and unit information prior distributions for B v , with M v = 0 p×p and as proposed by Kass and Wasserman (1995).The use of proper prior distributions for B v avoids Lindley's paradox (see Bernardo and Smith, 1994, pg 394) which states that the posterior model probabilities are sensitive to the scale of the prior variance (see also O'Hagan and Forster, 2004, pp. 322-324, Raftery et al., 1997and Fernandez et al., 2001).We assume the same exponential prior, see above, for each element of r v for each model, i.e. π(r v |v) = π(r v ).A uniform prior over the model space is chosen, i.e. π(v) = |V| −1 , where V is the set of all sub-models of the maximal model that respect marginality.
The maximal model has a mean function consisting of the intercept, all linear, two-way interaction and, for the continuous inputs, quadratic terms.The resulting model matrix, H, has m = 103 columns.
For this weak prior information, α from ( 12) reduces to

Design of the computer experiment
We employed a space-filling design that would enable the estimation of both the Gaussian process and lightweight emulators.The most common design used for computer experiments is the Latin Hypercube (McKay et al., 1979) and its extensions (see, for example, Tang, 1993, andMorris andMitchell, 1995).Such designs provide low-dimensional uniformity in the input variables, hence achieving good projection properties, and allow the estimation of nonparametric regression models.They are also an attractive choice for lightweight emulation, as the exact form of the emulator will be unknown in advance of the data collection and a flexible design that allows the fitting of many different parametric models may be required (see Section 3.2).
The design, ζ = {x 1 , ..., x n }, for this study needed to combine both continuous and categorical input variables.We used a sliced space-filling design as proposed by Qian and Wu (2009) with n = 120 runs.Such a design, constructed from an orthogonal array, has not only good space-filling properties overall but also for the projection into the continuous variables for each combination of values of the categorical input variables.

Construction of adequate emulators
We constructed both lightweight and multivariate GP emulators for the DIAMOND simulator using the n = 120 simulator runs, each outputting k = 5 responses, from the sliced space-filling design as training data.For model validation and diagnostics, we use a second design ζ 0 = {x 01 , ..., x 0n 0 }, with associated n 0 × k simulator output matrix Y 0 .This design is also a sliced space-filling design with n 0 = 120 runs and was constructed using a different orthogonal array to that used to construct ζ.
We chose, assessed and compared emulators using the diagnostics from Section 3. We calculated the root mean squared error (RMSE) for Y 0 , , where Y 0,us is the simulator output from the uth validation run for response s.We also calculated the root relative mean squared error (RRMSE), where the point estimate γ us = E Y −1 0,us |Y, r /E Y −2 0,us |Y, r minimises the relative squared error loss function.

Lightweight emulators
Our first lightweight emulator was the maximal model.The value of the omnibus test statistic, U , and coverage of the 95% predictive probability intervals are given in Table 2.Note that the reference distribution for U has expected value of 0.030, and 2.5% and 97.5% quantiles of 0.019 and 0.044, respectively.The diagnostics indicate there is a discrepancy between the simulator and this emulator, with the observed value of U and the achieved coverage both being low.Further evidence of this discrepancy is the QQ-plot of the uncorrelated errors against a reference t-distribution, Figure 2(a); the points form a line with slope greater than one, indicating that the variance associated with the emulator predictions has been underestimated.
To attempt to alleviate the obvious inadequacy of this emulator, alternative mean functions h(x) were compared using Bayesian model comparison (Section 3.2).The posterior modal model was found from 10 5 iterations of the MCMC algorithm (discarding the first 10% of iterations as burn in).The algorithm took 2.5 minutes on a computer with a 3.20Ghz processor and 8Gb RAM, and the average acceptance rate for the proposed moves in Phase 1 was 4.7%, reflecting the concentration of the posterior model probabilities on a small number of models.Table 3 displays the terms in the posterior modal model, and gives the associated posterior marginal inclusion probabilities (i.e. the proportion of visited models that included that term).The model matrix, H, for the posterior modal model has m = 11 columns.The value of U and the coverage for the emulator with this alternative mean function are shown in Table 2.These values suggest there is no evidence of a discrepancy between the simulator and the emulator.This conclusion is supported by the QQ-plot of the uncorrelated errors in Figure 2(b).Also shown in Table 2 are the RMSE and the RRMSE of the maximal and modal model emulators.Note how the simpler form of emulator has smaller values for RMSE and RRMSE, indicating the modal model has significantly improved predictive accuracy.We construct GP emulators with four different forms for the mean function, h(x): (i) intercept only (m = 1); (ii) linear terms only (m = 8); (iii) the modal model found by the model comparison procedure (m = 7; see Table 3); and (iv) the maximal model (m = 103).We initially fix the nugget at zero.As a comparison with Section 4.3.1, the model comparison procedure took 30 minutes and had an acceptance rate of 0.5%.
Table 2 shows the values of U and the coverage for these four GP emulators.Figure 3(ad) show QQ-plots of the uncorrelated errors for these emulators.Clearly, the values in Table 2 and the QQ-plots show that there exist serious discrepancies between all four emulators and the simulator.Similar to the maximal lightweight emulator, the QQ-plot shows that the variances associated with the GP emulator predictions are underestimated.
To remedy these inadequacies, we included a non-zero nugget in emulators using all four forms of the mean function.The model comparison algorithm took 30 minutes and had an acceptance rate of 1.6%.The modal mean function for both types of GP emulator (with and without nugget) are identical (see Table 3).The values of U and the coverage for the four non-zero nugget GP emulators are also shown in Table 2.The corresponding QQ-plots are shown in Figure 3(e-h).There still exist discrepancies between the emulator and simulator for the maximal and linear forms of the mean function.However, for the intercept and modal forms, the values in Table 2 and the QQ-plots provide no evidence of inadequacy, with the diagnostics being highly plausible under their reference distributions.The values of RMSE and RRMSE for all eight GP emulators are also given in Table 2.
Note the high values of these errors under the maximal models.The intercept and modal GP emulators (with non-zero nugget) have significantly higher predictive accuracy than the lightweight emulators.There appears to be little difference between the intercept and modal model for the GP emulators (with non-zero nugget) in terms of predictive accuracy.

Sensitivity analysis
An important application of statistical emulators are sensitivity analyses to identify important input variables and their impact on the responses.For the lightweight emulator, the model comparison algorithm in Section 3.2.1 has the advantage of automatically identifying the most important input variables.When product terms are included in the mean function, it can also identify important interactions.For the DIAMOND simulator, there are interactions between the food capacity at Catania and both the location of the NGO base and the recipient of the food aid.There are also interactions between the food capacity at Giarre and the recipient of food aid and location of NGO base and recipient of food aid.There is evidence that planning time has a non-linear effect.
For the multivariate GP emulator, input variables can impact the response through both the mean function and correlation structure.Hence, the model selection algorithm in Section 3.2.1 may not identify all the important variables.For an intercept-only GP, the relative importance of the input variables is only determined by the relative magnitude of the corresponding correlation parameters, r.In general, the output is more sensitive to those input variables with large correlation parameters.As calibrating the size of correlation parameters can be difficult, Linkletter et al. (2006) proposed a more formal variable selection method for univariate GPs, reference distribution variable selection  (RDVS).Values of an inert input variable, x * , are randomly generated from the input space X .An MCMC sample is generated from the marginal posterior distribution of r and r * , where r * is the correlation parameter of the inert input variable.The above procedure is repeated B times with different randomly generated values of inert input variables.The posterior median of each element of r, approximated from the union of the MCMC samples from all randomly generated sets of inert input variables, is compared to the null reference distribution of the posterior medians of r * (obtained from the B sets of values for the inert input variable).For more details see Linkletter et al. (2006).
Application of RDVS to multivariate GP emulators is straightforward.Our simulator has both continuous and categorical input variables, and hence we adapt RDVS by at each iteration randomly generating values for two inert input variables, x * 1 and x * 2 , where x * 1 ∈ [0, 1] and x * 2 ∈ {0, 1}, with {0, 1} indicating the two levels for a categorical variable.The posterior median of the elements of r corresponding to continuous input variables is then compared to the null reference distribution of the posterior medians of r * 1 , and similarly for the categorical input variables and r * 2 .We applied RDVS with the GP emulator (with non-zero nugget and the intercept mean function), using B = 1000.Figure 4 displays the null reference distributions for the correlation parameters (on the log scale) of the (a) continuous and (b) categorical inert input variables, i.e. the 1000 posterior medians of the correlation parameters, r * 1 and r * 2 , from the MCMC samples.Also indicated in Figure 4 are the posterior medians of the actual input variables as vertical lines.Clearly the most important continuous input variables are the food capacities at both Giarre and Catania, planning time and helicopter speed.Both of the categorical input variables are deemed to be important.This agrees with the conclusions from the modal lightweight emulator, except for the inclusion of helicopter speed.
RDVS with a GP emulator having mean function including only an intercept is unable to explicitly identify interactions.A probabilistic sensitivity analysis (see, for example, Santner et al., 2003, ch. 7) can be used to understand and visualise the functional form of the individual and joint effects of the variables.
The variation in the simulator output induced by variation in the input variables can be decomposed into main effects and interactions.Assume interest is in the total number of casualties across days two to six of the disaster, given by g(x) = k i=1 f i (x).Letting E denote expectation with respect to an assumed joint distribution for the input variables x, we can then define the following main effects and first-order interactions: where g 0 = E [g(x)].Corresponding partial variances are given by Following Oakley and O'Hagan (2004), these variances can be estimated by their expectation, denoted E * , with respect to the posterior predictive distribution of g(x), a non-standard t distribution; see Section 5 of the Supplementary Material.Hence, the following estimated sensitivity indices can be defined: (second-order) where V = Var [g(x)] with respect to the distribution of the input variables.Explicit formulae for E * (V ), E * (V i ) and E * (V ij ) can be derived in terms of the expectation with respect to the distribution of the input variables, and are given in Section 6 of the Supplementary Material.
We assume that the input variables are independent, that the continuous variables are uniformly distributed over their corresponding ranges and the categorical input variables have probability 0.5 on each of their two levels.We compute the estimated sensitivity indices under both the multivariate GP emulator (intercept mean function and nonzero nugget) and, for comparison, the lightweight emulator (modal mean function).For the lightweight emulator, the estimated sensitivity indices are available in closed-form (Rougier, 2007) and can only be non-zero for those main effects and interactions featuring in the, selected, modal model.Under the multivariate GP emulator, the expectations with respect to the distribution of the input variables require approximation, achieved here using Monte Carlo integration.
Table 4 shows the estimated sensitivity indices under both emulators.For the multivariate GP, we present first-order estimated sensitivity indices for each of the variables identified by the RDVS method.We also present the seven largest second-order sensitivity indices; four of the corresponding interactions were selected in the modal lightweight emulator.The dominance of input variable x 6 , controlling the food capacity at Catania, is clear; variation in x 6 induces nearly 90% of the total output variation for both emulators.However, this input variable, in common with x 1 -x 5 is essentially a noise variable and clearly could not be controlled in a real disaster.Hence, of particular interest are the interactions between x 6 and the control variables x 7 -x 13 .To graphically investigate these effects for the GP emulator, in Figure 5 we display the expected conditional main effects for i = 8, 9, 12, 13 (as identified by RDVS) and l = 0, 1/3, 2/3, 1.For x 6 = 1, there are strong negative conditional effects for both categorical variables x 12 and x 13 , with lower casualties resulting from providing food aid only to Catania and, especially, locating the NGO base with the task force.However, for x 6 = 1, variable x 13 no longer has a substantive effect and x 12 now has a positive effect (lower casualties result from providing food aid to both cities).Planning time (x 8 ) always has a positive effect, although the degree of nonlinearity changes with the value of x 6

Discussion
Statistical emulation of multivariate simulators is an important problem in a number of application areas and presents challenging methodological issues.We have presented a unified Bayesian approach to the construction of both parametric (lightweight, linear model) and nonparametric (Gaussian process) emulators, including model selection, diagnostics and sensitivity analyses.Our application, emulating a humanitarian relief simulator applied to an artificial scenario involving an earthquake and volcanic eruption in Sicily, demonstrated the utility and versatility of the methodology.We were able to identify the most important input variables, and their interactions, using the lightweight emulator.While the GP emulator was more accurate, the lightweight emulator was more scientifically intuitive and informative.The technology in this paper provides the capacity for our collaborators to efficiently explore "what-if" questions and to make faster "in-theatre" decisions.
Extensions of the methodology to allow the construction and model-checking of different emulators are possible.In Section 4, only weakly informative prior distributions were assumed.If more informative prior information was available, this could be incorporated into both lightweight and GP emulators, for example via the prior distribution for the regression parameters B|Σ.It is likely that the use of such information would lead to a smaller difference in predictive accuracy between the two emulators, provided there was not a conflict between the prior information and the simulator.
Diagnostics for multivariate emulators were also employed by Fricker et al. (2013) in a number of case studies using models with a general class of non-separable covariance structure.These diagnostics were similar in spirit to those of Bastos and O'Hagan (2009) but, for example, the non-separability prevents analytic marginalisation across any of the scale parameters when calculating the equivalent to the omnibus statistic ( 10).An alternative non-separable model may be constructed as the full posterior distribution under model uncertainty, see Section 3.2.1.The model-averaged posterior predictive distribution is then a mixture of matrix t-distributions; see also Rougier (2007), who proposed a mixture of matrix normal inverse Wishart joint prior distributions for B and Σ.The diagnostics described in Section 3.1 are straightforward to extend to mixture distributions by averaging over the components of the mixture using simulation.

Figure 1 :
Figure 1: Map of Sicily showing the locations of Mount Etna, Giarre, Catania, a possible humanitarian task-force base, and the capital city Palermo

Figure 2 :
Figure 2: QQ-plots of the uncorrelated errors against a reference t-distribution for lightweight emulators: (a) the maximal model; (b) the modal model

Figure 3 :
Figure 3: QQ-plots of the uncorrelated errors against a reference t-distribution for the zero nugget GP emulator with (a) the intercept model, (b) the linear model, (c) the modal model, (d) the maximal model, and for the non-zero nugget GP emulator with (e) the intercept model, (f) the linear model, (g) the modal model, (h) the maximal model

Figure 4 :
Figure 4: Histograms of the null reference distributions from the RDVS method for the correlation parameters of the (a) continuous and (b) categorical inert input variables.The posterior medians of the input variables are shown as vertical lines

Table 1 :
Input variables for the humanitarian relief mission simulator.The units of measurement for helicopter cargo capacity are specific to this simulator.Note that the initial populations in the simulator of Giarre and Catania are 27,000 and 300,000, respectively.Under normal circumstances, the simulator only expects 1% of the population per day to require hospital treatment.

Table 2 :
Observed values (to 3 decimal places) of the omnibus diagnostic U , coverage of the 95% predictive probability intervals, RMSE and RRMSE for the various emulators considered.The reference distribution for U has expected value of 0.030, and 2.5% and 97.

Table 3 :
Marginal posterior probabilities (up to 3 decimal places) of the terms in the modal

Table 4 :
Estimated first-and second-order sensitivity indices (multiplied by 1000 and displayed up to 3 decimal places) of the input variables under the lightweight and multivariate Gaussian process emulators.