A Bayesian spatiotemporal model to estimate long‐term exposure to outdoor air pollution at coarser administrative geographies in England and Wales

Estimation of long‐term exposure to air pollution levels over a large spatial domain, such as the mainland UK, entails a challenging modelling task since exposure data are often only observed by a network of sparse monitoring sites with variable amounts of missing data. The paper develops and compares several flexible non‐stationary hierarchical Bayesian models for the four most harmful air pollutants, nitrogen dioxide and ozone, and PM10 and PM2.5 particulate matter, in England and Wales during the 5‐year period 2007–2011. The models make use of observed data from the UK's automatic urban and rural network as well as output of an atmospheric air quality dispersion model developed recently especially for the UK. Land use information, incorporated as a predictor in the model, further enhances the accuracy of the model. Using daily data for all four pollutants over the 5‐year period we obtain empirically verified maps which are the most accurate among the competition. Monte Carlo integration methods for spatial aggregation are developed and these enable us to obtain predictions, and their uncertainties, at the level of a given administrative geography. These estimates for local authority areas can readily be used for many purposes such as modelling of aggregated health outcome data and are made publicly available alongside this paper.


Introduction
Long-term exposure to outdoor air pollution has been associated with many adverse effects on human health, such as respiratory and cardiovascular diseases. The literature establishing this linkage is rapidly growing (Boezen et al., 1999;Dockery and Pope, 1994;Samet et al., 2000;Kassomenos et al., 1995;Bell et al., 2004). Such efforts require accurate measurements of air quality and, since these are not available everywhere in a large spatial study domain, methods for spatial interpolation are in high demand. Typically, air pollution concentrations are monitored at a handful of sites which are then associated with large surrounding geographical regions. However, the outdoor air that we breathe is a toxic mix of airborne particles which travel hundreds and thousands of kilometres-for example see news headlines that Saharan dust and pollution from continental Europe are triggering high air pollution levels in the UK (http://www.bbc.com/news/science-environment-26848489; http://www. bbc.co.uk/news/uk-32233922). The absence of physical boundary walls between nations and smaller administrative geographies, and the predominant weather patterns at the time, will always allow movement of air pollution both spatially and temporally. Estimation of the spatial and temporal variation in air pollution levels must then be performed by appropriate spatiotemporal modelling which also directly allows estimation and then reporting of the associated uncertainties.
Recently, there has been much interest in developing statistical space-time models for air pollution levels observed over large spatial domains, e.g. the continental USA, Europe and other parts of the world such as Asia and South America where air pollution is a very prominent health hazard. Methodological developments include downscaler models (Berrocal et al., 2010;Alkuwari et al., 2013), data fusion models (Sahu et al., 2010), land use regression (Jerrett et al., 2005;Hoek et al., 2008;Shaddick et al., 2013;Venegas et al., 2014;Morrison et al., 2016), hierarchical space-time auto-regressive models (Sahu et al., 2007). Model-based methods for realtime forecasting of air pollution have also been developed (Huerta et al., 2004;McMillan et al., 2005;Sahu et al., 2009).
Statistical modelling of UK air pollution data for estimating long-term exposure poses a unique set of challenges because of the sparsity of the monitoring sites (about 144 sites covering England and Wales, which are the study region in this paper), irregular monitoring leading to a large percentage of missing data and the proximity to continental Europe with which the UK exchanges emission and pollution depending on the prevailing weather. Key references discussing Bayesian modelling for UK data include Lee and Shaddick (2010), Shaddick and Wakefield (2002) and Pirani et al. (2014). The spatial domain that was used in these three references is only the Greater London area with a limited number of monitoring sites. In particular, Lee and Shaddick (2010) modelled daily data from 30 monitoring sites between 2003 and 2005. They performed a simulation study for four pollutants: carbon monoxide, ozone (O 3 ), nitrogen dioxide (NO 2 ) and particulate matter PM 10 . However, they did not validate the air pollution modelling for the real life data from Greater London. Pirani et al. (2014) modelled daily PM 10 exposure data from 45 sites for 728 days during 2002-2003. Lastly, Shaddick and Wakefield (2002 considered spatiotemporal modelling of four pollutants measured daily at eight monitoring sites in London over the 4-year period 1994-1997. There have been numerous other air pollution modelling and validation efforts which are based on non-Bayesian methods. For example,  provided a comparative assessment of methods to predict mean annual PM 10 -concentrations across 52 monitoring sites in London. They did not, however, model temporal variation of the pollutants.  discussed geographical information system based air pollution dispersion models for citywide exposure assessment at daily temporal resolution. In a similar vein, the atmospheric dispersion model proposed by Carruthers et al. (2000) provides citywide exposure maps. Various other references, e.g. Atkinson et al. (2012), Pirani et al. (2014Pirani et al. ( , 2015 and Rushworth et al. (2014), discuss and use air pollution estimates to estimate health effects. However, their main purpose is not exposure modelling and validation, and hence these references do not report the accuracy of their air pollution estimates.
There has been much less interest in modelling UK-wide air pollution data. Recently, Lee et al. (2016) developed space-time models for air pollution data from England and Wales but only at the monthly temporal resolution to enable linkage with aggregated monthly health outcome data, which was their main purpose. High resolution spatial maps of UK-wide annual air pollution exposure levels for NO 2 and PM 10 for the year 2001 only are available from http://www.envhealthatlas.co.uk/ which has been prepared by the Small Area Health Statistics Unit of Imperial College London. They use land use regression methods for predicting concentrations at a 100 m × 100 m spatial resolution. However, the Web site does not report the accuracy of the estimates; nor does it provide estimates of air pollution at the daily temporal resolution.
Lack of statistical modelling for obtaining UK-wide air pollution estimates does not imply a lack of air pollution dispersion modelling using physical and chemical transport models. For example, Savage et al. (2013) developed the air quality unified model (AQUM) for the whole of the UK. The AQUM is a three-dimensional weather and chemistry transport model used by the Met Office to deliver the UK national air quality forecast for the Department for Environment, Food and Rural Affairs and for scientific studies of atmospheric composition and air quality. The model has been run in hindcast mode to recreate hourly varying, UK air pollution concentrations for the period 2007-2011. However, the raw outputs of the AQUM are biased (Savage et al., 2013) and there are no associated uncertainties for the air pollution estimates. This drawback excludes the direct use of the AQUM estimates, or their deterministic adjusted values, in rigorous scientific health effects studies where the uncertainties of the air pollution estimates must be taken into account; otherwise the health effect estimates may be inflated (Lee et al., 2016). To overcome the biases in the AQUM, we model monitoring data which are more accurate. The use of the AQUM output frees us from having to use emission data, as in Pirani et al. (2014), and relevant meteorological data as in Sahu et al. (2007). These additional variables, already included as inputs in the AQUM, do not remain significant in the Bayesian model when outputs from a computer simulation model, such as the AQUM, are already present (Sahu and Bakar, 2012).
The primary motivation for this paper is to develop space-time models for daily air pollution levels for five years, 2007-2011, in England and Wales. We use Bayesian model selection methods to validate and select the best model for each of the four pollutants NO 2 , O 3 , PM 10 and PM 2:5 . Using the model selected we obtain air pollution estimates at a 1-km square grid covering the whole study region. Our modelling at the point level, described by a latitude and a longitude pair, and at the daily temporal scale, enables us to develop air pollution estimates at any coarser level of spatial resolution, e.g. local authority levels and at any aggregated temporal levels, e.g. quarterly and annual. The main advantage of the Bayesian methods, implemented by using Markov chain Monte Carlo (MCMC) sampling, lies in the ability to estimate the uncertainties that are associated with the aggregated pollution levels, as we demonstrate in detail. The resulting aggregated pollution estimates are then available to be used to integrate modelling of health outcome data which are often recorded for administrative geographies, such as local authority areas; see for example Lee et al. (2016).
The remainder of the paper is organized as follows. Section 2 describes the data summaries and explores relationships between the pollutant levels and the AQUM values. In Section 3 we discuss the methods including the models and how to obtain the predictions at point level, and at coarser geographies. Data analysis and model validation results are presented in Section 4. Finally, a few summary remarks are placed in Section 5.
The data that are analysed in the paper and the programs that were used to analyse them can be obtained from http://wileyonlinelibrary.com/journal/rss-datasets

Data description
We model daily air pollution data collected from the 144 active automatic urban and rural network (AURN) stations in England and Wales for the five years 2007-2011. The AURN is the UK's largest automatic monitoring network and is the main network used for compliance reporting against the Ambient Air Quality Directives; see for example http://uk-air. defra.gov.uk/networks. The stations (see Fig. 1 for their locations) measure oxides of nitrogen, NO x , sulphur dioxide, O 3 , carbon monoxide and particles (PM 10 and PM 2:5 ). Data from these stations are publicly available from a wide range of electronic, media and Web platforms. The pollutants are measured at a height between 1.5 m and 4 m above ground level (https://uk-air.defra.gov.uk/networks/site-info?site id=LB) depending on the type of pollutant. The data quality is verified and ratified on an on-going basis as detailed at https://uk-air.defra.gov.uk/networks/site-info?site id=LB.
We obtain the daily concentration data for the four pollutants NO 2 , O 3 and particles less than 10 μm (PM 10 ) and 2.5 μm (PM 2:5 ) in size since these are the most harmful pollutants in the UK which may have health effects; see for example Lee et al. (2016). All pollutants are measured in micrograms per cubic metre. In our modelling we have used the daily maximum for NO 2 , the daily maximum 8-h running mean for O 3 and the daily mean for both PM 10 and PM 2:5 following the 2008 European Union directive on air pollution; see European Commission (2008).
For each pollutant, the total number of possible daily observations is 262944 (= 144 × 1826). However, in our data set there are 159 534 (60.67%), 98 628 (37.51%), 79 227 (30.13%) and 58 177 (22.12%) daily observations respectively for the four pollutants NO 2 , O 3 , PM 10 and PM 2:5 . The remaining observations are missing for several reasons including instrument malfunction, discontinuation of some sites and then introduction of new replacement sites during the study period, or the fact that not all sites monitor all pollutants. Table 1 shows a  yearwise breakdown of the missing observations. Most notably, more than 90% of the PM 2:5observations are missing for 2007 and 2008 because this pollutant was not monitored before 2009 since this was not considered to be one of the criteria pollutants (those which are regulated) until the publication of the 2008 European Union air pollution directive. In spite of the missing values the spatiotemporal models in Section 3 are estimated by using the large number of available observations as noted above.
The location of each of the 144 sites is classified into one of three site types: 'rural', 'urban' (which also includes suburban), and 'road and kerbside'. Summary statistics, as reported in Table 2, show significant variability in all four pollutants across these three site types. As  Table 2 shows that the rural sites are less polluted than the urban and the road-andkerbside sites for NO 2 , PM 10 and PM 2:5 whereas the converse is true for O 3 . O 3 concentrations are generally lower in urban areas than in the surrounding rural areas because of reaction with NO x (mostly NO) emissions, which are greatest in urban areas; see for example Grewe et al. (2012) and also page 3 of the report 'Ozone in the United Kingdom', written by the Air Quality Expert Group and available from http://www.defra.gov.uk/environment/airqua lity/aqeg. Boxplots of the distributions of each pollutant by site type are displayed in Fig. 2. These plots also show differences in variances for each pollutant by site type, with road-and-kerbside sites having a larger spread for NO 2 , PM 10 and PM 2:5 because of many extreme observations. Finally, boxplots of the pollution concentrations by year (which have been omitted for brevity) showed little variability in the median from year to year, although there is considerable variation in the distribution of the extreme values. The methods that are proposed in this paper address the sparsity of the observed AURN data by including the hourly hindcasts of the AQUM, which was developed by Savage et al. (2013) especially for the UK. The AQUM is a chemistry transport model based on weather and emission inventory data for which full details are provided in Savage et al. (2013). However, the AQUM does not use the observed AURN data but it produces output on a 12 km × 12 km square grid covering the whole UK. We use bilinear interpolation methods, which interpolate once horizontally and then vertically (see for example the Wikipedia entry https://en.wikipedia.org/wiki/Bilinear interpolation), to predict the hourly pollution concentration values at the corners of a 1 km × 1 km square grid that we use for our prediction. These hourly values are subsequently aggregated to daily values for our modelling. The corners of these 1-km grid cells do not coincide with the locations of the 144 AURN monitoring sites and to resolve this misalignment we use bilinear interpolation of the four grid corners of each site to estimate the daily AQUM output for the 144 observation sites. These outputs are moderately correlated with the corresponding observations with correlations of 0.44 for NO 2 , 0.68 for O 3 , 0.59 for PM 10 and 0.61 for PM 2:5 . We also obtained scatter plots showing these correlations but those have been omitted for brevity. These moderate values of correlations help us in regression modelling of the observations that we consider in the next section.
Inclusion of the site type classifications (rural, urban and road and kerbside) poses a problem for predictions at locations for which site types are unknown. We tackle the rural-urban classification problem by using map data from the moderate resolution imaging spectroradiometer satellite MODIS; see Schneider et al. (2009). These maps, based on satellite data from 2001-2002, provide land use information (urban-rural) at a 500-m spatial resolution and have an overall accuracy of 93%. Thus the rural-urban classification for each corner of a 1-km predictive grid location is accurately obtained from these maps. However, these data do not provide the roadand-kerbside site type classification for the 1-km predictive grid. To obtain these classifications, we use the detailed open data roads network product from the Ordnance Survey (https:// www.ordnancesurvey.co.uk/opendatadownload/products.html). We calculate the distance of each of the 1-km grid locations to the nearest road. The grid locations which are within 4 m of the nearest road are classified as road and kerbside.

Model specification and prediction
In our study region of England and Wales, we have daily data from n = 144 sites for T = 1826 days. As is common practice in modelling air pollution concentration data we model on the square-root scale to stabilize the variance (Sahu et al., 2007;Berrocal et al., 2010). However, for ease of interpretation, the accuracies of all predictions from the pollution model are assessed on the original scale. Let z .p/ .s i , t/ and x .p/ .s i , t/ respectively denote the measured and modelled AQUM pollution concentration on the square-root scale at location s i during day t for pollutant p. In what follows we suppress the subscript p for notational clarity.

Hierarchical model
As part of the Bayesian modelling hierarchy we proceed with the top level specification .s i , t/ ∼ N.0, σ 2 /, .1/ for i = 1, : : : , n and t = 1, : : : , T , where Y.s i , t/ is the true process and .s i , t/ is the independent nugget effect absorbing microscale variability. (Henceforth, values of the subscripts i and t will always be in the range that is mentioned here.) At the next stage of the hierarchy we specify where μ.s i , t/ denotes the mean surface and η.s i , t/ is a space-time process, which we specify later. The mean surface is modelled as where we propose a site-type-specific regression on the modelled square-root AQUM concentrations x.s i , t/. Here r = 3, corresponding to the three site types (rural, urban, road and kerbside), and the rural site type corresponds to j = 1 and is the baseline level. Thus .γ 0 , γ 1 / are respectively the slope and intercept terms for the rural sites, whereas .γ 0j , γ 1j / are the incremental adjustments for site type j, j = 2, 3. Finally, δ j .s i / is an indicator function, equalling 1 if site s i is of the jth site type and 0 otherwise.

Specification of the spatiotemporal process
We consider four modelling possibilities for the spatiotemporal process η.s i , t/, which differ in their level of sophistication. The first is the simplest that assumes η.s i , t/ = 0 for all sites s i and times t, which renders model (1) a simple regression model that is used for comparison with the other two models. The second model for η.s i , t/ is an independent-over-time Gaussian process (GP) with zero mean and a Matérn covariance function given by for φ > 0 and ν > 0, where Γ.ν/ is the standard gamma function, K ν is the modified Bessel function of the second kind with order ν and ||s − s || is the distance between sites s and s . The parameter φ controls the rate of decay of the correlation as the distance ||s i − s j || increases and the parameter ν controls smoothness of the random field (Banerjee et al., 2004;Cressie, 1993), i.e., for each time t, η t = .η.s 1 , t/, : : : , η.
where H η .φ, ν/ ij = C.||s i − s j ||; φ, ν/, j = 1, : : : , n, which is assumed to be the Matérn correlation function with decay and smoothness parameters φ and ν respectively. The special case of the Matérn correlation function when ν = 0:5 is called the exponential correlation function given by corr{η.s, t/, η.s , t/} = exp.−||s − s ||φ/: In this case φ alone determines the rate of decay of the spatial correlation as the distance d = ||s − s || between two sites increases. A related quantity, which is called the effective range, best describes the limit of the correlation decay to zero in practical situations. However, exp.−dφ/ = 0 for any finite value of d and φ. Hence the effective range, for a given value of φ, is defined to be the value of the distance d for which exp.−dφ/ = 0:05, which implies that d = 3=φ. The third model introduces non-stationary covariance structure following Sahu and Mukhopadhyay (2015). This is achieved by first assuming a set of knot locations, S Å m = .s Å 1 , : : : , s Å m /, which are specified below, for a value of m which will be chosen by out-of-sample prediction performance. Given S Å m , we assume that η Å t = .η.s Å 1 , t/, : : : , η.s Å m , t// T is a zero-mean GP with the Matérn covariance function (4). The non-stationary modelling proposal is to replace η.s i , t/ in equation (2) byη . 6/ The .n + m/-dimensional vector .η t , η Å t / is assumed to be a realization of the same underlying GP as in equation (4) independently for each t. By writingη t = .η.s 1 , t/, : : : ,η.s n , t// T and using multivariate Gaussian theory we havẽ where C Å .φ, ν/ is the n × m cross-correlation matrix between η t and η Å t , i.e. .C Å / ij = C.|s i − s Å j |; φ, ν/ for i = 1, : : : , n and j = 1, : : which shows non-stationarity of theη t -process. Theη t -surface is now based on linear functions of the m-dimensional η Å t instead of the ndimensional η t . This leads to a reduction in the computational burden when m is much smaller than n. However, n = 144 is not very large in this paper and hence dimension reduction by the Gaussian predictive process approximation (Banerjee et al., 2008) is not required here. We still use equation (7) to benefit from having a flexible non-stationary model for the spatial random effects. Sahu and Mukhopadhyay (2015) showed that this flexibility in modelling, even when m > n, leads to more accurate predictive models, which are also considered as candidate models in this paper.
Finally, we remove the temporal independence assumption in the previous model by introducing an auto-regressive model for η Å t . Here we assume that with η Å 0 = 0 as the initial condition and as the unknown auto-regressive parameter for which we specify a prior distribution. Thus, given m and S Å m , η Å t is determined from the above autoregressive process, and thenη t is obtained by using equation (7) for all t = 1, : : : , T .

Specifying the knot locations
Now we return to specifying the knot locations S Å m for a given m where m is to be chosen by cross-validation. Sahu and Mukhopadhyay (2015) showed that a random selection of S Å m is preferable to a space filling design that distributes the m locations evenly within the study region. In our implementation, we discretize by M points, which are the corners of a set of 1-km grid squares covering the study region. To add flexibility we assume a probability surface p.s Å j /, j = 1, : : : , M, such that Σ M j=1 p.s Å j / = 1 and p.s Å j / 0 for j = 1, : : : , M. The p.s Å j / can be a normalized population density surface that will guarantee that knots are placed at high density areas. A random sample of m locations is proposed to be used as the knot locations S Å m . In effect, this implies a discrete prior distribution for S Å m and MCMC model fitting is easily accomplished by using a Metropolis-Hastings step for S Å m where the proposal samples are drawn from the prior itself. Throughout, we choose m = 25 which was chosen by an out-of-sample root-mean-square prediction error criterion (see Section 3.6) among the possible values of 16, 25, 36, 49 and 100. A complete square value, e.g. 25, is put forward as a candidate so that an equal number of points is chosen in the two co-ordinate directions.

Specification of the prior distributions
Our proposed Bayesian model is completed by assigning vague but proper prior distributions for the remaining model parameters. These include zero-mean Gaussian priors for the regression parameters γ in equation (3) with large variances of 10 4 , and an inverse gamma prior distribution for the variance parameters .σ 2 , σ 2 η / with hyperparameters .2, 1/ parameterized to have mean 1 and infinite variance. Informative (and thus proper) prior distributions must be specified for the two parameters ν and φ describing the Matérn correlation function, since these are weakly identified in the likelihood; see for example Zhang (2004), who showed that consistent estimation is not possible for these parameters. In addition, sparsely observed spatial data are not very informative for the smoothness parameter ν; see for example Stein (1999). Hence we shall adopt the exponential correlation function corresponding to ν = 0:5 following many researchers, e.g. Berrocal et al. (2010) and Sahu et al. (2007).
Now it remains to specify a proper prior distribution for φ. Here, besides fixing φ at several plausible effective ranges we entertain two proper prior distributions: (a) a uniform prior distribution in the interval .0:001, 0:01/ corresponding to having an effective range between 300 and 3000 km and (b) a gamma prior distribution with parameters 2 and 1, parameterized to have mean 2 but with infinite variance.
The implied effective range between 300 and 3000 km corresponding to the uniform prior distribution provides opportunities for the models to have adequate spatial correlation but avoiding singularity corresponding to an infinite range. However, the gamma prior distribution does not bound φ within any finite range unlike the uniform prior distribution. In Section 4, suitable prediction validation criteria will be used to choose between these contrasting specifications.

Prediction details
The hierarchical space-time model allows us to predict pollutant concentrations at any location s and at any time point t, 1 t T . We first consider prediction of the large percentages of the missing data as noted in Section 2. Suppose that Z.s i , t/ is missing for particular values of s i and t. By virtue of the Bayesian model, implemented by using MCMC sampling, we have a value of the μ.s i , t/ .l/ at each iteration l for l = 1, : : : , L of the implemented MCMC algorithm.

Evaluating the predictive performance
The predictive performance of each pollution model is assessed by a cross-validation exercise, where data at sites .s 1 , : : : , s n 0 / are used to fit the model whereas data at sites .s n 0 +1 , : : : , s n ) are held out to assess predictive performance. The root-mean-square prediction error (RMSPE) and mean absolute prediction error (MAPE) are used to quantify prediction accuracy, which are given by whereẑ.s j , t/ is the posterior median from the predictive distribution. Here N v is the total number of available (i.e. non-missing) observations from these validation sites over the T = 1826 days. These criteria are used to select the best model. Once the best model has been chosen, we use all the available data from n = 144 sites for inference.

Aggregating predictions to administrative geographies
The proposed geostatistical models can predict at any unmonitored location s by using the methodology that was described above. In this section we consider spatial aggregation to any coarser level administrative geography, e.g. local authorities or electoral wards, which may be required to align with aggregated health outcome data. For the kth administrative region, denoted by A k , for k = 1, : : : , K, where K is the total number of such regions, we define the average pollution concentration at time t by where |A k | is the area of the region A k . We approximate equation (10)  where .s k1 , : : : , s kn k / forms a fine grid of prediction locations within the kth region. As mentioned in Section 1, we perform the predictions at a 1-km square grid covering the study region. Hence, the n k locations .s k1 , : : : , s kn k / are simply taken as the corners of the 1-km square grid which fall within A k .
Under the Bayesian inference setting,Z kt will have a posterior predictive distribution given the observed data z since each Z.s kj , t/ also has such a distribution. This posterior predictive distribution can be summarized by using MCMC samples drawn from it using composition sampling. Corresponding to each draw from the joint posterior distribution we draw a sample z .l/ .s kj , t/ from the posterior predictive distribution (9). At each MCMC iteration l, we form the averageZ .l/ kt = 1 n k n k j=1 z .l/ .s kj , t/: These MCMC iterates,Z .l/ kt , l = 1, : : : , L, are summarized to estimate Z kt . Uncertainties in these estimates are also easily estimated by using the MCMC iterates. Note that temporal aggregation can easily be performed by using the MCMC iterates. For example, if it is required to estimatē Z k = .1=T /Σ T t=1Z kt , then we simply obtain z .l/ .s kj , t/ .12/ for each l = 1, : : : , L and summaries, e.g. the mean and median, of these MCMC iterates are used. Moreover, uncertainties in these estimates expressed through, for example, credible intervals and standard deviations are also obtained by using the MCMC iterates. Prediction details for z .l/ .s kj , t/ for any arbitrary site s kj have been noted in the previous subsection. Lastly, we also note that temporal aggregation for a subperiod (e.g. annual) of the whole time domain can be performed similarly by simply taking the average similar to that in equation (12) but only using the simulations z .l/ .s kj , t/ for all the individual time points t which fall in that subperiod. For example, we can find annual averages from modelling 5 years' data.

Implementation details and models set-up
All model fitting and model choice results are based on L = 5000 MCMC iterations, after discarding first 5000 iterations at which point convergence was assessed. We first consider the issue of model choice and validation for each pollutant separately. These tasks are performed by fitting the models with data from 90 .62:5%/ randomly selected sites and then validating the available observations from the remaining 54 sites. Thus model fitting is done using a maximum of 164 340 .= 90 × 1826/ observations and validation is done using a maximum of 98 604 .= 54 × 1826/ observations. For individual pollutants the actual numbers of fitting and validation observations will be less than these maximums because of missing data. The 90 fitting sites contained 50 urban, 11 rural and 29 road-and-kerbside sites whereas the 54 validations sites contained 29 urban, seven rural and 18 road-and-kerbside sites. Thus all three site types are adequately represented in both the fitting and the validation set of sites. For model comparison we consider the RMSPE, MAPE, bias and actual coverage of the 95% prediction intervals, since spatial prediction is the main objective here. For each pollutant, in addition to the AQUM, we entertain nine modelling possibilities as follows. The first is simple kriging, performed independently for each time point (day here), using the well-known fields package (Furrer et al., 2013) in the R programming language. The second model is a simple linear regression model, implemented by using MCMC sampling in the Bayesian framework, that does not take care of the spatiotemporal dependence in the data, i.e. models (1)-(3) with η.s i , t/ = 0 for all i and t. To introduce spatial dependence only we consider the independent-in-time GP model described as the second model in Section 3. These three models are chosen solely for benchmarking since these off-the-shelf methods are often used in practice and also these are simple stationary GP versions of our proposed models.
The remaining seven Bayesian models are all based on the most complex non-stationary spatiotemporal models described in Section 3. The first five of the seven models are obtained for five different fixed values of φ, the decay parameter. The five values, which are denoted by φ i , i = 1, : : : , 5, correspond to effective ranges of 3500 (φ 1 ), 3000 (φ 2 ), 600 (φ 3 ), 300 (φ 4 ) and 100 (φ 5 ) km, and these choices are guided by the need to include large-to-moderate amounts of spatial correlation in the model. In the final two modelling attempts we specify the uniform and gamma prior distributions for the decay parameter φ as mentioned in Section 3.1.

Model validation and comparison results
Tables 3 and 4 report the validation results for the AQUM and all nine modelling scenarios noted above. They show that the raw AQUM outputs are far worse than all nine modelling strategies including kriging. This is expected as the AQUM does not use the actual observations and those outputted are heavily biased as noted previously. All six Bayesian models, based on the non-stationary spatiotemporal models, perform much better than the three benchmarking strategies: kriging, linear and GP. The differences between the performances of the six Bayesian models are not very large, which is because the dominant linear model part is the same for all six models. Slight differences in performance are observed by varying the spatial correlation structure. For NO 2 the model with φ fixed at φ 2 is chosen to be the best; the model with the  0.85 †SS, sample size, which is the number of daily observations. R 2 denotes the sample correlation coefficient between the predictions and actual observations. uniform prior distribution is best for O 3 and the model with the gamma prior distribution is best for PM 10 and PM 2:5 . Each of the four best models reports an R 2 -value, which is the value of the sample correlation coefficient between the observations and their predictions, between 80% and 90%, which shows very good agreement between the model-based predictions and the held-out observations. Model adequacy can be checked by the achieved coverages of the 95% prediction intervals, which are seen to be more than 90% for NO 2 , O 3 and PM 10 . However, the achieved coverages are between 80% and 84.6% for PM 2:5 . This is most likely to be because of the smaller sample sizes that are available both for fitting and for validation for this pollutant. However, the RMSPE of the best model is 4.30, which is less than half of the standard deviation of 9.52 for the full data (see Table 2) and hence the model does perform quite well in reducing prediction variability.
Next we investigate whether the best chosen model for each of the four pollutants performs similarly in validating the three site types: rural, urban, and road and kerbside. Recall that there were 31 urban, seven rural and 18 road-and-kerbside sites among the 54 validation sites. Table  5 shows the RMSPEs aggregated by the site types. It shows that the RMSPE is slightly smaller for the urban sites for NO 2 , PM 10 and PM 2:5 . This is expected since the urban sites outnumber the other site types among both the fitting and the validation sites, which leads to more accurate estimation and more stable values of the RMSPE. For O 3 , the rural sites have a smaller RMSPE value, which shows that the higher pollution values are estimated with on average more accuracy.
So that we can compare the accuracy of the proposed models with that of the previous modelling efforts of Pirani et al. (2014) we consider fitting our models to data from monitoring  Table 3 and Table 4. sites in Greater London only. We perform this experiment only for PM 10 because comparable model validation results for daily data are available from Pirani et al. (2014) for this pollutant only. Table 6 reports the results for their best model and the best performing model in this paper. Our model has a much lower RMSPE with a better R 2 -value than the Pirani et al. (2014) modelling effort, although this is not an exact comparison since the data that were used in Pirani et al. (2014) and in this paper are different. However, some justification for the comparison comes from the fact that the error rates are for daily data observed in the same spatial domain of Greater London.

Parameter estimates
Parameter estimates for the best Bayesian models are presented in Table 7. These parameter estimates have been obtained by fitting the models with all the available data from 144 sites; see the last column of Table 2. The models rightly find significantly elevated levels of NO 2 concentration in the urban and roadside sites compared with the rural sites since all the incremental slopes and intercepts are positively significant in Table 7. However, this behaviour reverses for O 3 , which is known to be lower in urban and roadside areas because of its negative correlation with NO 2 . The estimates for the same parameters for models fitted to the two species of particulate matter show a mixed behaviour regarding the three site types. The estimates show moderately significant temporal correlation, but spatial correlation seems to be more dominant since the estimates of φ,φ, show large effective ranges (3=φ).
The spatial variance σ 2 η is estimated to be higher than the nugget effect σ 2 except for the case of modelling NO 2 . This is plausible since the variability of NO 2 is much higher than that of the other three pollutants (see Table 2) and the model absorbs the extra variability by using the

Spatiotemporal aggregation results
Aggregation of the point level predictions to any given administrative geographies is performed by using block averaging at each MCMC iteration as detailed in Section 3.7. We also perform annual aggregation as detailed in Section 3.7. Here we illustrate annual aggregation for the 346 local authorities in England and Wales for 2011, boundaries of which have been shown in Fig. 1. Within each local authority, k, where k = 1, : : : , 346, we evaluate the Bayesian predictions z .l/ .s kj , t/ for each grid point s kj of a 1-km square grid. At each MCMC iteration l, these predictions are spatially and temporally aggregated to produce average local-authority-specific pollution estimates. These are then summarized to produce the predictive maps and their standard deviations are shown in Figs 3-6.
The annual pollution maps plotted, along with their uncertainties, show considerable spatial variation. As expected, the NO 2 levels are higher in London and other urban areas than in the rural areas. However, the reverse happens for O 3 . The particulate matter levels are consistently higher in urban areas. However, these values are lower in a few local authorities in central London according to Figs 5 and 6. This is very plausible as particulate matter pollution is linked to wood burning and such activities are not very common in central London; see Fuller et al. (2014). As expected, predictive uncertainties (see the maps of the standard deviation) are generally higher for the local authorities where there are not many monitoring sites. Also higher pollutant levels are generally associated with higher prediction standard deviations. This is expected since the boxplots in Fig. 2 reveal much more variability for the suspected extreme observations than the observations lying in the central box. The square-root transformation adopted reduces such a mean-variance relationship but does not completely eliminate it in the empirical modelling (Sahu et al., 2007). Further illustrations of the predictive maps are provided in the accompanying on-line supplemental material.

Discussion
This paper has developed statistical modelling and prediction methodologies for long-term exposure to outdoor air pollution levels and illustrated these for England and Wales with daily data from the 5-year period 2007-2011. We have considered the four most important pollutants: NO 2 , O 3 , PM 10 and PM 2:5 . Novelty in the statistical model has been introduced through spacetime random effects by means of a temporally correlated GP that is allowed to be anchored at a random set of locations within the study region. The random effect at any arbitrary location is obtained as a kriged value of the realizations of the GP at the anchoring points known as knots. By also varying the number of knots we obtain a highly flexible non-stationary random-effects surface that can match the non-stationarity in the daily air pollution surface. Further local information has been injected into the model through a site classification variable that describes the location of the monitoring sites. We have found that the site classifier taking three possible values best contrasts the pollution data according to land use. A site-type-specific regression model with site type varying intercept and slope has been put forward at the heart of the Bayesian space-time model. Empirical results show significance of the site-specific regression model for all four pollutants. It is also possible to incorporate a spatially varying coefficient model for the AQUM outputs. However, our preliminary results by using those models did not show any gain in predictive ability and, as a result, we have not considered those in this paper.
The models proposed have been empirically verified with hold-out data and have achieved the least, as far as we are aware, root-mean-square error rates for daily PM 10 -data over the 5year period. Comparable error rates for daily data for the other pollutants are not available for modelling data from England and Wales. However, many researchers model annual average data, and error rates from such modelling are available. For example, Shaddick et al. (2013) modelled the 2001 annual average NO 2 data from 934 sites in the then 15 European Union countries and their best model has a root-mean-squared error of 8. This is not directly comparable with our model RMSPE for the daily data as provided in Table 3 since daily data and annual average data have different variabilities. However, their root-mean-squared error of 8 is relative to the standard deviation of 12 for the annual average as reported in their Table 1. A similar comparison for our best model for NO 2 shows the RMSPE value of 17.65 which is relative to the overall data standard deviation of 37.19 as reported in Table 2 here. Thus our model has much lower relative root-mean-square error, compared with the standard deviation of the full data, than what was reported in Shaddick et al. (2013). For PM 10 ,  modelled annual mean data from 52 monitoring sites in London and their best land use regression model, PMLUR, reported an R 2 -value of 0.58. This is much lower than the gamma model R 2 of 0.85 in Table 6, although the R 2 -values are not fully comparable since our models are at the daily temporal resolution and the model of  is at the annual temporal resolution. We have also performed, though not reported here, other cross-validation studies with hold-out data from one site at a time and those also provide favourable verdicts for the models proposed. Thus using variants of the models we obtain the most accurate empirically verified maps of daily air pollution levels for England and Wales over the 5-year period.
The Bayesian prediction methodology developed has been applied to a 1-km square grid covering England and Wales. This enables us to aggregate the predictions to any administrative geographies that are coarser than the 1-km square grid, e.g. to electoral wards or local authority areas. These predictions are on a daily timescale and hence can be aggregated to monthly, seasonal and annual scales. For each aggregation task, one needs to average the MCMC iterates of the predictions. These aggregated MCMC iterates enable us to provide uncertainty estimates in the aggregated predictions. We have illustrated the local-authority-wise prediction maps for the year 2011. All the predictions and data sets are published on line alongside this paper from the Web site of the 'Medical and environmental data mash-up infrastructure' project (https://www.data-mashup.org.uk/research-projects/statistical-downs caling-of-gridded-air-quality-data/). The accompanying on-line supplement contains the descriptions of the files and further illustrative maps.
The published predictive values and maps can be used in many different types of scientific studies. For example, health effects estimation studies, such as Lee et al. (2016), can benefit from accurate air pollution estimates at any spatial and temporal resolution for any spatial domain (such as Greater London) within England and Wales. Compliance with air pollution regulation can also be judged at any unmonitored site by performing spatial prediction at that site of interest. Compliance summaries, such as the number of days exceeding a threshold level of any pollutant in a year, can be estimated, along with the associated uncertainties. For example, the European Union regulations, see European Commission (2008), state that the 24-h average PM 10 , the quantity modelled here, should not exceed the upper assessment threshold 35 μg m −3 for more than 35 days in a year. For O 3 , the regulation states that the modelled O 3 metric should not exceed 120 μg m −3 for more than 25 days per calendar year averaged over 3 years. The Bayesian modelling framework that is developed here can estimate the number of days for any particular site, monitored or unmonitored, for which such a threshold is exceeded and can also estimate maps of probability of non-compliance, as has been demonstrated in Sahu et al. (2007). Such assessments for England and Wales for all four pollutants and for all the five years require further investigation and will be considered elsewhere.
This paper is concerned only with exposure to outdoor pollution. Assessing the true personal exposure, including indoor air pollution, is a much more demanding task since people move between different environments, travel and relocate over a long study period of 5 years. There are many references discussing such issues; see for example Shaddick et al. (2008) and Zidek et al. (2005).
Finally, the methodology that is presented in this paper can be improved in several ways. For example, to account for correlation in the pollutants one can use multivariate spatiotemporal models. However, multivariate modelling will only be required if it is desired to study the correlations between the pollutants. Further methodological development is also required to produce air pollution estimates, and their associated uncertainties, for user-defined geographies and temporal windows so that air pollution estimates are available on demand for the spatial domain at the desired temporal resolution as required by the user.