Emulation of long-term changes in global climate: application to the late Pliocene and future. the Past

. Multi-millennial transient simulations of climate changes have a range of important applications, such as for investigating key geologic events and transitions for which high-resolution palaeoenvironmental proxy data are available, or for projecting the long-term impacts of future climate evolution on the performance of geological repositories for the disposal of radioactive wastes. However, due to the high computational requirements of current fully coupled general


Introduction
Palaeoclimate natural archives reveal how the Earth's past climate has fluctuated between warmer and cooler intervals. Glacial periods, such as the Last Glacial Maximum (e.g. , exhibit relatively lower temperatures associated with extensive ice sheets at high northern latitudes (Herbert et al., 2010;Jouzel et al., 2007;Lisiecki and Raymo, 2005), whilst interglacials are characterized by much milder temperatures in global mean. Even warmer and sometimes transient ("hyperthermal") intervals, such as occurred during the Palaeocene-Eocene Thermal Maximum (e.g. Kennett and Stott, 1991), are characterized by even higher global mean temperatures. Assuming that on glacial-interglacial timescales and across transient warmings and climatic transitions tectonic effects can be neglected, the timing and rate of climatic change is at least partly controlled by the three main orbital parameters -precession, obliquity and eccentricity -which have cycle durations of approximately 23, 41, and both 96 and ∼ 400 kyr (thousand years), respectively (Berger, 1978;Hays et al., 1976;Kawamura et al., 2007;Lisiecki and Raymo, 2007;Milankovitch, 1941). Further key drivers of past climate dynamics include changes in atmospheric CO 2 concentration and in respect of the glacial-interglacial cycles, changes in the extent and thickness of ice sheets.
In order to investigate the dynamics, impacts and feedbacks associated with the response of the climate system to orbital forcing and CO 2 , long-term (> 10 3 yr -years) projections of changing climate are required. Transient simulations such as these are useful for investigating key past episodes of extended duration for which detailed palaeoenvironmental proxy data are available, such as through the Quaternary and Pliocene, allowing data-model comparisons. Simulations of long-term future climate change also have a number of applications, such as in assessments of the safety of geological disposal of radioactive wastes. Due to the long halflives of potentially harmful radionuclides in these wastes, geological disposal facilities must remain functional for up to 100 kyr in the case of low-and intermediate-level wastes (e.g. Low Level Waste Repository, UK; LLWR, 2011), and up to 1 Myr in the case of high-level wastes and spent nuclear fuel (e.g. proposed KBS-3 facility, Sweden; SKB, 2011). Projections of possible long-term future climate evolution are therefore required in order for the impact of potential climatic changes on the performance and safety of a repository to be assessed (NDA, 2010;Texier et al., 2003). Indeed, while the glacial-interglacial cycles are expected to continue into the future, the timing of onset of the next glacial episode is currently uncertain and will be fundamentally impacted by the increased radiative forcing from anthropogenic CO 2 emissions (Archer and Ganopolski, 2005;Ganopolski et al., 2016;Loutre and Berger, 2000b).
Making spatially resolved past or future projections of changes in surface climate generally involves the use of fully coupled general circulation models (GCMs). However, a consequence of their high spatial and temporal resolution and structural complexity (and attendant computational resources) is that it is not usually practical to run them for simulations of more than a few millennia, and invariably, rather less than a single precessional cycle. Even when run for several thousand years, only a limited number of runs can be performed. Previously, therefore, lower-complexity models such as Earth system Models of Intermediate Complexity (EMICs) have been used to simulate long-term transient past (e.g. Loutre and Berger, 2000a;Stap et al., 2014) and future (e.g. Archer and Ganopolski, 2005;Eby et al., 2009;Ganopolski et al., 2016;Loutre and Berger, 2000b) climate development. Where GCMs have been employed, generally only a relatively small number of snapshot simulations of particular climate states or time slices of interest have been modelled (e.g. Braconnot et al., 2007;Haywood et al., 2013;Marzocchi et al., 2015;Masson-Delmotte et al., 2011;Prescott et al., 2014).
In this study, we present long-term continuous projections of climate evolution based on the output from a GCM, via the use of a statistical emulator. Emulators have been utilised in previous studies for a range of applications, including sensitivity analyses of climate to orbital, atmospheric CO 2 and ice sheet configurations (Araya-Melo et al., 2015;Bounceur et al., 2015) and model parameterisations (Holden et al., 2010). However, to the best of our knowledge, this is the first time that an emulator has been trained on data from a GCM and then used to simulate long-term future transient climate change. It should be noted that, whilst other research communities may use different terms, we refer to the groups of climate model experiments as "ensembles", and we refer directly to the GCM when discussing calibration of the emulator, rather than using the term "simulator" as has been used in a number of previous studies.
We calibrated an emulator using surface air temperature (SAT) data produced using the HadCM3 GCM (Gordon et al., 2000). Two ensembles of simulations were run, with varying orbital configurations and atmospheric CO 2 concentrations. Each ensemble was run twice, once with modernday continental ice sheets and once (for a reduced number of members) with reduced-extent ice sheets. We adopted this approach because in at least two of the intended uses for the emulator (Pliocene, and long-term future climate for application to performance assessments for potential radioactive waste repositories), it is thought that the Greenland and West Antarctic ice sheets (GrIS, WAIS) could be reduced relative to their current size. The implications and uncertainties asso-ciated with this approach are discussed in Sect. 7. The ensembles thus cover a range of possible future conditions, including the high atmospheric CO 2 concentrations expected in the near-term due to anthropogenic CO 2 emissions, and the gradual reduction of this CO 2 perturbation over timescales of hundreds of thousands of years by the long-term carbon cycle (Lord et al., 2015(Lord et al., , 2016. We go on to illustrate a number of different ways in which the emulator can be applied to investigate long-term climate evolution over hundreds of thousands to millions of years. Firstly, the emulator is used to simulate SAT changes for the late Pliocene for the period 3300-2800 kyr BP (before present) for a range of CO 2 concentrations. This interval occurs in the middle part of the Piacenzian age, and was previously referred to as the "mid-Pliocene" (e.g. Dowsett and Robinson, 2009). During this time, global temperatures were warmer than pre-industrial (e.g. Dowsett et al., 2011;Haywood and Valdes, 2004;Lunt et al., 2010), before the transition to the intensified glacial-interglacial cycles that are associated with Quaternary climate . We then apply the emulator to future climate, simulating temperature and precipitation data for the next 200 kyr AP (after present) for a range of anthropogenic CO 2 emissions scenarios. Regional changes in climate at a number of European sites (grid boxes) are presented, selected either because they have been identified as adopted or proposed locations for the geological disposal of solid radioactive wastes, as in the cases of Forsmark, Sweden, and El Cabril, Spain, or simply as reference locations where a suitable site has not yet been identified, as in the cases of Switzerland and the UK.
The paper is structured such that the theoretical basis of the emulator is described in Sect. 2, the GCM model description and simulations are presented in Sect. 3 and an account of how the emulator is trained and evaluated is given in Sect. 4. Section 5 presents illustrative examples of a number of potential applications of the emulator for the late Pliocene. Further examples of the application of the emulator to the next 200 kyr are described in Sect. 6. Section 7 includes a description and discussion of uncertainties associated with the methodology and tools, and the conclusions of this study are presented in Sect. 8.

Theoretical basis of the emulator
The emulator is a statistical representation of a more complex model, in this case a GCM. It works on the principle that a relatively small number of experiments, which fill the entire multidimensional input space (in our case, four dimensions consisting of three orbital dimensions and a CO 2 dimension), albeit rather sparsely, are carried out using the GCM. The statistical model is calibrated on these experiments, with the aim of being able to interpolate the GCM results such that it can provide a prediction of the output that the GCM would produce if it were run using any particular input configura-tion (i.e. any set of orbital and CO 2 conditions). If successful (as can be tested by comparing emulator results with additional GCM results not included in the calibration), no further experiments are required using the GCM; the emulator can then be used to produce results for any set of conditions or sequence of sets of conditions within the range of conditions on which it has been calibrated. It should not, of course, be used to extrapolate to conditions outside that range.
In this study, we use a principal component analysis (PCA) Gaussian process (GP) emulator based on Sacks et al. (1989), with the subsequent Bayesian treatment of Kennedy and O'Hagan (2000) and Oakley and O'Hagan (2002), and a PCA approach associated with . All code for the GP package is available online at https://github.com/ mcrucifix/GP. This principal component (PC) emulator is based on climate data for the entire global grid, as opposed to calibrating separate emulators based on data for individual grid boxes. This approach is taken because, for past climate, the global response overall is of interest, rather than just the response at specific locations individually. It also means that the results are consistent across all locations. For future climate, and in particular for application to nuclear waste, recommendations and results should be consistent across all sites, which would be especially relevant to a large country such as the US. Alternatively, for some countries and locations, it may be more appropriate to emulate specific grid boxes. The theoretical basis for the emulator and its calibration is as follows.
Let D represent the design matrix of input data with n rows, where n is the total number of experiments performed with the GCM, here 60 (sum of the two ensembles). The number of columns, p, is defined by the number of dimensions in input parameter space. In this case, p = 4 representing the three orbital parameters and atmospheric CO 2 concentration. A more detailed explanation of the orbital input parameters is included in Sect. 3; however, briefly, they are longitude of perihelion ( ), obliquity (ε) and eccentricity (e), with longitude of perihelion and eccentricity being combined under the form e sin and e cos . For a set of i = 1,n simulations, each simulation represents a point in input space, and is characterized by the input vector x i , i.e. a row of D.
The corresponding GCM climate data output is denoted f (x i ), where the function f represents the GCM model. This output for all n experiments is contained in the matrix Y. The raw output from the GCM is in the form of gridded data covering the Earth's surface, with 96 longitude by 73 latitude grid boxes. We perform a principal component analysis to reduce the dimension of the output data before it is used to calibrate the emulator. Each column of Y contains the results for one experiment, i.e. Y = [y(x 1 ), . . . , y(x n )]. Furthermore, the centred matrix Y * can be defined as Y − Y mean , where Y mean is a matrix in which each row comprises a set of identical elements that are the row averages where S is the diagonal matrix containing the corresponding eigenvalues of V, V is a matrix of the right singular vectors of Y, and U is a matrix of the left singular vectors. U and V are orthonormal, and V T * denotes the conjugate transpose of the unitary matrix V. The columns of US represent the principal components, and the columns of V the principal directions or axes. Each column of U represents an eigenvector, u k , and VS provides the projection coefficients β k . Specifically, for experiment i, a k (x i ) = k V ik S kk gives the projection coefficient for the kth eigenvector. The eigenvectors are ordered by decreasing eigenvalue, and in practice only a relatively small number of the eigenvectors will be retained (n ), typically selected on the basis of the largest values of a k (x). Thus, We calibrate the emulator using the reduced dimension output data rather than the raw spatial climate data. However, for simplicity, we will first consider a simple GP emulator.
For this, the model output f (x) for the input conditions x is modelled as a stochastic quantity that is defined by a GP. Its distribution is fully specified by its mean function, m(x), and its covariance function, V (x, x ), which may be written The mean and covariance functions take the form where h(x) is a vector of known regression functions of the inputs, β is a column vector of regression coefficients corresponding to the mean function, c(x, x ) is the GP correlation function and σ 2 is a scaling value for the covariance function. h(x) and β both have q components and, as before, T denotes the transpose operation. A range of options are available for the regression functions h(x) and the GP correlation function c, the most suitable of which depends on the application of the emulator. Any existing knowledge that the user may have about the expected response of the GCM to the input parameters can be used to inform their function choices. However, if the emulator performs poorly, an alternative function can be selected, which may prove to be more suitable.
We assume a linear model, h(x) T = (1, x T ), with any non-linearities in the GCM response being absorbed by the stochastic component of the GP. The correlation function is exponential decay with a nugget, a detailed discussion of which can be found in Andrianakis and Challenor (2012). Hence, for the input parameters a = 1,p, the correlation function can be written as where δ is the correlation length hyperparameter for each input, ν is the nugget term and I is an operator which is equal to 1 when x = x , and 0 otherwise. The nugget term has a number of functions in this application, including accounting for any non-linearity in the output response to the inputs and for non-explicitly specified inactive inputs, such as initial conditions and experiment, and averaging length. It also represents the effects of lower-order PCs that are excluded from the emulator. Now consider run i, which has inputs characterized by x i and outputs by y i . Let H be the design matrix relating to the GCM output, where row i represents the regressors h(x i ), making H an n by q matrix. The adopted modelling approach states that the prior distribution of y is Gaussian, characterized by y ∼ N(Hβ, Following the specification of the prior model above, a Bayesian approach is now used to update the prior distribution. The posterior estimate of the GCM output is described by where and We follow the suggestion of Berger et al. (2001) and assume a vague prior (β, σ 2 ) that is proportional to σ 2 , an approach that has been adopted by several other studies, including Oakley and O'Hagan (2002), Bastos andO'Hagan (2009), Araya-Melo et al. (2015), and . The posterior distribution of the GCM output is a Student's t distribution with n − q degrees of freedom, but is sufficiently close to being Gaussian for this application. Now, taking the output from the PCA performed earlier, we apply the GP model to each basis vector (a k (x)), which has been updated according to Eqs. (7) and (8), in turn. Thus, where mean and covariance functions take the form The values of the hyperparameters are chosen by maximising the likelihood of the emulator, following Kennedy and O'Hagan (2000), and based on the following expression from Andrianakis and Challenor (2012): where K is an unspecified constant. On the recommendation of Andrianakis and Challenor (2012), a penalised likelihood is used, which limits the amplitude of the nugget: where M(ν, δ) is the mean squared error between the GCM's output data and the emulator's posterior mean at the design points, defined by M(ν, δ) = ν 2 /n(y − Hβ) T A −2 (y − Hβ). M(∞) is its asymptotic value at δ i → ∞, given by M(∞) = 1/n(y − Hβ) T (y − Hβ). is assigned a value of 1.
To summarise, in this study D is a 60 × 4 matrix (n × p) of input data, consisting of 60 GCM simulations and four input factors (ε, e sin , e cos and CO 2 ). The matrix Y contains the output data from the GCM, with dimensions of 96 × 73 × 60 (longitude × latitude × n). A PC analysis is performed on this output data, which is then used to calibrate the emulator. Four hyperparameters (δ) are used, due to there being four input factors, along with a nugget term (ν). The optimal values for these hyperparameters and the number of PCs retained are calculated during calibration and evaluation of the emulator, discussed in Sect. 4. The GCM data used in this study are mean annual SAT and mean annual precipitation, although these are each emulated separately using different emulators.

Model description
To run the GCM simulations, we used the HadCM3 climate model (Gordon et al., 2000;Pope et al., 2000) -a coupled atmosphere-ocean general circulation model (AOGCM) developed by the UK Met Office. Although HadCM3 can no longer be considered as state-of-the-art when compared with the latest generation of GCMs, such as those used in the most recent IPCC Fifth Assessment Report (IPCC, 2013), its relative computational efficiency makes it ideal for running experiments for comparatively long periods of time (of several centuries) and for running large ensembles of simulations, as performed in this study. As a result, this model is still widely used in climate research, both in palaeoclimatic studies (e.g. Prescott et al., 2014) and in projections of future climate (Armstrong et al., 2016). In addition, it has previously been employed in research into climate sensitivity using a statistical emulator (Araya-Melo et al., 2015). The horizontal resolution of the atmosphere component is 2.5 • latitude by 3.75 • longitude with 19 vertical levels, whilst the ocean has a resolution of 1.25 • by 1.25 • and 20 vertical levels.
HadCM3 is coupled to the land surface scheme MOSES2.1 (Met Office Surface Exchange Scheme), which was developed from MOSES1 (Cox et al., 1999). It has been used in a wide range of studies (Cox et al., 2000;Crucifix et al., 2005), and a comparison to MOSES1 and to observations is provided by Valdes et al. (2017). MOSES2.1 in turn is coupled to the dynamic vegetation model TRIFFID (Top-down Representation of Interactive Foliage and Flora Including Dynamics) (Cox et al., 2002). TRIFFID calculates the global distribution of vegetation based on five plant functional types: broadleaf trees, needleleaf trees, C3 grasses, C4 grasses and shrubs. Further details of the overall model set-up, denoted HadCM3B-M2.1aE, can be found in Valdes et al. (2017).

Experimental design
In our simulations, four input parameters are varied: atmospheric CO 2 concentration and the three main orbital parameters of longitude of perihelion ( ), obliquity (ε) and eccentricity (e). The extents of the GrIS and WAIS are also modified, although only between two modes -their present-day configurations and their reduced-extent Pliocene configurations . The extent and thickness of the East Antarctic Ice Sheet (EAIS) was not modified. A more detailed description of the continental ice sheet configurations is provided in Sect. 3.5.
We combined eccentricity and longitude of perihelion under the forms e sin and e cos given that, in general at any point in the year, insolation can be approximated as a linear combination of these two terms and obliquity (ε) (Loutre, 1993). The ranges of orbital and CO 2 values considered are appropriate for the next 1 Myr and a range of anthropogenic emissions scenarios. For the astronomical parameters, calculated using the Laskar et al. (2004) solution, this essentially equates to their full ranges of −0.055 to 0.055 for e sin and e cos , and 22.2 to 24.4 • for ε.
For CO 2 , an emissions scenario is selected from Lord et al. (2016) in which atmospheric CO 2 follows observed historical concentrations from 1750 CE (Common Era) to 2010 CE (Meinshausen et al., 2011), after which emissions follow a logistic trajectory, resulting in cumulative total emissions of 10 000 Pg C by year ∼ 3200 CE. This experiment was run for 1 Myr using the cGENIE Earth system model and aims to represent a maximum total future CO 2 re-Figure 1. Time series of atmospheric CO 2 concentration (ppmv) for the next 200 kyr following logistic CO 2 emissions of 10 000 Pg C, run using the cGENIE model (Lord et al., 2016). Also shown are the upper and lower CO 2 limits of the highCO 2 (red dashed lines) and lowCO 2 (green dashed lines) ensembles. The pre-industrial CO 2 concentration of 280 ppmv (horizontal grey dotted line) and the 110 kyr AP cut-off for the high-CO 2 ensemble (vertical grey dotted line) are included for reference.
lease. To put this into perspective: current estimates of fossil fuel reserves are approximately 1000 Pg C, with an estimated ∼ 4000 Pg C in fossil fuel resources that may be extractable in the future (McGlade and Ekins, 2015), and up to 20-25 000 Pg C in nonconventional resources such as methane clathrates (Rogner, 1997). The evolution of atmospheric CO 2 concentration over the next 200 kyr for this emissions scenario is shown in Fig. 1. Although in the cGENIE simulation, atmospheric CO 2 reaches a maximum of 3900 ppmv (parts per million) within the first few hundred years, this concentration is not at equilibrium and only lasts for a couple of decades before decreasing. As a result, the concentration at 500 years into the experiment, 3600 ppmv, is chosen as the upper CO 2 limit, which means that the climatic effects of emissions of more than 10 000 Pg C cannot be estimated with the emulator.
By the end of the 1 Myr emissions scenario, atmospheric CO 2 concentrations have nearly declined to pre-industrial levels, reaching 285 ppmv. However, this experiment does not account for natural variations in the carbon cycle, which result in periodic fluctuations in CO 2 . For example, during the Holocene (11 kyr BP to ∼ 1750 CE) atmospheric CO 2 varied between 260 and 280 ppmv (Monnin et al., 2004). A value of 250 ppmv is therefore deemed to be appropriate to account for these natural variations in an unglaciated world, in addition to possible uncertainties in the model, and hence is assumed as the value of the lower CO 2 limit in the ensemble.
The orbital and CO 2 parameter ranges that have been selected are also applicable to unglaciated periods during the late Pliocene, when atmospheric CO 2 was estimated to be higher than pre-industrial values (Martinez-Boti et al., 2015;Raymo et al., 1996). In this study, we do not consider or attempt to simulate past or future glacial episodes, which may be accompanied by larger continental ice sheets (see Sect. 7 for more discussion), although the conditions required to initiate the next glaciation, and extending the ensemble of GCM simulations to represent glacial states, are being investigated in a forthcoming study. The underlying assumption of our ensemble is that it is suitable for simulating periods for which the CO 2 concentration is high enough to prevent entry into a glacial state.
Two ensembles were generated, each made up of 40 simulations, meeting the recommended 10 experiments per input parameter (Loeppky et al., 2009). One ensemble includes orbital values suitable for the next 1 Myr and a relatively small range of lower CO 2 values, whereas the other ensemble represents the shorter-term future with a reduced range of orbital values and a larger range of higher CO 2 concentrations. This approach was adopted because various studies have shown that on geological timescales of thousands to hundreds of thousands of years an emission of anthropogenic CO 2 to the atmosphere is taken up by natural carbon cycle processes over different timescales (Archer et al., 1997;Lord et al., 2016). A relatively large fraction of the CO 2 perturbation is neutralised on shorter timescales of 10 3 -10 4 years, but it takes 10 5 -10 6 years for atmospheric CO 2 concentrations to very slowly return to pre-industrial levels (Colbourn et al., 2015;Lenton and Britton, 2006;Lord et al., 2016), if the effects of glacial-interglacial cycles and other natural variations, such as those due to imbalances between volcanic outgassing and weathering, are excluded. Hence, only a relatively short portion of the full million years has very high CO 2 concentrations under any emissions scenario, with the major part of the time having a CO 2 concentration no more than several hundred parts per million by volume above preindustrial, as demonstrated in Fig. 1.
The parameter ranges for the two ensembles, which are referred to as "highCO 2 " and "lowCO 2 ", are given in Table 1. The cut-off point for the highCO 2 ensemble is set at 110 kyr AP, as after this time eccentricity, which remained relatively low prior to this time, starts to increase more rapidly, and variability in e sin and e cos increases. This first ensemble therefore has CO 2 sampled up to 3600 ppmv, and the orbital parameters are sampled within the reduced range of values that will occur over the next 110 kyr. The lowCO 2 ensemble samples the full range of orbital values and the upper CO 2 limit is set to 560 ppmv. This upper limit also covers the range of CO 2 concentrations that have been estimated for the late Pliocene (e.g. Martinez-Boti et al., 2015; Seki et al., 2010). At 110 kyr AP in the 10 000 Pg C emissions scenario, the atmospheric CO 2 concentration is 542 ppmv, which is rounded up to twice the pre-industrial atmospheric CO 2 concentration (560 ppmv = 2 × 280 ppmv), a common scenario used in future climate-change modelling studies.
The benefits of the approach of having separate ensembles for high and low CO 2 mean that both parameter ranges have sufficient sampling density, whilst also reducing the chance of unrealistic sets of parameters, in particular for the period of the next 110 kyr. During this time, CO 2 is likely to be comparatively high, while eccentricity remains relatively low, and e sin and e cos exhibit relatively low variability. Having a separate ensemble in which CO 2 and the orbital parameters are only sampled within the ranges experienced within the next 110 kyr avoids wasting computing time on parameter combinations that are highly unlikely to occur, such as very high CO 2 and very high eccentricity. This methodology also provides the additional benefit of the low-CO 2 emulator being applicable to palaeo-modelling studies, as the ensemble encompasses an appropriate range of CO 2 and orbital values for many past periods of interest, such as the Pliocene.

Generation of experiment ensembles
We used the Latin hypercube sampling function from the MATLAB Statistics and Machine Learning Toolbox (LHC; MATLAB, 2012) to generate the two ensembles, thereby efficiently sampling the four-dimensional input parameter space (Mckay et al., 1979). Briefly, this method divides the parameter space within the prescribed ranges into n equally probable intervals, n being the number of experiments required, which in this case is 40 per ensemble. n points are then selected for each input variable, one from each interval, without replacement. The sample points for the four variables are then randomly combined. The LHC sampling function also includes an option to maximise the minimum distance between all pairs of points (the maxi-min criterion), which is utilised here to ensure the set of experiments is optimally space filling.
For each ensemble, 3000 sample sets were created, with each set consisting of an n by p matrix, X, containing the four sampled input parameter values for each of the 40 experiments, and then the optimal sample set was selected as the final ensemble based on a number of criteria. Following Joseph and Hung (2008), we seek, in addition to the maximin criterion, to maximise det(X T X). Here, we will term this determinant the "orthogonality" because the columns of the design matrix will approach orthogonality as this determinant is maximised (assuming that input factors are normalised). However, a limitation of the method of sampling the parameters e sin and e cos , rather than eccentricity and longitude of perihelion directly, is that due to the nature of the e sin and e cos parameter space, the sampling process favours higher values of eccentricity over lower ones. This is not an issue for the longitude of perihelion because when eccentricity is low the value of this parameter has little effect on insolation. However, the value of obliquity selected for a given eccentricity value could have a significant impact on climate, meaning that it is desirable to have a relatively large range of obliquity values for low (< 0.01) and high (> 0.05) eccentricity values, in order to sample the boundaries sufficiently. It was observed that the sample sets with the highest orthogonality had comparatively few, if any, values of low eccentricity, also meaning that a very limited number of obliquity values were sampled for low eccentricity. We therefore adopted the approach whereby all sample sets that demonstrated normalised orthogonality values that were more than 1 SD (standard deviation) above the mean orthogonality were selected. From these, the single sample set with the greatest range of obliquity values for low eccentricity, hence with maximal sampling coverage of the low-eccentricity boundary, was selected as the final ensemble design. The input parameter values for the highCO 2 and lowCO 2 ensembles are given in Table 2, and the distributions in parameter space are illustrated in Fig. 2.

AOGCM simulations
The two CO 2 ensembles were initially run with constant modern-day GrIS and WAIS configurations (modice). All experiments were initiated from a pre-industrial spin-up experiment, with an atmospheric CO 2 concentration of 280 ppmv, and pre-industrial ice sheet extents and orbital conditions. Atmospheric CO 2 and the orbital parameters were kept constant throughout each simulation, and each experiment was run for a total of 500 model years. This simulation length allows the experiments with lower CO 2 to reach nearequilibrium at the surface. Experiments with higher CO 2 have not yet equilibrated by the end of this period; the significance of this is addressed in Sect. 3.6. A number of the veryhigh-CO 2 experiments caused the model to become unstable and the interpretation of these experiments is discussed in Sect. 3.4 under "Very-high-CO 2 simulations". A control simulation was also run for 500 years, with the atmospheric CO 2 concentration and the orbital parameters set at pre-industrial values. All climate variable results for the model, unless specified, are an average of the final 50 years of the simulation. Anomalies compared with the pre-industrial control (i.e. emulated minus pre-industrial) are discussed and used in the emulator, rather than absolute values, to account for biases in the control climate of the model.

Very-high-CO 2 simulations
As mentioned previously, experiments in the highCO 2 ensemble with CO 2 concentrations of greater than 3100 ppmv become unstable. These experiments exhibit accelerating warming trends several hundred years into the simulation, which eventually cause the model to crash before completion. This is the result of a runaway positive feedback in the GCM caused, at least in part, by the vertical distribution of ozone in the model being prescribed for modern-day climate conditions. Consequently, the ozone distribution is not able to respond to changes in climate, meaning that when increased mean global temperatures result in an increase in altitude of the tropopause and hence an extension of the troposphere, relatively high concentrations of ozone, which were previously located in the stratosphere, enter the troposphere, resulting in runaway warming.
All other experiments ran for the full 500 years. However, those with a CO 2 concentration of 2000 ppmv or higher also exhibited accelerating warming trends before the end of the simulation. Consequently, only simulations with CO 2 concentrations of less than 2000 ppmv (equivalent to a total CO 2 release of up to 6000 Pg C) are included in the rest of this study, meaning the methodology is not appropriate for CO 2 values greater than this. This equates to 20 experiments in total from the highCO 2 ensemble, with CO 2 concentrations ranging from 303 to 1901 ppmv. All 40 of the lowCO 2 experiments were used.

Sensitivity to ice sheets
In addition to running the two ensembles with modern-day GrIS and WAIS configurations, we also investigated the climatic impact of reducing the sizes of the ice sheets. Many of the CO 2 values sampled, particularly in the highCO 2 ensemble, are significantly higher than pre-industrial levels, and if the resulting climate were to persist for a long period of time it could result in significant melting of the continental ice sheets over timescales of 10 3 -10 4 years (Charbit et al., 2008;Stone et al., 2010;Winkelmann et al., 2015).
We therefore set up the highCO 2 and lowCO 2 ensembles with reduced GrIS and WAIS extents (lowice), using the PRISM4 Pliocene reconstruction of the ice sheets . In this reconstruction, the GrIS is limited to high elevations in the Eastern Greenland Mountains, and no ice is present over Western Antarctica. Similar patterns of ice retreat have been simulated in response to future warming scenarios for the GrIS (Greve, 2000; Huybrechts and Table 2. Experiment set-up: orbital parameters (obliquity, eccentricity and longitude of perihelion) and atmospheric CO 2 concentration for simulations in the highCO 2 and lowCO 2 ensembles. All experiments in both ensembles were run with modern ice sheet (modice) configurations. Experiments shown in bold were also run with reduced ice sheet (lowice) configurations. The experiment number is given, and the experiment name is constructed using the ice sheet configuration, the ensemble name and the experiment number, for example, modice_lowCO2_1.  (Huybrechts and de Wolde, 1999;Winkelmann et al., 2015), equivalent to ∼ 7 m (Ridley et al., 2005) and ∼ 3 m (Bamber et al., 2009;Feldmann and Levermann, 2015) of global sea level rise, respectively. Large regions of the EAIS show minimal changes or slightly increased surface elevation, although there is substantial loss of ice in the Wilkes and Aurora subglacial basins .
The same CO 2 and orbital parameter sample sets were used for both ice configuration ensembles to allow the impact of varying the ice sheet extents on climate to be directly compared. Only the Greenland and Antarctic grid boxes were modified; the boundary conditions for all other grid boxes, as well as the land-sea mask, were the same as in the modernday ice sheet simulations. For Greenland and Antarctica, the extent and orography of the ice sheets was updated with the PRISM4 data, as well as the orography of any grid boxes that are projected to be ice-free. Soil properties, land surface type and snow cover were also updated for these grid boxes. Figure 3 compares the orography for the modice and lowice ensembles, clearly showing the reduced extents for the ice sheets.

Pattern scaling of reduced ice simulations
It was expected that reducing the size of the continental ice sheets would have a relatively localised impact on climate (Lunt et al., 2004) and that the effect would be of a linear nature. Therefore, a subset of five simulations from the two ensembles were selected as reduced ice sheet simulations (lowCO 2 -experiments 8, 19 and 29; highCO 2 -experiments 21, and 34; see Table 2), covering a range of orbital and CO 2 values.
A comparison of the mean annual SAT anomaly for the five experiments showed that the largest temperature changes occur over Greenland and Antarctica, particularly in regions where there is ice in the modice ensemble but that are ice free in lowice. The spatial pattern of the change is also fairly similar across the simulations, suggesting that the response of climate to the extents of the ice sheets is largely independent of orbital variations or CO 2 concentration. The SAT anomaly for the five lowice experiments compared with their modice equivalents was calculated and then averaged across the experiments, shown in Fig. 4a. The largest SAT anomalies occur locally to the GrIS and Antarctic ice sheet (AIS), accompanied by smaller anomalies in some of the surrounding ocean regions (e.g. Barents and Ross seas), with no significant changes in SAT elsewhere, in line with the results of Lunt et al. (2004), Toniazzo et al. (2004 and Ridley et al. (2005). This SAT anomaly, caused by the reduced extents of the GrIS and WAIS, was then applied (added) to the mean annual SAT anomaly data for all other highCO 2 and lowCO 2 modice experiments to generate the SAT data for two lowice ensembles.
Also shown in Fig. 4, for comparison, are mean annual SAT anomalies produced by the other forcings, including a doubling of CO 2 , the difference between maximum and minimum obliquity and the difference between "warm" orbital conditions and "cold" orbital conditions. The warming caused by increased CO 2 is more widespread (Fig. 4b), with the largest warming occurring at high latitudes and for land regions, in agreement with typical future-climate simulations (IPCC, 2013, p. 1059). The temperature change due to obliquity and warm versus cold orbital conditions is less than that for either reduced ice (compared to pre-industrial) or increased CO 2 . Changes in obliquity have the largest impact on temperatures in high-latitude regions since the exposure of these regions to the sun's radiation is most affected by changes in obliquity. Smaller temperature anomalies are observed over northern Africa and India and, since an increase in obliquity is indeed known to boost monsoon dynamics (e.g. Araya-Melo et al., 2015;Bosmans et al., 2015), changes in soil latent heat exchanges are therefore expected to contribute negatively to the temperature response. The comparison of warm versus cold orbital conditions, which highlights (annual mean) temperature changes primarily caused by precession, generally shows a warming trend, with the largest temperature changes occurring in monsoonal regions. Lower temperatures are observed in the Northern Hemisphere over northern Africa, India and East Asia, whilst warmer temperatures occur in the Southern Hemisphere over South America, southern Africa and Australia. Figure 4 demonstrates that the temperature forcing caused by CO 2 affects mean annual temperatures on a global scale, whilst the forcing due to ice sheet and orbital changes affects mean annual temperatures in specific regions, having a limited impact on global mean temperatures. This is supported by the relatively high global mean SAT anomaly for the 2 × CO 2 scenario of 4.2 • C, compared with the lower SAT anomalies that result from the obliquity and precession forcing of 0.4 • C each (see caption of Fig. 4).

Calculation of equilibrated climate
Given the high values of CO 2 concentration in many of the experiments, particularly in the highCO 2 ensemble, even by the end of the 500-year running period the climate has not yet reached steady state. We therefore estimated the fully equilibrated climate response using the methods described below.

Gregory plots
In order to estimate the equilibrated response, we applied the method of Gregory et al. (2004) to the model results, regressing the net radiative flux at the top of the atmosphere (TOA) against the global average SAT change, as displayed in figures termed Gregory plots Gregory et al., 2015). In this method, for an experiment that has a constant forcing applied (i.e. with no inter-annual variation) it can be assumed that where N is the change in the global mean net TOA radiative flux (W m −2 ), F is the effective radiative forcing (W m −2 , positive downwards), α is the climate feedback parameter (W m −2 • C −1 ) and T is the global mean annual SAT change compared with the control simulation ( • C). This method works on the assumption that if F and α are constant, N is an approximately linear function of T . By linearly regressing T against N, both F (intercept of the line at T = 0) and −α (slope of the line) can be diagnosed. The intercept of the line at N = 0 provides an estimate of the equilibrium SAT change (relative to the pre-industrial SAT) for the experiment, denoted T g eq to indicate it was calculated from the Gregory plots, and is equal to F /α. This is in contrast to the SAT change calculated directly from the GCM model data by averaging the final 50 years of the experiment ( T 500 ).
The Gregory plots for two modice experiments, modice_lowCO2_13 (CO 2 555.6 ppmv) and modice_highCO2_17 (CO 2 1151.6 ppmv), are shown in Fig. 5. These experiments were selected as they have CO 2 values nearest to the 2 × and 4 × pre-industrial CO 2 scenarios that are commonly used in idealised future climate experiments. For each experiment, mean annual data are plotted for years 1-20 of the simulation, and mean decadal 1550 N. S. Lord et al.: Emulation of long-term changes in global climate Table 3. Parameter values estimated from Gregory plots for the 2 × and 4 × pre-industrial CO 2 simulations. Shown are the effective radiative forcing (F , W m −2 ) and the climate feedback parameter (α, W m −2 • C −1 ) for years 1-20 and 21-100. The uncertainties are the standard error from the linear regression.
yr 1-20 yr 21-100 yr 1-20 yr 21-100 for approximate 2 × CO 2 (modice_lowCO2_13, circles) and 4 × CO 2 (modice_highCO2_17, triangles) experiments. Lines show regression fits to the global mean annual data points for years 1-20 (blue) and years 21-500 (red). Data points are mean annual data for years 1-20 (blue) and mean decadal data for years 21-500 (red). The T intercepts (N = 0) of the red lines give the estimated equilibrated SAT ( T g eq ) for the two experiments. The T intercepts of the dashed blue lines represent the equilibrium that the experiment would have reached if the feedback strengths in the first 20 years had been maintained. SAT is shown as an anomaly compared with the pre-industrial control simulation. data for years 21-500. The regression fits are to mean annual data in each case, and years 1-20 and 21-500 were fitted separately. The values for F and α estimated from  Table 3. These values are slightly lower than those identified in other studies using the same method. For example, Gregory et al. (2004) used HadCM3 to run experiments with 2 × and 4 × CO 2 , obtaining values for years 1-90 of 3.9 ± 0.2 and 7.5 ± 0.3 W m −2 for F , and −1.26 ± 0.09 and −1.19 ± 0.07 W m −2 • C −1 for α, respectively. Andrews et al. (2015) calculated F to be 7.73 ± 0.26 W m −2 and α to be −1.25 W m −2 • C −1 for years 1-20 and −0.74 W m −2 • C −1 for years 21-100 for 4 × CO 2 simulations using HadCM3. The differences between our results and theirs may be due to the fact that we used MOSES2.1 and the TRIFFID vegetation model, whereas they used MOSES1, which is a different land-surface scheme and does not account for vegetation feedbacks.
The decrease in the climate response parameter (α) as the experiment progresses suggests that the strength of the climate feedbacks changes as the climate evolves over time. Consequently, the T intercept (N = 0) for the first 20 years of the simulation underestimates the actual warming of the model. Over longer timescales, the slope of the regression line becomes less negative, implying that the sensitivity of the climate system to the forcing increases Gregory et al., 2004;Knutti and Rugenstein, 2015). This non-linearity has been found to be particularly apparent in cloud feedback parameters, in particular shortwave cloud feedback processes . A number of studies have attributed this strengthening of the feedbacks to changes in the pattern of surface warming (Williams et al., 2008), mainly in the eastern tropical Pacific where an intensification of warming can occur after a few decades, but also in other regions such as the Southern Ocean . The impact of variations in ocean heat uptake has also been suggested to be a contributing factor (Geoffroy et al., 2013;Held et al., 2010;Winton et al., 2010).
We take the T intercept (N = 0) for years 21-500 to give the equilibrium temperature change ( T g eq ) for the experiments, equating to values of 4.3 and 8.9 • C for the 2 × and 4 × CO 2 scenarios in Fig. 5. A limitation of this approach is that it assumes that the response of climate to a forcing is linear after the first 20 years, which has been shown to be unlikely in longer simulations of several decades or centuries Armour et al., 2013;. However, a comparison of the difference in temperature response to upper-and deep-ocean heat uptake and its contribution to the relationship between net radiative flux change (N) and global temperature change ( T ) in Geoffroy et al. (2013) indicated that the method of  of fitting two separate linear models to the early and subsequent (N, T ) data gives a good approximation of T g eq , F and α as they have been calculated here. A study by Li et al. (2013) also found that, using the Gregory plot methodology, T g eq was estimated to within 10 % of its ac- Figure 6. Equilibrated global mean annual change in SAT ( T g eq , • C) estimated using the methodology of Gregory et al. (2004) against global mean annual change in SAT ( T 500 , • C) at year 500 (average of final 50 years) for the lowCO 2 (circles) and highCO 2 (triangles) modice ensembles. The colours of the points indicate the CO 2 concentration of the experiment, from low (blue) to high (yellow). The 1 : 1 line (dashed) is included for reference. SAT is shown as an anomaly compared with the pre-industrial control simulation. tual value, obtained by running the simulation very close to equilibrium (∼ 6000 years). However, this was using the ECHAM5/MPIOM model, meaning that it is not necessarily also true for HadCM3.
Given that the slope of the 21-500 years regression line appears to become shallower with time, the estimates of T g eq should be taken as a lower limit of the actual equilibrated SAT anomaly. However, this tendency to flatten, particularly as the CO 2 concentration is increased, further justifies our use of the Gregory methodology; by the end of 500 years the high-CO 2 experiments have not yet reached steady state, and even in the lower-CO 2 experiments SAT increases very slowly and will thus likely take a long time to reach equilibrium. It would therefore not be feasible to run most of these experiments to steady state using a GCM, due to the associated computational and time requirements. Furthermore, on longer timescales the boundary conditions (orbital characteristics and, more importantly, atmospheric CO 2 concentrations) would have changed, such that, in reality, equilibrium would never be attained.

Equilibrated climate
The final estimates of T g eq for the lowCO 2 and highCO 2 modice ensembles range from a minimum of −0.4 • C (CO 2 264.5 ppmv) to a maximum of 12.5 • C (CO 2 1900.9 ppmv). Figure 6 illustrates the difference between global mean annual SAT anomaly calculated from the GCM model Figure 7. Equilibrated global mean annual change in SAT ( T eq , • C, blue), estimated by applying the T g eq / T 500 ratio identified using the Gregory methodology to the GCM data, against atmospheric CO 2 (ppmv) for the lowCO 2 (circles) and highCO 2 (triangles) modice ensembles. Also shown is T 500 (green), along with the idealised relationship between ln(CO 2 ) and T (red lines) for a climate sensitivity of 3 • C (solid), 1.5 • C (lower dashed) and 4.5 • C (upper dashed) (IPCC, 2013). SAT is shown as an anomaly compared with the pre-industrial control simulation. data ( T 500 ) and calculated using the Gregory plot ( T g eq ). Experiments with CO 2 below or near pre-industrial levels tended to reach equilibrium by the end of the 500 years, making a Gregory plot unnecessary; hence, T g eq is taken to be the same as T 500 in these cases. As CO 2 increases, the data points in Fig. 6 deviate further from the 1 : 1 line. This is the result of the ratio between T g eq and T 500 increasing, as the experiments grow increasingly far from equilibrium by the end of the GCM run with increasing CO 2 .
We next calculated the ratio between T g eq and T 500 for each experiment ( T g eq / T 500 ), which represents the fractional increase in climate change still due to occur after the end of the 500-year model run in order for steady state to be reached. To estimate the fully equilibrated climate anomaly, the spatial distribution of mean annual SAT anomaly was multiplied by the T g eq / T 500 ratio. The ratio identified for each experiment is assumed to be equally applicable to all grid boxes. The same scaling ratio was also applied to the precipitation anomaly data to estimate the equilibrated mean annual precipitation.
The equilibrated global mean annual SAT anomaly ( T eq ) for the highCO 2 and lowCO 2 modice ensembles is plotted against ln(CO 2 ) in Fig. 7, along with T 500 for reference. Also plotted in Fig. 7 are a number of lines illustrating idealised relationships between T eq and CO 2 based on a range of climate sensitivities. The most recent IPCC report suggested that the likely range for equilibrium climate sensitiv-ity is 1.5 to 4.5 • C (IPCC, 2013), hence sensitivities of 1.5, 3 and 4.5 • C have been plotted. The size of the correction required to calculate T eq from T 500 increases with increasing CO 2 and brings the final temperature estimates in line with the expected response (red lines), further increasing our confidence. The T eq estimated for the experiments generally follows the upper line, equivalent to an equilibrium climate sensitivity of 4.5 • C, which is higher than a previous estimate of 3.3 • C for HadCM3 (Williams et al., 2001). This difference may be due to our simulations being "fully equilibrated" following the application of the Gregory plot methodology. In addition, Williams et al. (2001) used an older version of HadCM3 and prescribed vegetation (MOSES1), whilst in this study interactive vegetation is used (MOSES2.1 with TRIFFID).

Calibration and evaluation of the emulator
By considering different contributions of modern and low ice, high and low CO 2 , different number of PCs, and different values for the correlation length hyperparameters, we generated an ensemble of emulators, in order to test their relative performance. The modice and lowice ensembles were treated as independent data sets that were used separately when calibrating the emulator since ice extent is not defined explicitly as an input parameter in the emulator code. This approach was adopted, rather than including the ice sheet extent as an active input parameter to the emulator, because only two ice sheet configurations have been simulated, which are not sufficient for an interpolation. One of the main benefits of including ice sheet extent as an active input parameter would be to emulate changing ice sheets over time, but this was beyond the scope of this study. ln(CO 2 ) was used as one of the four input parameters, along with obliquity, e sin and e cos . The performance of each emulator was assessed using a leave-one-out cross-validation approach, in which a series of emulators is constructed and used to predict one left-out experiment each time. For example, for the lowCO 2 modice ensemble (40 experiments), 40 emulators were calibrated with one experiment left out of each. This left-out experiment was then reproduced using the corresponding emulator and the results were compared with the actual experiment results. The number of grid boxes for each experiment calculated to lie within different standard deviation bands, and the RMSEs averaged across all the emulators were used as performance indicators to compare the different input configurations and hyperparameter value selections. The results in this section are applicable to the modice emulator, unless otherwise specified; however, the calibration and evaluation for the lowice emulator yielded similar trends and results.

Sensitivity to input data
We investigated the impact on performance of calibrating the emulator on the highCO 2 and lowCO 2 modice ensem-bles separately and combined. The lowCO 2 modice emulator generally performs slightly better in the leave-one-out crossvalidation exercise than the highCO 2 modice version, with a lower RMSE and fewer grid boxes with an error of more than 2 SD. Combining the two ensembles into one emulator results in a similar RMSE to the lowCO 2 -only modice emulator but decreases the RMSE compared with the highCO 2 -only modice emulator. As a consequence, we took the approach of calibrating the emulator on the combined ensembles for the rest of the study. This has the advantage that continuous simulations of climate with CO 2 levels that cross the boundary between the high-and low-CO 2 ensembles (∼ 560 ppmv), such as may be appropriate for emulation of future climate, can be performed using one emulator, rather than having to calibrate separate emulators for different time periods based on CO 2 concentration. There is also no loss of performance in the emulator for either set of CO 2 ranges, but rather a slight improvement for the highCO 2 ensemble.

Optimisation of hyperparameters
We calibrated two separate emulators, the first using the modice data and the second using the lowice data, both with 60 experiments each (combined highCO 2 and lowCO 2 ). The input factors (ε, e sin , e cos and ln(CO 2 )) were standardised prior to the calibration being performed; each was centred in relation to its column mean, and then scaled based on the standard deviation of the column. We tested different emulator configurations by varying the number of PCs retained, ranging from 5 to 20, and for each emulator configuration, the correlation length scales δ and nugget ν were optimized by maximisation of the penalised likelihood. This optimisation was carried out in log space, ensuring that the optimized hyperparameters would be positive. A leave-one-out validation was performed each time, and the modice and lowice configurations that performed best were selected as the final two optimized emulators. We found that a modice emulator retaining 13 PCs has the lowest RMSE and a relatively low percentage of grid boxes with errors of more than 2 SD. The scales δ for the modice emulator are 7.509 (ε), 3.361 (e sin ), 3.799 (e cos ), and 0.881 (CO 2 ) and the nugget is 0.0631. In contrast, a lowice emulator using 15 principal components exhibits the best performance, with length scales δ of 5.597 (ε), 2.887 (e sin ), 3.273 (e cos ), and 0.846 (CO 2 ) and a nugget of 0.0925. In both cases, the scales for the three orbital parameters are larger than the range associated with the input factors, indicating that the response is relatively linear with respect to these terms.
The modice emulator was evaluated using the leave-oneout methodology and results are shown in Fig. 8. The results suggest that the emulator performs well. Figure 8a shows the percentage of grid boxes for each left-out experiment estimated by the corresponding emulator within different standard deviation bands, along with the RMSE. The mean percentage of grid boxes within 1 and 2 SD is 80 and 97 %, Figure 8. Evaluation of emulator performance. (a) Bars give the percentage of grid boxes for which the emulator predicts the SAT of the left-out experiment to within 1, 2, 3 and more than 3 SD (standard deviations). Also shown is the RMSE for the experiments (black circles). Red lines indicate 68 and 95 %. (b) Global mean annual SAT index ( • C) calculated by the emulator and the GCM for the lowCO 2 (circles) and highCO 2 (triangles) modice ensembles. The 1 : 1 line (dashed) is included for reference. Note: this is the mean value for the GCM output data grid assuming all grid boxes are of equal size, hence not taking into account variations in grid box area: we therefore refer to it as a SAT index. SAT is shown as an anomaly compared with the pre-industrial control simulation.
which roughly corresponds to the 68 and 95 % ratios expected for a normal distribution, suggesting that the uncertainty in the prediction is being correctly captured.
Several of the experiments performed considerably worse than others, exhibiting below the expected number of grid boxes with errors within 1 SD (for reference, the mean value for 1 SD across the left-out experiments is 0.3 • C), and/or higher than the expected number of grid boxes with errors of greater than 2 SD, which is generally accompanied by a higher RMSE. However, the input conditions for these experiments are not particularly similar or unique. Experiments modice_highCO2_43, modice_highCO2_45 and modice_highCO2_46 all have a fairly low eccentricity and obliquity and a CO 2 concentration of ∼ 1000 ppmv, but there are multiple experiments with similar values that have lower RMSE values. A spatial map of the errors (not shown) indicates that the grid boxes with errors of 3 or more standard deviations are at high northern latitudes in these experiments. However, the signs of the anomalies are not the same across these experiments, as the emulator overestimates the Arctic SAT anomaly in modice_highCO2_43 and underestimates it in modice_highCO2_45 and modice_highCO2_46. This suggests that the emulator is perhaps not quite capturing the full model behaviour in high northern latitudes, particularly for low eccentricity values, but this is certainly not true for all experiments. The errors in the experiments are generally less than ±4 • C, and for most of the Arctic much lower than that. Note that the Arctic is a region in the model with high interannual variability; thus, one factor may be that the model simulations that are used to calibrate the emulator are not representative of the true stationary mean. There does not appear to be any obvious systematic error associated with the input parameters, suggesting that errors are less likely to be an issue resulting from the design of the emulator and more likely to arise from run-to-run variability in the behaviour of the underlying GCM. Figure 8b compares the mean annual "SAT index" for each left-out experiment calculated by the GCM and the corresponding emulator (note: this is the mean value for the GCM output data grid assuming all grid boxes are of equal size, hence not taking into account grid box area). There are no obvious outliers, and the emulated means are relatively close to their modelled equivalents. There also does not appear to be any significant loss of performance at very low or very high temperature, and therefore at very low or very high CO 2 .
In summary, our calibration and evaluation shows that the emulator is able to reproduce the left-out ensemble simulations reasonably well, with no obvious systematic errors in its predictions. Using the emulator, calibrated on the full set of 60 simulations (modice or lowice), we are able to simulate global climate development over long periods of time (several hundred thousand years or longer). Provided that the atmospheric CO 2 levels for the period are known and are within the limits of those used to calibrate the emulator, ice sheets do not change outside of the two configurations considered in the two ensembles, and the topography and landsea mask are unchanged.
In the next two sections, we present illustrative examples of a number of potential applications of the emulator by applying it to the late Pliocene in Sect. 5 and the next 200 kyr in Sect. 6.

Application of the emulator to the late Pliocene
In addition to being able to rapidly project long-term climate evolution, the emulator also allows climatic changes to be examined and analysed using a range of different methods that may not be possible using other modelling approaches. To illustrate this, we applied the lowice emulator to the late Pliocene and compared the results to palaeo-proxy data for the period. The lowice emulator was used because the ice sheets in this configuration are the PRISM4 Pliocene ice sheets . It should be noted, however, that this approach is only appropriate for periods of the Pliocene with equivalent or less-than-modern ice sheet extents (i.e. not glacial conditions), and that palaeogeographic changes for the Pliocene are not included here (see Sect. 7 for further discussion). We also tested the modice emulator, which, in agreement with the findings in Sect. 4, had a limited impact on the long-term evolution of global sea surface temperatures (SSTs) outside the immediate region of the ice sheets themselves. Potential applications of the emulator for palaeoclimate are described below.

Time series data
One application of the emulator is to produce a time series of the continuous evolution of climate for a particular time period, as is illustrated here where climate is simulated at 1 kyr intervals over the period 3300-2800 kyr BP. This period of the late Pliocene was selected because it has been extensively studied as part of a number of projects (e.g. PRISM; Dowsett et al., 2016;Dowsett, 2007, PlioMIP;Haywood et al., 2010Haywood et al., , 2016, represents the warm phase of climate (interglacial conditions) and does not include major glaciations -though the M2 cooling event may persist to the very start of the simulation at 3300 kyr BP, and the simulated period does include periods of likely glaciation, such as KM2 (∼ 3100 kyr BP) and G20 (∼ 3000 kyr BP). The emulator would not be appropriate for periods of extensive glaciation and may not be well-matched to the periods of lesser glaciation included within the simulated interval. Orbital data for each 1 kyr (Laskar et al., 2004) were provided as input to the calibrated emulator, along with three representative CO 2 concentrations. Three CO 2 reference scenarios were initially emulated, with constant concentrations of 280, 350 and 400 ppmv (although note that in reality, CO 2 varied on orbital timescales during this period; Martinez-Boti et al., 2015).
To illustrate the comparison of the emulator results to palaeo-proxy data, SST data for various locations were compared with the emulated SAT for the equivalent grid box.  Table 4. These Pliocene datasets were selected because they are all of sufficiently high resolution (≤ 4 kyr) for the impacts of individual orbital cycles on climate to be captured, whilst covering a range of locations and climatic conditions. Alkenone data are shown converted to SST using two commonly applied calibrations: Prahl et al. (1988) and Muller et al. (1998). All temperatures are presented as an anomaly compared with pre-industrial. The emulator results are compared with the SAT for the relevant grid box in the pre-industrial control experiment, whilst the proxy data are compared with SST observations for the relevant location taken from the HadISST data set (Rayner et al., 2003). Observations are annual means and are averaged over the period 1870-1900 CE. Table 4 presents the mean SAT anomaly (compared with pre-industrial) for the modelled period as estimated by the emulator for the 280 ppmv scenario for each of the four grid boxes. The mean increases with increasing CO 2 , by ∼ 1 • C at low latitudes to 2-3 • C at high latitudes for atmospheric CO 2 of 400 ppmv. Figure 10 illustrates the evolution of annual mean temperature variations through the late Pliocene as calculated using the various methods. For the equatorial and Arabian Sea sites (662 and 722), the SAT and SST estimates are relatively similar to each other in terms of the general estimated temperature, particularly for the higher-CO 2 scenarios of 350 and 400 ppmv. However, the comparison of timings and variations between the SAT and SST data is fairly poor, and there was not found to be a significant correlation between the emulated and proxy data tempera- tures at these sites when correlation coefficients were calculated. In fact, Site 982 was the only location for which significant (negative) correlations were found for a confidence interval of 95 %, although the correlation coefficient is still relatively low. These correlation coefficients were −0.22 (p value 0.004) for the Prahl et al. (1988) proxy SST data compared with the emulated SAT for the 280 ppmv scenario and −0.2 (p value 0.007) for the same SST data compared with the emulated SAT for the 350 ppmv scenario. The Muller et al. (1998) SST data demonstrated correlation coefficients that were essentially identical to those above when compared with the same emulated SATs.
At the higher latitudes, the simulated SAT estimate is generally lower than the proxy data SST. This is a common issue in GCM simulations of the late Pliocene, in which temperatures at high latitudes under increased CO 2 -induced radiative forcing are often underestimated (Haywood et al., 2013).
It could also be that the alkenones are not recording mean annual temperature and instead are being produced during peak warmth (e.g. during the summer months), especially at higher latitudes (Lawrence et al., 2009). This issue could be explored further by extending the methodology presented here to other variables. This seasonal bias could explain the large offset in temperature at the northernmost site (982), which exhibits a maximum difference in mean temperature anomaly for the period of 5.1 • C between data sets, and possibly also at Site U1313. The emulated uncertainty in SAT (defined as 1 SD of the emulated grid box posterior variance) is also shown in Fig. 10, and average values for the period are given in Table 4. This is slightly higher at the northernmost North Atlantic site (982) compared to the lower-latitude sites, but overall the uncertainty is relatively small when compared with the effects of variations in the orbital parameters and atmospheric CO 2 concentration.

Orbital variability and spectral analysis
The emulator can also be used to identify the influence of orbital variations on long-term climate change. One approach is to assess the spatial distribution of orbital timescale variability by plotting the standard deviation for a climate variable for each grid box, as illustrated for SAT in Fig. 9 for the 400 ppmv CO 2 scenario (blue lines in Fig. 10). Figure 9a shows mean annual SAT (compared with pre-industrial) produced by the emulator under modern-day orbital conditions. Anomalies over the majority of the Earth's surface are positive due to the relatively high atmospheric CO 2 concentration of 400 ppmv. Warming is larger at high latitudes, primarily due to a number of positive feedbacks operating in these regions (known as polar amplification). The greatest warming is centred over parts of the GrIS and WAIS, showing a similar spatial pattern to that in Fig. 4, and is a result of the reduced ice sheet extents in the emulated experiments compared with the pre-industrial simulation. Figure 9b shows the standard deviation of mean annual SAT for the late Pliocene, with spatial variations primarily illustrating differences in the impact of orbital forcing on climate. For example, the relatively higher values at high latitudes compared with low latitudes in Fig. 9b suggest that changes in the orbital parameters have a relatively large impact on SAT in these regions. This is consistent with astronomical theory, as changes in both obliquity and precession affect the distribution of insolation in space and time, with this effect being particularly significant at high latitudes. Monsoonal regions also demonstrate relatively large variations (Fig. 9b), including Africa, India and South America, in agreement with previous studies that suggest a link between orbital changes and monsoon variability (Caley et al., 2011;Prell and Kutzbach, 1987;Tuenter et al., 2003).
In order to visualise the effects of orbital forcing over time, a spectral wavelet analysis was performed on the SAT time series data produced by the emulator for the scenario with  (Laskar et al., 2004), showing eccentricity (black) and precession (radians, blue) on the left axis and obliquity (degrees, red) on the right axis. (b)-(e) Time series of emulated grid box mean annual SAT ( • C, plain lines), modelled every 1 kyr, for three constant CO 2 scenarios: 280 ppmv (black), 350 ppmv (red) and 400 ppmv (blue). Modelled using the lowice emulator. Error bands represent the emulated grid box posterior variance (1 SD). Also shown are SST proxy data ( • C, dotted lines) calibrated using the method of Prahl et al. (1988) (maroon) and the method of Muller et al. (1998) (green). SSTs for four ODP/IODP sites are compared: Site 982 (North Atlantic; Lawrence et al., 2009Naafs et al., 2010), Site 722 (Arabian Sea; Herbert et al., 2010) and Site 662 (tropical Atlantic; Herbert et al., 2010). SAT is shown as an anomaly compared with the pre-industrial control simulation, SST is shown as an anomaly compared with SST observations for the period 1870-1900 taken from the HadISST data set (Rayner et al., 2003). Note the different vertical axis scales.
constant CO 2 at 400 ppmv, shown in Fig. 10 (blue line). We used the standard MATLAB wavelet software of Torrence and Compo (1998) (available online at http://atoc.colorado. edu/research/wavelets). The wavelet power spectra for the four ODP/IODP sites are presented in Fig. 11, from which the dominant orbital frequencies influencing climate can be identified. For the late Pliocene up to ∼ 2900 kyr BP, Fig. 11 suggests that changes in emulated SAT are paced by a combination of precession (longitude of perihelion) and eccentricity, with periodicities of approximately 21 and 96 kyr, re- Figure 11. The wavelet power spectrum for 3300-2800 kyr BP (late Pliocene). Wavelet analysis was performed on emulated grid box mean annual SAT ( • C), modelled every 1 kyr using the lowice emulator, for a constant CO 2 level of 400 ppmv (blue line in Fig. 10b-e). The data are normalised by the mean variance for the analysed SAT data (σ 2 = 0.14 • C). Four ODP/IODP sites are compared: spectively. The influence of precession is also supported by Fig. 4c, which demonstrates precessional forcing in the regions where the sites are located, as well as by the frequency of the SAT oscillations for this period shown in Fig. 10 and the observation that it appears to have a larger impact on SAT at higher latitudes (Figs. 10 and 11). After ∼ 2900 kyr BP, obliquity appears to have an increased impact at the highlatitude Site 982, superimposing the precession-driven tem-perature variations with a periodicity of ∼ 41 kyr (Figs. 10  and 11). This signal is also apparent to a lesser extent at Site 722, but not at Site U1313. Spectral analysis of palaeoproxy data and June insolation at 65 • N also finds a reduction in the influence of precession and an increase in 41 kyr obliquity forcing around this time (Herbert et al., 2010;Lawrence et al., 2009). SAT changes at the lower-latitude sites generally continue to be dominated by variations in precession and eccentricity, although the relatively low eccentricity during this period is likely to reduce the impact that precession has on climate. It also significantly reduces the variability in temperature, which is also observed during the period of low eccentricity between approximately 3240 and 3200 kyr BP in both Figs. 10 and 11. The slightly higher amplitudes of the peaks in temperature around 3150, 3050 and 2950 kyr BP in Fig. 10 coincide with periods of high eccentricity, when its impact on climate is increased (Fig. 11). It is more difficult to identify orbital trends in the proxy data, particularly in sections with lower resolution. This is due to there being significantly more variation, both on shorter timescales of several tens of thousands of years and longer timescales of hundreds of thousands of years, likely caused in part by changes in atmospheric CO 2 . However, the amplitude of variations in the palaeo data at all four sites is generally, though not always, lower during periods of low eccentricity, particularly for the period ∼ 3225-3200 kyr BP.

Calculation of atmospheric CO 2
We also illustrate the use of the emulator for calculating a simple estimate of atmospheric CO 2 concentration during the late Pliocene and its comparison to published palaeo CO 2 records obtained from proxy data. CO 2 is estimated from the four alkenone SST records presented in Table 4  . Individual records of SST, rather than stacked benthic oxygen isotope data, were used because the GCM experiments that the emulator is calibrated on were only run for 500 years, meaning that deep-ocean conditions would not yet have spun-up sufficiently, particularly in the experiments with high CO 2 . Thus, it would not be appropriate to compare deep-ocean temperatures from the experiments with those from the proxy data.
A linear regression is performed on the emulated grid box mean annual SAT data versus prescribed atmospheric CO 2 concentration, for constant CO 2 scenarios ranging from 260 ppmv up to 800 ppmv. The CO 2 concentration is then estimated from the palaeo SST data based on this linear relationship and is presented in Fig. 12, along with the uncertainty. A number of CO 2 proxy records are also compared, derived from alkenone data at ODP Site 1241 in the east tropical Pacific (Seki et al., 2010) and Site 999 in the Caribbean (Badger et al., 2013;Seki et al., 2010), and from boron (δ 11 B) data at Site 662 (Martinez-Boti et al., 2015) and Site 999 (Bartoli et al., 2011;Martinez-Boti et al., 2015;Seki et al., 2010).
Our model-based CO 2 estimates suggest a mean atmospheric CO 2 concentration for the period of between approximately 350 ± 15 and 629 ± 37 ppmv (error represents the uncertainty taking into account the emulated grid box posterior variance for SAT), indicated at Sites 722 and 982, respectively. Our CO 2 estimates are generally higher than the CO 2 proxy records, particularly for the two North Atlantic sites (982 and U1313), where palaeo SST anomalies were also estimated to be high, compared with tropical anomalies, by the proxy data (Fig. 10). However, CO 2 concentrations derived from SST data calibrated using the approach of Prahl et al. (1988) at the tropical sites of 722 and 662 show greater similarity to the CO 2 proxy data, both in terms of mean concentration and variance (not shown). It is difficult to identify temporal similarities between our CO 2 estimates and the palaeo records. This is partly due to the high level of variability in our CO 2 time series, resulting from the variability in the SST records that they were derived from. In addition, the CO 2 proxy records have comparatively low resolutions, generally with intervals of 10 kyr or greater, and there is also considerable variation between them.
There is substantial variation between our CO 2 estimates at different sites, and this may be attributed to a number of causes. It could be that there are errors in the GCM model used, in particular in its representation of the response of climate to CO 2 and/or orbital forcing. There could be inaccuracies associated with the SST data at one or more locations as, if the model was assumed to be correct, the estimated CO 2 should be similar across the four locations. The fact that they are not may indicate that the temperature records are not consistent with each other, which may not have been obvious by just comparing the records visually. This is one of the potential advantages to using individual temperature records rather than stacked records. It may also be that there is an issue with the dating of some of the proxy records; the data may be correct but there may be uncertainties/inaccuracies in the age models. Alternatively, the emulator may be wrong; for example, there may be non-linearities in the climate response simulated by the GCM that it is not capturing. Finally, there may be errors related to the modelled representation of the ice sheets, which are fixed at a constant configuration. In reality, of the possible sources of error that have been identified, the variations are less likely to be the result of errors in the emulator's estimates of the GCM output because validation diagnostics did not seem to suggest systematic failures. They are also less likely to be due to unrepresented changes in climate due to the ice sheets. Whilst some of the variation at the high-latitude sites (982 and U1313) may be attributed to some regional climate processes not fully accounted for, e.g. involving the ice sheets and sea ice, two of the sites (722 and 662) are in tropical regions. Thus, SSTs at these sites would not be expected to be affected by changes in the ice sheets, and yet they show significantly different variations. Therefore, the inconsistencies are likely to be due to a combination of errors in the GCM model and inaccuracies in the SST data.  Muller et al. (1998) (light green). CO 2 is calculated based on a linear relationship between emulated grid box mean annual SAT (modelled using the lowice emulator) and CO 2 for three constant CO 2 scenarios of 280, 350 and 400 ppmv. Error bands represent estimated atmospheric CO 2 concentration taking into account the emulated grid box posterior variance (1 SD). Where the error appears to be very low, this is generally an artefact of the way that the data have been plotted. The pre-industrial CO 2 concentration of 280 ppmv (grey dotted line) is included for reference. In addition to using the emulator to model past climates, it can also be applied to future climate, and in particular on the long timescales (> 10 3 yr) that are of interest for the disposal of solid radioactive wastes. Previous modelling of long-term future climate has involved the use of lower-complexity models such as EMICs for transient simulations (Archer and Ganopolski, 2005;Eby et al., 2009;Ganopolski et al., 2016;Loutre and Berger, 2000b), or of GCMs to model a relatively small number of snapshot simulations of particular reference climate states of interest. The BIOCLIM (Modelling Sequential Biosphere Systems under Climate Change for Radioactive Waste Disposal) research programme (BIOCLIM, 2001(BIOCLIM, , 2003, for example, utilised both of these approaches to investigate climatic and vegetation changes for the next 200 kyr, for use in performance assessments for radioactive waste disposal facilities. Here, for the first time, a GCM has been used to project future long-term transient climate evolution, via use of the emulator. We provide illustrations of two possible applications of the emulator, including production of a time series of climatic data and assessing the impact of orbital variations on climate. This work has input to the International Atomic Energy Agency (IAEA) Modelling and Data for Radiological Impact Assessments (MODARIA) collaborative research programme (http://www-ns.iaea.org/projects/modaria).

Time series data
Similarly to the late Pliocene, snapshots of SAT and precipitation at 1 kyr intervals were produced using the modice emulator for the next 200 kyr, assuming modern day ice sheet configurations. The projected evolution of climate is a result of future variations in the orbital parameters and atmospheric CO 2 concentrations, which were provided as input data to the emulator (again, at 1 kyr intervals). Four CO 2 emissions scenarios were modelled, with the response of atmospheric CO 2 concentration to emissions and its long-term evolution calculated using the impulse response function of Lord et al. (2016). The scenarios adopted logistic CO 2 emissions of 500, 1000, 2000 and 5000 Pg C released over the first few hundred years, followed by a gradual reduction of atmospheric CO 2 concentrations by the long-term carbon cycle. These four scenarios cover the range of emissions that might occur given currently economic and potentially economic fossil fuel reserves, but not including other potentially exploitable reserves, such as clathrates.
Four single grid boxes are selected, shown in Fig. 13 is presented in Fig. 14, along with the emulated uncertainty (1 SD of the emulated grid box posterior variance). In the 500 Pg C scenario, the largest SAT increase of 4.1 ± 0.2 • C occurs at the Switzerland grid box, whilst the Spain grid box exhibits the largest warming in the 5000 Pg C scenario of 12.3 ± 0.3 • C. For comparison, when the lowice emulator is utilised, these values are reduced slightly to 3.9 ± 0.3 • C (Spain grid box) and 12.2 ± 0.3 • C (Spain grid box) in the 500 and 5000 Pg C scenarios, respectively. This peak in temperature occurs up to the first thousand years, when atmospheric CO 2 is at its highest following the emissions period, after which it decreases relatively rapidly with declining atmospheric CO 2 until around 20 kyr AP. By 200 kyr AP, SAT at all sites is within 2.6 • C (2.2 • C using the lowice emulator) of pre-industrial values, calculated by averaging the final 10 kyr of the 5000 Pg C scenarios. The emulated uncertainty for the next 200 kyr is of a similar magnitude to that for the late Pliocene and, similarly, is relatively small when compared with the fluctuations in SAT that result from orbital variations and changing atmospheric CO 2 concentration.
Until ∼ 20 kyr AP, the behaviour of the climate is primarily driven by the high levels of CO 2 in the atmosphere as a result of anthropogenic CO 2 emissions from a range of sources, including combustion of fossil fuels, land-use change and cement production. However, after this time, changes in orbital conditions begin to exert a relatively greater influence on climate, as the periodic fluctuations in SAT at all locations appear to be paced by the orbital cycles, which are shown in Fig. 14a.
The timing and relative amplitudes of the oscillations in future SAT are in good agreement with a number of previous studies. Paillard (2006) applied the conceptual model of Paillard and Parrenin (2004), previously mentioned in Sect. 5, to  (Laskar et al., 2004), showing eccentricity (black) and precession (radians, blue) on the left axis and obliquity ( • , red) on the right axis. (b)-(e) Time series of emulated grid box mean annual SAT ( • C), modelled every 1 kyr, for four CO 2 emissions scenarios: 500 Pg C (black), 1000 Pg C (green), 2000 Pg C (red) and 5000 Pg C (blue). Modelled using the modice emulator. Error bands represent the emulated grid box posterior variance (1 SD). Four sites are presented, representing grid boxes in Sweden, central England, Switzerland and Spain. SAT is shown as an anomaly compared with the pre-industrial control simulation. the next 1 Myr. The development of atmospheric CO 2 over the next 200 kyr, simulated by the model following emissions of 450 to 5000 Pg C and accounting for natural variations, shows a similar pattern of response to that of SAT presented here. Estimates of global mean temperature in Archer and Ganopolski (2005), derived by scaling changes in modelled ice volume to temperature, before applying anthropogenic CO 2 temperature forcing for a number of emissions scenar-ios, also demonstrate fluctuations in global mean annual SAT (not shown) of a similar timing and relative scale. The influence of declining CO 2 is still evident after 20 kyr, particularly for the higher-emissions scenarios, in the slightly negative gradient of the general evolution of SAT. This is due to the long atmospheric lifetime of CO 2 emissions , and is also demonstrated in other studies (Archer and Ganopolski, 2005;Archer et al., 2009;Lord et al., 2016;Pail-lard, 2006). The impact of excess atmospheric CO 2 on the long-term evolution of SAT appears to be fairly linear, with only minor differences between the scenarios and sites, discounting the overall offset of SAT for different total emissions.
One of the key uncertainties associated with future climate change, which is of particular relevance to radioactive waste repositories located at high northern latitudes, is the timing of the next glacial inception. This is expected to occur during a period of relatively low incoming solar radiation at high northern latitudes, which, for the next 100 kyr, occurs at 0, 54 and 100 kyr AP. A number of studies have investigated the possible timing of the next glaciation under pre-industrial atmospheric CO 2 concentrations (280 ppmv), finding that it is unlikely to occur until after 50 kyr AP (Archer and Ganopolski, 2005;Berger and Loutre, 2002;Paillard, 2001).
When anthropogenic CO 2 emissions are taken into account, the current interglacial is likely to last significantly longer, until ∼ 130 kyr AP following emissions of 1000 Pg C and beyond 500 kyr AP for emissions of 5000 Pg C (Archer and Ganopolski, 2005). A recent study by Ganopolski et al. (2016) using the CLIMBER-2 model found that emissions of 1000 Pg C significantly reduced the probability of a glaciation in the next 100 kyr, and that a glacial inception within the next 100 kyr is very unlikely for CO 2 emissions of 1500 Pg C or higher.
Our CO 2 emissions scenarios, modelled using the response function of Lord et al. (2016), suggest that atmospheric CO 2 will not have returned to pre-industrial levels by 100 kyr AP, equalling 298 and 400 ppmv for the 500 and 5000 Pg C emissions scenarios, respectively. We calculated the critical summer insolation threshold at 65 • N using the logarithmic relationship identified between maximum summer insolation at 65 • N and atmospheric CO 2 by Ganopolski et al. (2016). The evolution of atmospheric CO 2 concentration over the course of our emissions scenarios suggests that, for emissions of 1000 Pg C or less, Northern Hemisphere summer insolation will next fall below the critical insolation threshold in approximately 50 kyr, and in ∼ 100 kyr for emissions of 2000 Pg C. For the highest emissions scenario of 5000 Pg C, the threshold is not passed for considerably longer, until ∼ 160 kyr AP. However, the uncertainty of the critical insolation value is ±4 W m −2 (1 SD), and often the difference between summer insolation at 65 • N and the insolation threshold is less than this, potentially impacting whether the threshold has in fact been passed and therefore whether glacial inception is likely. For example, for the 1000 Pg C scenario, whilst insolation first falls below the critical threshold at ∼ 50 kyr AP, it does not fall below by more than the uncertainty value until ∼ 130 kyr AP.
A limitation of our study relates to the continental ice sheets in HadCM3 being prescribed rather than responsive to changes in climate. A consequence of this is that an increase in the extent or thickness of the ice sheets, and hence the onset of glaciation, cannot be explicitly projected, but this also means that a regime shift of the ice sheets to one of negative mass balance, which may be expected to occur under high-CO 2 -emissions scenarios (Ridley et al., 2005;Stone et al., 2010;Swingedouw et al., 2008;Winkelmann et al., 2015), cannot be modelled. However, the results of the sensitivity analysis to ice sheets described in Sect. 3.5., for which a number of simulations were run again with reduced GrIS and WAIS extents, suggest that the reduction in continental ice results in relatively localised increases in SAT in regions that are ice free, in addition to some regional cooling at high latitudes. Consequently, this does not act as a significant restriction on the glaciation timings put forward in this study considering their radioactive waste disposal application; given that the earliest timing of the next glaciation is of significant interest, smaller continental ice sheets and therefore higher local SATs would likely inhibit the build-up of snow and ice, delaying glacial inception further. As such, the estimates presented here should be viewed as conservative. As will be discussed in Sect. 7, however, the emulator was not designed and calibrated to predict changes in ice sheets. This is a limitation that should be addressed when modelling future climate on timescales of tens of thousands of years or more (depending on the CO 2 scenario(s) being modelled). Another caveat is that the carbon cycle in the emulator is also essentially prescribed, and thus not interactive. This means that the atmospheric CO 2 trajectory follows a smooth decline, as was projected using an impulse response function based on experiments using the cGENIE model (Lord et al., 2016), with long-term future climate being modelled as a series of snapshot simulations with the emulator. This smooth decline in CO 2 assumes that no non-linear or unexpected behaviour will be demonstrated by the long-term carbon cycle, and that the silicate weathering mechanism, which is associated with a substantial degree of uncertainty, is correct.
The emulator can also be used to project the evolution of a range of other climate variables, providing that they were modelled as part of the initial GCM ensembles. Figure 15 illustrates the development of mean annual precipitation and emulated uncertainty over the next 200 kyr at the four sites. The maximum increase in precipitation is between 0.3 ± 0.1 mm day −1 (Switzerland grid box) and 0.6 ± 0.1 mm day −1 (Sweden grid box) in the 500 and 5000 Pg C scenarios, respectively. Precipitation increases with increasing atmospheric CO 2 at all sites apart from the Spain grid box, where it decreases by up to 0.9 ± 0.1 mm day −1 . Regional differences in the sign of changes in precipitation, including an increase at high latitudes and a decrease in the Mediterranean, are consistent with modelling results included in the International Panel on Climate Change (IPCC) Fifth Assessment Report, for simulations forced with the Representative Concentration Pathway (RCP) 8.5 scenario (Collins et al., 2013). In contrast to SAT, precipitation appears to be more closely influenced by precession, illustrated by its periodicity of slightly less than 25 kyr. There appears to be an increase in the intensity  (Laskar et al., 2004), showing eccentricity (black) and precession (radians, blue) on the left axis and obliquity ( • , red) on the right axis. (b)-(e) Time series of emulated grid box mean annual precipitation (mm day −1 ), modelled every 1 kyr, for four CO 2 emissions scenarios: 500 Pg C (black), 1000 Pg C (green), 2000 Pg C (red) and 5000 Pg C (blue). Modelled using the modice emulator. Error bands represent the emulated grid box posterior variance (1 SD). Four sites are presented, representing grid boxes in Sweden, central England, Switzerland and Spain. Precipitation is shown as an anomaly compared with the pre-industrial control simulation. Note the different vertical axis scales. of precipitation fluctuations from approximately 140 kyr AP onwards, implying that the modulation of precession by eccentricity also has an impact.

Orbital variability and spectral analysis
The impact of orbital forcing was assessed by performing a spectral wavelet analysis on the SAT and precipitation time series data produced by the emulator for the central Eng-land grid box for the 5000 Pg C emissions scenario, represented by blue lines in Figs. 14c and 15c, respectively. As for the late Pliocene, the wavelet software of Torrence and Compo (1998) was utilised. The analysis was performed on the data for 20-200 kyr AP because the climate response up until ∼ 20 kyr AP is dominated by the impact of elevated atmospheric CO 2 concentrations, which masks the orbital signal and affects the results of the wavelet analysis.  Fig. 14c) and (b) emulated grid box mean annual precipitation (mm day −1 , blue line in Fig. 15c). Both variables were modelled every 1 kyr using the modice emulator for the 5000 Pg C emissions scenario. The data are normalised separately by (a) the mean variance for the analysed SAT data (σ 2 = 0.14 • C) and (b) the variance for the analysed precipitation data (σ 2 = 0.003 • C).
For future SAT, Fig. 16a suggests that, up until ∼ 160 kyr AP, the obliquity cycle acts as the dominant influence, resulting in temperature oscillations with a periodicity of approximately 41 kyr. This is confirmed by Fig. 14c, which shows that the major peaks in SAT generally coincide with periods of high obliquity. Over this period, precession has a far more limited influence, likely due to eccentricity being relatively low until ∼ 110 kyr AP (Fig. 14a). However, from ∼ 120 kyr AP onwards, concurrently with increasing eccentricity, precession becomes a more significant forcing on climate, resulting in SAT peaks approximately every 21 kyr. In contrast, precession appears to be the dominant forcing on precipitation for the central England grid box for the entire 20-200 kyr AP period (Figs. 15c and 16b). This signal is particularly strong after ∼ 120 kyr AP, due to higher eccentricity.

Limitations
There are a number of limitations associated with the methodology outlined above, particularly relating to the assumptions that it is based on and its application to different periods of time. Although these have mostly been discussed briefly in the preceding sections, here we summarise them together.
As noted previously, the carbon cycle in the emulator is not coupled to the climate since the atmospheric CO 2 concentration is prescribed. The methodology thus assumes that there will be no unexpected non-linearities in the carbon cycle and that changes in climate that are different from those in cGENIE do not feed back to the carbon cycle. This may be of particular importance when simulating future climates, when the natural carbon cycle is expected to be significantly perturbed due to ongoing anthropogenic emissions of CO 2 , in a way that may not be fully represented in cGENIE. There is also uncertainty surrounding the dynamics of the carbon cycle over long periods of time, such as the role of the silicate weathering mechanism, although the observation that different carbon cycle models generally produce fairly similar results increases our confidence .
The ice sheets in the emulator are also fixed, at either modern-day or reduced extents, although expanding the range of ice sheets that can be modelled is currently being undertaken in ongoing research. This means that care needs to be taken when simulating very long periods of time. For example, neither Quaternary nor future glacial-interglacial cycles can be simulated using the current version of the emulator. Furthermore, even during the Pliocene, it is likely that the extent of ice sheets in the Northern Hemisphere varied beyond the range simulated in this study (Willeit et al., 2015), and the emulator in its current form cannot represent this.
In the context of the Pliocene, the land-sea mask and orography used in the simulation of Pliocene climate are fixed and appropriate to modern-day conditions, whereas the PRISM4 reconstruction of palaeogeography suggests that there may have been considerable differences in some regions, for example, the region of the Hudson Bay is thought to have been land in the Pliocene .
Due to both the carbon cycle and ice sheets being prescribed, interactions between these components of the climate system can also not be simulated. These include natural changes in CO 2 that have been found to accompany past glacial-interglacial cycles, with glacial periods over the last 800 kyr exhibiting CO 2 concentrations of approximately 180 to 200 ppmv (Petit et al., 1999), whereas interglacial periods demonstrated concentrations of 240 to 290 ppmv (Luthi et al., 2008). Changes in the ice sheets in response to atmospheric CO 2 , such as the likely future melting of the GrIS and AIS in response to anthropogenic CO 2 emissions, can also not be modelled. Various studies have modelled the response of the ice sheets to future climate warming, finding that the ice sheets may experience significantly increased melt. In fact, for scenarios with high CO 2 emissions (> ∼ 5000 Pg C), it has been suggested that the GrIS and AIS may be almost entirely melted within the next few thousand years (e.g. De-Conto and Pollard, 2016;Huybrechts et al., 2011;Winkelmann et al., 2015), which would cause significant changes in deep-ocean circulation and ocean stratification. These ocean changes cannot be captured by the current version of the methodology and, whilst their impacts on global and regional climate are uncertain, they are expected to be long-term. The melting of the ice sheets would also cause significant increases in global sea level, of approximately 70 m if both the GrIS and AIS melted, which would strongly affect the global land-sea mask and regional climates, and which cannot be represented using the current methodology. This sea level rise would also have serious implications for radioactive waste repositories located in relatively low-lying coastal regions that are vulnerable to sea level rise, such as in northern Europe.
Since the emulator models climate via a series of snapshots rather than a truly transient simulation, it is not able to capture deviations from a stationary response. As a consequence, the methodology becomes inappropriate if such transient changes in the deep ocean are found to be important for controlling surface climate evolution.
The emulator presented in this study is only suitable for modelling transient climate changes on timescales of several millennia or longer, as a number of shorter-term processes in the climate and carbon cycle are not represented. These include internal variability in the climate system, such as interannual variability, North Atlantic Oscillation (NAO), and El Niño-Southern Oscillation (ENSO), radiative forcing occurring on shorter timescales (e.g. volcanic activity) and terrestrial carbon cycle processes. On these timescales, tran-sient simulations run using complex models such as GCMs or EMICS are most appropriate.
As a consequence of these limitations, care needs to be taken when applying the emulator to ensure that its application is appropriate. For example, when considering future climate, the way in which future carbon dioxide concentrations and ice sheet configurations have been modelled means that this methodology is only applicable on timescales up until the next glacial inception. After this, atmospheric CO 2 would be expected to change in response to the initiation of glacial conditions, accompanied by the expansion of the ice sheets, decreasing sea level and the climatic changes that would result from these changes. A number of studies have modelled the possible timing of the next glacial inception, finding that for CO 2 scenarios with medium emissions the current interglacial period may end in approximately 130 kyr (Archer and Ganopolski, 2005;Ganopolski et al., 2016). However, for high emissions of 5000 Pg C, glacial inception may be delayed for more than 500 kyr (Archer and Ganopolski, 2005). A study by Brandefelt et al. (2013) estimated that for permafrost development to occur at Forsmark, Sweden, during the insolation minima at 17 and 54 kyr AP, atmospheric CO 2 concentrations of ∼ 210 ppmv or less and ∼ 250 ppmv or less would be required, respectively. In light of the long atmospheric lifetime of CO 2 emissions that has been discussed, low concentrations such as these are unlikely in the next few tens of thousands of years; however, they cannot be entirely excluded. In order to account for this limitation, the emulator could be extended to include glacial states, meaning that it could be applied to future climate on a longer timescale, as well as to the Quaternary, if the evolution of CO 2 and ice volume were known (e.g. from a transient EMIC or conceptual model simulation). Thus, when emulating long-term climate, careful consideration should be given to what assumptions are being made and whether the methodology is appropriate for the conditions being modelled.
Bearing in mind these limitations, the methodology described in this paper is a useful and powerful tool for simulating long-term past and future climatic changes, as well as for exploring the dynamics and sensitivities of the climate system.

Summary and conclusions
In this study, we present long-term continuous projections of future climate evolution at the spatial resolution of a GCM, via the use of a statistical emulator. The emulator was calibrated on two ensembles of simulations with varied orbital and atmospheric CO 2 conditions and modern-day continental ice sheet extents, produced using the HadCM3 climate model. The method presented by Gregory et al. (2004) to calculate the steady-state global temperature change for a simulation, by regressing the net radiative flux at the top of the atmosphere against the change in global SAT, was utilised to calculate the equilibrated SAT data for these ensembles, as it was not feasible to run the experiments to equilibrium due to the associated time and computer resources needed. A number of simulations testing the sensitivity of SAT to the extent of the GrIS and WAIS suggest that the response of SAT is fairly linear regardless of orbit and that the largest changes are generally local to regions that are ice free. The mean SAT anomaly identified across these experiments was then applied to the equilibrated SAT results of the modern-day ice sheet extent ensembles to generate two equivalent ensembles with reduced ice sheets.
Output data from the modern-day and reduced ice sheet ensembles were then used to calibrate separate emulators, which were optimized and then validated using a leave-oneout approach, resulting in satisfactory performance results. We discuss a number of useful applications of the emulator, which may not be possible using other modelling approaches at the same temporal and spatial resolution. Firstly, a particular benefit of the emulator is that it can be used to produce time series of climatic variables that cover long periods of time (i.e. several thousand years or more) at a GCM resolution, accompanied by an estimation of the uncertainty in the form of the posterior variance. This would not be feasible using GCMs due to the significant time and computational requirements involved. The global grid coverage of the data also means that the evolution of a climate variable at a particular grid box can be examined, allowing for comparisons to data on a regional or local scale, such as palaeoproxy data, or for the evolution of climate at a specific site to be studied. However, further downscaling of the data may also be necessary or beneficial, via further modelling such as proxy modelling, impact models or regional climate models, or via statistical downscaling techniques. Secondly, the influence of orbital forcing on climate can be assessed. This effect may be visualised with a continuous wavelet analysis on the time series data for a particular CO 2 emissions scenario, which will identify the orbital frequencies dominating at different times. The spatial distribution of orbital timescale variability can also be simulated, by plotting the standard deviation for a climate variable for each grid box, taking into account the emulator posterior variance. Finally, the emulator can be used to back-calculate past atmospheric CO 2 concentrations based on proxy climate data. Through an inversion, atmospheric CO 2 concentrations can be estimated using SST proxy data, based on a linear relationship between emulated grid box mean annual SAT and prescribed CO 2 concentration. Estimated CO 2 can then be compared with palaeo CO 2 concentration proxy records.
To illustrate these potential applications, we applied the emulator at 1 kyr intervals to the late Pliocene (3300-2800 kyr BP) for atmospheric CO 2 concentrations of 280, 350 and 400 ppmv and compared the emulated SATs at specific grid boxes to SSTs determined from proxy data from a number of ODP/IODP sites. The wavelet power spectrum for SAT at each site was also produced, and the dominant orbital frequency was assessed. In addition, we used the SST proxy data to estimate atmospheric CO 2 concentrations, based on a linear relationship between emulated grid box mean annual SAT and prescribed CO 2 concentration. We find the following: -Temperature estimates from the emulator and proxy data show greater similarity at the equatorial sites than at the high-latitude sites. Discrepancies may be the result of seasonal biases in the proxy data, unknown changes in the climate and/or carbon cycle, issues with the tuning of parts of the record, biases in the GCM, or errors in the emulator.
-The response of emulated SAT appears to be dominated by a combination of precessional and eccentricity forcing from 3300 kyr BP to approximately 2900 kyr BP, after which obliquity begins to have an increased influence.
-Regions with a particularly large response to orbital forcing include the high latitudes and monsoon regions (Fig. 9b).
-Our CO 2 concentrations derived from tropical ODP/IODP sites show relatively similar concentrations to CO 2 proxy records for the same period, although the concentrations derived from higherlatitude sites are generally significantly higher than the proxy data.
The emulator was also applied to the next 200 kyr, as longterm future simulations such as these have relevance to the geological disposal of solid radioactive wastes. The continuous evolution of mean annual SAT and precipitation at a number of sites in Europe are presented for four scenarios with anthropogenic CO 2 emissions of 500, 1000, 2000 and 5000 Pg C. A spectral wavelet analysis was also performed on the SAT and precipitation data for the central England grid box. The data suggests the following: -SAT and, to a lesser extent, precipitation exhibit a relatively rapid decline back towards pre-industrial values over the next 20 kyr, as excess atmospheric CO 2 is removed by the long-term carbon cycle.
-Following this, SAT fluctuates due to orbital forcing on an approximately 41 kyr obliquity timescale until ∼ 160 kyr AP, before the influence of precession increases with increasing eccentricity from ∼ 120 kyr AP.
-Conversely, precipitation variations over the entire 200 kyr period demonstrate a strong precessional signal.
Overall, we find that the emulator provides a useful and powerful tool for rapidly simulating the long-term evolution of climate, both past and future, due to its relatively high spatial resolution and relatively low computational cost. We have presented illustrative examples of a number of different possible applications, which we believe make it suitable for tackling a wide range of climate questions.