A multi-model assessment of last interglacial temperatures

The last interglaciation ( ∼ 130 to 116 ka) is a time period with a strong astronomically induced seasonal forcing of insolation compared to the present. Proxy records indicate a significantly different climate to that of the modern, in particular Arctic summer warming and higher eustatic sea level. Because the forcings are relatively well constrained, it provides an opportunity to test numerical models which are used for future climate prediction. In this paper we compile a set of climate model simulations of the early last interglaciation (130 to 125 ka), encompassing a range of model complexities. We compare the simulations to each other and to a recently published compilation of last interglacial temperature estimates. We show that the annual mean response of the models is rather small, with no clear signal in many regions. However, the seasonal response is more robust, and there is significant agreement amongst models as to the regions of warming vs cooling. However, the quantitative agreement of the model simulations with data is poor, with the models in general underestimating the magnitude of response seen in the proxies. Taking possible seasonal biases in the proxies into account improves the agreement, but only marginally. However, a lack of uncertainty estimates in the data does not allow us to draw firm conclusions. Instead, this paper points to several ways in which both modelling and data could be improved, to allow a more robust model–data comparison.

March; as such, the anomalies in October in the Southern Hemisphere and September in the Northern Hemisphere are largely an artefact (Joussaume and Braconnot, 1997).
characterised by a maximum in δD in Antarctic ice cores (EPICA community members, 2004) and a minimum in benthic δ 18 O in marine sediment cores (Lisiecki and Raymo, 2005), which qualitatively indicate a relatively warm climate and/or reduced terrestrial ice volume. Palaeo-data archives indicate that the climate of the LIG differed from that of the modern. A compilation of terrestrial and marine records (Turney and Jones, 2010) indicates a global mean warming relative to pre-industrial of about 2 • C. A compilation of SST records (McKay et al., 2011) indicates a global mean SST warming relative to the late Holocene of 0.7 ± 0.6 • C. The maximum annual mean warming occurred in mid-and high Northern Hemisphere latitudes, reducing the pole-to-equator temperature difference by about 1.5 • C relative to pre-industrial (Turney and Jones, 2010). This was associated with changes in vegetation patterns, notably a northwards shift of boreal forest across the Arctic (e.g. in Scandinavia -Saarnisto et al., 1999;Alaska -Edwards et al., 2003;and Siberia -Lozhkin et al., 2007). Palaeo-archives can also give an indication of seasonal changes in temperature; for example, records have been interpreted as representing Arctic summer temperatures about 5 • C warmer than present, with an associated decrease in summer sea ice (CAPE-Last Interglacial Project Members, 2006). Ocean circulation also varied through the LIG, with North Atlantic δ 13 C and 231 Pa/ 230 Th records indicating increasing AMOC (Atlantic meridional overturning circulation) strength in the early LIG, and maximum overturning in the middle of the LIG (Sanchez Goni et al., 2012). A compilation of global sea level records (Kopp et al., 2009) indicates a LIG highstand of at least 6.6 m (95 % probability), and likely in excess of 8.0 m (67 % probability). Such records have been interpreted as representing contributions from reduced volume of both Greenland and West Antarctic ice sheets . A substantial contribution from the Greenland ice sheet at the LIG is supported by modelling evidence Stone et al., 2013), which indicates a contribution from Greenland of 0.3 to 3.6 m (80 % probability). A contribution from Antarctica is supported by benthic δ 18 O and modelling evidence (Duplessy et al., 2007).
The principal driver of climatic differences between LIG and modern climate is the astronomical configuration of Earth. The early LIG is characterised by relatively high obliquity and eccentricity compared with modern, and a precessional component with boreal summer coinciding with perihelion (Laskar et al., 2004;Yin and Berger, 2010). This results in an insolation anomaly relative to modern consisting of a maximum in boreal summer and minimum in austral summer (Fig. 1). A secondary driver is natural variations in greenhouse gases (Siegenthaler et al., 2005;Loulergue et al., 2008;Spahni et al., 2005), which were fairly constant through the LIG, but with a maximum in all three gases (CO 2 , CH 4 and N 2 O) between 129 and 128 ka (Fig. 2).
Because of the very different principal forcing mechanisms (seasonal astronomical variations compared with greenhouse gas changes), the LIG should not be considered an "analogue" for future climate change. However, because of its relative warmth and high sea level, the LIG could be considered as an appropriate test bed for climate models developed for future climate prediction. Furthermore, modelling studies suggest that over Greenland, the summer warming is amplified by similar albedo and water feedbacks to those found in future climate simulations (Masson-Delmotte et al., 2011). As such, the LIG has begun to receive more attention from the modelling community, and the Paleoclimate Model Intercomparison Project (now in its third phase, PMIP3, http://pmip3.lsce.ipsl.fr) has recently extended its focus from the Last Glacial Maximum (LGM, 21 ka) and mid-Holocene (6 ka) to include the LIG (as well as another warm period, the Pliocene, 3 Ma).
This paper describes an ensemble of climate model simulations of the LIG, many of which have been carried out using guidelines developed by PMIP. The simulations are "snapshots", that is, each one is designed to represent equilibrium conditions during a ∼ 1 ka "window" during the LIG. There  Atmospheric concentrations of (a) CO 2 , (b) CH 4 and (c) N 2 O through the last interglaciation. Vertical lines show the PMIP-defined snapshots of 125, 128 , and 130 ka. Small black crosses show the raw gas concentrations from the Dome C ice core: Lüthi et al. (2008) for CO 2 (although note that this is a composite record), Loulergue et al. (2008) for CH 4 and Spahni et al. (2005) for N 2 O. Blue line shows this raw data interpolated onto a 100-yr resolution. Large blue crosses show the PMIP3 gas concentrations at the time of the snapshots. Large black crosses show the greenhouse gas concentrations used by those groups which did not use the PMIP3 guidelines. are a number of snapshots covering the period 125 to 130 ka, and they have been carried out using a range of climate models, representing a range of model complexity.
The aims of the paper are twofold: -Firstly, to catalogue the differences between the model simulations, determining which features are robust, and where there is uncertainty, and to provide some firstorder hypotheses for the mechanisms behind the largescale features.
-Secondly, to compare the simulations with the latest data compilations, determining to what extent the model simulations and data are consistent.
The focus of this paper is on temperature because there are more proxy records for temperature than any other variable, and it is generally one of the more robustly modelled variables. We consider the terrestrial and marine realm for our model-data comparisons, and investigate the seasonality of the model simulations and proxy records.

Model simulation descriptions
As part of the third phase of PMIP, a set of four last interglacial snapshot simulations were proposed -at 130, 128, 125, and 115 ka. Here we focus on the first three of these, which encompass the time of maximum anomaly in insolation in Northern Hemisphere summer; the fourth was designed to look at glacial inception processes at the very end of the LIG. PMIP laid out a set of boundary conditions for these snapshots. These consisted of astronomical and greenhouse gas parameters as it was decided to leave possible smaller forcings, such as vegetation, ice sheet, sea level and aerosol changes, to subsequent sensitivity studies. The PMIP3 LIG astronomical and greenhouse gas boundary conditions are illustrated in Figs. 1 and 2 (and also can be read off Table 2). The astronomical constants were obtained from Berger and Loutre (1991). The greenhouse gas concentrations were derived from Antarctic ice core records: Lüthi et al. (2008) for CO 2 (although note that this is a composite record), Loulergue et al. (2008) for CH 4 and Spahni et al. (2005) for N 2 O. The raw greenhouse gas data was interpolated onto a 100-yr timestep, and the values for each snapshot taken from the appropriate time in this interpolated record.
The simulations used in this paper are all those which were submitted to a call for model contributions to this intercomparison, following a PMIP meeting in Crewe, UK, in May 2012. Table 1 gives some details of the models included in this intercomparison, and Table 2 gives some key aspects of their experimental design, including boundary conditions. The models cover a wide range of complexity, from state-ofthe-art GCMs used in the fifth assessment report of the IPCC (e.g. COSMOS, MIROC), through GCMs which featured in the fourth assessment report (e.g. CCSM3, HadCM3), to models of intermediate complexity ("EMICs", e.g. LOVE-CLIM, CLIMBER).
Not all simulations described in this paper follow the PMIP3 guidelines. Indeed, some were carried out before the guidelines were developed. As such, this is an "ensemble of opportunity" in that there is not complete consistency across all the model simulations. However, most of the model simulations from any one organisation are self-consistent; e.g. the simulations are all carried out with the same model version. A minor exception is CCSM3 NCAR, where the LIG Table 1. Summary of models in this intercomparison. "Type" refers to the atmospheric component of the model: GCM (General Circulation Model) or EMIC (Earth system Model of Intermediate Complexity). "RMS" gives the RMS "error" of the pre-industrial simulation surface air temperature ( • C) relative to the NCEP climatology (see Fig. 4). Note that the RMS error is not area weighted. simulations have a slightly greater solar constant than the pre-industrial simulation (see Table 2). All groups used identical land-sea masks and terrestrial ice sheets in their LIG simulations as compared with their controls. As such, greenhouse gases and/or astronomical configuration were the main external forcings imposed in the LIG simulations compared with the controls. Although groups may have used slightly different astronomical solutions, these differences are minimal (e.g. Berger and Loutre (1991) give insolation values which differ from those of Laskar et al. (2004) by less than 0.1 % for these time slices). Therefore, different greenhouse gas concentrations were the main inconsistency in experimental design between different groups. The various greenhouse gas concentrations applied by the different groups are illustrated in Fig. 2.
Simulations carried out using HadCM3 Bris, CCSM3 Bremen, COSMOS AWI, LOVECLIM Ams, CLIMBER LSCE, CSIRO UNSW and NORESM BCCR were all carried out using the greenhouse gas boundary conditions specified by PMIP3. Simulations carried out by KCM Kiel, COSMOS MPI and IPSL LSCE chose to keep the LIG greenhouse gases fixed at the control values, and as such just included astronomical variations. The other models developed greenhouse gas changes independently. Most are relatively consistent, but CCSM3 NCAR at 130 ka does have higher values of CO 2 , CH 4 and N 2 O (but note that the CCSM3 NCAR pre-industrial greenhouse gas levels are also relatively high; see Table 2). Some of the models are similar to each other -the most obvious being three versions of CCSM3, the two versions of LOVECLIM, and the two versions of COSMOS. In the case of CCSM3, the model versions are different -CCSM3 NCAR runs at a higher resolution (T85) than the other two (T31), and CCSM3 Bremen includes dynamic vegetation. In the case of LOVECLIM, the two groups have contributed different snapshots (125 and 130 ka from LOVE-CLIM Ams, and 127 ka from LOVECLIM LLN). In the case of COSMOS, COSMOS MPI uses dynamic vegetation in all simulations, whereas for COSMOS AWI the LIG simulation (130 ka) is forced by a fixed pre-industrial vegetation that has been taken from the equilibrated control simulation, which itself is spun up using a dynamic vegetation scheme (Stepanek and Lohmann, 2012). KCM Kiel uses the ECHAM5 atmosphere model (Roeckner et al., 2003), an atmospheric component also used in COSMOS, and the NEMO ocean-sea ice model (Madec, 2008) -an ocean component also used in IPSL LSCE. NORESM BCCR is a hybrid of an updated version of the atmospheric component of CCSM3 (CAM4 compared with CAM3), and an independent ocean model (MICOM).
The temperature data from all of these simulations and for the ensemble mean is provided in electronic format (netcdf) in the Supplement. Table 2. Summary of simulations in this intercomparison. For the greenhouse gas concentrations, a " * " indicates that the value is that specified by PMIP3. CO 2 is in units of ppmv, CH 4 and N 2 O are in units of ppbv. The LIG skill score, σ , is relative to the terrestrial data of Turney and Jones (2010), and is defined in Eq. (1). Note that CO 2 is the only greenhouse gas considered by CLIMBER.

Last interglacial SST and land temperature dataset
For the model-data comparison in Sect. 4.2, we make use of the terrestrial and ocean annual mean temperature reconstruction of Turney and Jones (2010). This consists of 262 sites, made up of 100 terrestrial temperatures and 162 SSTs (see Fig. 3). The data are derived from a diverse range of proxies, including: Sr-Ca, U k 37 , Mg/Ca and diatom and radiolarian assemblage transfer functions for SSTs, pollen and macrofossils for terrestrial temperatures, and δ 18 O for ice  sheet temperatures. Sites are only included in the compilations if they have 4 or more data points through the LIG; the reconstruction consists of the average temperature of the period of plateaued δ 18 O for marine sequences, and maximum warmth for terrestrial sequences. The data are presented as anomalies relative to modern (averaged over the years 1961-1990). Turney and Jones (2010) noted a pattern of early warming off the southern African coastline and Indian Ocean, that they interpreted as evidence for leakage from the Indian Ocean via an enhanced Agulhas current, consistent with southward migration of the Southern Ocean westerlies. Here we consider all sites as contemporaneous, although in reality they represent average conditions over a time window which varies from site to site. However, as we shall see, the modelled variability across the time window of interest is relatively small compared to other uncertainties.
Unfortunately, Turney and Jones (2010) give no indication of the uncertainties in their SST or terrestrial reconstructions. It is possible that some of the LIG sites may be more representative of a seasonal change as opposed to an annual mean change (e.g. see discussion in Schneider et al., 2010 in the context of the Holocene). This is because the calibration of many of the proxies used is based on modern analogues, which are by definition all under modern astronomical conditions, and because the astronomical configuration of the LIG is significantly different, this could result in a seasonal shift being interpreted as an annual mean change.

Results and model-data comparison
Before turning to the simulations of the LIG, it is worthwhile putting these into context, by examining potential biases in the pre-industrial control simulations. These are illustrated in Fig. 4, which shows the simulated pre-industrial annual mean temperatures from each model relative to those from the NCEP reanalysis product (Kalnay et al., 1996). We choose NCEP as opposed to any other reanalyses product purely for pragmatic reasons in that we had it readily available. It should be noted that the NCEP reanalyses themselves are not perfect. In particular, in regions of sparse observational input, such as over Antarctica, the model "error" should be treated with caution. Furthermore, the observations represent a 40yr average which starts in 1948, whereas the model control simulations represent a "pre-industrial" time, and assume a range of greenhouse gas concentrations (see Table 2).
Every model has at least one gridbox where the "error" is at least 10 • C. The models with the smallest RMS error are HadCM3 Bris and CCSM3 NCAR, both with 2.4 • C, and the model with the largest RMS error is CLIMBER LSCE, with 4.9 • C. However, note that because the differences are calculated after interpolating all simulations and observations to a resolution of 96 × 73 gridboxes (the resolution of HadCM3 Bris), this penalises those models, like CLIMBER, with relatively low resolution. Also note that the RMS score is not area weighted, so has a bias towards errors in the high latitudes. The CSIRO UNSW model uses flux adjustment for all simulations, so the control has relatively low errors over the ocean. As expected, similar models show similar anomalies; for example, all CCSM3-type models have a cold bias in the North Atlantic, and all models with ECHAM5 atmospheric components have a cold bias in the central Sahara. Because the control model simulations have been run for very different lengths of time (see Table 2), any small cooling or warming trends could also potentially contribute to the differences between model results. Figure 4o shows the model ensemble mean. This has a lower RMS error than any individual model, 2.2 • C, and also has a relatively low error in the global mean, having a mean error of −0.73 • C (a fraction of which is likely related to the difference between modern and pre-industrial temperatures due to recent warming). The strong relative performance of the ensemble mean has been observed in many other model ensembles, and Annan and Hargraves (2011) show that this is consistent with the Fig. 4. "Error" in the pre-industrial control simulation of each model, relative to NCEP reanalyses (Kalnay et al., 1996), for surface air temperature.  , whereas the models are designed to represent pre-industrial. All data are interpolated onto a 96 × 73 resolution before calculating the difference, model minus data. The RMS values for each model simulation are given in Table 1. model simulations and observations being considered as being drawn from the same statistical distribution. Figure 5 shows the annual mean surface air temperature (at ∼ 1.5-m height) change, LIG minus pre-industrial control, for each snapshot carried out by each model (although note that for NORESM BCCR, Fig. 5v and w, the surface temperature is shown, as the surface air temperature was not available). Also see Fig. S1 in the Supplement, which shows these figures in tabulated form. There are several points worth noting here. Firstly, for nearly all models and snapshots, the maximum warming occurs in the mid-to high latitudes of the Northern Hemisphere. The spread in predicted temperature change as a function of snapshot for any particular model is less than the spread in predicted temperature as a function of model for any particular snapshot. In other words, which model is used has more of an influence on the predicted LIG climate than which snapshot is used (in the range 130 to 125 ka     Fig. 5j, which share a common atmospheric component, ECHAM5). However, there are also strong similarities between HadCM3 Bris and COSMOS MPI at 125 ka ( Fig. 5a and i), and between MIROC Tokyo and CCSM3 NCAR at 125 ka ( Fig. 5f and n). Perhaps surprisingly, CCSM3 NCAR and CCSM3 Bremen at 125 ka are not very similar ( Fig. 5d and f). This is probably related to the higher resolution of CCSM3 NCAR (T85 compared with T31), and the use of dynamic vegetation in CCSM3 Bremen (see Table 1). CCSM3 LLN (Fig. 5e) appears to be more similar to CCSM3 Bremen (Fig. 5d) than to CCSM3 NCAR ( Fig. 5f and g). CCSM3 LLN has the same T31 resolution as CCSM3 Bremen, but similar to CCSM3 NCAR does not include dynamic vegetation, implying that in CCSM3 the resolution has more of an effect on  the climate than the inclusion of dynamic vegetation. The LOVECLIM EMIC has a different response to many of the GCMs, with a greater Arctic warming (especially at 127 ka, Fig. 5m), and reduced cooling in the Sahel. However, it is interesting to note that although this cooling is absent in the surface air temperature response, it is present in the surface temperature response (not shown). CLIMBER LSCE also exhibits different behaviour (Fig. 5o-q), with a lack of geographical structure. Amongst the GCMs the IPSL CM4 model (Fig. 5r) is an outlier in that it does not exhibit cooling in the Sahel at 126 ka. Possible reasons for these differences are discussed later in the context of the DJF and JJA changes. One point to note is that the length of the different LIG simulations could be playing a role; for example, Herold et al. (2012) show that the Nordic Sea cooling in CCSM3 LLN (Fig. 5e) is only manifested after 800 yr of simulation.

Ensemble mean response
It is also instructive to examine the ensemble mean response. In order to include variations between different models, and temporal variability through the LIG, we construct the ensemble mean as a straightforward average of all the simulations presented in Fig. 5. This will weight higher those models which have more than one simulation, and treat different versions of models as independent.
The model ensemble mean annual mean temperature change, LIG minus pre-industrial (Fig. 6a), is characterised by maximum warming at high latitudes, especially in the Arctic. However, there is disagreement amongst the models as to the sign of the change in the Southern Ocean and Antarctica. There is little temperature change in the tropics except for in the Indian and African monsoon regions, where there is a cooling.
The ensemble mean temperature change in DJF (Fig. 6b) is more consistent across models. There is a warming in the Arctic Ocean, and a cooling over most of the rest of the globe, with maximum cooling occurring in the tropical regions. The models generally agree about the sign of the change, except in the region between warming and cooling in the Northern Hemisphere mid-latitudes, and in the Southern Ocean. The large winter warming of the Arctic in response to insolation forcing was highlighted by Yin and Berger (2012) in the context of the LOVECLIM LLN model, who related it to the "summer remnant effect". Their analysis of the surface heat balance components shows that the excess of solar radiation over the Arctic during summer is transferred directly into downward ocean heat flux, and it enhances the melting of sea ice and increases the warming of the upper ocean preventing any important warming of the model surface atmospheric layer. The additional heat received by the upper ocean delays the formation of sea ice and reduces its thickness in winter. This reduction of the sea ice thermal insulation allows the ocean to release heat which finally leads to a significant warming of the surface atmospheric layer in winter. Otto-Bliesner et al. (2013) also attribute the DJF Arctic warmth in the CCSM3 NCAR model to seasonal lags in the system associated with sea ice; this region is still feeling the effects of the preceding summer warming. This warming is not likely due to local insolation forcing (Fig. 1) because the DJF Arctic signal is weak, owing to this being polar night in both LIG and modern, and the CO 2 contribution is relatively small.
The cooler LIG temperatures at other latitudes can be related to the insolation forcing, which is negative relative to pre-industrial in DJF at all latitudes south of 65 • N. The maximum cooling occurs in the ensemble mean in monsoon regions; however, the cause of this is different to cooling in JJA in these regions because in DJF there is also a decrease in precipitation compared with pre-industrial. Little work has focused on this DJF monsoon region cooling, but it is consistent with an increase in north-easterly winds in the Sahara seen in HadCM3 Bris (not shown), advecting relatively cold air from the Eurasian continental interior, and associated with a modelled increase in DJF sea level pressure across much of North Africa. This is also consistent with the fact that this maximum in cooling is not as strong in the CLIMBER model (not shown) -the relatively simple CLIMBER atmosphere is unlikely to capture these dynamical changes in the tropics.
The ensemble mean temperature change in JJA (Fig. 6c) exhibits warming in most regions, apart from the subtropical Southern Hemisphere oceans and the monsoon regions. There is also good agreement amongst the models in most regions of warming. The maximum warming occurs in the Northern Hemisphere mid-latitude continental regions, especially in central Eurasia. The general warming is consistent with the seasonal insolation signal, including the fact that in the Arctic the signal is slightly weaker due to negative forcing in August (Fig. 1). The maximum warming over continents as opposed to over oceans is consistent with the lower heat capacity of the terrestrial surface, and reduced potential for latent cooling. Many models exhibit JJA cooling in the monsoon regions. Previous studies (e.g. Braconnot et al., 2007) have attributed this to enhanced monsoon circulation, driven by greater land-sea contrasts, leading to enhanced precipitation, cloud cover and evapotranspiration. The models which do not simulate cooling in JJA are CLIMBER, LOVECLIM, and IPSL CM4. For CLIMBER the signal is large enough that it should be visible even at the low model resolution, which indicates that the relatively simple atmosphere may be responsible. For LOVECLIM, clouds are prescribed in all LIG simulations to be the same as modern , and so the summer monsoon cooling feedback is weaker (but still present to an extent due to increased precipitation, Berger and Yin (2011)). For IPSL CM4 this is due to a more limited response of monsoon precipitation in this model (P. Braconnot, personal communication, July 2012).
It can be seen that the lack of clear signal in the annual mean response over the Southern Ocean and Antarctica is due to the balancing of seasonal positive and negative forcings. The annual mean cooling in the tropics is due to dominant DJF cooling, the annual mean warming in Northern Hemisphere high latitudes is due to dominant JJA warming, and the annual mean Arctic warming is due to year-round warming.
The warm-month mean (WMM, the temperature in the warmest month, at any one gridcell) temperature change (Fig. 6d) exhibits warming in the Northern Hemisphere, and cooling in the Southern Hemisphere. This is effectively an amalgam of the DJF signal in the Southern Hemisphere, and a JJA signal in the Northern Hemisphere. In this case the only major region of equivocal sign is in the tropics.

Model-data comparison
The terrestrial model-data comparison as a function of latitude for the annual mean surface air temperature is shown in Fig. 7a. Although the very fundamental pattern of maximum warming at mid-and high latitudes is present in both model simulations and Turney and Jones (2010) data, it is clear that the ensemble mean fails to capture the same magnitude of change as in the data. In particular, the data indicate warming of up to 15 • C in Eurasia at the LIG, but the ensemble mean is only about 2 • C. Also, in Antarctica the data is interpreted as indicating warmth of up to 5 • C, whereas the model simulations are less than 1 • C. The agreement is actually worse than this considering that the data represent anomalies relative to modern ), whereas the model simulations are relative to the (cooler) pre-industrial. This mismatch is highlighted in Fig. 7b, which shows a point-by-point comparison of the ensemble mean and the data (see Fig. S2 in Supplement for this equivalent figure for each individual simulation from each model). It can be informative to quantify the degree of model-data agreement by defining a "skill score", σ . In this case we use a very simple measure of skill, σ , equal to the RMS difference between the proxy values (T p ) and the modelled values (T m ) at the same location, so that where N is the number of data points (N = 100 in the case of terrestrial data, and N = 162 in the case of SSTs). The skill score is not ideal due to uneven data coverage, including some regions with no data. As such, the metric gives high weighting to model errors in the Mediterranean region, where there is the greatest density of data. However, it does give a first-order estimate of the models' ability to replicate the data. For the ensemble mean, σ = 3.6 • C. This lies approximately at the centre of the distribution of all the models of σ -the lowest ("best", but note caveats above) being MIROC Tokyo at 125 k, with σ = 3.0 • C, and the highest being NORESM BCCR at 130 k, with σ = 4.6 • C. It is interesting to note that for three of the models (CCSM3 Bremen, CCSM3 LLN and NORESM BCCR), the LIG σ is actually worse (higher) than the equivalent σ obtained by assuming that the LIG climate is identical to that of pre-industrial (σ = 4.0 • C).
It is possible that some of the proxies used in the compilation of Turney and Jones (2010) may be more indicative of changes in seasonal temperature, as opposed to annual mean temperature. If this were the case, then better agreement may be achieved by comparing the proxy temperatures with seasonal modelled changes. In particular, it is possible that some proxies may be biased towards warm growth season changes. The equivalent plots as for Fig. 7 are shown for DJF, JJA, and the warm-month mean (WMM) in Fig. 8. The JJA and WMM simulations are "better" in the sense that they have a wider range of anomalies (i.e. the greatest warming is larger for the WMM than for the annual mean), which is closer to the range of the data, but they are "worse" in that they all have a higher value of σ . As such, considering possible seasonal biases in the proxies does not substantially improve the model-data agreement. Turney and Jones (2010) also provide a compilation of LIG SSTs. The SST data are less geographically biased than the terrestrial data, but there is still an oversampling of data in the Atlantic, coastal, and upwelling regions. We compare these with the modelled SSTs (as opposed to surface air temperatures in the previous sections) in Fig. 9. Note that because the CLIMBER model has a 2-D ocean, for that model we use the global surface air temperatures in place of SST. Many of the findings from the analysis of surface air temperature are supported by the SST analysis. Namely, that the model ensemble does not exhibit the same range of warming as the proxy data, and that this is also the case for each individual model within the ensemble. In particular, the model simulations do not show as much warming as the data in the North Atlantic, and on the northward margins of the Antarctic Circumpolar Current. The σ for the ensemble mean SST is 2.6 • C. In a similar way as for surface air temperatures, looking at the JJA or WMM temperature does improve the range of modelled warming, but does not have a substantial effect on the σ values.

Discussion
There are several ways in which the model simulations, and the ensemble, presented in this paper could be improved.
Firstly, an attempt could be made to use more realistic boundary conditions. In particular, evidence for relatively high LIG sea level (e.g. Kopp et al., 2009) suggests that a reduced Greenland and/or West Antarctic ice sheet would be more realistic than the unchanged-from-modern ice sheets used here, and could result in an improved model-data agreement in the North Atlantic SSTs. Evidence for shifts in Arctic treelines (see Sect. 1) suggests that a modified vegetation could be imposed in the models, or more widespread use made of dynamic vegetation models. The combination of vegetation with ocean and sea ice feedbacks could transform the seasonal insolation forcing into a stronger annual mean warming (Wohlfahrt et al., 2004). MIROC Tokyo has a particularly strong JJA response in terrestrial Northern Hemisphere high latitudes compared with many other models, which may be related to its use of dynamic vegetation; however, other models with dynamic vegetation (CCSM3 Bremen, COSMOS MPI, and LOVECLIM LLN) do not have this same response (Fig. 5). As such, and without a set of comparable simulations from a single model, with and without dynamic vegetation, it is currently not possible to assess the impact of LIG vegetation feedbacks. Secondly, many of the models included in this intercomparison are not "state-of-the-art". It is possible that higher resolution, improved atmospheric and ocean dynamics, more complex parameterisations, and additional "Earth system" processes, could lead to better simulations of the LIG. Such simulations would be computationally challenging, but the LIG has the advantage over some other time periods, such as the LGM and Pliocene, in that the boundary conditions are very easy to implement (if modern ice sheets are assumed, as has been done for all the simulations in this paper).
An indication of the impact of increasing complexity can be obtained by comparing the response from the EMICs in the ensemble (LOVECLIM and CLIMBER) with that of the GCMs (see Fig. 10). This shows a more complex response from the GCMs, even considering the difference in resolution, and the cooling in the African monsoon region is not seen as strongly in the EMICs as in the GCMs (at least partly related to the simplified representation of clouds in the EMICs). However, the warming in the Arctic is stronger in the EMICs than in the GCMs. Previous comparisons of EMICs with GCMs (Stouffer et al., 2006) have not reported    such large differences, and it is possible that our results are biased by the relatively small number of EMICs in this study. Thirdly, in order to examine more closely the range of climates across the interglaciation, and to make the most of the many sites which have well-dated time-varying proxy records, it is desirable to carry out transient simulations across the LIG. As computational power increases, such simulations become more feasible, although not necessarily with the very latest models. Some such simulations exist already, mostly with models of intermediate complexity, low resolution GCMs, or with accelerated boundary conditions. A companion paper to this one, Bakker et al. (2013) is carrying out an initial review of existing LIG transient simulations. Evaluation of these simulations with transient proxy records is an exciting and challenging prospect.
There are also ways in which the data synthesis could be modified, in the context of making model-data comparison more robust.
Because the LIG climate signal is driven primarily by a seasonal forcing, the annual mean response of the models is relatively small, and model dependent, as shown in Fig. 6a. However, the seasonal response is large. As such, a synthesis of seasonal, or WMM/CMM proxy indicators would be much more useful than annual mean indicators for evaluating models.
On a similar note, proxy indicators are perhaps most useful when they show a large signal, as the signal-to-noise ratio will likely be higher. Figure 6b-d show clearly the regions of large modelled seasonal signals. Although there are some data located in northern Eurasia in the Turney and Jones (2010) compilation, there are none in central North America, or the Africa and Eurasian monsoon regions, where there are strong summer-and winter-modelled signals, respectively. This is a similar approach to that suggested by Lunt et al. (2008) in the context of the Miocene.
Probably the most important improvement would be an assessment of the uncertainties in the various proxy estimates. A single value from a proxy, without an error estimate, is almost meaningless in the context of model-data comparison. For example, a model-data disagreement of 5 • C on a proxy with an uncertainty estimate of 5 • C, has a very different implication to a model-data disagreement of 2 • C on a proxy with an uncertainty estimate of 0.5 • C. One way in which proxy uncertainty can be tested is to aim for multiproxy assessments at all sites. Such an approach can radically change the interpretation of proxy data, such as that which was found by the MARGO group for the LGM (MARGO Project Members, 2009), and by the PRISM group for the Pliocene (Dowsett et al., 2012).
The LIG clearly has potential as a test bed of climate models due to its large seasonal signal and relative abundance of proxies with sufficient age control. However, this paper has shown that there is still some way to go before its potential can be realised, both in the development of a robust proxy dataset, and in the use of state-of-the art models.
Future work should also look at other aspects of these and other model simulations, such as the hydrological cycle and ocean circulation. In addition, it would be very interesting to look at the response of the Greenland and West Antarctic ice sheets to a range of modelled climates; previous work in this field (e.g. Otto-Bliesner et al., 2006;Stone et al., 2013) has focused on a single model and so ignored this potentially important aspect of uncertainty. The simulations here have implied that the CO 2 and other greenhouse gas contribution to LIG warmth is small compared to the seasonal astronomical signal, but this could be confirmed by carrying out sensitivity studies.
Finally, this work indicates that although other interglacials, such as MIS 7 to MIS 11, could also be potentially useful targets for models (e.g. Yin and Berger, 2012), in terms of model-data comparison more benefit would probably be gained by improving aspects of the LIG data compilations first.

Conclusions
In this paper, we have assembled a set of climate model simulations of the last interglacial, spanning 14 models of varying complexity, and 5 time slices. We have compared the temperature anomalies predicted by the models with those reconstructed by Turney and Jones (2010).
The main findings are the following: -The annual mean signal from the ensemble is small, with robust changes largely limited to warming in the Arctic and cooling in the African and Indian monsoon regions.
-The seasonal signal is stronger and more robust, with clear JJA warming across the mid-high latitudes of the Northern Hemisphere, and DJF cooling globally except for warming in the Arctic, and equivocal signal in the Southern Ocean.
-There appears to be a difference in signal from the models of intermediate complexity compared with the GCMs (see Fig. 10), which cannot just be explained by resolution, but this should be confirmed with further analysis.
-The model simulations and data for all individual models and for the ensemble do not show good agreement.
In particular, the large LIG annual mean temperature anomalies in the data are not replicated by the models.
-The range of seasonal warming in the model simulations is closer to that of the data, but there is still very little skill in the seasonal model predictions, with, in some cases, a better model-data agreement being obtained if it is assumed that the LIG were identical to modern.
-This study points the way to several improvements in both the modelling and data strategy, which could be employed to provide a more robust model-data comparison. On the data side this includes the incorporation of error bars in the proxy datasets, and inclusion of seasonal proxies in order to capture the largest signals. On the model side this includes more studies on the role of vegetation, and ice sheet change and associated fresh water forcing.