Estimated external radiative forcings, model results, and proxy-based climate reconstructions have been used over the past several decades to improve our understanding of the mechanisms underlying observed climate variability and change over the past millennium. Here, the recent set of temperature reconstructions at the continental-scale generated by the PAGES 2k project and a collection of state-of-the-art model simulations driven by realistic external forcings are jointly analysed. The first aim is to estimate the consistency between model results and reconstructions for each continental-scale region over the time and frequency domains. Secondly, the links between regions are investigated to determine whether reconstructed global-scale covariability patterns are similar to those identified in model simulations. The third aim is to assess the role of external forcings in the observed temperature variations. From a large set of analyses, we conclude that models are in relatively good agreement with temperature reconstructions for Northern Hemisphere regions, particularly in the Arctic. This is likely due to the relatively large amplitude of the externally forced response across northern and high-latitude regions, which results in a clearly detectable signature in both reconstructions and simulations. Conversely, models disagree strongly with the reconstructions in the Southern Hemisphere. Furthermore, the simulations are more regionally coherent than the reconstructions, perhaps due to an underestimation of the magnitude of internal variability in models or to an overestimation of the response to the external forcing in the Southern Hemisphere. Part of the disagreement might also reflect large uncertainties in the reconstructions, specifically in some Southern Hemisphere regions, which are based on fewer palaeoclimate records than in the Northern Hemisphere.
The past millennium is an important period for testing our understanding of the mechanisms that give rise to climate system variability (e.g. Masson-Delmotte et al., 2013). Constraints on, and uncertainties in, external radiative forcings that drive climate change have been extensively documented (e.g. Schmidt et al., 2011, 2012). Such radiative forcing data sets can be used to drive climate simulations using the same model versions that are applied to simulate future climate changes. This allows an evaluation of the relative importance of the various forcings over time, while comparisons of past and future climate simulations place 20th century climate variability within a longer context (e.g. Schmidt et al., 2014a; Cook et al., 2015). Additionally, the availability of high-quality palaeoclimatic observations for the last 1000 years permits the reconstruction of regional-, hemispheric-, and global-scale climate variability (e.g. Mann et al., 1999, 2009; Cook et al., 1999, 2004, 2010; Jones et al., 2009; PAGES 2k Consortium, 2013, 2015; Masson-Delmotte et al., 2013; Neukom et al., 2014). As a result, the past millennium has become a useful test case for evaluating climate and Earth system models used within the Intergovernmental Panel on Climate Change (IPCC) Fifth Assessment Report (Flato et al., 2013; Bindoff et al., 2013).
Palaeoclimate reconstructions provide opportunities to test the fidelity of modelled processes and their role in explaining past climatic variations. Reconstructions and simulations can also be used jointly to evaluate estimates of climate sensitivity to external radiative forcing (e.g. Hegerl et al., 2006; Braconnot et al., 2012; Masson-Delmotte et al., 2013). Comparisons across many realizations of simulated climate are used to assess the extent to which characteristic climate statistics are accurately simulated, as well as to disentangle unforced and forced patterns (e.g. Hargreaves et al., 2013; Bothe et al., 2013a, b; Neukom et al., 2014; Coats et al., 2015a, b). Estimates of the unforced variability in the climate system may be made from unforced simulations, or from the residual obtained when the forced signal is removed from climate reconstructions, using realistically forced model experiments (Schurer et al., 2013).
Furthermore, simulations can provide the basis for the design of observing network arrays (Comboul et al., 2015). Simulation results also provide a test bed for palaeoclimatic reconstruction algorithms within so-called pseudo-proxy experiments (e.g. Zorita et al., 2003; Hegerl et al., 2007; Smerdon, 2012; Lehner et al., 2012; Tingley et al., 2012; Wang et al., 2014; Smerdon et al., 2015b). All of these purposes, which are also pursued within the historical period and with comparison to direct climate observations (Bindoff et al., 2013; Ding et al., 2014), are potentially extended by the longer time interval made possible by analysis over the past millennium.
However, obtaining unequivocal conclusions from the comparison between reconstructions and simulation results over the past millennium remains difficult due to uncertainties in climate and forcing reconstructions, the simplified world represented by climate models, and the relatively weak forced signal in the pre-industrial part of the past millennium compared to internal climate variability (e.g. Moberg, 2013). Reconstructions and simulations are two different representations of the behaviour of the actual climate system, and this creates multiple uncertainties in the task of intercomparison. Simulations have uncertain forcings (Schmidt et al., 2011, 2012), and models contain parameterized or uncertain representation of the physics, chemistry, biology, and interactions within the climate system (Flato et al., 2013). Furthermore, computational constraints impose a limited spatial resolution or a deliberate omission of some known processes in order to perform simulations at global scale that cover several centuries (e.g. Goosse et al., 2005; Schurer et al., 2013; Phipps et al., 2013)
The uncertainty in palaeoclimatic reconstructions is not always well understood either, and estimating its magnitude is challenging. For regional- to large-scale temperature reconstructions, uncertainty can be caused by random or systematic error in the proxy measurement, inadequate understanding of the proxy system response to environmental variation, differences in fields derived from instrumental records selected to calibrate the records, changes in the spatiotemporal and data type availability across the observational network, and reconstruction methods (e.g. Jones et al., 2009; Smerdon et al., 2010; Smerdon, 2012; Emile-Geay et al., 2013; Evans et al., 2013; Wang et al., 2014; Comboul et al., 2015).
The non-climatic noise in reconstructions has a significant influence on model–data comparison. This may first have an impact on the variance of the reconstructed climatic signal itself, although this is dependent on the actual choice of calibration method (e.g. Hegerl et al., 2007; Christiansen et al., 2009; Mann et al., 2009; Smerdon et al., 2010; Smerdon, 2012). Furthermore, the non-climatic noise can mask real relationships between climate variations in different regions, or obscure the responses to forcing, which are clearer in models because of the absence of this noise.
Acknowledging the considerable uncertainty in palaeoclimatic reconstructions, the earliest comparisons of past millennium simulations and reconstructions focused on hemispheric- and global-scale changes, using a single, often simple, climate model driven by globally uniform external radiative forcing estimates (e.g. Crowley, 2000; Bertrand et al., 2002). Later, simulations with more comprehensive models (e.g. Gonzalez-Rouco et al., 2006; Amman et al., 2007; Tett et al., 2007) refined the conclusions reached previously and enabled regional- and continental-scale analyses. They underscored the potential role of the spatial distribution of some forcings, such as land use and of the dynamic response of the atmospheric circulation (e.g. Luterbacher et al., 2004; Raible et al., 2006; Goosse et al., 2006; Hegerl et al., 2011). Changes in the latter may be driven by the forcings (e.g. Shindell et al., 2001; Mann et al., 2009) or be a signature of internal variability in the climate system (e.g. Wunsch, 1999; Raible et al., 2005).
State-of-the-art climate models reasonably simulate properties of internal variability, such as teleconnection patterns or the probability of a particular event (e.g. Flato et al., 2013). However, they are not expected to reproduce the part of the observed time trajectory that is not directly constrained by external forcing because of the non-linear, chaotic nature of the system (Lorenz, 1963). This makes model–data comparison a complex issue when using a single simulation, because differences between model results and reconstructions may be due to a model or reconstruction bias, but may also simply reflect a different sample of internal variability (defined here as the fraction of climatic variability that is not due to changes in external forcings).
Indeed, comprehensive climate models have their own internal climate variability and, if a model represents the real world in a satisfactory way, the observed trajectory would just be one among all potential model realizations. The issue may be addressed by analysing an ensemble of simulations, which provides information on the range that can be simulated by one single model (e.g. Goosse et al., 2005; Yoshimori et al., 2005; Jungclaus et al., 2010; Moberg et al., 2015) or a set of models (e.g. Jansen et al., 2007; Lehner et al., 2012; Fernández-Donado et al., 2013, Bothe et al., 2013b). The reconstruction then needs to be compatible with this range, at least when considering all the uncertainties, to claim consistency between simulations and reconstructions, whereby such a compatibility can be defined in various ways, as discussed below.
Fernández-Donado et al. (2013) reviewed results from 26 climate simulations with 8 atmosphere–ocean general circulation models (AOGCMs), reflecting the state of modelling before the CMIP5/PMIP3 (Coupled Model Intercomparison Project Phase 5/Paleoclimate Modelling Intercomparison Project Phase 3). These pre-CMIP5/PMIP3 simulations were driven by a relatively wide range of choices for boundary conditions and forcing agents. For the Northern Hemisphere surface temperature variations, Fernández-Donado et al. (2013) found an overall agreement within the temporal evolution but still noted discrepancies between simulations and hemispheric and global temperature reconstructions. For example, the period between around 850 and 1250 CE is warmer in the reconstructions than in the simulations (see also Jungclaus et al., 2010; Goosse et al., 2012b; Shi et al., 2013).
Additionally, a comparison of the simulated changes in the temperature fields from this warm period and the colder period around 1450–1850 showed little resemblance to the field reconstruction by Mann et al. (2009), but the spatial reconstructions themselves have significant uncertainties (e.g. Wang et al., 2015). These two relatively warm and cold periods are often referred to as the Medieval Climate Anomaly (MCA), and the Little Ice Age (LIA), respectively, although their exact timing has been debated and the adequacy of their names has been questioned (e.g. Jones and Mann, 2004; PAGES 2k Consortium, 2013).
The assessment of information from palaeoclimate archives (Masson-Delmotte et al., 2013) in the IPCC Fifth Assessment Report partly followed the approach applied by Fernández-Donado et al. (2013). Masson-Delmotte et al. (2013) included a preliminary analysis of the more recent CMIP5/PMIP3 “past1000” simulations, which were coordinated more closely than previous experiments, particularly in regard to the choices of forcings (Schmidt et al., 2011, 2012). They came to similar conclusions as Fernández-Donado et al. (2013): the reconstructed MCA warming is greater than simulated but not inconsistent within the large uncertainties.
Agreement between palaeoclimate reconstructions and simulations has also been assessed by compositing the response to individual forcing events (e.g. Hegerl et al., 2003, 2011; Luterbacher et al., 2004; Stenchikov et al., 2006; Masson-Delmotte et al., 2013). The reconstructed and simulated response to volcanic forcing agrees in magnitude on multi-decadal timescales. Detailed comparisons of observations around the 1815 Tambora eruption indicate that the simulated cooling is larger than in instrumental observations or in reconstructions (Brohan et al., 2012), but a significant part of the discrepancy might be due to forcing uncertainties.
For the solar forcing, direct comparisons between simulations and reconstructions are inconclusive regarding whether simulations that use either moderate or weak variations of total solar irradiance provide generally better agreement with reconstructions (Masson-Delmotte et al.; 2013; Fernández-Donado et al., 2013). This has been confirmed at hemispheric and regional scales by Hind and Moberg (2013) and Moberg et al. (2015), using appropriately designed statistical tests of temporal correlation and quadratic distance between reconstructions and simulations (Sundberg et al., 2012).
The cause of past climate change in the Northern Hemisphere, specifically the
contribution by individual forcings to a climatic event, can be estimated
using detection and attribution techniques. These techniques allow for the
possibility that the reconstructions contain forced signals of larger or
smaller magnitude than simulated (e.g. due to forcing uncertainty,
uncertainty in a models transient response, or uncertainty in calibration of
reconstructions). The results show that the response to volcanic eruptions
can be clearly detected in reconstructions, consistent with epoch analysis
results, and also confirm that the signal is generally larger in magnitude in
the simulations (Hegerl et al., 2003, 2007; Schurer et al., 2013), although
the discrepancy may be within the range of volcanic forcing uncertainty. The
response to solar forcing cannot be reliably separated from internal
variability, but very high solar forcing such as that reconstructed by
Shapiro et al. (2011) needs to be significantly scaled down to match
reconstructions even given large reconstruction uncertainties (Schurer et
al., 2014). Within the LIA, detection and attribution methods show that
volcanic forcing is critical for explaining the anomalous cold conditions
(Hegerl et al., 2007; Miller et al., 2012; Lehner et al., 2013; McGregor et
al., 2015) and that there is also weak evidence for a contribution from a
small but long-lived decrease in CO
The studies mentioned above mainly focused on the Northern Hemisphere, because a larger number of palaeoclimatic observations and reconstructions are available there. However, several recent studies assessed differences in inter-hemispheric connections (Goosse et al., 2004; Neukom et al., 2014), Southern Hemisphere climate variability (Phipps et al., 2013), regional temperature variability (Luterbacher et al., 2004; Hegerl et al., 2011; Goosse et al., 2012a; Gergis et al., 2015; Shi et al., 2015), and Southern Hemisphere circulation features (Wilmes et al., 2012; Abram et al., 2014; Tierney et al., 2015).
In particular, the recent consolidation of Southern Hemisphere palaeoclimate data (Neukom and Gergis, 2012) led to the comparison of a hemispheric temperature reconstruction with a suite of 24 climate model simulations spanning the past millennium (Neukom et al., 2014). This study reported considerable differences in the 1000-year temperature reconstruction ensembles from the Northern and Southern Hemisphere. An extended cold period (1590–1670s CE) was observed in both hemispheres, while the current (post-1974) warm phase is found to be the only period of the past millennium where both hemispheres experienced simultaneous warm anomalies (Neukom et al., 2014). Their analyses also suggested that the simulations underestimate the influence of internal variability in the ocean-dominated Southern Hemisphere (Neukom et al., 2014).
While several studies have provided valuable advances in our understanding
of hemispheric-scale climate dynamics, this brief overview indicates that
observed and simulated palaeoclimate variations at regional and continental
scales have not been thoroughly compared up to now. This was the goal of a
workshop joining the PAGES 2k and PMIP3 communities in Madrid (Spain) in
November 2013, using a recent set of continental-scale temperature
reconstructions (PAGES 2k Consortium, 2013) and a collection of
state-of-the-art model simulations driven by realistic external forcings
(Schmidt et al., 2011, 2012). On the basis of the discussions held during
this workshop, the aim of this study is to systematically estimate the
consistency between the simulated and reconstructed temperature variations
at the continental scale and evaluate the origin of observed and simulated
variations. This study is motivated by the following key science questions:
Are the statistical properties of surface temperature data for each
individual continent-scale region consistent between simulations and
reconstructions? Are the cross-regional relations of temperature variations similar in
reconstructions and models? Can the signal of the response to external forcing be detected on
continental scale and, if so, how large are these signals?
Section 2 first presents a brief overview of the PAGES 2k reconstructions and
simulations analysed here. In addition to a selection of PMIP3 simulations,
some numerical experiments that did not follow the PMIP3 protocol were also
analysed, mainly to include model runs with larger solar forcing amplitude.
We use several statistical methods to achieve robust results in answering the
key science questions above. They are listed at the end of Sect. 2. Each
methodology is briefly described when it is applied while some specific
implementation information is provided in Supplement Sect. S2. In Sect. 3,
each continental-scale region is studied separately to determine whether the
reconstructed and simulated time series have similar characteristics, in
terms of the magnitude and timing of the observed changes as well as the
spectral distribution of the variance. Section 4 investigates whether the
inter-regional patterns of temperature variability are similar in the
reconstructions and simulations. The role of the external forcings in
producing the observed variations is presented in Sect. 5. Section 6
provides a discussion of our results, their limitations, and how our
conclusions compare to previous studies. Finally, Sect. 7 summarizes the main
findings and provides perspectives for future developments. Several
additional analyses are provided as a supplement for completeness and further
reference.
The PAGES 2k Consortium (2013) generated temperature reconstructions for
seven continental-scale regions (Fig. 1). The proxy climate records found to
be best suited for reconstructing annual or warm-season temperature
variability within each continental-scale region were identified. Expert
criteria for the adequacy of proxies were a priori specified (PAGES 2k
Consortium, 2013). The resulting PAGES 2k data set includes 511 time series
from different archives including tree rings, pollen, corals, lake and marine
sediment, glacier ice, speleothems, and historical documents. These data
record changes in biological or physical processes and are used to
reconstruct temperature variations (all data are archived at
Series of
simulated temperatures and PAGES 2k reconstructions for the seven
continent-scale regions. The reconstructions are shown at their original
resolution and after a smoothing using a 23-year Hamming filter, except for
the North American reconstructions. Only the smoothed series are shown for
models. Grey shading denotes each reconstruction's original uncertainty
estimates. Segments on the right indicate the unforced variability in the
23-year Hamming filtered times series in the respective control simulations
(standard deviation of the time series, colours as in the caption). The
anomalies are computed compared to the mean of the time series over the full
length of temporal overlap between simulations and reconstruction. Note the
different scales in the
The PAGES 2k reconstructions have annual resolution in all regions except North America, which has one 780-year-long tree-ring-based reconstruction (back to 1200 CE with 10-year resolution) and one 1400-year-long pollen-based reconstruction (back to 480 CE with 30-year resolution). These latter two reconstructions therefore are smoothed differently and they are either excluded from the analysis or treated in slightly different ways in some comparisons. The reconstruction for the Arctic region used in this study is based on a revised version (v1.1) of the PAGES 2k data set (McKay and Kaufman, 2014).
Each regional group tailored its own procedures to their local proxy records and regional calibration targets (PAGES 2k Consortium, 2013). Thus, each continental-scale temperature reconstruction was derived using different statistical methods. In short, most groups used either a scaling approach to adjust the mean and variance of a predictor composite to an instrumental target, or a regression-based technique to extract a common signal from the predictors using principal components or distance weighting. Thus, some of the observed region-to-region differences between simulations and reconstructions might be due to the differences in reconstruction methods. Nevertheless, alternative reconstructions for all regions based on exactly the same statistical procedures were also produced and were found to be similar to the PAGES 2k temperature reconstructions provided by each group (PAGES 2k Consortium, 2013). Each regional group also used individually selected approaches to assess the uncertainty of their temperature reconstructions, designed to quantify different aspects of the uncertainty. For example, some regions primarily quantified uncertainties associated with the set of records used in the reconstruction and their agreement through time, which can reflect within-region variability as well as uncertainty (Arctic, North American tree rings). Other regions focused on uncertainties associated with how closely the proxy resembles temperatures (Asia, Antarctica, Europe, North American pollen), and some regions incorporated both of these types of uncertainties (Australasia, South America). Uncertainty estimates in all of the regions except for Antarctica vary through time depending on the set of records available for any given interval and their agreement. All uncertainty estimates that assess how well the proxy data reproduce observed temperatures are based on the assumption that the modern proxy-temperature relation is stationary into the past, and that the agreement between proxy data and temperature on short timescales can be used to infer uncertainty at lower frequencies.
The climate model simulations used in this study are listed in Table 1, summarizing model specifications such as resolution, forcing applied to the transient simulations, and length of pre-industrial control simulations (piControl). These simulations include contributions to the third Palaeoclimate and the fifth Coupled Modelling Intercomparison Projects (PMIP3: Braconnot et al., 2012; CMIP5: Taylor et al., 2012) from six models (CCSM4, CSIRO-Mk3L-1-2, GISS-E2-R, HadCM3, IPSL-CM5A-LR, MPI-ESM-P), as well as a more recent simulation with CESM1, and the COSMOS pre-PMIP3 ensemble with ECHAM5/MPIOM (see also Table S1 in the Supplement).
Description of the model simulations.
Forcings: S, V, G, A, L and O stand for
solar, volcanic, greenhouse gas, aerosols,
land use, and orbital forcing, respectively, derived from the following
references:
10: Vieira et al. (2011) spliced to Wang et
al. (2005);
11: as 10, but scaled to double the Maunder Minimum–present day
amplitude;
12: Steinhilber et al. (2009) spliced to Wang et al. (2005);
13: Krivova et al. (2007);
14: Bard et al. (2000);
20: Gao et al. (2008);
21: Crowley and Unterman (2013);
22: Ammann et al. (2007);
30: Flückiger et al. (1999, 2002), Machida et al. (1995);
31: Hansen and Sato (2004);
32: MacFarling Meure et al. (2006);
33: Johns et al. (2003);
34: CO
The experiments were selected among available pre-PMIP3 and PMIP3 simulations on the basis of specific criteria: the conditions were that (i) they run continuously from 850 to 2000 CE; (ii) they include at least solar, volcanic aerosol, and greenhouse gas forcing (S, V, G in Table 1); (iii) they use a plausible solar forcing reconstruction with an amplitude within the range that is consistent with recent understanding; and (iv) they do not display a large unphysical drift over the simulated period.
PMIP3 simulations all comply with criteria (ii) and (iii) as they follow the
recommendation of Schmidt et al. (2011) by using an increase in total solar
irradiance (TSI) from the late Maunder Minimum period to the present day of
Most non-PMIP3 simulations did not comply with at least one the criteria
above. Nevertheless, experiments performed with two models (ECHAM5/MPIOM and
CESM1) follow all of them. They include simulations with a stronger solar
forcing than in the PMIP3 ensemble. A three-member ensemble with ECHAM5/MPIOM
uses a TSI reconstruction with an increase of
Consequently, the group of simulations analysed here is not strictly based on the PMIP3 ensemble. Nevertheless, as we use a majority of PMIP3 simulations and additional simulations that follow an experimental design similar to PMIP3, we will keep the reference to PMIP3 for simplicity.
The variable extracted from the simulation outputs is the monthly mean surface air temperature (labelled “tas” in the Climate Model Output Rewriter framework of CMIP5). These temperature fields were then used to create area-averaged time series that matched the domain and seasonal window of each of the PAGES 2k regional reconstructions (see Supplement Sect. S1).
Several climate model–palaeoclimate data comparison and analysis methods are used in this study to verify the robustness of the results generated by each method and to provide a comprehensive guide for future work. Model–data comparisons need to account for uncertainties in climate reconstructions, in forcing reconstructions, and in the response to forcings in model simulations. These approaches also must recognize that the real climate, and hence the reconstructions, and individual climate model simulations include their own individual realizations of internally generated variability. Therefore, perfect agreement between model simulations and data can never be expected when directly comparing time series.
The first group of methods is focused on the first question raised in the introduction. The goal is to assess whether temperature reconstructions have similar statistical properties compared to simulations. This is initially done by simple analysis of the time series, such as estimates of the variance (Sect. 3.1). The spectral properties are then analysed (Sect. 3.2) before the probabilistic and climatological consistency (Sect. 3.3) and the skill of the various simulations (Sect. 3.4). The second question dealing with the cross-regional variations in temperatures is addressed by discussing the correlation between regions (Sects. 4.1 and 4.3) and through a principal component analysis (Sect. 4.2). Finally, the third question about the role of the forcing is studied by means of a superposed epoch analysis (Sect. 5.1) by applying a statistical framework involving correlation and distance metrics (Sect. 5.2) and detection and attribution techniques (Sect. 5.3). For more details on those methods, see Sect. S2.
In the majority of the analyses presented in this manuscript, anomalies compared to the mean over the whole period covered are used and the time series are smoothed or temporally averaged, using either a 23-point Hamming filter or non-overlapping 15-year averages, depending on the requirements of the various techniques (both methods give a similar degree of low-pass filtering). This is motivated by the relatively weaker skill of some reconstructions to replicate observed records on interannual timescales (Cook et al., 2004; Esper et al., 2005; D'Arrigo et al., 2006) and by the fact that the main focus here on decadal to centennial timescales. The full period analysed is 850–2005 CE, although different periods are chosen for some analyses because of data availability, the choice of the temporal filtering, other technical restrictions, or to analyse sub-periods.
To begin, the agreement between simulations and reconstructions for individual regions is described qualitatively, using a simple visual comparison of the time series, and then quantitatively by calculating spectra, consistency, and skill metrics. The correlations between the time series are presented in the Supplement (Fig. S1 and Supplement Sect. S3). Overall, the analyses in this section illustrate the potential of identifying common signals in both data sets. The different diagnostics are presented here separately, whereas the conclusions derived from the results of the different analyses are compared and discussed in more detail in Sect. 6.
Figure 1 shows the regional time series in the forced simulations with each regional temperature reconstruction. To the right of each time series graph, the magnitude of variability in unforced simulated temperatures is illustrated by calculating the standard deviation of pre-industrial control simulations in each model. The unforced variability is generally similar in all models in all the regions, with weaker amplitudes in Australasia and Asia. Note that some regions cover only land areas, while others have an oceanic fraction (see Supplement Sect. S1), with a potential impact on the magnitude of the estimated variability.
Most reconstructions show a tendency of a gradual cooling over the millennium, followed by recent warming. Notable common features among regions on decadal timescales are the pronounced negative anomalies related to large tropical volcanic eruptions in the simulations. This is most obvious for the eruptions in the 1250s, 1450s, and 1810s. Among the temperature time series, a larger response to volcanic eruptions is noticeable in the CESM, MPI, and CCSM4 simulations. The regional temperature reconstructions rarely capture the first two of these anomalies or only register them at smaller amplitudes. Only the early 19th century eruptions are clearly reflected in many regions, and are most pronounced in the Northern Hemisphere reconstructions. The reconstruction for Europe also shows a negative anomaly coinciding with the effect of the 1450s eruption, with an amplitude comparable to that seen in some of the simulations.
Figure 1 suggests that the temperature reconstructions show slightly more centennial to multi-centennial variability than the models over the full period with stronger long-term trends, while several model results indicate a stronger recent warming compared to some of the reconstructions. The reconstruction uncertainty bands provided with the original PAGES 2k reconstructions encompass the simulated series with few exceptions, in particular the Arctic and North America during the 1250s. The published uncertainty estimates have been calculated using different methods for the various continental-scale regions, as detailed in the Supplement of PAGES 2k Consortium (2013). Furthermore, those uncertainties are only valid at the original temporal resolution, which is annual in all cases except for North America. It is expected that the reconstruction uncertainty decreases at lower resolution, or after smoothing as in our case. This is consistent with the lower uncertainty ranges for the low-resolution pollen-based reconstruction.
However, estimating the reduction of the uncertainty due to smoothing is not straightforward (e.g. Moberg and Brattström, 2011; Franke et al., 2013) as the resulting uncertainty magnitude is also dependent on autocorrelation of the non-climatic noise in proxy data. The extreme hypothesis, considering that the error is constant in time and that the errors are uncorrelated, would lead to a decrease proportional to 1 over the square root of the number of samples included in the average. For a smoothing similar to 15-year averaging, as performed herein, the approximation that likely leads to an underestimation of the uncertainties would correspond to a decrease by a factor of about 4 compared to the original error estimate. This suggests very small errors for most reconstructions. In this case, the major discrepancies between the reconstructions and model results would occur at the same time as mentioned above; however, periods when the models are out of the range of the reconstruction uncertainty bands would be more common at the decadal scale.
For North America, the long-term multi-centennial trend appears to be
similar between the pollen based reconstruction and simulations, except for
the last
Simulated and reconstructed temperatures show only weak long-term trends in Asia, but decadal variability appears to be larger in the reconstruction. Simulations generally differ from the reconstruction in the last 200 years and show either much weaker or much stronger trends. In Australasia, the weak forced variability common to all simulations may be due to the large spatial extent of the domain, which includes large oceanic areas that may dampen the forced high-frequency variability. For the recent warming, the trends in CESM, CCSM4, IPSL, and the COSMOS simulations are considerably stronger than the Australasian temperature reconstruction. The temperature reconstruction for South America is often near the upper or lower limit of the simulation ensemble range and displays more centennial-scale variability than the simulations. In Antarctica, the reconstruction has a clear long-term negative trend and only a modest warming in the 20th century, while the simulations show nearly no long-term cooling but agree on the warming onset in the beginning of the 20th century.
Next, we consider the agreement between simulated and reconstructed temperature data in terms of their spectral densities, which show how temperature variances are distributed over frequency (Fig. 2; see also Fig. S2). Spectra were computed using the multi-taper method (Thomson, 1982; Percival and Walden, 1993), with its so-called time–bandwidth product being set to 4. Consequently, each calculated spectrum is an average of seven statistically independent spectrum estimates. Spectra for the reconstructions are illustrated with their 95 % confidence intervals, while model spectra are plotted with single lines. The analysis is made at the original time resolution using all existing data points in the time frame 850–2005.
Spectral densities for simulations and reconstructions for PAGES 2k
regions, calculated using all existing data in the period 850–2005 CE.
Reconstruction spectra are illustrated with their 95 % confidence intervals
in coloured bands, while model spectra are shown with single coloured lines.
Dashed vertical lines denote the limit for frequencies and periods of
relevance (to the left of the line) for analyses made at the 15-year
resolution, or with a 23-point Hamming window, as in many other analyses in
this study. The multi-taper method (Thomson, 1982; Percival and Walden, 1993)
was used, with the time–bandwidth product set to 4 and with long-term
averages subtracted before estimating the spectra. Units are temperature
variance (
The degree of agreement between model and reconstruction spectra differ substantially between regions, with the Arctic showing the best agreement at all frequencies and South America showing the worst. In the latter, most model spectra lie in the reconstruction confidence interval only in a narrow frequency band corresponding to about 100- to 150-year periods. The agreement is generally good for the Arctic, Europe, and Asia at multi-decadal timescales (20–50 years) for many regions. Nevertheless, many models have systematically less variance in the 50- to 100-year band and most models have more variance than the reconstructions at higher frequencies.
Climatological consistency (first two columns): residual quantile–quantile plots for the full period; probabilistic consistency (last two columns): rank counts for the full period. The top row is for the Arctic, and the bottom row is for Antarctica. For both the climatological and probabilistic consistency, the computations are obtained by neglecting the uncertainties (left plot) and using the uncertainties provided with the original reconstructions (right plot). For the climatological assessment, positive and negative slopes or large differences from 0 emphasize lack of consistency. For the probabilistic measure, U- or dome-shaped features highlight lack of consistency.
Pronounced differences of high-frequency variance is seen for all Southern Hemisphere regions. In particular, the pre-PMIP3 COSMOS simulations show significantly too much variance at timescales of 3 to 5 years for Australasia and to a lesser degree for South America and Antarctica. This property has previously been related in regions with strong influence from tropical Pacific variability to this model's ENSO variability (Jungclaus et al., 2006; Fernández-Donado et al., 2013). Most model spectra for North America lie within the confidence interval of the tree-ring-based reconstruction spectrum, although several models have somewhat less variance than this reconstruction at periods longer than 50 years. The North America pollen-based reconstruction behaves as a roughly 150-year low-pass-filtered series and has significantly less variance than the corresponding tree-ring-based record at all frequencies for which both spectra are defined.
The probabilistic and climatological consistency of PMIP3 simulations and PAGES 2k reconstructions was assessed following the framework of Annan and Hargreaves (2010; and references therein; Hargreaves et al., 2011, 2013) and Marzban et al. (2011), respectively. The current application is based on Bothe et al. (2013a, b). The underlying null hypothesis follows the paradigm of a statistically indistinguishable ensemble (Annan and Hargreaves, 2010; Rougier et al., 2013), i.e. the validation target, represented here by the temperature reconstructions, and the model simulations are samples from a common distribution and are therefore exchangeable.
Climatological consistency refers to the similarity of the climatological probability distributions of reconstructions and of simulations over a selected period, either the whole millennium or sliding sub-periods. We analyse climatological consistency by comparing individual simulated series with the target (i.e. the reconstructions) to identify deviations in climatological variance and possible biases between them. To achieve this goal, Marzban et al. (2011) proposed the use of residual quantile–quantile (r-q-q) plots that should be approximately flat for consistent series (Sect. S2.1).
Probabilistic consistency refers to the position of the reconstruction in the range spanned by the ensemble of simulations. Histograms of the ranks should be flat under exchangeability (Sect. S2.1) – i.e. estimated frequencies of the verification target and the ensemble agree if the simulation ensemble is probabilistically consistent with the temperature reconstructions (Murphy, 1973).
As there are large uncertainties in palaeoclimate reconstructions, it is necessary to take into account these uncertainties in the evaluation of the consistency of the ensemble of climate model simulations (Anderson, 1996). This is achieved by inflating the model simulations results by adding noise with amplitudes that are proportional to published uncertainty estimates from the original temperature reconstructions.
We assess probabilistic and climatological consistency based on non-overlapping 15-year averages centred on the full period considered, except for the North American temperature reconstruction, where non-overlapping 30-year averages are used for the pollen-based reconstruction, and 10-year averages for the tree-ring-based reconstruction. The results are presented in Figs. 3, S3, and S4 for all regions.
The regions selected for Fig. 3 are chosen to provide a contrasting
example. Two estimates of the uncertainties are used. First, the
uncertainties provided with the original reconstruction are applied, which
is an overestimation for the smoothed time series. Second, at the other
extreme, the uncertainties are assumed to be equal to zero and are thus
known to be underestimated. A third estimate of the uncertainty is provided
in the Supplement figures, using an uncertainty measure equal to the one
provided in the original publication divided by a factor of
The simulations in most cases lack climatological consistency with the reconstructions (Figs. 3 and S3). The simulated quantiles can deviate strongly from the reconstructed quantiles. Specifically, the simulated distributions are generally over-dispersive when using the original estimates of uncertainties. The differences are much smaller when uncertainties in reconstructions are neglected, although extremes often remain overestimated. The Arctic and the North American tree-ring-based reconstruction are exceptions as some simulations are climatologically consistent with the reconstruction and display only small differences between simulated and reconstructed quantiles for all estimates of the uncertainty. Consistency is reduced for those simulations that show larger variability (recall Fig. 1) as is the case of the CCSM4 and CESM models.
In agreement with the climatological assessment, the simulated results generally lack probabilistic consistency with the reconstructions when the original uncertainty is considered (Figs. 3 and S4). The target data are too often in the central ranks, indicating that the probabilistic distribution of the ensemble is too wide and shows significantly over-dispersive spread deviations. The only exception is the North American region using the tree-ring-based reconstruction. The most prominent differences are found in the Antarctic region, where the simulation ensemble spread deviates considerably from reconstructed temperatures (Fig. S4), but strong ensemble spread deviations relative to the pollen reconstruction for North America are also evident.
This assessment of the probabilistic consistency strongly depends on the estimate of the uncertainty of the reconstruction. If we do not add noise to the model time series to reflect error in reconstructions before the ranking and thereby neglect reconstruction uncertainty, or if we assume a strong reduction of the error in reconstruction at the decadal time scale because of the smoothing, the ensemble appears to be consistent with a number of regions or even under-dispersive for others. However, ignoring the uncertainty in such a manner may lead to an overconfident assessment of consistency between simulation ensemble and reconstruction. Nevertheless, because the uncertainties are not well known, over-dispersion does not necessarily weaken the reliability of the ensemble relative to the target, but instead may highlight insufficiently constrained uncertainties in the reconstruction.
The skill of the simulations is assessed using a metric introduced by
Hargreaves et al. (2013). The idea of skill stems from weather forecasting
and refers to the ability of a simulation to represent a target better than
some simple reference values. For instance, in weather forecasting, a
standard reference is to assume no change compared to initial conditions
(i.e. persistence). A forecast has a positive skill if it is closer to the
observed changes than this simple reference. The skill
Figure 4 presents the skill for the Arctic and Antarctica, as an example,
with the other PAGES 2k regions displayed in Fig. S5. In this estimate, we
use a no-change reference forecast (i.e. the reference is the climatology)
as there is no clear a priori evidence that the climate at one
particular time during the past millennium is warmer or colder than the mean.
Positive values suggest that the simulations is in better agreement with
(i.e. closer than) the regional reconstructions than this reference. Results
are presented for dates when no data are missing in four periods: 850 to
1350, 1350 to 1850, 850 to 1850, and the full period 850 to 2000. As in
Sect. 3.3, we compute the skill in Fig. 4 using the uncertainties provided
with the original reconstruction, as well as a case that assumes the
uncertainties are negligible (i.e. assuming
The most notable result is that the skill measure is generally undefined when using the uncertainties provided with the original reconstruction: either the reference or the simulated data are closer to the reconstruction than uncertainty allows, leading to the square root of a negative number in Eq. (1). This confirms that uncertainties in the reconstructions are potentially an overestimation for smoothed time series. When ignoring uncertainties, the 15-year non-overlapping means of the simulations rarely display skill. Simulation skill appears to be most likely for the European and Arctic regions, while positive skill is nearly absent for the Southern Hemisphere regions and North America in all the models.
Skill metric for the individual models for all periods (from left to right: 850–1350, 1350–1850, 850–1850, 850–2000). Top row for the Arctic, bottom for Antarctica. The computations assume no uncertainties (left plot) and uncertainties provided with the original reconstructions (right plot). When the skill is undefined (as for Antarctica when using the original error estimates) no bar is shown. Positive values indicate skill in this simple evaluation.
The structure of the spatial variability, i.e. the spatial covariance of temperature changes, contains contributions from forced signals and from teleconnections in the internal climate variability. The PAGES 2k temperature reconstructions help to investigate the consistency between simulations and reconstructions with respect to this covariance structure. In the following sections, this is evaluated using spatial correlations, principal components (PCs) and empirical orthogonal functions (EOFs), and correlations over sliding temporal windows.
The spatial correlation matrix of simulated temperature for the PAGES 2k regions is compared to the correlation matrix of the PAGES 2k reconstructions (Figs. 5 and S6). Correlations are calculated for detrended continental mean time series filtered with a 23-year Hamming window and based on the continents for which these are available, which excludes North America. We use the longest common period for forced simulations and reconstructions, which for the filtered data is 1012–1978 CE (1000–1990 CE for annual data). To disentangle the contributions from forcings and from internal variability, we analysed forced simulations for the entire analysis period, forced simulations for the pre-industrial period (before 1850 CE), and unforced control simulations.
Correlations among the PAGES 2k regions for detrended simulated and reconstructed time series filtered using a 23-year Hamming filter. Left-hand panel: forced simulation with MPI-ESM (upper triangle) PAGES 2k reconstructions (lower triangle) for 1012–1978 CE. Right-hand panel: forced simulation with MPI-ESM for the pre-industrial period 1012–1850 CE (upper triangle) and unforced control simulation with MPI-ESM (lower triangle).
MPI-ESM-P is used to illustrate our main findings in Figure 5 (see Fig. S6 for the other models). Correlations in the forced MPI-ESM-P simulation for the whole period are higher than 0.6 between nearly all regions. In contrast, the correlations for the PAGES 2k temperature reconstructions are rather low, which indicates a substantial inconsistency between the correlation structure in the models and in the PAGES 2k temperature reconstructions. The potential causes of this discrepancy will be discussed in Sect. 6, but we must reiterate here that, in contrast to other analyses presented above, the evaluation of the spatial correlation does not take into account any uncertainty in the reconstruction. Any non-climatic noise related to the characteristics of the proxy records selected or differences in the reconstruction method between regions would decrease the correlation, contributing to lower values than for the model results.
The correlations in the simulations are lower if only the pre-industrial period is considered, and close to zero in the control simulations. The simulated high correlations for the last century are likely to be a consequence of the rather homogeneous and strong anthropogenic warming in the simulations. The high correlations for the pre-industrial forced runs show that the response to volcanic forcing, solar forcing land use, and/or orbital forcing also substantially contributes to the correlations at the timescales considered. Low values obtained for the control simulations indicate that teleconnections between continents are weak for simulated internal variability.
Although these general characteristics are present in many of the models evaluated here, there are some differences among them. In particular, some of the models that show higher correlations during pre-industrial times (e.g. CESM) also display a large response to volcanic forcing compared to the other members of the ensemble (Lehner et al., 2015). Additionally, the specific characteristics of some regions may differ substantially. For instance, the correlation between Antarctic temperatures and other regions is very low in MPI-ESM-P or IPSL-CM5A-LR for pre-industrial conditions, while it is much larger in CCSM4 and CESM. This can be attributed to a different ratio of forced versus unforced variability, and in particular to discrepancies in the magnitude of the response to external forcing in the selected models.
Figure 6a shows the loadings of the first EOF on each region for the PMIP3 forced simulations and the PAGES 2k reconstructions (with corresponding results for the GISS and COSMOS ensembles presented in Sect. S4 and Fig. S7). Most models show similarities in the loadings, which indicates that the different regions covary similarly in the different models. All loadings are positive, and thus the first principal component (PC) is only a weighted mean of all continental temperature series.
Consequently, the time series of the first PC of the PMIP3 simulations and PAGES 2k temperature reconstructions (Fig. 6b) reflect the main features of the individual original series (particularly for Northern Hemisphere regions); namely a temperature decline after around 1200 CE, which lasts until the early 1800s, followed by the sustained warming within the 19th and 20th century. Additionally, the influence of volcanic eruptions on reconstructed temperatures is visible during some periods, especially during the mid-13th century (although not in the reconstructions), the mid-15th century, and the beginning of the 19th century.
In most models, the first EOF explains about 80–90 % of the total variance, whereas the leading EOF in the PAGES 2k temperature reconstructions accounts for only 55 % of the total variance. This shows that the covariance structure is less complex in the simulations. This is consistent with the larger correlations between regions found in Sect. 4.1, which means that the leading mode of homogeneous warming or cooling dominates the covariance structure in model results. In a few simulations (HadCM3, COSMOS), however, the variance explained by the first EOF is about the same as in the reconstructions.
The largest values for the loadings are found for the Arctic region, due to the high temperature variability in the last 1200 years in this region. This expression of the classical Arctic amplification is reflected in most models and in the reconstructions. The ocean-dominated regions of the Southern Hemisphere show less pronounced variability relative to the Northern Hemisphere, consistent with the results of Neukom et al. (2014).
If the analysis is performed over the pre-industrial period only (Fig. S8), similar conclusions are reached but the loadings are smaller, especially over the Arctic, and the amount of variance represented in the leading EOF generally decreases, indicating a larger heterogeneity in the pre-industrial period.
Next, the stationarity of the correlation structure between the different regions, in the models and the reconstructions, is assessed using a running correlation analysis, (Figs. 7, S9). For the simulations, the multi-model mean shows generally high inter-regional correlations, as the common contribution of the forcing is enhanced because of the averaging procedure. Periods with small variations in external forcing are, however, characterized by weaker coherence between the regions. This occurs during the 11th and 12th century and in shorter periods around 1500 and 1750. High coherence occurs in periods with strong variations in external forcing, highlighting in particular that volcanic eruptions can cause simultaneous temperature variations in most regions.
One-hundred-year moving Tukey window correlations between selected PAGES 2k
regions for the PAGES 2k reconstructions (blue) and PMIP3 models (eight models in
orange, multi-model mean in red) and observations from HadCRUT4 (Morice et
al., 2012, black). Each 100-year segment is linearly detrended beforehand.
Grey shading illustrates correlations that are not significant at the 5 %
level.
Superposed epoch analysis of the impact of the volcanic activity in the reconstructed and simulated temperatures. Superposed composites of temperature responses during selected periods when peak negative forcings in the Crowley and Unterman (2012) volcanic reconstruction are aligned. The composite is produced by selecting the 12 strongest volcanic events, starting 5 years before the date of the peak eruption and ending 10 years after the event. Each panel indicates the reconstructed (red lines) and simulated (black) composites of the temperature response for each PAGES 2k region. The grey shading indicates the complete range of simulated temperature responses.
The inter-regional correlations in the individual model simulations vary considerably. The model range includes the correlations derived from the reconstructions for some regions, as for Europe vs. Arctic (Fig. 7a), but values for models are very often higher than for reconstructions (see also Sect. 4.1). The difference is particularly large for the coherence between Australasia and South America (Fig. 7b), which is substantially larger in model simulations compared to reconstructions and instrumental observations (Morice et al., 2012) (Fig. 7b). This could indicate that some regions are less connected by modes of variability (such as ENSO) in reality than suggested by models, that the models have poor representation of modes of internal variability that influence the ocean-dominated Southern Hemisphere (see Neukom et al., 2014; see also Supplement Sect. S5 and Fig. S10), or that there is more non-temperature noise in the proxy data from those regions.
Some aspects of the response to external forcing have been briefly discussed
in the previous sections. It is now formally addressed here by a superposed
epoch analysis, by applying the
The response of the PAGES 2k reconstructions and the various model simulations to external forcing from solar and volcanic activity is evaluated here using a superposed epoch analysis approach, following Masson-Delmotte et al. (2013). This analysis was conducted for two different timescales: interannual and multi-decadal. For interannual timescales, this is done by generating composites of reconstructed and simulated temperature sequences corresponding to the timing of the 12 strongest volcanic events (see Sect. S2.2). For the multi-decadal composites, the five strongest events are selected and the means from 40 years before to 40 years after the eruption are calculated from time series smoothed with a 40-year low-pass filter using least-squares coefficients (Bloomfield, 1976). We also calculate composites corresponding to the timing of intervals of weaker solar forcing at decadal timescales. The intensity of the average model response to the selected forcing events is then compared to the corresponding response found in the reconstructions.
The regional temperature response for six PAGES 2k regions (North America is not analysed here; see Sect. 2.1) to the major volcanic events in the Crowley and Unterman (2012) reconstruction are shown in Fig. 8. The temperature perturbation typically lasts longer than the forcing itself, with a recovery to pre-eruption temperatures after 3 to 10 years in the simulations and in the reconstructions.
The responses vary considerably in the simulations and in the reconstructions
among regions. Nevertheless, the composite averages are always larger in
model results with values of up to
At multi-decadal timescales, the simulated and reconstructed temperature responses are in better agreement, in particular when using the Crowley and Unterman (2013) reconstruction (Fig. S12) rather than the Gao et al. (2008) estimations (Fig. S13), with temperature decreases on the order of a few tenths of a degree in most regions. The one exception is South America, where, in contrast to simulations, the reconstructions do not show any multi-decadal changes associated with volcanic forcing.
The multi-decadal impact of solar forcing in the reconstructions is strongest
in Europe, the Arctic, and Asia (Fig. S14), with mean changes ranging from
0.15 to 0.25
A statistical framework for evaluating simulated temperature sequences
against reconstructed past temperature variations was developed by Sundberg
et al. (2012), Hind et al. (2012), and Moberg et al. (2015). It includes two
essential metrics, which both serve as statistical tests of a null
hypothesis. First, a correlation metric,
Second, a weighted square-distance metric,
Prior to the analysis, all records were recalibrated against their instrumental target temperature time series (see Sect. S2.3) following the procedure of Sundberg et al. (2012) and Moberg et al. (2015) to ensure that each regional reconstruction, after calibration, approximately satisfies the assumption that the true temperature component, upon which the non-climatic noise component is added, is correctly scaled (see Sundberg et al., 2012, and Moberg et al., 2015).
Figure 9 shows the model evaluation statistics
Correlation (
Because of the imprint of the forcing response, all forced simulations show
significant (
In contrast, simulated and reconstructed pre-industrial temperature
histories for South America show little common variation, as revealed by
mostly insignificant
Results for the distance statistic (
Thus, overall, the comparison between simulation results and reconstructions
performs notably better for the Northern Hemisphere than for the Southern
Hemisphere. In particular, nearly all simulations have significant negative
Results for both
Extending the
analysis to the full 861–2000 period yields higher
Ensemble means for COSMOS and GISS ensembles give more significant
This result is confirmed by a formal test where
Detection and attribution aims to identify the forced response in the regional temperature reconstructions by evaluating whether observed changes could be entirely caused by variability created within the climate system (internal variability), whether external forcing is necessary to explain them, and what magnitude of external forcing response is consistent with reconstructions (see Bindoff et al., 2013; Hegerl and Zwiers, 2011). Here, we focus on all forcings together and not on the response to each forcing individually, as simulations with individual forcings are needed to analyse the latter. Attribution is achieved by estimating the response to the external forcing in the reconstruction using a total least-squares regression techniques (following Schurer et al., 2013; see also Allen and Stott, 2003). The outcome are scaling factors that determine the amplitude of the fingerprints of the forcing response in the reconstructions. A forcing is detected if a scaling value of zero can be rejected at the 5 % significance level, indicating that it is unlikely that climate variability alone is responsible for the similarity between forced response and reconstruction. If the 5–95 % range of scaling factors encompasses 1, then the magnitude of the response to forcing is found to be consistent in simulations and in the reconstruction (see Sect. S2.4).
Figure 10 shows the results of the detection and attribution analysis using the multi-model ensemble mean, which is calculated as the mean of all model simulations described in Sect. 2.2, except for the high-solar COSMOS and CESM1 simulations as they include a clearly different forcing. All reconstructions and models used were first filtered using a 23-year Hamming window.
Detection and attribution results for PAGES 2k regions. Vertical bars indicate 5–95 % scaling factor ranges, with a cross indicating the best fit. Scaling factors that are significantly offset from “0” indicate that the response to forcing is detected, and those that encompass “1” indicate that the magnitude of the forced response agrees with simulations. For each region scaling ranges are shown for four different time periods (colours). For the Northern Hemisphere (NH), Southern Hemisphere (SH), and globally (ALL), the regressions were carried out on the combined data from all the applicable regions. An asterisk indicates that the detection analysis has been successful, namely the forced response is significantly greater than zero and that the residuals are consistent with model-based samples of internal variability.
The response to external forcing is detectable (
The only exceptions are the earliest time period (864–1350) for North America (tree-ring-based reconstruction), where only 150 years of data were available and the early European period, which fails the residual consistency check, indicating that the residual that is attributed to internal variability is larger than expected from model simulations, possibly due to non-climatic noise in reconstructions. The results for the latter case suggest that the basic hypotheses underlying the methodology are violated because the model–reconstruction discrepancy cannot be explained by internal variability alone. External forcing is also detectable when the model and reconstruction data from all Northern Hemisphere regions are combined.
External forcing is not detectable in South America (no scaling ranges significantly larger than 0) and only for certain time periods for Antarctica and Australasia (with fits for Australasia failing the residual consistency check). External forcing is also not detectable in the combined Southern Hemisphere reconstruction. As well as being undetectable, despite accounting for uncertainty in simulated signals due to variability, the estimated signals are also significantly smaller than simulated. Consequently, the models appear to simulate a magnitude and pattern of external forcing in the Southern Hemisphere significantly different from that derived from the PAGES 2k reconstructions. This could be due to strong noise in reconstructions swamping the forced response, calibration uncertainty in reconstructions misestimating the magnitude of the forced response, or errors in climate models as discussed below.
In the light of the results presented in the Sects. 3 to 5, we discuss below each of the three questions raised in the introduction.
The analyses herein show that the answer to this question depends on the specific feature assessed. The simulation results and reconstructions agree at regional scale for some metrics, but disagree in many cases. The consistency between simulations and observations is still generally more robust at hemispheric and global scales, and the fit to reconstructions is improved for ensemble mean of simulations compared to individual members.
Overall, smoothed simulated temperature anomalies from the long-term average lie within the range of the originally published uncertainty estimates of the reconstructions. However, these uncertainty ranges are, for all regions except North America, defined for data at annual resolution and therefore are very likely larger than uncertainty ranges adapted for the smoothed versions of the data (see Sect. 3.1). Thus, the published uncertainties for the reconstructions are in most cases too large to provide strong constraints on the ensemble of simulations, as different forcing amplitudes and responses are nevertheless consistent within the range of the reconstructed values. Some common signals between model results and reconstructions can be identified visually as isolated events, such as the cooling during the early 19th century in many regions, but they are relatively rare.
The time series for forced simulations are nevertheless significantly correlated with temperature reconstructions, for many regions, when the entire series are considered (Supplement Sect. S3). Models also have some skill compared to a simple a priori estimate assuming no temperature change over the past millennium (Sect. 3.4). Despite using a very simple reference method as a benchmark, however, such skill is achieved nearly exclusively for Northern Hemisphere regions, specifically for the Arctic and in some models for Europe and Asia. This is in agreement with the conclusions derived from the application of the Sundberg et al. (2012) evaluation framework (Sect. 5.2) that forced simulations are significantly closer to the reconstructions than unforced simulations in Northern Hemisphere regions. In particular, the Arctic region shows a robust agreement, as do Europe and Asia to a lesser degree. In contrast, for all the regions of the Southern Hemisphere, the models have nearly no skill compared to a constant climate reference and individual forced simulations are in most cases not significantly closer to reconstructions than an unforced reference.
The diagnostics mentioned above addressed whether simulated time series of surface temperature at continental scale have temporal similarities with reconstructed ones. The climatological or probabilistic consistency is complementary as it focuses on the distribution of temperature data, independent of the particular trajectory over time. For most regions, no consistency is found between the distribution of model results and reconstructed temperatures when using the original reconstruction uncertainty estimates (Sect. 3.3), which are annually resolved in all cases except North America.
It should be noted, however, that these results depend strongly on the uncertainty estimates considered (Bothe et al., 2013a, b): the greater the assumed reconstruction uncertainty, the weaker the consistency with model simulations as the models tend to appear over-dispersive. When reducing the uncertainties, to adapt them to the smoothing or temporal averaging applied here, the consistency improves in many regions. Such reduction of the uncertainties may, however, lead to overconfident conclusions if the original uncertainty estimates at the annual resolution did not account for all existing sources of uncertainty.
A visual comparison suggests that the temperature reconstructions show slightly more centennial to multi-centennial variability over the full period with stronger long-term trends, while model results indicate a stronger recent warming compared to some of the reconstructions (Sect. 3.1). Comparison of the series spectra (Sect. 3.2) reveals marked differences between regions in how well the simulations agree with the reconstructions. The best overall agreement is seen for the Arctic, where the model spectra mostly lie within a 95 % confidence interval for the reconstruction spectrum. For all other regions, the model spectra often lie outside the confidence interval for some frequency ranges. The mismatch is most pronounced for South America, but there are other examples with both lower and higher variance at different frequencies in model results compared to reconstructions.
The disagreements can have various origins, in either reconstructions or simulations or both. For instance, the total variance of reconstructions is dependent on how they were calibrated to instrumental observations (e.g. Kutzbach et al., 2011), but the shape and slopes of their spectra are determined by spectra of both the true climate and the non-climatic proxy-data noise and by the signal-to-noise ratio (Moberg et al., 2008). Some studies have suggested that reconstruction methodologies may alone underestimate low-frequency variability, in addition to any frequency biases inherent to the proxy data (e.g. Smerdon et al., 2010; Esper et al., 2012; Smerdon et al., 2015a). The amplitude of the reconstructed past forcing changes, which affect the model spectra, is still uncertain (Schmidt et al., 2011, 2012). The modelled transient climate response and the amplitude of internal variability at the regional scale vary considerably and thus deficiencies in applied forcings or internal model physics can lead to errors in the modelled spectra. Nevertheless, no major, systematic model underestimation of low-frequency variability can be deduced at the continental scale from the analyses performed herein, in contrast to some recent studies devoted to the ocean surface temperature (Laepple and Huybers, 2014a, b).
Discrepancies in the inter-regional relations between reconstructions and model results are clearer than for each individual region. While the strong correlations between the temperature variations in regions from the Northern Hemisphere in model simulations have some similarities to the ones in the reconstructions (Sect. 4.1), the correlation between the hemispheres and between the Southern Hemisphere regions are much stronger in models than in reconstructions, as previously reported by Neukom et al. (2014) at hemispheric scale.
This result is robust as it is also reflected in the larger variance explained by the first EOF mode in models than in the temperature reconstructions (Sect. 4.2), and this is valid for most of the past millennium (Sect. 4.3). These differences may be due to a stronger response to forcings in the models, to unrealistic internal variability in the models, or to non-climatic noise in the proxies or due to a combination of these factors, as discussed in more detail below. Additionally, there are large differences between the various models in the Southern Hemisphere. For instance, Antarctic temperature is strongly related to other regional temperatures in some simulations and not in others, suggesting that specific model dynamics may account for some of the discrepancies with the reconstructions.
The agreements or disagreements between model results and reconstructions can be partly explained by the model response to forcing. The contribution of the forcing derived from simulated results can be detected in the reconstructions for all regions of the Northern Hemisphere (Sect. 5.3). The forcings used in the PMIP3 model experiment result in simulated temperature histories that, on the whole, explain a significant fraction of the past regional temperatures in the pre-industrial climate.
This strongly contributes to the model skill for the Northern Hemisphere, as
unforced internal stochastic variability is unlikely to agree between model
results and observations. This is confirmed by the significant correlation
coefficients (Fig. S1) and correlation statistics (
On interannual timescales, the model response to volcanic forcing appears larger than represented in the reconstructions (Sect. 5.1). There is some debate on the potential underestimation or overestimation of the cooling due to volcanic eruptions in reconstructions (e.g. Mann et al., 2012; Anchukaitis et al., 2012; Tingley et al., 2014; Büntgen et al., 2015). Nevertheless, this model overestimation was also found when compared to instrumental data and at hemispheric scale, suggesting a robust phenomenon (Brohan et al., 2012; Fernández-Donado et al., 2013; Masson-Delmotte et al., 2013; Schurer et al., 2013). Both model results and reconstructions also show that volcanic activity impacts temperature at multi-decadal timescales, with a similar magnitude of the temperature response in models and reconstructions over most of the regions in the Northern Hemisphere. This is consistent with the detection and attribution analysis (Sect. 5.3), which indicates that the magnitude of the simulated response to forcing in the Northern Hemisphere has the correct amplitude for smoothed time series.
The role of solar forcing is less clear and none of the pre-PMIP3 COSMOS simulations with either a moderate or a weak solar forcing gives a systematically better agreement with the reconstructions than the other, although the ensemble with low solar forcing yielded a better fit during the first 500 years (Fig. S18). This confirms earlier results obtained at the hemispheric scale (Masson-Delmotte et al., 2013; Schurer et al., 2014).
In the Southern Hemisphere, the influence of external forcing is often not
detected (Sect. 5.3). This is consistent with the lower correlation
coefficients (Fig. S1) and weaker correlation statistics (
The analysis performed herein, however, cannot reveal the origin of the mismatch between simulation results and reconstructions. These differences may be due to biases in the dynamics of the climate models or to errors in the implemented forcing, in particular in their spatial distribution. Land-use changes, which are not included in some models (Table 1), tend to reduce the spatial correlation between regions as deforestation did not occur at the same time over all continents (Pongratz et al., 2008; Kaplan et al., 2011).
The spatial distribution of volcanic aerosols may also contribute to pronounced regional differences. Volcanic forcing is usually not implemented as a direct simulation of changes in stratospheric sulfate concentrations due to individual eruptions but rather as a mean change in the optical depth for different latitudinal bands. This can have an impact on the overestimation of the response in individual simulations or to individual eruptions. Additionally, if the latitudinal distribution of volcanic aerosols is too homogeneous, thereby inducing unrealistically symmetric forcing between hemispheres, it would also overestimate the global signature of the induced cooling.
Any non-climatic noise in the reconstruction would tend to reduce the covariance in reconstructions compared to model results, which would lead to an underestimation of the relative contribution of the forced signal. Despite the large progress made over the last few years, this may still be a critical problem in the Southern Hemisphere, where fewer long palaeoclimate records are available compared to the Northern Hemisphere, explaining some of the model–data mismatch there.
The role of internal variability in driving temperature variations may also be underestimated in model simulations, particularly in the ocean-dominated Southern Hemisphere, as suggested by Neukom et al. (2014). Simulated internal variability may, however, be overestimated, as reported here in at least one model and elsewhere for ENSO-type variability (Jungclaus et al., 2006) or for the Southern Ocean ice extent (Zunz et al., 2013). This would imply a ratio between internal and forced variability that is incorrect, which would lead to biased correlations between the different regions.
Another potential explanation for the differences in the spatial covariance structure between models and observations relates to the relatively coarse resolution of the climate models. Using models with higher spatial resolution will increase the number of spatial degrees of freedom and potentially improve the covariance structure of the climate models compared to reconstructions. Nevertheless, the expense required for both high spatial and temporal resolution, as well as the necessary ensemble approach, could be prohibitive.
The analysis of model simulations and PAGES 2k temperature reconstructions has allowed us to extend some of the conclusions previously articulated at hemispheric scale. For all the continental-scale regions in the Northern Hemisphere, the models are able to simulate a forced response with a magnitude similar to the one derived from reconstructions. Despite higher levels of variability on continental scales (relative to full hemispheres), the role of forcing is found to be important. This leads to reasonable agreement between models and temperature reconstructions.
Nevertheless, a deeper assessment of the consistency between simulated results and reconstruction is limited because of the large uncertainties in the reconstructions and the weak constraints on the estimates of this uncertainty. Notably, the agreement between simulation results and reconstructions is poor for the Southern Hemisphere regions. Our results indicate that models have a much clearer response to forcing than deduced from the reconstructions, leading to a greater consistency across the Southern Hemisphere regions and between hemispheres in model results than in the reconstructions.
It is not possible to precisely assess which part of those disagreements comes from the biases in model dynamics, the forcing, or the reconstructions. As suggested in many previous studies, substantial progress will only be possible with better uncertainty quantification and reduction (spatially and temporally) in the reconstructions and the forcing, as well as through model improvements.
Nevertheless, on the basis of our results we highlight four specific points that may lead to significant advances in the coming years.
The first is the insights that can be gained through studying the discrepancies between reconstructions and simulations relative to direct observations over the most recent decades. A quantitative comparison between simulations, reconstructions, and instrumental data would provide useful information on the origin of those disagreements, allow an estimate of the non-climatic noise in reconstructions, and would elucidate how mismatches over the last 150 years are related to disagreements over the last several millennia (e.g. Ding et al., 2014).
Secondly, large uncertainties are associated with the behaviour of the ocean over the past millennium. The discrepancies in the low-frequency variability between model results and reconstructions at the continental scale seem less systematic than for some oceanic data (Laepple and Huybers, 2014a, b), but clearly assessing this would require additional analyses. As new palaeoclimate data compilations are now available for the global ocean (Tierney et al., 2015; McGregor et al., 2015), model–data comparison for oceanic regions should be encouraged, and the compatibility between ocean and land temperature reconstructions tested. This would allow us to assess the multi-decadal internal and forced variability in the ocean and to determine whether it is the origin of the disagreement between model simulations and Southern Hemisphere reconstructions (e.g. Neukom et al., 2014). Internal ocean variability can also have a significant influence on Northern Hemisphere climate as seen in several studies investigating the circulation in the Atlantic at multi-decadal timescales (e.g. Delworth and Mann, 2000; Knight et al., 2005; Lohmann et al., 2014). These are the timescales for which most models tend to display less variability than reconstructions.
Third, our comparison of continental-scale temperature reconstructions with simulated temperatures only uses a small fraction of the information provided by models and palaeoclimate records. As discussed in Phipps et al. (2013), other approaches can provide analyses complementary to classical model–data comparison, through a better handling of the various sources of uncertainty. Promising examples are proxy forward models, which directly simulate the proxy records from climate model outputs (e.g. Evans et al., 2013) and data assimilation methods (e.g. Widmann et al., 2010; Goosse et al., 2012b; Steiger et al., 2014). These approaches combine model results and observations to obtain the best estimates of past change and may be most effective at detecting inconsistencies between model and palaeoclimate estimates.
Finally, one could also question the selection of the continental scale as the basis of a comparison, as regional changes are strongly affected by modes of variability such as ENSO, the Southern Annular Mode, the North Atlantic Oscillation, or the Pacific North America pattern. These modes could imprint temperature patterns that are masked by averaging over the continents. On the other hand, model–data comparison made at smaller spatial scales has revealed highly variable and even contradictory results at nearby regions (Moberg et al., 2015), suggesting that a large number of local proxy-data sites are needed for obtaining robust results. Ideally, a sub-regional selection from key teleconnection regions should be used to assess the stability of climate modes (Raible et al., 2014) or enable reliable reconstruction of modes of variability (Lehner et al., 2012; Zanchettin et al., 2015; Ortega et al., 2015), although this requires strong reconstruction skill to be successful (e.g. Russon et al., 2015). Additionally, spatially resolved reconstructions should be targeted because they offer useful potential for dynamic interpretation (e.g. Luterbacher et al., 2004; Mann et al., 2009; Steiger et al., 2014, PAGES 2k Consortium, 2014; Shi et al., 2015).
In summary, our results for the Northern Hemisphere suggest a convergence of our understanding of climate variability over the past 1000 years, but there remain many open questions for the Southern Hemisphere. Progress may be expected from comparing simulations, reconstructions, and observations in the instrumental period, from a better knowledge of internal and forced variability in the ocean, from efforts to understand climate variability via proxy forward modelling and data assimilation, and from a clearer view of the influence of climate modes on temperature variability.
O. Bothe (Max Planck Institute for Meteorology, Hamburg, Germany), M. Evans (Department of Geology & ESSIC, University of Maryland, USA), L. Fernández Donado (Institute of Geosciences (UCM-CSIC), Department Astrophysics and Atmospheric Sciences, University Complutense Madrid, Spain, Renewable Energy Unit, CIEMAT, Madrid, Spain), E. Garcia Bustamante (Department of Physics, University of Murcia, Spain), J. Gergis (School of Earth Sciences, University of Melbourne, Australia), J. F. Gonzalez-Rouco (Institute of Geoscience (UCM-CSIC), Department Astrophysics and Atmospheric Sciences, University Complutense Madrid, Spain), H. Goosse (Earth and Life Institute (ELI)/Georges Lemaître Centre for Earth and Climate Research (TECLIM), Université Catholique de Louvain, Belgium), G. Hegerl (School of GeoSciences, University of Edinburgh, Edinburgh, UK), A. Hind (Department of Physical Geography, Stockholm University, Sweden, Department of Mathematics, Stockholm University, Sweden), J. Jungclaus (Max Planck Institute for Meteorology, Hamburg, Germany), D. Kaufman (School of Earth Sciences & Environmental Sustainability Northern Arizona University, USA), F. Lehner (National Center for Atmospheric Research, Boulder, USA; Climate and Environmental Physics and Oeschger Centre for Climate Change Research, University of Bern, Bern, Switzerland), N. McKay (School of Earth Sciences & Environmental Sustainability Northern Arizona University, USA), A. Moberg (Department of Physical Geography, Stockholm University, Sweden), C. C. Raible (Climate and Environmental Physics and Oeschger Centre for Climate Change Research, University of Bern, Bern, Switzerland), A. Schurer (School of GeoSciences, University of Edinburgh, Edinburgh, UK), F. Shi (Key Laboratory of Cenozoic Geology and Environment, Institute of Geology and Geophysics, Chinese Academy of Sciences, Beijing, China), J. E. Smerdon (Lamont-Doherty Earth Observatory of Columbia University, Palisades, New York, USA), L. von Gunten (PAGES International Project Office, Falkenplatz 16, 3012 Bern, Switzerland), S. Wagner (Institute for Coastal Research, Helmholtz Zentrum Geesthacht, Germany), E. Warren (School of Geography, Earth and Environmental Sciences, University of Birmingham, UK), M. Widmann (School of Geography, Earth and Environmental Sciences, University of Birmingham, UK), P. Yiou (Laboratoire des Sciences du Climat et de l'Environnement, LSCE/IPSL, CEA-CNRS-UVSQ, Université Paris-Saclay, 91191 Gif-sur-Yvette, France), and E. Zorita (Institute for Coastal Research, Helmholtz Zentrum Geesthacht, Germany).
L. Fernández Donado, J. F. Gonzalez-Rouco, E. Garcia Bustamante, H. Goosse, and J. Jungclaus organized the workshop at the origin of this paper. H. Goosse led the synthesis. O. Bothe, H. Goosse, G. Hegerl, A. Moberg, C. C. Raible, A. Schurer, S. Wagner, and E. Zorita coordinated the writing. L. Fernández Donado, E. Garcia Bustamante, A. Hind, F. Lehner, and N. McKay prepared the data sets and made them available to the whole group. O. Bothe, L. Fernández Donado, E. Garcia Bustamante, J. F. Gonzalez-Rouco, A. Hind, F. Lehner, N. McKay, A. Moberg, A. Schurer, S. Wagner, E. Warren, M. Widmann, and E. Zorita created the figures and performed their initial analysis. All authors contributed to the writing of the various sections and reviewed the manuscript.
This study is based on discussions held during the joint workshop of the PAGES 2k network and PAST2k–PMIP Integrated analyses of reconstructions and multi-model simulations for the past two millennia, Madrid, Spain, 4–6 November 2013. PAGES and FECYT (FCT-13-6276) are greatly thanked for supporting this workshop. We acknowledge the World Climate Research Programme's Working Group on Coupled Modelling, which is responsible for CMIP. The US Department of Energy's Program for Climate Model Diagnosis and Intercomparison provides coordinating support for CMIP and led the development of software infrastructure in partnership with the Global Organization for Earth System Science Portals. H. Goosse is Research Director with the Fonds National de la Recherche Scientifique (F.R.S.-FNRS, Belgium). This work is supported by the F.R.S.-FNRS and by the Belgian Federal Science Policy Office (Research Programme on Science for a Sustainable Development). C. C. Raible and F. Lehner are supported by the Swiss National Science foundation. P. Yiou is supported by the MILEX project of the Swedish Research Council. J. Gergis is funded by Australian Research Council project DE130100668. O. Bothe was supported by LOCHMES (Leibniz Society), PRIME-II (within DFG INTERDYNAMIK), and CliSAP. L. Fernández Donado was funded by a FPU grant: AP2009-4061. A. Moberg and A. Hind are supported by the Swedish Research Council grants B0334901 and C0592401. G. Hegerl and A. Schurer are supported by the ERC advanced grant TITAN (320691). L. Fernandez Donado, E. Garcia Bustamente, and J. F. Gonzalez-Rouco were supported by grants CGL2011-29677-CO2-02 and CGL2014-59644-R. Gabi Hegerl was also supported by a Royal Society Wolfson Research Merit Award and by NCAS. This is LDEO contribution number 7956. N. McKay was funded through the US National Science Foundation (ARC-1107869). The article processing charges for this open-access publication were covered by the Max Planck Society. Edited by: P. Braconnot