Continental-scale temperature variability in PMIP3 simulations and PAGES 2k regional temperature reconstructions over the past millennium

Estimated external radiative forcings, model results, and proxy-based climate reconstructions have been used over the past several decades to improve our understanding of the mechanisms underlying ...

PAGES 2k-PMIP3 group: Continental-scale temperature variability also be used jointly to evaluate estimates of climate sensitivity to external radiative forcing (e.g. Hegerl et al., 2006;Braconnot et al., 2012;Masson-Delmotte et al., 2013). Comparisons across many realizations of simulated climate are used to assess the extent to which characteristic climate statistics are accurately simulated, as well as to disentangle unforced and forced patterns (e.g. Hargreaves et al., 2013;Bothe et al., 2013a, b;Neukom et al., 2014;Coats et al., 2015a, b). Estimates of the unforced variability in the climate system may be made from unforced simulations, or from the residual obtained when the forced signal is removed from climate reconstructions, using realistically forced model experiments (Schurer et al., 2013).
Furthermore, simulations can provide the basis for the design of observing network arrays (Comboul et al., 2015). Simulation results also provide a test bed for palaeoclimatic reconstruction algorithms within so-called pseudo-proxy experiments (e.g. Zorita et al., 2003;Hegerl et al., 2007;Smerdon, 2012;Lehner et al., 2012;Tingley et al., 2012;Wang et al., 2014;Smerdon et al., 2015b). All of these purposes, which are also pursued within the historical period and with comparison to direct climate observations (Bindoff et al., 2013;Ding et al., 2014), are potentially extended by the longer time interval made possible by analysis over the past millennium.
However, obtaining unequivocal conclusions from the comparison between reconstructions and simulation results over the past millennium remains difficult due to uncertainties in climate and forcing reconstructions, the simplified world represented by climate models, and the relatively weak forced signal in the pre-industrial part of the past millennium compared to internal climate variability (e.g. Moberg, 2013). Reconstructions and simulations are two different representations of the behaviour of the actual climate system, and this creates multiple uncertainties in the task of intercomparison. Simulations have uncertain forcings (Schmidt et al., 2011(Schmidt et al., , 2012, and models contain parameterized or uncertain representation of the physics, chemistry, biology, and interactions within the climate system (Flato et al., 2013). Furthermore, computational constraints impose a limited spatial resolution or a deliberate omission of some known processes in order to perform simulations at global scale that cover several centuries (e.g. Goosse et al., 2005;Schurer et al., 2013;Phipps et al., 2013) The uncertainty in palaeoclimatic reconstructions is not always well understood either, and estimating its magnitude is challenging. For regional-to large-scale temperature reconstructions, uncertainty can be caused by random or systematic error in the proxy measurement, inadequate understanding of the proxy system response to environmental variation, differences in fields derived from instrumental records selected to calibrate the records, changes in the spatiotemporal and data type availability across the observational network, and reconstruction methods (e.g. Jones et al., 2009;Smerdon et al., 2010;Smerdon, 2012;Emile-Geay et al., 2013;Evans et al., 2013;Wang et al., 2014;Comboul et al., 2015).
The non-climatic noise in reconstructions has a significant influence on model-data comparison. This may first have an impact on the variance of the reconstructed climatic signal itself, although this is dependent on the actual choice of calibration method (e.g. Hegerl et al., 2007;Christiansen et al., 2009;Mann et al., 2009;Smerdon et al., 2010;Smerdon, 2012). Furthermore, the non-climatic noise can mask real relationships between climate variations in different regions, or obscure the responses to forcing, which are clearer in models because of the absence of this noise.
Acknowledging the considerable uncertainty in palaeoclimatic reconstructions, the earliest comparisons of past millennium simulations and reconstructions focused on hemispheric-and global-scale changes, using a single, often simple, climate model driven by globally uniform external radiative forcing estimates (e.g. Crowley, 2000;Bertrand et al., 2002). Later, simulations with more comprehensive models (e.g. Gonzalez-Rouco et al., 2006;Amman et al., 2007;Tett et al., 2007) refined the conclusions reached previously and enabled regional-and continental-scale analyses. They underscored the potential role of the spatial distribution of some forcings, such as land use and of the dynamic response of the atmospheric circulation (e.g. Luterbacher et al., 2004;Raible et al., 2006;Goosse et al., 2006;. Changes in the latter may be driven by the forcings (e.g. Shindell et al., 2001;Mann et al., 2009) or be a signature of internal variability in the climate system (e.g. Wunsch, 1999;Raible et al., 2005).
State-of-the-art climate models reasonably simulate properties of internal variability, such as teleconnection patterns or the probability of a particular event (e.g. Flato et al., 2013). However, they are not expected to reproduce the part of the observed time trajectory that is not directly constrained by external forcing because of the non-linear, chaotic nature of the system (Lorenz, 1963). This makes model-data comparison a complex issue when using a single simulation, because differences between model results and reconstructions may be due to a model or reconstruction bias, but may also simply reflect a different sample of internal variability (defined here as the fraction of climatic variability that is not due to changes in external forcings).
Indeed, comprehensive climate models have their own internal climate variability and, if a model represents the real world in a satisfactory way, the observed trajectory would just be one among all potential model realizations. The issue may be addressed by analysing an ensemble of simulations, which provides information on the range that can be simulated by one single model (e.g. Goosse et al., 2005;Yoshimori et al., 2005;Jungclaus et al., 2010;Moberg et al., 2015) or a set of models (e.g. Jansen et al., 2007;Lehner et al., 2012;Fernández-Donado et al., 2013, Bothe et al., 2013b. The reconstruction then needs to be compatible with this range, at least when considering all the uncertainties, to Clim. Past, 11, 1673-1699 www.clim-past.net/11/1673/2015/ claim consistency between simulations and reconstructions, whereby such a compatibility can be defined in various ways, as discussed below. Fernández-Donado et al. (2013) reviewed results from 26 climate simulations with 8 atmosphere-ocean general circulation models (AOGCMs), reflecting the state of modelling before the CMIP5/PMIP3 (Coupled Model Intercomparison Project Phase 5/Paleoclimate Modelling Intercomparison Project Phase 3). These pre-CMIP5/PMIP3 simulations were driven by a relatively wide range of choices for boundary conditions and forcing agents. For the Northern Hemisphere surface temperature variations, Fernández-Donado et al. (2013) found an overall agreement within the temporal evolution but still noted discrepancies between simulations and hemispheric and global temperature reconstructions. For example, the period between around 850 and 1250 CE is warmer in the reconstructions than in the simulations (see also Jungclaus et al., 2010;Goosse et al., 2012b;Shi et al., 2013).
Additionally, a comparison of the simulated changes in the temperature fields from this warm period and the colder period around 1450-1850 showed little resemblance to the field reconstruction by Mann et al. (2009), but the spatial reconstructions themselves have significant uncertainties (e.g. Wang et al., 2015). These two relatively warm and cold periods are often referred to as the Medieval Climate Anomaly (MCA), and the Little Ice Age (LIA), respectively, although their exact timing has been debated and the adequacy of their names has been questioned (e.g. Jones and Mann, 2004;PAGES 2k Consortium, 2013).
The assessment of information from palaeoclimate archives (Masson-Delmotte et al., 2013) in the IPCC Fifth Assessment Report partly followed the approach applied by Fernández-Donado et al. (2013). Masson-Delmotte et al. (2013) included a preliminary analysis of the more recent CMIP5/PMIP3 "past1000" simulations, which were coordinated more closely than previous experiments, particularly in regard to the choices of forcings (Schmidt et al., 2011(Schmidt et al., , 2012. They came to similar conclusions as Fernández- Donado et al. (2013): the reconstructed MCA warming is greater than simulated but not inconsistent within the large uncertainties.
Agreement between palaeoclimate reconstructions and simulations has also been assessed by compositing the response to individual forcing events (e.g. Hegerl et al., 2003Luterbacher et al., 2004;Stenchikov et al., 2006;Masson-Delmotte et al., 2013). The reconstructed and simulated response to volcanic forcing agrees in magnitude on multi-decadal timescales. Detailed comparisons of observations around the 1815 Tambora eruption indicate that the simulated cooling is larger than in instrumental observations or in reconstructions (Brohan et al., 2012), but a significant part of the discrepancy might be due to forcing uncertainties.
For the solar forcing, direct comparisons between simulations and reconstructions are inconclusive regarding whether simulations that use either moderate or weak variations of to-tal solar irradiance provide generally better agreement with reconstructions 2013;Fernández-Donado et al., 2013). This has been confirmed at hemispheric and regional scales by Hind and Moberg (2013) and Moberg et al. (2015), using appropriately designed statistical tests of temporal correlation and quadratic distance between reconstructions and simulations .
The cause of past climate change in the Northern Hemisphere, specifically the contribution by individual forcings to a climatic event, can be estimated using detection and attribution techniques. These techniques allow for the possibility that the reconstructions contain forced signals of larger or smaller magnitude than simulated (e.g. due to forcing uncertainty, uncertainty in a models transient response, or uncertainty in calibration of reconstructions). The results show that the response to volcanic eruptions can be clearly detected in reconstructions, consistent with epoch analysis results, and also confirm that the signal is generally larger in magnitude in the simulations (Hegerl et al., 2003(Hegerl et al., , 2007Schurer et al., 2013), although the discrepancy may be within the range of volcanic forcing uncertainty. The response to solar forcing cannot be reliably separated from internal variability, but very high solar forcing such as that reconstructed by Shapiro et al. (2011) needs to be significantly scaled down to match reconstructions even given large reconstruction uncertainties (Schurer et al., 2014). Within the LIA, detection and attribution methods show that volcanic forcing is critical for explaining the anomalous cold conditions (Hegerl et al., 2007;Miller et al., 2012;Lehner et al., 2013;McGregor et al., 2015) and that there is also weak evidence for a contribution from a small but long-lived decrease in CO 2 concentration (e.g. MacFarling Meure et al., 2006;Schurer et al., 2014).
In particular, the recent consolidation of Southern Hemisphere palaeoclimate data (Neukom and Gergis, 2012) led to the comparison of a hemispheric temperature reconstruction with a suite of 24 climate model simulations spanning the past millennium (Neukom et al., 2014). This study reported considerable differences in the 1000-year temperature reconstruction ensembles from the Northern and Southern Hemisphere. An extended cold period (1590-1670s CE) was observed in both hemispheres, while the current (post-1974) warm phase is found to be the only period of the past millennium where both hemispheres experienced simultaneous www.clim-past.net/11/1673/2015/ Clim. Past, 11, 1673-1699, 2015 PAGES 2k-PMIP3 group: Continental-scale temperature variability warm anomalies (Neukom et al., 2014). Their analyses also suggested that the simulations underestimate the influence of internal variability in the ocean-dominated Southern Hemisphere (Neukom et al., 2014).
While several studies have provided valuable advances in our understanding of hemispheric-scale climate dynamics, this brief overview indicates that observed and simulated palaeoclimate variations at regional and continental scales have not been thoroughly compared up to now. This was the goal of a workshop joining the PAGES 2k and PMIP3 communities in Madrid (Spain) in November 2013, using a recent set of continental-scale temperature reconstructions (PAGES 2k Consortium, 2013) and a collection of state-ofthe-art model simulations driven by realistic external forcings (Schmidt et al., 2011(Schmidt et al., , 2012. On the basis of the discussions held during this workshop, the aim of this study is to systematically estimate the consistency between the simulated and reconstructed temperature variations at the continental scale and evaluate the origin of observed and simulated variations. This study is motivated by the following key science questions: 1. Are the statistical properties of surface temperature data for each individual continent-scale region consistent between simulations and reconstructions?
2. Are the cross-regional relations of temperature variations similar in reconstructions and models?
3. Can the signal of the response to external forcing be detected on continental scale and, if so, how large are these signals?
Section 2 first presents a brief overview of the PAGES 2k reconstructions and simulations analysed here. In addition to a selection of PMIP3 simulations, some numerical experiments that did not follow the PMIP3 protocol were also analysed, mainly to include model runs with larger solar forcing amplitude. We use several statistical methods to achieve robust results in answering the key science questions above. They are listed at the end of Sect. 2. Each methodology is briefly described when it is applied while some specific implementation information is provided in Supplement Sect. S2. In Sect. 3, each continental-scale region is studied separately to determine whether the reconstructed and simulated time series have similar characteristics, in terms of the magnitude and timing of the observed changes as well as the spectral distribution of the variance. Section 4 investigates whether the inter-regional patterns of temperature variability are similar in the reconstructions and simulations. The role of the external forcings in producing the observed variations is presented in Sect. 5. Section 6 provides a discussion of our results, their limitations, and how our conclusions compare to previous studies. Finally, Sect. 7 summarizes the main findings and provides perspectives for future developments. Several additional analyses are provided as a supplement for completeness and further reference.

PAGES 2k reconstructions
The PAGES 2k Consortium (2013) generated temperature reconstructions for seven continental-scale regions (Fig. 1). The proxy climate records found to be best suited for reconstructing annual or warm-season temperature variability within each continental-scale region were identified. Expert criteria for the adequacy of proxies were a priori specified (PAGES 2k Consortium, 2013). The resulting PAGES 2k data set includes 511 time series from different archives including tree rings, pollen, corals, lake and marine sediment, glacier ice, speleothems, and historical documents. These data record changes in biological or physical processes and are used to reconstruct temperature variations (all data are archived at https://www.ncdc.noaa.gov/cdo/f?p=519:2: 0::::P1_study_id:12621). The PAGES 2k reconstructions have annual resolution in all regions except North America, which has one 780-yearlong tree-ring-based reconstruction (back to 1200 CE with 10-year resolution) and one 1400-year-long pollen-based reconstruction (back to 480 CE with 30-year resolution). These latter two reconstructions therefore are smoothed differently and they are either excluded from the analysis or treated in slightly different ways in some comparisons. The reconstruction for the Arctic region used in this study is based on a revised version (v1.1) of the PAGES 2k data set (McKay and Kaufman, 2014).
Each regional group tailored its own procedures to their local proxy records and regional calibration targets (PAGES 2k Consortium, 2013). Thus, each continental-scale temperature reconstruction was derived using different statistical methods. In short, most groups used either a scaling approach to adjust the mean and variance of a predictor composite to an instrumental target, or a regression-based technique to extract a common signal from the predictors using principal components or distance weighting. Thus, some of the observed region-to-region differences between simulations and reconstructions might be due to the differences in reconstruction methods. Nevertheless, alternative reconstructions for all regions based on exactly the same statistical procedures were also produced and were found to be similar to the PAGES 2k temperature reconstructions provided by each group (PAGES 2k Consortium, 2013). Each regional group also used individually selected approaches to assess the uncertainty of their temperature reconstructions, designed to quantify different aspects of the uncertainty. For example, some regions primarily quantified uncertainties associated with the set of records used in the reconstruction and their agreement through time, which can reflect within-region variability as well as uncertainty (Arctic, North American tree rings). Other regions focused on uncertainties associated with how closely the proxy resembles temperatures (Asia, Antarctica, Europe, North American pollen), and some regions incorpo-  rated both of these types of uncertainties (Australasia, South America). Uncertainty estimates in all of the regions except for Antarctica vary through time depending on the set of records available for any given interval and their agreement. All uncertainty estimates that assess how well the proxy data reproduce observed temperatures are based on the assumption that the modern proxy-temperature relation is stationary into the past, and that the agreement between proxy data and temperature on short timescales can be used to infer uncertainty at lower frequencies.

Climate model simulations
The climate model simulations used in this study are listed in Table 1, summarizing model specifications such as resolution, forcing applied to the transient simulations, and length of pre-industrial control simulations (piControl). These simulations include contributions to the third Palaeoclimate and the fifth Coupled Modelling Intercomparison Projects (PMIP3: Braconnot et al., 2012;CMIP5: Taylor et al., 2012) from six models (CCSM4, CSIRO-Mk3L-1-2, GISS-E2-R, HadCM3, IPSL-CM5A-LR, MPI-ESM-P), as well as a more recent simulation with CESM1, and the COSMOS pre-PMIP3 ensemble with ECHAM5/MPIOM (see also Table S1 in the Supplement). The experiments were selected among available pre-PMIP3 and PMIP3 simulations on the basis of specific criteria: the conditions were that (i) they run continuously from 850 to 2000 CE; (ii) they include at least solar, volcanic aerosol, and greenhouse gas forcing (S, V, G in Table 1); (iii) they use a plausible solar forcing reconstruction with an amplitude within the range that is consistent with recent understanding; and (iv) they do not display a large unphysical drift over the simulated period.
PMIP3 simulations all comply with criteria (ii) and (iii) as they follow the recommendation of Schmidt et al. (2011) by using an increase in total solar irradiance (TSI) from the late Maunder Minimum period to the present day of ∼ 0.10 %. Nevertheless, some PMIP3 simulations were excluded from the analysis, as the simulations presented clear incompatibilities with the rest of the ensemble. For instance, the MIROC simulation displays a trend in the global annual mean temperature over the whole millennium that is not compatible with the present understanding of the past millennium climate. It has been considered here as a likely model artefact that could also affect regional and seasonal temperatures in unknown ways. Contrary to the GISS model, this drift is not clearly understood and no control run is available to statistically correct it. The simulation with bcc-csm-1 was discarded because of potentially unphysical large anomalies in some regions. FGOALS-gl was not used due to the unavailability of a continuous run from 850 to 2000, as the so-called "past1000" simulation covers only the years 850-1850 under the PMIP protocol.
Most non-PMIP3 simulations did not comply with at least one the criteria above. Nevertheless, experiments performed with two models (ECHAM5/MPIOM and CESM1) follow all of them. They include simulations with a stronger solar forcing than in the PMIP3 ensemble. A three-member ensemble with ECHAM5/MPIOM uses a TSI reconstruction with an increase of ∼ 0.24 % (COSMOS E2), while CESM1 uses a TSI reconstruction with an increase of ∼ 0.20 %. No simulation used in this study incorporates the much larger increase of ∼ 0.44 %, suggested by Shapiro et al. (2011), which results in simulations that are inconsistent with reconstructed large-scale temperatures (Feulner, 2011 Berger (1978); 61: Bretagnon andFrancou (1988). et al., 2014). The COSMOS simulations deviate from the PMIP3 protocol because they included an interactive carbon cycle with CO 2 concentration as prognostic variable. While simulated and reconstructed CO 2 evolution diverge during some periods, the differences have only a marginal effect on simulated temperatures (Jungclaus et al., 2010).
Consequently, the group of simulations analysed here is not strictly based on the PMIP3 ensemble. Nevertheless, as we use a majority of PMIP3 simulations and additional simulations that follow an experimental design similar to PMIP3, we will keep the reference to PMIP3 for simplicity.
The variable extracted from the simulation outputs is the monthly mean surface air temperature (labelled "tas" in the Climate Model Output Rewriter framework of CMIP5). These temperature fields were then used to create areaaveraged time series that matched the domain and seasonal window of each of the PAGES 2k regional reconstructions (see Supplement Sect. S1).

Statistical methods
Several climate model-palaeoclimate data comparison and analysis methods are used in this study to verify the robustness of the results generated by each method and to provide a comprehensive guide for future work. Model-data comparisons need to account for uncertainties in climate reconstructions, in forcing reconstructions, and in the response to forcings in model simulations. These approaches also must recognize that the real climate, and hence the reconstructions, and individual climate model simulations include their own individual realizations of internally generated variability. Therefore, perfect agreement between model simulations and data can never be expected when directly comparing time series.
The first group of methods is focused on the first question raised in the introduction. The goal is to assess whether temperature reconstructions have similar statistical properties compared to simulations. This is initially done by simple analysis of the time series, such as estimates of the variance (Sect. 3.1). The spectral properties are then analysed (Sect. 3.2) before the probabilistic and climatological consistency (Sect. 3.3) and the skill of the various simulations (Sect. 3.4). The second question dealing with the cross-regional variations in temperatures is addressed by discussing the correlation between regions (Sects. 4.1 and 4.3) and through a principal component analysis (Sect. 4.2). Finally, the third question about the role of the forcing is studied by means of a superposed epoch analysis (Sect. 5.1) by applying a statistical framework involving correlation and distance metrics (Sect. 5.2) and detection and attribution techniques (Sect. 5.3). For more details on those methods, see Sect. S2.
In the majority of the analyses presented in this manuscript, anomalies compared to the mean over the whole period covered are used and the time series are smoothed or Clim. Past, 11, 1673-1699 www.clim-past.net/11/1673/2015/ temporally averaged, using either a 23-point Hamming filter or non-overlapping 15-year averages, depending on the requirements of the various techniques (both methods give a similar degree of low-pass filtering). This is motivated by the relatively weaker skill of some reconstructions to replicate observed records on interannual timescales (Cook et al., 2004;Esper et al., 2005;D'Arrigo et al., 2006) and by the fact that the main focus here on decadal to centennial timescales. The full period analysed is 850-2005 CE, although different periods are chosen for some analyses because of data availability, the choice of the temporal filtering, other technical restrictions, or to analyse sub-periods.

Regional analysis
To begin, the agreement between simulations and reconstructions for individual regions is described qualitatively, using a simple visual comparison of the time series, and then quantitatively by calculating spectra, consistency, and skill metrics. The correlations between the time series are presented in the Supplement ( Fig. S1 and Supplement Sect. S3). Overall, the analyses in this section illustrate the potential of identifying common signals in both data sets. The different diagnostics are presented here separately, whereas the conclusions derived from the results of the different analyses are compared and discussed in more detail in Sect. 6.

Observed and simulated time series
Figure 1 shows the regional time series in the forced simulations with each regional temperature reconstruction. To the right of each time series graph, the magnitude of variability in unforced simulated temperatures is illustrated by calculating the standard deviation of pre-industrial control simulations in each model. The unforced variability is generally similar in all models in all the regions, with weaker amplitudes in Australasia and Asia. Note that some regions cover only land areas, while others have an oceanic fraction (see Supplement Sect. S1), with a potential impact on the magnitude of the estimated variability. Most reconstructions show a tendency of a gradual cooling over the millennium, followed by recent warming. Notable common features among regions on decadal timescales are the pronounced negative anomalies related to large tropical volcanic eruptions in the simulations. This is most obvious for the eruptions in the 1250s, 1450s, and 1810s. Among the temperature time series, a larger response to volcanic eruptions is noticeable in the CESM, MPI, and CCSM4 simulations. The regional temperature reconstructions rarely capture the first two of these anomalies or only register them at smaller amplitudes. Only the early 19th century eruptions are clearly reflected in many regions, and are most pronounced in the Northern Hemisphere reconstructions. The reconstruction for Europe also shows a negative anomaly coinciding with the effect of the 1450s eruption, with an amplitude comparable to that seen in some of the simulations. Figure 1 suggests that the temperature reconstructions show slightly more centennial to multi-centennial variability than the models over the full period with stronger longterm trends, while several model results indicate a stronger recent warming compared to some of the reconstructions. The reconstruction uncertainty bands provided with the original PAGES 2k reconstructions encompass the simulated series with few exceptions, in particular the Arctic and North America during the 1250s. The published uncertainty estimates have been calculated using different methods for the various continental-scale regions, as detailed in the Supplement of PAGES 2k Consortium (2013). Furthermore, those uncertainties are only valid at the original temporal resolution, which is annual in all cases except for North America. It is expected that the reconstruction uncertainty decreases at lower resolution, or after smoothing as in our case. This is consistent with the lower uncertainty ranges for the lowresolution pollen-based reconstruction.
However, estimating the reduction of the uncertainty due to smoothing is not straightforward (e.g. Moberg and Brattström, 2011;Franke et al., 2013) as the resulting uncertainty magnitude is also dependent on autocorrelation of the non-climatic noise in proxy data. The extreme hypothesis, considering that the error is constant in time and that the errors are uncorrelated, would lead to a decrease proportional to 1 over the square root of the number of samples included in the average. For a smoothing similar to 15-year averaging, as performed herein, the approximation that likely leads to an underestimation of the uncertainties would correspond to a decrease by a factor of about 4 compared to the original error estimate. This suggests very small errors for most reconstructions. In this case, the major discrepancies between the reconstructions and model results would occur at the same time as mentioned above; however, periods when the models are out of the range of the reconstruction uncertainty bands would be more common at the decadal scale.
For North America, the long-term multi-centennial trend appears to be similar between the pollen based reconstruction and simulations, except for the last ∼ 200 years, when some simulations show much stronger warming than is present in the reconstruction. This warming feature is somewhat stronger in the tree-ring-based reconstruction than in the pollen-based reconstruction but is nevertheless weaker than in some simulations. The COSMOS simulations appear to be collectively colder than this reconstruction in the late 20th century. Although the European temperature reconstruction and simulated series disagree substantially in some parts of the 12th century and for the last ∼ 200 years, there are otherwise strong similarities, particularly during periods of large volcanic eruptions. Simulated and reconstructed Arctic series show large decadal to centennial variability, but the timing of these variations does not agree well. Therefore, simulations are often outside the reconstruction's uncertainty range. Simulated and reconstructed temperatures show only weak long-term trends in Asia, but decadal variability appears to be larger in the reconstruction. Simulations generally differ from the reconstruction in the last 200 years and show either much weaker or much stronger trends. In Australasia, the weak forced variability common to all simulations may be due to the large spatial extent of the domain, which includes large oceanic areas that may dampen the forced highfrequency variability. For the recent warming, the trends in CESM, CCSM4, IPSL, and the COSMOS simulations are considerably stronger than the Australasian temperature reconstruction. The temperature reconstruction for South America is often near the upper or lower limit of the simulation ensemble range and displays more centennial-scale variability than the simulations. In Antarctica, the reconstruction has a clear long-term negative trend and only a modest warming in the 20th century, while the simulations show nearly no long-term cooling but agree on the warming onset in the beginning of the 20th century.

Spectral analysis
Next, we consider the agreement between simulated and reconstructed temperature data in terms of their spectral densities, which show how temperature variances are distributed over frequency ( Fig. 2; see also Fig. S2). Spectra were computed using the multi-taper method (Thomson, 1982;Percival and Walden, 1993), with its so-called time-bandwidth product being set to 4. Consequently, each calculated spectrum is an average of seven statistically independent spectrum estimates. Spectra for the reconstructions are illustrated with their 95 % confidence intervals, while model spectra are plotted with single lines. The analysis is made at the original time resolution using all existing data points in the time frame 850-2005. The degree of agreement between model and reconstruction spectra differ substantially between regions, with the Arctic showing the best agreement at all frequencies and South America showing the worst. In the latter, most model spectra lie in the reconstruction confidence interval only in a narrow frequency band corresponding to about 100-to 150year periods. The agreement is generally good for the Arctic, Europe, and Asia at multi-decadal timescales (20-50 years) for many regions. Nevertheless, many models have systematically less variance in the 50-to 100-year band and most models have more variance than the reconstructions at higher frequencies.
Pronounced differences of high-frequency variance is seen for all Southern Hemisphere regions. In particular, the pre-PMIP3 COSMOS simulations show significantly too much variance at timescales of 3 to 5 years for Australasia and to Hamming window, as in many other analyses in this study. The multi-taper method (Thomson, 1982;Percival and Walden, 1993) was used, with the time-bandwidth product set to 4 and with longterm averages subtracted before estimating the spectra. Units are temperature variance ( • C 2 or K 2 ) per frequency (year −1 ). a lesser degree for South America and Antarctica. This property has previously been related in regions with strong influence from tropical Pacific variability to this model's ENSO variability (Jungclaus et al., 2006;Fernández-Donado et al., 2013). Most model spectra for North America lie within the confidence interval of the tree-ring-based reconstruction spectrum, although several models have somewhat less variance than this reconstruction at periods longer than 50 years. The North America pollen-based reconstruction behaves as a roughly 150-year low-pass-filtered series and has signifi- cantly less variance than the corresponding tree-ring-based record at all frequencies for which both spectra are defined.

Consistency estimate
The probabilistic and climatological consistency of PMIP3 simulations and PAGES 2k reconstructions was assessed following the framework of Annan and Hargreaves (2010;and references therein;Hargreaves et al., 2011Hargreaves et al., , 2013 and Marzban et al. (2011), respectively. The current application is based on Bothe et al. (2013a, b). The underlying null hypothesis follows the paradigm of a statistically indistinguishable ensemble (Annan and Hargreaves, 2010;Rougier et al., 2013), i.e. the validation target, represented here by the temperature reconstructions, and the model simulations are samples from a common distribution and are therefore exchangeable.
Climatological consistency refers to the similarity of the climatological probability distributions of reconstructions and of simulations over a selected period, either the whole millennium or sliding sub-periods. We analyse climatological consistency by comparing individual simulated series with the target (i.e. the reconstructions) to identify deviations in climatological variance and possible biases between them. To achieve this goal, Marzban et al. (2011) proposed the use of residual quantile-quantile (r-q-q) plots that should be approximately flat for consistent series (Sect. S2.1).
Probabilistic consistency refers to the position of the reconstruction in the range spanned by the ensemble of simulations. Histograms of the ranks should be flat under exchangeability (Sect. S2.1) -i.e. estimated frequencies of the verification target and the ensemble agree if the simulation ensemble is probabilistically consistent with the temperature reconstructions (Murphy, 1973).
As there are large uncertainties in palaeoclimate reconstructions, it is necessary to take into account these uncertainties in the evaluation of the consistency of the ensemble of climate model simulations (Anderson, 1996). This is achieved by inflating the model simulations results by adding noise with amplitudes that are proportional to published uncertainty estimates from the original temperature reconstructions.
We assess probabilistic and climatological consistency based on non-overlapping 15-year averages centred on the full period considered, except for the North American temperature reconstruction, where non-overlapping 30-year averages are used for the pollen-based reconstruction, and 10year averages for the tree-ring-based reconstruction. The results are presented in Figs. 3, S3, and S4 for all regions.
The regions selected for Fig. 3 are chosen to provide a contrasting example. Two estimates of the uncertainties are used. First, the uncertainties provided with the original reconstruction are applied, which is an overestimation for the smoothed time series. Second, at the other extreme, the uncertainties www.clim-past.net/11/1673/2015/ Clim. Past, 11, 1673-1699 are assumed to be equal to zero and are thus known to be underestimated. A third estimate of the uncertainty is provided in the Supplement figures, using an uncertainty measure equal to the one provided in the original publication divided by a factor of √ 15 to account for the smoothing (see Sect. 3.1). This leads to results that are generally very similar to the case where uncertainty is assumed to be zero.
The simulations in most cases lack climatological consistency with the reconstructions (Figs. 3 and S3). The simulated quantiles can deviate strongly from the reconstructed quantiles. Specifically, the simulated distributions are generally over-dispersive when using the original estimates of uncertainties. The differences are much smaller when uncertainties in reconstructions are neglected, although extremes often remain overestimated. The Arctic and the North American tree-ring-based reconstruction are exceptions as some simulations are climatologically consistent with the reconstruction and display only small differences between simulated and reconstructed quantiles for all estimates of the uncertainty. Consistency is reduced for those simulations that show larger variability (recall Fig. 1) as is the case of the CCSM4 and CESM models.
In agreement with the climatological assessment, the simulated results generally lack probabilistic consistency with the reconstructions when the original uncertainty is considered (Figs. 3 and S4). The target data are too often in the central ranks, indicating that the probabilistic distribution of the ensemble is too wide and shows significantly overdispersive spread deviations. The only exception is the North American region using the tree-ring-based reconstruction. The most prominent differences are found in the Antarctic region, where the simulation ensemble spread deviates considerably from reconstructed temperatures (Fig. S4), but strong ensemble spread deviations relative to the pollen reconstruction for North America are also evident.
This assessment of the probabilistic consistency strongly depends on the estimate of the uncertainty of the reconstruction. If we do not add noise to the model time series to reflect error in reconstructions before the ranking and thereby neglect reconstruction uncertainty, or if we assume a strong reduction of the error in reconstruction at the decadal time scale because of the smoothing, the ensemble appears to be consistent with a number of regions or even under-dispersive for others. However, ignoring the uncertainty in such a manner may lead to an overconfident assessment of consistency between simulation ensemble and reconstruction. Nevertheless, because the uncertainties are not well known, over-dispersion does not necessarily weaken the reliability of the ensemble relative to the target, but instead may highlight insufficiently constrained uncertainties in the reconstruction.

Skill estimate
The skill of the simulations is assessed using a metric introduced by Hargreaves et al. (2013). The idea of skill stems from weather forecasting and refers to the ability of a simulation to represent a target better than some simple reference values. For instance, in weather forecasting, a standard reference is to assume no change compared to initial conditions (i.e. persistence). A forecast has a positive skill if it is closer to the observed changes than this simple reference. The skill S, as in Hargreaves et al. (2013), is then where F i is the simulation result at each data point, O i is the reconstruction data, R i is the reference (for instance a constant climate here), and e i is uncertainty of the target. The square-root expression becomes undefined when either the actual simulation or the reference is better than the upper possible agreement level indicated by the errors. Uncertainty estimates are derived from the originally reported uncertainties in regional temperature reconstructions given by PAGES 2k Consortium (2013). If reconstructed error estimates are realistic, we do not expect the simulations to fit the target better than these uncertainty estimates. As for the consistency analyses, the skill analysis is calculated using temperature anomalies from the long-term averages within each analysis period. Figure 4 presents the skill for the Arctic and Antarctica, as an example, with the other PAGES 2k regions displayed in Fig. S5. In this estimate, we use a no-change reference forecast (i.e. the reference is the climatology) as there is no clear a priori evidence that the climate at one particular time during the past millennium is warmer or colder than the mean. Positive values suggest that the simulations is in better agreement with (i.e. closer than) the regional reconstructions than this reference. Results are presented for dates when no data are missing in four periods: 850 to 1350, 1350 to 1850, 850 to 1850, and the full period 850 to 2000. As in Sect. 3.3, we compute the skill in Fig. 4 using the uncertainties provided with the original reconstruction, as well as a case that assumes the uncertainties are negligible (i.e. assuming e 2 i = 0 in Eq. (1) of Sect. 3.4). Additionally, the skill is computed assuming a reduction by a factor of √ 15 in the Supplement figures.
The most notable result is that the skill measure is generally undefined when using the uncertainties provided with the original reconstruction: either the reference or the simulated data are closer to the reconstruction than uncertainty allows, leading to the square root of a negative number in Eq. (1). This confirms that uncertainties in the reconstructions are potentially an overestimation for smoothed time series. When ignoring uncertainties, the 15-year non-overlapping means of the simulations rarely display skill. Simulation skill appears to be most likely for the European and Arctic regions, while positive skill is nearly absent for the Southern Hemisphere regions and North America in all the models.

Links between the different regions
The structure of the spatial variability, i.e. the spatial covariance of temperature changes, contains contributions from forced signals and from teleconnections in the internal climate variability. The PAGES 2k temperature reconstructions help to investigate the consistency between simulations and reconstructions with respect to this covariance structure. In the following sections, this is evaluated using spatial correlations, principal components (PCs) and empirical orthogonal functions (EOFs), and correlations over sliding temporal windows.

Spatial correlation
The spatial correlation matrix of simulated temperature for the PAGES 2k regions is compared to the correlation matrix of the PAGES 2k reconstructions (Figs. 5 and S6). Correlations are calculated for detrended continental mean time series filtered with a 23-year Hamming window and based on the continents for which these are available, which excludes North America. We use the longest common period for forced simulations and reconstructions, which for the filtered data is 1012-1978 CE (1000-1990 CE for annual data). To disentangle the contributions from forcings and from internal variability, we analysed forced simulations for the en- tire analysis period, forced simulations for the pre-industrial period (before 1850 CE), and unforced control simulations.
MPI-ESM-P is used to illustrate our main findings in Figure 5 (see Fig. S6 for the other models). Correlations in the forced MPI-ESM-P simulation for the whole period are higher than 0.6 between nearly all regions. In contrast, the correlations for the PAGES 2k temperature reconstructions are rather low, which indicates a substantial inconsistency between the correlation structure in the models and in the www.clim-past.net/11/1673/2015/ Clim. Past, 11, 1673-1699, 2015 PAGES 2k temperature reconstructions. The potential causes of this discrepancy will be discussed in Sect. 6, but we must reiterate here that, in contrast to other analyses presented above, the evaluation of the spatial correlation does not take into account any uncertainty in the reconstruction. Any non-climatic noise related to the characteristics of the proxy records selected or differences in the reconstruction method between regions would decrease the correlation, contributing to lower values than for the model results.
The correlations in the simulations are lower if only the pre-industrial period is considered, and close to zero in the control simulations. The simulated high correlations for the last century are likely to be a consequence of the rather homogeneous and strong anthropogenic warming in the simulations. The high correlations for the pre-industrial forced runs show that the response to volcanic forcing, solar forcing land use, and/or orbital forcing also substantially contributes to the correlations at the timescales considered. Low values obtained for the control simulations indicate that teleconnections between continents are weak for simulated internal variability.
Although these general characteristics are present in many of the models evaluated here, there are some differences among them. In particular, some of the models that show higher correlations during pre-industrial times (e.g. CESM) also display a large response to volcanic forcing compared to the other members of the ensemble . Additionally, the specific characteristics of some regions may differ substantially. For instance, the correlation between Antarctic temperatures and other regions is very low in MPI-ESM-P or IPSL-CM5A-LR for pre-industrial conditions, while it is much larger in CCSM4 and CESM. This can be attributed to a different ratio of forced versus unforced variability, and in particular to discrepancies in the magnitude of the response to external forcing in the selected models. Figure 6a shows the loadings of the first EOF on each region for the PMIP3 forced simulations and the PAGES 2k reconstructions (with corresponding results for the GISS and COS-MOS ensembles presented in Sect. S4 and Fig. S7). Most models show similarities in the loadings, which indicates that the different regions covary similarly in the different models. All loadings are positive, and thus the first principal component (PC) is only a weighted mean of all continental temperature series.

Principal component analysis
Consequently, the time series of the first PC of the PMIP3 simulations and PAGES 2k temperature reconstructions (Fig. 6b) reflect the main features of the individual original series (particularly for Northern Hemisphere regions); namely a temperature decline after around 1200 CE, which lasts until the early 1800s, followed by the sustained warming within the 19th and 20th century. Additionally, the influ- ence of volcanic eruptions on reconstructed temperatures is visible during some periods, especially during the mid-13th century (although not in the reconstructions), the mid-15th century, and the beginning of the 19th century.
In most models, the first EOF explains about 80-90 % of the total variance, whereas the leading EOF in the PAGES 2k temperature reconstructions accounts for only 55 % of the total variance. This shows that the covariance structure is less complex in the simulations. This is consistent with the larger correlations between regions found in Sect. 4.1, which means that the leading mode of homogeneous warming or cooling dominates the covariance structure in model results. In a few simulations (HadCM3, COSMOS), however, the vari-Clim. Past, 11, 1673-1699, 2015 www.clim-past.net/11/1673/2015/ ance explained by the first EOF is about the same as in the reconstructions.
The largest values for the loadings are found for the Arctic region, due to the high temperature variability in the last 1200 years in this region. This expression of the classical Arctic amplification is reflected in most models and in the reconstructions. The ocean-dominated regions of the Southern Hemisphere show less pronounced variability relative to the Northern Hemisphere, consistent with the results of Neukom et al. (2014).
If the analysis is performed over the pre-industrial period only (Fig. S8), similar conclusions are reached but the loadings are smaller, especially over the Arctic, and the amount of variance represented in the leading EOF generally decreases, indicating a larger heterogeneity in the pre-industrial period.

Inter-regional and inter-hemispheric coherence of past temperature variability
Next, the stationarity of the correlation structure between the different regions, in the models and the reconstructions, is assessed using a running correlation analysis, (Figs. 7, S9). For the simulations, the multi-model mean shows generally high inter-regional correlations, as the common contribution of the forcing is enhanced because of the averaging procedure. Periods with small variations in external forcing are, however, characterized by weaker coherence between the regions. This occurs during the 11th and 12th century and in shorter periods around 1500 and 1750. High coherence occurs in periods with strong variations in external forcing, highlighting in particular that volcanic eruptions can cause simultaneous temperature variations in most regions. The inter-regional correlations in the individual model simulations vary considerably. The model range includes the correlations derived from the reconstructions for some regions, as for Europe vs. Arctic (Fig. 7a), but values for models are very often higher than for reconstructions (see also Sect. 4.1). The difference is particularly large for the coherence between Australasia and South America (Fig. 7b), which is substantially larger in model simulations compared to reconstructions and instrumental observations (Morice et al., 2012) (Fig. 7b). This could indicate that some regions are less connected by modes of variability (such as ENSO) in reality than suggested by models, that the models have poor representation of modes of internal variability that influence the ocean-dominated Southern Hemisphere (see Neukom et al., 2014; see also Supplement Sect. S5 and Fig. S10), or that there is more non-temperature noise in the proxy data from those regions.

Role of forcing
Some aspects of the response to external forcing have been briefly discussed in the previous sections. It is now formally addressed here by a superposed epoch analysis, by applying the U R and U T (correlation-and distance-based) model evaluation statistics and by detection and attribution techniques.

Superposed epoch analysis
The response of the PAGES 2k reconstructions and the various model simulations to external forcing from solar and volcanic activity is evaluated here using a superposed epoch analysis approach, following Masson-Delmotte et al. (2013). This analysis was conducted for two different timescales: interannual and multi-decadal. For interannual timescales, this is done by generating composites of reconstructed and simulated temperature sequences corresponding to the timing of the 12 strongest volcanic events (see Sect. S2.2). For the multi-decadal composites, the five strongest events are selected and the means from 40 years before to 40 years afwww.clim-past.net/11/1673/2015/ Clim. Past, 11, 1673-1699, 2015 ter the eruption are calculated from time series smoothed with a 40-year low-pass filter using least-squares coefficients (Bloomfield, 1976). We also calculate composites corresponding to the timing of intervals of weaker solar forcing at decadal timescales. The intensity of the average model response to the selected forcing events is then compared to the corresponding response found in the reconstructions. The regional temperature response for six PAGES 2k regions (North America is not analysed here; see Sect. 2.1) to the major volcanic events in the Crowley and Unterman (2012) reconstruction are shown in Fig. 8. The temperature perturbation typically lasts longer than the forcing itself, with a recovery to pre-eruption temperatures after 3 to 10 years in the simulations and in the reconstructions.
The responses vary considerably in the simulations and in the reconstructions among regions. Nevertheless, the composite averages are always larger in model results with values of up to −1 • C compared to about −0.25 • C in reconstructions. The largest responses in simulated and reconstructed temperatures are found in Europe and Asia. The Arctic and South America show smaller simulated temperature changes compared to Europe and Asia (around −0.5 • C) and the average responses in the reconstructions are even smaller but stay at levels of −0.1 to −0.2 • C during several years. For the Antarctic region, both the simulated and reconstructed temperature responses are negligible. This is also the case for the reconstructed response in Australasia. Similar results were obtained using the Gao et al. (2008) forcing (see Sect. S6 and Fig. S11).
At multi-decadal timescales, the simulated and reconstructed temperature responses are in better agreement, in particular when using the Crowley and Unterman (2013) reconstruction (Fig. S12) rather than the Gao et al. (2008) estimations (Fig. S13), with temperature decreases on the order of a few tenths of a degree in most regions. The one exception is South America, where, in contrast to simulations, the reconstructions do not show any multi-decadal changes associated with volcanic forcing.
The multi-decadal impact of solar forcing in the reconstructions is strongest in Europe, the Arctic, and Asia (Fig. S14), with mean changes ranging from 0.15 to 0.25 • C. Changes in model simulations are smaller, lying between 0.05 and 0.1 • C in all regions except for Antarctica, where no changes are perceptible. The reconstructed changes thus appear larger than the simulated ones in Europe and the Arctic. This interpretation of the results should be approached cautiously, however, as the solar variability is not independent of the volcanic forcing analysis. Volcanic eruptions tend to occur more often in periods of low solar forcing in the reconstructed forcing records, and solar forcing itself is characterized by significant uncertainties (e.g. Schmidt et al., 2011 Sundberg et al. (2012), , and Moberg et al. (2015). It includes two essential metrics, which both serve as statistical tests of a null hypothesis. First, a correlation metric, U R , is used to test whether a significant positive correlation exists between simulated and observed (or reconstructed) temperature variations, indicating that they share a common response to changes in external forcings. Second, a weighted square-distance metric, U T , is used to test whether temperature variations in a forced simulation are significantly closer to the observed temperature variations than an unforced control simulation. If this is the case, a negative U T is obtained, whereas a positive U T indicates that the simulated response to forcings is larger than those in the observations, provided a significant positive U R is found. Both metrics are approximately distributed as N(0,1) under the null hypothesis that forced simulations are equivalent to unforced control simulations. Thus, it is easy to see whether a U R or U T value is significantly different from zero. For example, a one-sided test value numerically larger than 1.65 is significant at the 5 % level.
Prior to the analysis, all records were recalibrated against their instrumental target temperature time series (see Sect. S2.3) following the procedure of Sundberg et al. (2012) and Moberg et al. (2015) to ensure that each regional reconstruction, after calibration, approximately satisfies the assumption that the true temperature component, upon which the non-climatic noise component is added, is correctly scaled (see Sundberg et al., 2012, andMoberg et al., 2015). Figure 9 shows the model evaluation statistics U R and U T , calculated for the 861-1850 preindustrial period. In general, all forced simulations and the reconstructions share a common forcing signal and, overall, the forced simulations match the reconstructions significantly better than the unforced control simulations. However, these overall positive results are essentially due to a good match between simulations and reconstructions in the Northern Hemisphere, while the agreement is poorer in the Southern Hemisphere.
Because of the imprint of the forcing response, all forced simulations show significant (p < 0.01) positive correlation statistics (U R ) when data from all seven regions are combined, although notable differences are seen between regions. In the Arctic, Europe, and Asia, all simulations have significant positive U R values. Nearly all simulations for Australasia and most for Antarctica also have significant positive U R .
In contrast, simulated and reconstructed pre-industrial temperature histories for South America show little common variation, as revealed by mostly insignificant U R (some are even negative) in that region. U R statistics for North America (tree-ring-based reconstruction) are only slightly better, . Correlation (U R ) and distance (U T ) statistics for PAGES 2k regions, with hemispheric and global combinations of all regional data, in the period 861-1850 CE. Positive U R indicates that simulations and reconstructions have a positive correlation and that they share an effect of temporal changes in external forcings. Negative U T indicates that a forced simulation is closer to the observed temperature variations than its own control simulation. The analysis reveals a notably better general agreement between simulations and reconstructions for the Northern Hemisphere as compared to the Southern Hemisphere. Coloured dots: individual simulations. Diamonds: ensemble-mean results for COSMOS and GISS models. Dashed lines show one-sided 5 and 1 % significance levels. Note the reversed vertical axis in the U T graphs.
but note that this reconstruction only starts in 1201. Moreover, the original temporal resolution of 10 years in the North American reconstruction leads to some loss of information in this analysis, which was performed at a 15-year resolution. Results for the distance statistic (U T ) show that nearly all forced simulations are significantly closer (p < 0.05) to the observed temperature variations than their respective control simulations when all regions are combined -i.e. their U T statistics are negative and statistically significant. The Arctic shows the overall best performance in the sense of having the largest number of negative significant U T values. Most simulations also show negative U T for Europe, Asia, and Antarctica, but many of them are insignificant. For Australasia and South America, nearly all U T values are insignificant and many are even positive.
Thus, overall, the comparison between simulation results and reconstructions performs notably better for the Northern Hemisphere than for the Southern Hemisphere. In particular, nearly all simulations have significant negative U T values for the combined Northern Hemisphere data (p < 0.05) but no significant negative values are found for the Southern Hemisphere, where most of the U T values are positive. This suggests that the simulated effect of forcings in the Northern Hemisphere agrees well in amplitude with the corresponding effect in the Northern Hemisphere reconstructions, whereas www.clim-past.net/11/1673/2015/ Clim. Past, 11, 1673-1699, 2015 the simulated Southern Hemisphere effect of forcings often appears to be larger than in the Southern Hemisphere reconstructions.
Results for both U R and U T suggest that the most robust agreements are for the largest spatial scales and for ensemble mean results (Fig. 9). The most significant U R and U T are for ensemble means and global comparisons, followed by ensemble means and NH comparisons. Splitting the analysis period into two halves (856-1350(856- and 1356(856- -1850. S15-S16) shows that the more recent period has better U R statistics. There are, however, not many significant negative U T in this period, although North America in particular shows several significant values.
Extending the analysis to the full 861-2000 period yields higher U R values for most regions (Fig. S17). The exception is Antarctica, where lower U R values indicate a divergence of the simulations and reconstruction for this region during the industrial period. Notably, U T values for the full analysis period are mostly weaker than for the pre-industrial period. Consequently, the overall performance of the simulation results-reconstruction comparison degrades in terms of the distance measure when recent data are included. This is likely because the simulated signal itself often has a larger amplitude in the industrial period than many of the regional temperature reconstructions (see Fig. 1).
Ensemble means for COSMOS and GISS ensembles give more significant U R and U T than individual simulations from these ensembles, demonstrating once more the value of averaging for isolating the forced signal. The intra-ensemble spread of test statistics illustrates the degree of randomness in U R and U T statistics for individual simulations, highlighting the danger of judging one model as being better than another. In particular, it is difficult to judge whether the high (E2) or low (E1) solar forcing amplitude of the COSMOS simulations provides a better fit to the reconstructions, as their ranges of U T values for individual simulations mostly overlap. For the early period analysis (856-1350), however, the low solar COSMOS simulations provide a better fit than the high solar simulations, as seen by their respective U T values being of different sign and having entirely non-overlapping ranges when all seven regions or when the Northern Hemisphere regions are combined (Fig. S15). This result is confirmed by a formal test where U T is calculated in a different way to directly compare the two COS-MOS ensembles, using the method described in Moberg et al. (2015). This test reveals that, despite a significantly better fit of the low solar simulations in the earliest period, neither of the two solar forcing alternatives gives a significantly better fit to the reconstructed temperature history when the more recent data are included (Fig. S18).

Detection and attribution
Detection and attribution aims to identify the forced response in the regional temperature reconstructions by evaluating Vertical bars indicate 5-95 % scaling factor ranges, with a cross indicating the best fit. Scaling factors that are significantly offset from "0" indicate that the response to forcing is detected, and those that encompass "1" indicate that the magnitude of the forced response agrees with simulations. For each region scaling ranges are shown for four different time periods (colours). For the Northern Hemisphere (NH), Southern Hemisphere (SH), and globally (ALL), the regressions were carried out on the combined data from all the applicable regions. An asterisk indicates that the detection analysis has been successful, namely the forced response is significantly greater than zero and that the residuals are consistent with model-based samples of internal variability.
whether observed changes could be entirely caused by variability created within the climate system (internal variability), whether external forcing is necessary to explain them, and what magnitude of external forcing response is consistent with reconstructions (see Bindoff et al., 2013;Hegerl and Zwiers, 2011). Here, we focus on all forcings together and not on the response to each forcing individually, as simulations with individual forcings are needed to analyse the latter. Attribution is achieved by estimating the response to the external forcing in the reconstruction using a total least-squares regression techniques (following Schurer et al., 2013; see also Allen and Stott, 2003). The outcome are scaling factors that determine the amplitude of the fingerprints of the forcing response in the reconstructions. A forcing is detected if a scaling value of zero can be rejected at the 5 % significance level, indicating that it is unlikely that climate variability alone is responsible for the similarity between forced response and reconstruction. If the 5-95 % range of scaling factors encompasses 1, then the magnitude of the response to forcing is found to be consistent in simulations and in the reconstruction (see Sect. S2.4). Figure 10 shows the results of the detection and attribution analysis using the multi-model ensemble mean, which is calculated as the mean of all model simulations described in Sect. 2.2, except for the high-solar COSMOS and CESM1 simulations as they include a clearly different forcing. All reconstructions and models used were first filtered using a 23-year Hamming window.
The response to external forcing is detectable (p < 0.05) in all four reconstructions from the Northern Hemisphere and during all time periods (scaling range always greater than zero, indicating that the level of agreement between the Clim. Past, 11, 1673-1699 www.clim-past.net/11/1673/2015/ multi-model mean and the reconstructions exceeds that from random control samples significantly). The scaling ranges always encompass the scaling factor 1, which shows that the model results are consistent with the reconstructions because they do not need to be scaled up or down. The only exceptions are the earliest time period (864-1350) for North America (tree-ring-based reconstruction), where only 150 years of data were available and the early European period, which fails the residual consistency check, indicating that the residual that is attributed to internal variability is larger than expected from model simulations, possibly due to non-climatic noise in reconstructions. The results for the latter case suggest that the basic hypotheses underlying the methodology are violated because the modelreconstruction discrepancy cannot be explained by internal variability alone. External forcing is also detectable when the model and reconstruction data from all Northern Hemisphere regions are combined.
External forcing is not detectable in South America (no scaling ranges significantly larger than 0) and only for certain time periods for Antarctica and Australasia (with fits for Australasia failing the residual consistency check). External forcing is also not detectable in the combined Southern Hemisphere reconstruction. As well as being undetectable, despite accounting for uncertainty in simulated signals due to variability, the estimated signals are also significantly smaller than simulated. Consequently, the models appear to simulate a magnitude and pattern of external forcing in the Southern Hemisphere significantly different from that derived from the PAGES 2k reconstructions. This could be due to strong noise in reconstructions swamping the forced response, calibration uncertainty in reconstructions misestimating the magnitude of the forced response, or errors in climate models as discussed below.

Discussion
In the light of the results presented in the Sects. 3 to 5, we discuss below each of the three questions raised in the introduction.
6.1 Are the statistical properties of surface temperature data for each individual continent-scale region consistent between simulations and reconstructions?
The analyses herein show that the answer to this question depends on the specific feature assessed. The simulation results and reconstructions agree at regional scale for some metrics, but disagree in many cases. The consistency between simulations and observations is still generally more robust at hemispheric and global scales, and the fit to reconstructions is improved for ensemble mean of simulations compared to individual members.
Overall, smoothed simulated temperature anomalies from the long-term average lie within the range of the originally published uncertainty estimates of the reconstructions. However, these uncertainty ranges are, for all regions except North America, defined for data at annual resolution and therefore are very likely larger than uncertainty ranges adapted for the smoothed versions of the data (see Sect. 3.1). Thus, the published uncertainties for the reconstructions are in most cases too large to provide strong constraints on the ensemble of simulations, as different forcing amplitudes and responses are nevertheless consistent within the range of the reconstructed values. Some common signals between model results and reconstructions can be identified visually as isolated events, such as the cooling during the early 19th century in many regions, but they are relatively rare.
The time series for forced simulations are nevertheless significantly correlated with temperature reconstructions, for many regions, when the entire series are considered (Supplement Sect. S3). Models also have some skill compared to a simple a priori estimate assuming no temperature change over the past millennium (Sect. 3.4). Despite using a very simple reference method as a benchmark, however, such skill is achieved nearly exclusively for Northern Hemisphere regions, specifically for the Arctic and in some models for Europe and Asia. This is in agreement with the conclusions derived from the application of the Sundberg et al. (2012) evaluation framework (Sect. 5.2) that forced simulations are significantly closer to the reconstructions than unforced simulations in Northern Hemisphere regions. In particular, the Arctic region shows a robust agreement, as do Europe and Asia to a lesser degree. In contrast, for all the regions of the Southern Hemisphere, the models have nearly no skill compared to a constant climate reference and individual forced simulations are in most cases not significantly closer to reconstructions than an unforced reference.
The diagnostics mentioned above addressed whether simulated time series of surface temperature at continental scale have temporal similarities with reconstructed ones. The climatological or probabilistic consistency is complementary as it focuses on the distribution of temperature data, independent of the particular trajectory over time. For most regions, no consistency is found between the distribution of model results and reconstructed temperatures when using the original reconstruction uncertainty estimates (Sect. 3.3), which are annually resolved in all cases except North America.
It should be noted, however, that these results depend strongly on the uncertainty estimates considered (Bothe et al., 2013a, b): the greater the assumed reconstruction uncertainty, the weaker the consistency with model simulations as the models tend to appear over-dispersive. When reducing the uncertainties, to adapt them to the smoothing or temporal averaging applied here, the consistency improves in many regions. Such reduction of the uncertainties may, however, lead to overconfident conclusions if the original uncertainty estiwww.clim-past.net/11/1673/2015/ Clim. Past, 11, 1673-1699 mates at the annual resolution did not account for all existing sources of uncertainty. A visual comparison suggests that the temperature reconstructions show slightly more centennial to multi-centennial variability over the full period with stronger long-term trends, while model results indicate a stronger recent warming compared to some of the reconstructions (Sect. 3.1). Comparison of the series spectra (Sect. 3.2) reveals marked differences between regions in how well the simulations agree with the reconstructions. The best overall agreement is seen for the Arctic, where the model spectra mostly lie within a 95 % confidence interval for the reconstruction spectrum. For all other regions, the model spectra often lie outside the confidence interval for some frequency ranges. The mismatch is most pronounced for South America, but there are other examples with both lower and higher variance at different frequencies in model results compared to reconstructions.
The disagreements can have various origins, in either reconstructions or simulations or both. For instance, the total variance of reconstructions is dependent on how they were calibrated to instrumental observations (e.g. Kutzbach et al., 2011), but the shape and slopes of their spectra are determined by spectra of both the true climate and the nonclimatic proxy-data noise and by the signal-to-noise ratio (Moberg et al., 2008). Some studies have suggested that reconstruction methodologies may alone underestimate lowfrequency variability, in addition to any frequency biases inherent to the proxy data (e.g. Smerdon et al., 2010;Smerdon et al., 2015a). The amplitude of the reconstructed past forcing changes, which affect the model spectra, is still uncertain (Schmidt et al., 2011(Schmidt et al., , 2012. The modelled transient climate response and the amplitude of internal variability at the regional scale vary considerably and thus deficiencies in applied forcings or internal model physics can lead to errors in the modelled spectra. Nevertheless, no major, systematic model underestimation of low-frequency variability can be deduced at the continental scale from the analyses performed herein, in contrast to some recent studies devoted to the ocean surface temperature (Laepple and Huybers, 2014a, b).
6.2 Are the cross-regional relations of temperature variations similar in reconstructions and models?
Discrepancies in the inter-regional relations between reconstructions and model results are clearer than for each individual region. While the strong correlations between the temperature variations in regions from the Northern Hemisphere in model simulations have some similarities to the ones in the reconstructions (Sect. 4.1), the correlation between the hemispheres and between the Southern Hemisphere regions are much stronger in models than in reconstructions, as previously reported by Neukom et al. (2014) at hemispheric scale. This result is robust as it is also reflected in the larger variance explained by the first EOF mode in models than in the temperature reconstructions (Sect. 4.2), and this is valid for most of the past millennium (Sect. 4.3). These differences may be due to a stronger response to forcings in the models, to unrealistic internal variability in the models, or to non-climatic noise in the proxies or due to a combination of these factors, as discussed in more detail below. Additionally, there are large differences between the various models in the Southern Hemisphere. For instance, Antarctic temperature is strongly related to other regional temperatures in some simulations and not in others, suggesting that specific model dynamics may account for some of the discrepancies with the reconstructions.
6.3 Can the signal of the response to external forcing be detected on continental scale and, if so, how large are these signals?
The agreements or disagreements between model results and reconstructions can be partly explained by the model response to forcing. The contribution of the forcing derived from simulated results can be detected in the reconstructions for all regions of the Northern Hemisphere (Sect. 5.3). The forcings used in the PMIP3 model experiment result in simulated temperature histories that, on the whole, explain a significant fraction of the past regional temperatures in the preindustrial climate. This strongly contributes to the model skill for the Northern Hemisphere, as unforced internal stochastic variability is unlikely to agree between model results and observations. This is confirmed by the significant correlation coefficients (Fig. S1) and correlation statistics (U R ) (Sect. 5.2) that indicate common external forcing variations. Furthermore, the correlations increase for the ensemble average relative to the available single-model simulations due to the fact that the contributions from internal variability are reduced by averaging.
On interannual timescales, the model response to volcanic forcing appears larger than represented in the reconstructions (Sect. 5.1). There is some debate on the potential underestimation or overestimation of the cooling due to volcanic eruptions in reconstructions (e.g. Mann et al., 2012;Anchukaitis et al., 2012;Tingley et al., 2014;Büntgen et al., 2015). Nevertheless, this model overestimation was also found when compared to instrumental data and at hemispheric scale, suggesting a robust phenomenon (Brohan et al., 2012;Fernández-Donado et al., 2013;Masson-Delmotte et al., 2013;Schurer et al., 2013). Both model results and reconstructions also show that volcanic activity impacts temperature at multi-decadal timescales, with a similar magnitude of the temperature response in models and reconstructions over most of the regions in the Northern Hemisphere. This is consistent with the detection and attribution analysis (Sect. 5.3), which indicates that the magnitude of the simulated response to forcing in the Northern Hemisphere has the correct amplitude for smoothed time series.
The role of solar forcing is less clear and none of the pre-PMIP3 COSMOS simulations with either a moderate or a weak solar forcing gives a systematically better agreement with the reconstructions than the other, although the ensemble with low solar forcing yielded a better fit during the first 500 years (Fig. S18). This confirms earlier results obtained at the hemispheric scale (Masson-Delmotte et al., 2013;Schurer et al., 2014).
In the Southern Hemisphere, the influence of external forcing is often not detected (Sect. 5.3). This is consistent with the lower correlation coefficients (Fig. S1) and weaker correlation statistics (U R ) there (Sect. 5.2). The models also seem to overestimate the response compared to the signal recorded in the Southern Hemisphere reconstructions (Sect. 5.2-5.3). This finding is likely related to the larger covariability seen within Southern Hemisphere regions in models compared to reconstructions. Moreover, control simulations display low correlations between the Northern and Southern Hemisphere regions.
The analysis performed herein, however, cannot reveal the origin of the mismatch between simulation results and reconstructions. These differences may be due to biases in the dynamics of the climate models or to errors in the implemented forcing, in particular in their spatial distribution. Land-use changes, which are not included in some models (Table 1), tend to reduce the spatial correlation between regions as deforestation did not occur at the same time over all continents (Pongratz et al., 2008;Kaplan et al., 2011).
The spatial distribution of volcanic aerosols may also contribute to pronounced regional differences. Volcanic forcing is usually not implemented as a direct simulation of changes in stratospheric sulfate concentrations due to individual eruptions but rather as a mean change in the optical depth for different latitudinal bands. This can have an impact on the overestimation of the response in individual simulations or to individual eruptions. Additionally, if the latitudinal distribution of volcanic aerosols is too homogeneous, thereby inducing unrealistically symmetric forcing between hemispheres, it would also overestimate the global signature of the induced cooling.
Any non-climatic noise in the reconstruction would tend to reduce the covariance in reconstructions compared to model results, which would lead to an underestimation of the relative contribution of the forced signal. Despite the large progress made over the last few years, this may still be a critical problem in the Southern Hemisphere, where fewer long palaeoclimate records are available compared to the Northern Hemisphere, explaining some of the model-data mismatch there.
The role of internal variability in driving temperature variations may also be underestimated in model simulations, particularly in the ocean-dominated Southern Hemisphere, as suggested by Neukom et al. (2014). Simulated internal variability may, however, be overestimated, as reported here in at least one model and elsewhere for ENSO-type variability (Jungclaus et al., 2006) or for the Southern Ocean ice extent (Zunz et al., 2013). This would imply a ratio between internal and forced variability that is incorrect, which would lead to biased correlations between the different regions.
Another potential explanation for the differences in the spatial covariance structure between models and observations relates to the relatively coarse resolution of the climate models. Using models with higher spatial resolution will increase the number of spatial degrees of freedom and potentially improve the covariance structure of the climate models compared to reconstructions. Nevertheless, the expense required for both high spatial and temporal resolution, as well as the necessary ensemble approach, could be prohibitive.

Conclusions and perspectives
The analysis of model simulations and PAGES 2k temperature reconstructions has allowed us to extend some of the conclusions previously articulated at hemispheric scale. For all the continental-scale regions in the Northern Hemisphere, the models are able to simulate a forced response with a magnitude similar to the one derived from reconstructions. Despite higher levels of variability on continental scales (relative to full hemispheres), the role of forcing is found to be important. This leads to reasonable agreement between models and temperature reconstructions.
Nevertheless, a deeper assessment of the consistency between simulated results and reconstruction is limited because of the large uncertainties in the reconstructions and the weak constraints on the estimates of this uncertainty. Notably, the agreement between simulation results and reconstructions is poor for the Southern Hemisphere regions. Our results indicate that models have a much clearer response to forcing than deduced from the reconstructions, leading to a greater consistency across the Southern Hemisphere regions and between hemispheres in model results than in the reconstructions.
It is not possible to precisely assess which part of those disagreements comes from the biases in model dynamics, the forcing, or the reconstructions. As suggested in many previous studies, substantial progress will only be possible with better uncertainty quantification and reduction (spatially and temporally) in the reconstructions and the forcing, as well as through model improvements.
Nevertheless, on the basis of our results we highlight four specific points that may lead to significant advances in the coming years.
The first is the insights that can be gained through studying the discrepancies between reconstructions and simulations relative to direct observations over the most recent decades. A quantitative comparison between simulations, reconstructions, and instrumental data would provide useful information on the origin of those disagreements, allow an estimate of the non-climatic noise in reconstructions, and would elucidate how mismatches over the last 150 years are related to www.clim-past.net/11/1673/2015/ Clim. Past, 11, 1673-1699, 2015 PAGES 2k-PMIP3 group: Continental-scale temperature variability disagreements over the last several millennia (e.g. Ding et al., 2014). Secondly, large uncertainties are associated with the behaviour of the ocean over the past millennium. The discrepancies in the low-frequency variability between model results and reconstructions at the continental scale seem less systematic than for some oceanic data (Laepple and Huybers, 2014a, b), but clearly assessing this would require additional analyses. As new palaeoclimate data compilations are now available for the global ocean (Tierney et al., 2015;McGregor et al., 2015), model-data comparison for oceanic regions should be encouraged, and the compatibility between ocean and land temperature reconstructions tested. This would allow us to assess the multi-decadal internal and forced variability in the ocean and to determine whether it is the origin of the disagreement between model simulations and Southern Hemisphere reconstructions (e.g. Neukom et al., 2014). Internal ocean variability can also have a significant influence on Northern Hemisphere climate as seen in several studies investigating the circulation in the Atlantic at multi-decadal timescales (e.g. Delworth and Mann, 2000;Knight et al., 2005;Lohmann et al., 2014). These are the timescales for which most models tend to display less variability than reconstructions.
Third, our comparison of continental-scale temperature reconstructions with simulated temperatures only uses a small fraction of the information provided by models and palaeoclimate records. As discussed in Phipps et al. (2013), other approaches can provide analyses complementary to classical model-data comparison, through a better handling of the various sources of uncertainty. Promising examples are proxy forward models, which directly simulate the proxy records from climate model outputs (e.g. Evans et al., 2013) and data assimilation methods (e.g. Widmann et al., 2010;Goosse et al., 2012b;Steiger et al., 2014). These approaches combine model results and observations to obtain the best estimates of past change and may be most effective at detecting inconsistencies between model and palaeoclimate estimates.
Finally, one could also question the selection of the continental scale as the basis of a comparison, as regional changes are strongly affected by modes of variability such as ENSO, the Southern Annular Mode, the North Atlantic Oscillation, or the Pacific North America pattern. These modes could imprint temperature patterns that are masked by averaging over the continents. On the other hand, model-data comparison made at smaller spatial scales has revealed highly variable and even contradictory results at nearby regions (Moberg et al., 2015), suggesting that a large number of local proxy-data sites are needed for obtaining robust results. Ideally, a subregional selection from key teleconnection regions should be used to assess the stability of climate modes  or enable reliable reconstruction of modes of variability (Lehner et al., 2012;Zanchettin et al., 2015;Ortega et al., 2015), although this requires strong reconstruction skill to be successful (e.g. Russon et al., 2015). Additionally, spatially resolved reconstructions should be targeted because they offer useful potential for dynamic interpretation (e.g. Luterbacher et al., 2004;Mann et al., 2009;Steiger et al., 2014, PAGES 2k Consortium, 2014Shi et al., 2015).
In summary, our results for the Northern Hemisphere suggest a convergence of our understanding of climate variability over the past 1000 years, but there remain many open questions for the Southern Hemisphere. Progress may be expected from comparing simulations, reconstructions, and observations in the instrumental period, from a better knowledge of internal and forced variability in the ocean, from efforts to understand climate variability via proxy forward modelling and data assimilation, and from a clearer view of the influence of climate modes on temperature variability.