Statistical framework for evaluation of climate model simulations by use of climate proxy data from the last millennium – Part 2: A pseudo-proxy study addressing the amplitude of solar forcing

The statistical framework of Part 1 ( Sundberg et al., 2012), for comparing ensemble simulation surface temperature output with temperature proxy and instrumental records, is implemented in a pseudo-proxy experiment. A set of previously published millennial forced simulations (Max Planck Institute – COSMOS), including both “low” and “high” solar radiative forcing histories together with other important forcings, was used to define “true” target temperatures as well as pseudo-proxy and pseudo-instrumental series. In a global land-only experiment, using annual mean temperatures at a 30-yr time resolution with realistic proxy noise levels, it was found that the low and high solar fullforcing simulations could be distinguished. In an additional experiment, where pseudo-proxies were created to reflect a current set of proxy locations and noise levels, the low and high solar forcing simulations could only be distinguished when the latter served as targets. To improve detectability of the low solar simulations, increasing the signal-to-noise ratio in local temperature proxies was more efficient than increasing the spatial coverage of the proxy network. The experiences gained here will be of guidance when these methods are applied to real proxy and instrumental data, for example when the aim is to distinguish which of the alternative solar forcing histories is most compatible with the observed/reconstructed climate.


Introduction
Variations of solar irradiance on long time scales have a potential influence on global climate. Instrumental satellitebased measurements of total solar irradiance (TSI) are, however, available only back to the mid-1970s. Within this period, TSI monitors show an 11-yr cycle with an amplitude of about 0.07 %, in phase with the sunspot number cycle. To estimate TSI further back in time, several investigators have relied on observed correlations between various indices of solar activity in combination with assumptions of how these indices are related to variations in TSI (see Gray et al., 2010, for a thorough review).
One of the most highly debated questions concerns whether there exists a centennial-scale variation in the background level of TSI. Different estimates of the background amplitude of TSI are often characterized by their hypothesized decrease in TSI values within the Maunder Minimum (MM) period of low solar activity, during 1645-1715 AD, compared to the recent satellite-based measurements. Estimates made in the 1990s suggested rather large values between 0.24 % and as much as 1 % (Reid., 1991;Hoyt and Schatten, 1993;Lean et al., 1995;Zhang et al., 1994;Reid., 1997;Cliver et al., 1998;Bard et al., 2000). Continued research in the 2000s (Wang et al., 2005;Krivova et al., 2007;Tapping et al., 2009;Steinhilber et al., 2009;Gray et al., 2010) did not support these results; the most widely accepted view now is that the background variations are between 0.04 % and 0.1 %, which is the range adopted by the Paleoclimate Model Intercomparison Project Phase III (PMIP3) Published by Copernicus Publications on behalf of the European Geosciences Union.

1356
A. Hind et al.: Statistical framework for evaluation of climate model simulations -Part 2 (Schmidt et al., 2011). To put these different estimates into context, a change in TSI by 0.1 % corresponds to a radiative forcing that is about one-tenth of the current anthropogenic forcing from greenhouse gases (Lockwood, 2011). The debate, however, is not yet over. Very recently, two author teams challenged the currently held view, where one team (Shapiro et al., 2011) hypothesized that the decrease at MM could be more than 0.4 %, while the other team (Schrijver et al., 2011) argued that there could possibly be no change at all.
One way to attempt constraining the long-term amplitude of solar forcing is to use alternative TSI histories to drive climate model simulations, and then see which forcing history provides simulated temperatures that are most compatible with the observed past temperatures and reconstructed past temperatures derived from proxy data (Ammann et al., 2007;Jungclaus et al., 2010;Feulner, 2011;Schmidt et al., 2011). This approach, however, is associated with difficulties because of the always present noise in the climate proxy data (Jones et al., 2009) in combination with the stochasticity of the internal (unforced) variability of the climate system (Yoshimori et al., 2005). Another complicating factor is uncertainty regarding the Earth's climate sensitivity to radiation changes and the varying climate sensitivity among different climate models (Knutti and Hegerl, 2008). These difficulties provide a motivation for the experiment we undertake here, which is designed such that we define "true" temperatures derived from simulations with a single climate model, where we know with certainty what the amplitude of solar forcing has been and that the climate sensitivity issue can be ignored. Moreover, we know precisely how much noise there is in our proxy data, because they are constructed from simulated "true" temperatures but with known noise added. We then ask the following: Given knowledge of the true solar forcing, the true past temperatures, and the level of proxy noise, is it possible to determine whether a forced simulation with a climate model, which includes the correct solar forcing amplitude, gives a smaller distance to the reconstructed temperatures than expected from a control simulation with constant forcings? And, if so, can we correctly rank simulations driven by the correct TSI amplitude, such that they are deemed better than other simulations that include an alternative incorrect amplitude?
A study of this kind is a variant of a now common approach in paleoclimatology, known as a pseudo-proxy experiment, where output from climate model simulations is used to test the performance of different methods to reconstruct past climates (see Smerdon, 2012, for a review). In our pseudo-proxy study, we use the newly developed statistical framework of our companion paper (Sundberg et al., 2012; henceforth referred to as Part 1) to rank or distinguish between model simulations using two different solar forcings, either as single forcings or in conjunction with other important forcings used in tandem. Note that we do not attempt to address the question of whether a higher or lower solar variability imposed on simulations is closer to reality. We merely state that the issue is of great importance and choose it as a focal subject in the testing of our framework's sensitivity. Ultimately, this will allow better judgement regarding how possible it is, in future comparisons, to identify which simulation is best able to simulate observed temperatures in real proxy and instrumental data. As our pseudo-proxy experiment test-bed, we use the set of simulations from the Community Earth System Modeling (COSMOS) Millennium Activity of the Max Planck Institute (Jungclaus et al., 2010).

The COSMOS Millennium Activity -model description and experimental design
The COSMOS Millennium Activity simulation experiments were conducted using the Max Planck Institute Earth System Model (MPI-ESM), which is formed from an atmospheric model ECHAM5 (Roeckner et al., 2003), an ocean model MPIOM (Marsland et al., 2003) and models for both land vegetation (JSBACH) and ocean biogeochemistry (HAMOCC). The model resolution is T31 (3.75 • ) for ECHAM5, and MPIOM applies a conformal grid with a horizontal resolution ranging from 22 km to 350 km (Jungclaus et al., 2010). The ocean and atmosphere are coupled daily without flux correction. The Millennium Activity involved the creation of a 3000yr unforced control (CTRL) simulation, after a multi-century spin-up phase in which the carbon cycle was brought into equilibrium. The CTRL model experienced 800 AD orbital conditions and pre-industrial greenhouse gas concentrations (Jungclaus et al., 2010). In our experiment, it was separated into three 1000-yr-long CTRL simulations to be used in the comparison with the forced simulations. The globally averaged land-only annual temperature anomalies (30-yr means) of the three CTRL simulations are shown in Fig. 1a. To account for some of the previously discussed uncertainty in the magnitude of solar forcing, the Millennium Activity conducted experiments using both "low" and "high" estimated TSI forcing series. The "low" forcing exhibits a total TSI reduction of 0.1 % at the Maunder Minimum compared to the present (Krivova et al., 2007 reconstruction -in agreement with the largest amplitude used in PMIP3) against a forcing with a "high" reduction of 0.25 % (Bard et al., 2000 reconstruction, representative of a common late-1990s view). Other forcings known to be principal drivers of climate were also included in the experiments: orbital, volcanic and nonvolcanic aerosols, greenhouse gases (CO 2 , CH 4 , N 2 O), as well as land-use changes (see Jungclaus et al., 2010, for details).
Two full-forcing ensembles, representing the last 12 centuries, were generated by starting simulations from different ocean initial conditions and are separated by their respective "low" E1 ( as any solar-induced CO 2 concentration changes (which are possible through the model's interactive carbon cycle). A representation of the forcings is shown in Fig. 2. Note that these single time series representations of the global forcings are shown in terms of their annual mean radiative forcing at the top of the atmosphere. In addition to the two full-forcing simulation ensembles, the model was also driven by each forcing individually to create several single-forcing simulations ( Fig. 1c). There is a pronounced simulated warming in the 20th century associated with the enhanced greenhouse gas radiative forcing in both the full-forcing ensembles ( Fig. 1b and d), whereas the single forcing simulations do not show this 20th century warming as they do not contain the greenhouse-gas radiative forcing.

Model -(pseudo-proxy) data comparison setup
A pseudo-proxy series can be defined as an instrumental or climate model data series that has purposefully been distorted through the addition of noise (Jones et al., 2009;Smerdon, 2012). This is to ensure that the pseudo-proxies account for a fraction of the variance of a temperature series, as is the case for a real proxy reconstruction of temperature. A key advantage of this approach is that the distortion and reconstruction targets are both prescribed and hence fully known. Here, the pseudo-proxy setup is described in relation to the statistical framework, upon which further details can be read in Part 1.
In the present pseudo-proxy analysis, the true temperature τ i is defined explicitly by a particular simulation, chosen either from the E1 or E2 full-forcing ensembles, where the regions used in the comparison are specified. Then the proxy series z i and instrumental series y i can be constructed as τ i plus added noise at specified levels. An additional advantage of the pseudo-proxy approach using model output is that the number of locations can be varied from a single grid box to any number of locations. We also consider an average single time series for the entire globe. Given a realistic amount of noise in the pseudo-proxies, it is hoped, first, that the correlation-based test statistic U R will indicate that a forced simulation from either the E1 or E2 ensemble is able to explain some of the simulated variability in another simulation from E1 or E2, when a single member of one of those forced ensembles is used as the "truth". Then, if this happens, it is hoped that the distance-based performance metric U T will distinguish the E1 and E2 ensemble simulations from CTRL simulations, and also correctly rank them against each other, again when a single member of one of the two forced ensembles is used as the "truth". If this is not the case, then the method cannot be expected to help better constrain the definition of a suitable past millennial solar forcing amplitude, if the analysis were applied to real proxy and instrumental data. In our experiment, we also compare simulated temperatures from the single-forcing simulations with pseudo-proxy temperatures created from either one of the E1 or E2 ensembles, to learn more about the detectability of the effect of single forcings and their influence on temperatures in a fullforced "noisy proxy world". In all cases, the climate model simulation time sequences x i are 2-m (surface) temperatures from the COSMOS simulations (land points only), where the forced component αξ i is the response to either a single forcing in the case of land-use changes, solar and volcanic, or to the combined forcings in the E1/E2 ensembles. Note that α = 0 in the case of the unforced CTRL simulations (see Statistical Models 1 and 2 in Part 1).
We undertook our analysis using 30-yr non-overlapping means of simulated temperatures from the COSMOS simulations. A motivation for this choice is given later in this section. The instrumental measurements y i are defined as the target simulation (i.e. one member from E1 or E2) for a given location over the period 1850-2000 with added white noise (θ i ), defined as representing 10 % of the total variance of y. Regarding the added noise in y i , this approximately corresponds to a doubling of recent single-thermometer measurement error estimates (Folland et al., 2001;Brohan et al., 2006), but is chosen here on an ad hoc basis to provide a level of noise that is not negligible but yet notably smaller than in most real proxy data. The proxy series z i are defined similarly, though over the period 1000-2000 and feature added white noise ( i ) with two-thirds the total variance of z. This corresponds to an SNR = 0.71 (signal-to-noise ratio; see Smerdon, 2012) and correlation r = 0.58 between z and τ , which is not untypical for high-quality real proxy records Ljungqvist, 2011, 2012). To represent both better and worse real proxies, considerably higher and lower percentages (always defined for the 30-yr time unit) of noise levels were also investigated (see Supplement).
The analysis included data for the period 1000-2000 AD, despite that forced simulations begin at 850 AD. The computation of the test statistics U T (Eq. 18; Part 1) and U R (Eq. 23; Part 1), however, was restricted to the period 1000-1850 to avoid the influence of anthropogenic greenhouse gas increases. It should also be noted that data after 1850 were used for the calibration of z i against y i and for estimating the total variance of y. The statistical framework of Part 1 allows for uncertainty in both the instrumental and proxy series, which are specified through a time-dependent weighting w i (Eq. 9-12 in Part 1). In our experiment, however, the precision of z i does not vary with time. The variance of the "true" unforced temperature, s 2 η , was estimated using detrended pseudo-instrumental data, whilst the sample variance of internal unforced variability, s 2 δ , was estimated from CTRL simulations (see Sect. 5 in Part 1).
As described in Sect. 2 of Part 1, the unforced simulated temperature δ i is assumed to be white noise. It is of course quite possible that white noise is not a good representation  4 5 6 7 8 9 10 11 12 13  . The left panels are for E1 ("low" solar) as target, right E2 ("high" solar) as target.
The 5 % two-sided significance levels are shown with dashed lines. Each box covers the 50 % interval between the lower and upper quartiles, with the median as a thick black line between. The simulations are: 1 = land-use changes, 2 = low solar, 3 = high solar, 4 = volcanoes, 5-9 = E1, 10-12 = E2, 13 = average E1, 14 = average E2. The CTRL simulation (numbers 15-17) results are shown for the U R analysis but not for U T , since they are then used as internal references. Note that the y-axis for U T is flipped to simplify any comparisons with the U R box plots.
of the internal variability of the true climate, and the distance measure D 2 does not require white noise. However, the null hypothesis of the statistical tests is that forced simulations are equivalent to CTRL simulations, so for the described tests to have the prescribed type I error level, the unforced simulations should be well represented by white noise. We investigated the seriousness of this problem by calculating the lag-1 autocorrelation for the full 3000-yr CTRL simulation, both in terms of the proportion of global area with significant autocorrelations for various time resolutions, as well as the lag-1 autocorrelation for the global land-only series (see Supplement for further details). It was found that beyond a 20-yr time resolution, δ i can be considered as white noise, in keeping with the statistical assumptions of Sect. 2 of Part 1. Hence, a non-overlapping 30-yr mean resolution, as used in the present analysis, should be able to keep the type I error of the tests under reasonable control in the model.

Global analysis
We first conducted a study on globally averaged (areaweighted) time series using only land points (i.e. the data shown in Fig. 1), the results of which are shown in Fig. 3  (Fig. 3a and b show the U R correlation analysis results, whilst Fig. 3c and d show the U T distance measure results).
To clarify, a positive U R represents a positive correlation between a simulation and its target, whilst a negative U T indicates a better performance of a forced simulation compared with unforced simulations. The global mean was investigated first, simply because this series will likely exhibit a stronger signal-to-noise ratio of the forced component than at the individual grid-point scale where internal temperature variability is more dominant (Servonnat et al., 2010). Hence, we use a single set of τ i , y i and z i sequences in this globally averaged analysis (i.e. the summation in the definitions of U T and U R is made over a single term, and no covariance computations are needed). Both the E1 and E2 simulations were used separately as targets in this experiment, and to use as many target A. Hind et al.: Statistical framework for evaluation of climate model simulations -Part 2 "true" climates or "truths" as possible, each ensemble member was used as the target in turn. For each type of "truth", ≈ 100 noise realizations were generated to produce y i and z i with a rotation in the five E1 target simulations (20 noise realizations for each simulation, 5 × 20 = 100) (Fig. 3a and c) and in the three E2 target simulations (33 noise realizations for each simulation, 3 × 33 = 99) ( Fig. 3b and d). Iteratively treating the E1 or E2 ensemble members as targets could cause the distributions to be hierarchical, in that the error distribution associated with different noise realizations could potentially be small in comparison with the difference between ensemble members (internal climate variability in the model). Hence, an identical analysis to this was conducted but with zero proxy noise added to the target temperatures, which revealed the E1 and E2 ensemble simulations to give results with little qualitative spread (not shown). This satisfied the authors sufficiently that the spread of the distributions in Fig. 3 predominantly represents the uncertainty due to the pseudo-proxy noise realizations.
To further explain the U T and U R box plot distributions shown in Fig. 3, the first four represent the single forcing simulations, namely land-use changes (green), low solar (light orange), high solar (yellow) and volcanoes (red), where they are compared with either the E1 (left panels) or the E2 (right panels) simulations as target. Analogously, the next five box plots (numbers 5-9) represent the E1 simulations, all coloured dark blue with their corresponding ensemble average U R /U T value in blue (number 13). The three E2 simulations are coloured dark red (numbers 10-12) with their corresponding ensemble average in red (number 14). Note that, when an E1 (or E2) simulation is used as the target, this target simulation is excluded from the E1 (or E2) ensemble being analysed. Additionally, for comparison, Fig. 3a and b feature an analysis of the three CTRL simulation segments (numbers 15-17) as these are not required in the calculation of U R . From Fig. 3a and b, the U R correlation analysis, it is clear that individual E1 and E2 ensemble members are significantly correlated with both E1 and E2 targets. However, the E2 simulations are the most highly correlated, whichever is the target. This can be expected in so far as the E2 simulations feature the strongest solar forcing and the largest variability (Fig. 1). However, the significant correlations between E1 and E2 ensembles may not be reflected in a distancebased measure. U T is expected to be more effective in distinguishing between the simulations and, in some instances, being capable of ranking them. The principal reason being that the correlation analysis does not consider the variance of two compared series (target and simulation), whereas this is explicitly considered in the distance measure. This can be seen by the fact that, when E1 serves as target (Fig. 3c), E1 simulations are generally significantly closer to the target than CTRL simulations, whilst the E2 simulations are not. The E1 and E2 simulations are also correctly distinguished when E2 serves as target. In this case, however, both are closer to the target than CTRL simulations (Fig. 3d).
The low solar single-forced simulation (number 2) is not significantly correlated with, or close to, the E1 targets ( Fig. 3a and c). In contrast, the high solar simulation (number 3) is significantly correlated with, and close to, the E2 targets ( Fig. 3b and d). This implies that the low solar forcing is too weak to produce any detectable effect at the 30-yr time unit, whilst the high solar is strong enough. A related conclusion was reached by Ammann et al. (2007): the greater the solar forcing amplitude applied to their model, the weaker the detectable response to other natural forcings. In regard to CTRL simulations, their U R values are mostly insignificant, as should be expected given the construction of the experiment and the null hypothesis being tested.

Local analysis
At global or hemispheric scales, the temperature can be expected to respond to large-scale external forcings (such as solar or greenhouse gases), whereas at local or regional scales the internal climate dynamics can account for a larger proportion of the temperature variability (Goosse et al., 2005). Hence, on small spatial scales, the ability to distinguish between simulations that use low and high solar forcing, and consequently rank them, may not be possible. A current set of proxy locations from Juckes et al. (2007) was used to generate pseudo-proxies in order to investigate whether the low and high solar simulations can still be distinguished (Fig. 4). Though this set of locations is clearly a sparse representation of the global surface, 20-40 or so proxy locations is a typical number of high quality millennial proxy data found in current analyses (Christiansen and Ljungqvist, 2011).
The same type of experiments conducted in the global analysis (Fig. 3) was also conducted for the combined Juckes et al. (2007) locations (Fig. 5). Specifically, we compute local correlation and distance measures for each proxy location,  4 5 6 7 8 9 10 11 12 13  Simulation no.
Test statistic value Fig. 5. As Fig. 3, but using the local proxy locations from Juckes et al. (2007). before they are combined to obtain a single U R and U T value for each simulation (Sects. 7 and 8; Part 1). The correlation analysis U R for the Juckes et al. (2007) proxy locations ( Fig. 5a and b) gives similar results to the global timeseries analysis, though surprisingly the correlations are not less significant, rather sometimes even more significant. This is something that could not have been expected due to the increased influence of internal (unforced) variability at the regional scale in combination with the reduced area coverage. However, in contrast to the global analysis, when E1 serves as target, U T is unable to distinguish the E1 simulations from the CTRL simulations (Fig. 5c), whereas the E2 simulations are again significantly closer to the target than the CTRL simulations when E2 serves as target (Fig. 5d). Concerning the single forcing simulations, only the high solar (number 3) is significantly closer to the targets than the CTRL simulations when E2 is the target (Fig. 5d).
Using a realistic set of proxy locations such as the Juckes et al. (2007) set, it seems difficult to rank simulations, unless the forcing is large and multi-decadal in nature (as is the case for the high solar forcing used here). Note that U R is more sensitive than U T for testing if a model forcing has any correspondence with the true climate, but it answers a different question than U T . This higher sensitivity is seen when we compare subfigures a and b with c and d respectively in both Figs. 3 or 5. Specifically, if U R is not significant, nor is U T . Comparisons between the Juckes et al. (2007) and global land-only average results naturally lead to the question of how the possibility to rank simulations depends on the spatial coverage of the pseudo-proxy data.

Varying coverage
There are in practice relatively few locations which have high quality proxy data available or where there is the potential at present to acquire more data. A pseudo-proxy experiment, however, has the advantage of allowing any number of locations to be used to serve as a proxy series or instrumental series. Hence, an analysis is conducted on how varying degrees of % surface area coverage affect the sensitivity of the correlation and distance measures to distinguish between simulations with low or high solar forcing.
The various specified global surface area coverages are for 0.1, 0.25, 0.5, 1, 2, 3, 4, 5 %, using only land grid points, which is equivalent to 3, 10, 22, 44, 90, 137, 183, 230 proxy locations. Calculation of the covariance matrices Cov(T j 1 , T j 2 ) (Sect. 7; Part 1) and Cov(R j 1 , R j 2 ) (Sect. 8; Part 1) becomes computationally intensive for large % coverages; hence, they were only calculated up to 5 %. Note that, although a principal component truncation could in principle be considered here to reduce the dimensionality of the climate variability represented by the proxy series, it was felt that, due to the heterogeneous coverage distributions and the arbitrary nature of the choice of retained principal components (and also considering the varying seasonal representation and time periods covered by real proxies), we would not conduct such an approach here. The set of proxy locations was selected as a stratified random sample from the available land points in the COSMOS simulations, with specified proportions for three strata (the latitudinal bands 0-30 • , 30-60 • , 60-90 • ). The stratification was chosen to better control the coverage and to account for the changing area of the grid points with latitude in the simulations. Figure 6 shows the correlation U R (top panel) and distance U T (bottom panel) measures for the low (light orange) and high solar (yellow) single forcing simulations for different % coverages, again with both E1 and E2 simulations serving as targets. For each % coverage level, approximately 100 noise realizations were generated, of which the median values are represented by solid lines and the upper and lower quartiles are dashed. For comparison, results for the volcanic (red) forcing simulation are also shown. The target and test statistics panels are arranged the same as Figs. 3 and 5.
The high solar simulation is significantly correlated even for the lowest coverages when E2 serves as target (Fig. 6b), whilst also achieving significant U R values for coverages upwards of 1 % when E1 serves as target. Contrastingly, the high solar simulation U T values are significantly better than the CTRL simulations for all coverages when E2 serves as target (Fig. 6d), but not for any coverage when E1 serves as target (Fig. 6c). The low solar simulation shows no significant correlations for either target ensemble and can therefore be expected to be indistinguishable from the CTRL simulations using the U T measure. The volcanic simulation is mostly significantly correlated with both E1 and E2 targets ( Fig. 6a and b), but its U T values are generally only significant for coverages upwards of 1 % for both targets (Fig. 6c  and d). Figure 7 is arranged as Fig. 6 but shows the E1 (blue) and E2 (red) ensemble average results. Both ensembles are significantly correlated with all targets, even for the lowest data coverages. The results for U T are much the same as for the global analysis in Sect. 4.1, where the E1 and E2 ensembles can be correctly ranked with their respective targets. For coverages lower than 1 %, it becomes difficult to distinguish E1 from the CTRL simulations or separate the E1 and E2 simulations when E1 serves as target (Fig. 7c). Additionally, the experiments of Figs. 6 and 7 were conducted for cases with a SNR = 0.25 and also with negligible noise, the results of which are briefly discussed in the conclusions and shown in the Supplement. An important feature of note in Figs. 6 and 7 is how flat the U R and U T measures are with changing % coverage after a certain coverage is reached. In fact, there is little gain in increasing the sample size from 40 or so proxy series to several hundred. Above all else, this suggests a substantial degree of spatial correlation in simulated temperatures, given the 30-yr time resolution used in this analysis (Jones et al., 1997;Franke et al., 2011).
Finally, we should mention that two variants of U T for ensemble means were defined in Part 1. In the main variant, averaging of a distance measure D 2 is undertaken for different individual simulations before calculating the T and U T statistics. This variant has been used in all analyses here.
In the alternative variant, defined in Appendix A of Part 1, the averaging is instead undertaken on the simulation temperature series before calculating the D 2 . Results for varying coverage with this alternative approach are shown in Appendix A of this paper, where the Fig. A1 should be compared with Fig. 7c and d.

Conclusions
We apply a new statistical framework (Sundberg et al., 2012) designed for comparing ensemble model simulation surface temperature from one or more locations with proxy and instrumental data. This framework derives a unified correlation-based statistic (U R ) that provides an initial test of whether a set of simulation time series from different locations (and/or seasons) correlates with a set of target series for the corresponding real locations (seasons), and a distance-based measure (U T ) that can be used to assess the goodness-of-fit of a given forced simulation in comparison with those that are unforced. The ultimate goal was to rank the simulations according to their closeness to the target data. A pseudo-proxy experiment was designed for this task, based on the MPI-COSMOS Earth system model simulations (Jungclaus et al., 2010). Here, the "true" climate and the proxy noise are known; hence, if no difference between two forced simulations containing different solar forcing evolutions can be detected with these methods for realistic proxy noise levels, then no significant conclusions could be assumed based on comparing the same model output with real proxy data. Firstly, an analysis was conducted on globally averaged land-only data where a single series was calculated for each simulation and compared with every member of the fullforced E1 (with low solar) and E2 (with high solar) ensembles in turn plus added noise. Regardless of whether E1 or E2 simulations are used as a target, it was found that both simulation types are strongly correlated (significant positive U R ) with each other. Knowing that the shared forcing information gives significantly correlated temperature evolutions between both low and high solar simulations, U T was found capable of ranking these simulations correctly.
Given that this statistical framework has been developed in view of using real proxy information to assess the goodnessof-fit of model simulations, a pseudo-proxy evaluation was also conducted for a representative set of about 30 proxy locations (taken from Juckes et al., 2007). The results of this multiple-site local comparison were similar to the global land-only results; however, the U T values of the E1 ensemble could not be said to be significantly different from the CTRL simulations when E1 serves as target. This motivated an analysis of how differing % coverage levels change the significance of the U T and U R statistics (Figs. 6 and 7). The results suggest that, for a global coverage of say 40 or more proxy locations, if a high quality of individual proxy series is obtained with low noise levels (SNR of at least 0.71 for white noise defined at the analysed 30-yr time unit), it can be possible to distinguish the E1 and E2 ensembles when E1 serves as target. If E2 serves as target, very few proxy Percent of global surface area Test statistic value E1 E2 E1 inside E2 inside Fig. A1. As Fig. 7, with E1 (blue) and E2 (red) ensemble averages. The thick lines denote the use of inside averaging, whilst the thin lines denote outside averaging (as presented in Fig. 7). Note that the y-axis is extended here to accommodate the inside averaging lines.
series are needed. Additionally, the same type of analysis was conducted for a higher noise level (SNR = 0.25), where it was found that the E1 and E2 simulations are indistinguishable even if the global surface area coverage is 5 % (approximately 230 proxy locations). Although these results, in quantitative terms, are conditional upon the actual climate model simulations used to define the pseudo-proxy world, they have an important implication: it is more important to improve the quality of individual local proxy series in terms of SNR than it is to increase the quantity of available proxy locations. Even a limited spatial coverage is sufficient to distinguish forced multi-decadal temperature signals, provided the temperature proxies are of a sufficient quality and represent areas that can be directly compared with model output.

Averaging inside D 2
From Fig. A1, if the alternative "inside" averaging (defined in Appendix A of Part 1; thick lines) is used instead of "outside" averaging (thin lines) in calculating the E1 and E2 ensemble averages, the U T results appear to change little if E1 serves as target, whereas there is a substantial increase in the significance of U T when E2 serves as target. This likely reflects the fact that, if there is a stronger common signal amongst the ensemble members (as with the high solar E2 ensemble), then the inside averaging approach will enhance the SNR of the series, whilst, if the common signal is weaker (as with the low solar E1 ensemble), there will not be a large difference between the approaches. Hence, inside averaging can be more effective than outside averaging.