Statistical framework for evaluation of climate model simulations by use of climate proxy data from the last millennium - Part 1: Theory

A statistical framework for comparing the output of ensemble simulations from global climate models with networks of climate proxy and instrumental records has been developed, focusing on near-surface temperatures for the last millennium. This framework includes the formulation of a joint statistical model for proxy data, instrumental data and simulation data, which is used to optimize a quadratic distance measure for ranking climate model simulations. An essential underlying assumption is that the simulations and the proxy/instrumental series have a shared component of variability that is due to temporal changes in external forcing, such as volcanic aerosol load, solar irradiance or greenhouse gas concentrations. Two statistical tests have been formulated. Firstly, a preliminary test establishes whether a significant temporal correlation exists between instrumental/proxy and simulation data. Secondly, the distance measure is expressed in the form of a test statistic of whether a forced simulation is closer to the instrumental/proxy series than unforced simulations. The proposed framework allows any number of proxy locations to be used jointly, with different seasons, record lengths and statistical precision. The goal is to objectively rank several competing climate model simulations (e.g. with alternative model parameterizations or alternative forcing histories) by means of their goodness of fit to the unobservable true past climate variations, as estimated from noisy proxy data and instrumental observations.


Introduction
Studies that compare climate reconstructions for the last millennium with climate model simulations have contributed significantly to our understanding of natural and anthropogenic climate change. Based upon results from such investigations, the Intergovernmental Panel on Climate Change concluded in its fourth assessment report that volcanic and solar forcings have very likely affected NH mean temperature over the past millennium, that external influences explain a substantial fraction of inter-decadal temperature variability in the past, and that the climate response to greenhouse gas increases can be detected in a range of multi-proxy reconstructions during recent decades (Hegerl et al., 2007b). More recently, detection of temperature changes and their attribution to external influences, such as the concentration of stratospheric aerosols and possibly changes in total solar irradiance, has been made at a regional (European) scale for the last five centuries (Hegerl et al., 2011).
A growing size of climate model simulation ensembles for the last millennium (e.g. Jungclaus et al., 2010) and a constantly increasing number of local/regional climate reconstructions from proxy data (Jones et al., 2009) will make it possible to undertake a more systematic evaluation of model simulations against proxy data. However, the growing amount of information also calls for new statistical tools for evaluating the models against the reconstructions. Statistical measures of model performance in terms of mean square errors have long since been used within weather prediction to compare different forecast systems and to track forecast improvements over time (Murphy, 1988;Murphy and Epstein, 1989;Krishnamurti et al., 1999). These ideas have developed into methods for the detection and attribution of climate change signals using the instrumental record (Allen and Tett, 1999) and paleoclimate reconstruction data (Hegerl et al., 2007a), as well as techniques for data assimilation of climate proxy data in model simulations (Goosse et al., 2006;Widmann et al., 2010). Statistical measures of climate model performance can use spatial and temporal correlations found in internal climate variability (Rowlands et al., 2012) and also combine information from several climate field variables (Mu et al., 2004). However, explicit treatment of the model and observational data error terms in the formulation of performance metrics becomes a great challenge when dealing with climate proxy data, because they are typically associated with substantial uncertainties, including mixed seasonal signals and time scale-dependent, temporally unstable climate-proxy relationships. Moreover, the available proxy data are irregularly distributed in space, vary in seasonal representativeness and can reflect different climate variables (Jones et al., 2009).
Our aim is to address some of these problems and formulate a statistical framework for evaluation of climate model simulations against a diverse set of climate proxy series. We will assume that evaluation of the models against modern instrumental gridded data sets has already been made and that the models to be tested have been judged to simulate the current climate conditions reasonably well. For example, we assume that previous model evaluations have shown that the climate models have acceptable biases and that the models, when driven by historical external forcings, simulate climate trends that are consistent with the instrumental observations. Hence we focus on problems connected with how to use climate proxy data for model evaluation back into the pre-instrumental period. We demand that the proxies have sufficiently high temporal resolution and dating precision to allow direct calibration against instrumental climate time series. In practice, this requirement excludes many types of proxy data and also time periods far beyond the last millennium. Tree-ring data and historical documentary proxies are annually resolved and have exact dating, which make them suitable. Some proxies with lower resolution, but still with a great deal of precision in their dating, may also be considered, provided that their sampling resolution is high enough to allow meaningful calibration against overlapping instrumental series.
Our goal here is to develop a method that can be used to objectively rank several "competing" simulations by means of their goodness of fit to the unobservable true past climate variations, as estimated from noisy proxy data and instrumental observations. The competing simulations can be, for example, simulations driven with alternative plausible amplitudes of past radiative forcings (Jungclaus et al., 2010), simulations from models with different climate sensitivities due to different parametrization of unresolved physical processes (Murphy et al., 2004), or a combination of alternative forcings and model parameters is used (Rowlands et al., 2012). We also consider the ranking of entire competing ensembles of simulations, where the members of one and the same ensemble are assumed to differ only in the initial conditions; i.e. their forcings and model physics are the same. Our performance metric will also serve as a test statistic of the null hypothesis that the climate model simulation under consideration is equivalent to an unforced model (control simulation). In addition, we will suggest another test statistic to test the null hypothesis that a climate model simulation does not explain any of the temporal variation in the instrumental or proxy data. To investigate the performance of our framework, under conditions where the results can be evaluated against a perfectly known climate, we undertake a pseudo-proxy experiment in a companion paper henceforth Part 2).
We start by formulating a statistical model with nearsurface temperatures in mind, from which a climate model evaluation framework is developed. Note that other climate variables, such as precipitation or a drought index, are probably more difficult to model and may require substantial modification of the theory presented below.

A statistical model
We assume the climate characteristic of interest, to be called τ , is a temperature time series representing a particular region during some time period, divided into a number of time units yielding a sequence of values τ i , i = 1, ..., n, where the subscript i represents time. Typically, this region consists of a single model grid box, but averages over several grid boxes can also be considered. The time unit can be single years or, say, averages over a 10-yr or 30-yr period. To begin with, we only consider temperatures for a single region and a particular season, but later (in Sect. 7) we will investigate how to combine data from different regions and seasons. The following notations will be used: x -a simulated temperature value for the region and time period of interest, generated by a climate model.
τ -a true temperature, corresponding to x as a spatial and temporal average over the same region and time unit. The true temperature is an unobserved (or latent) variable, except in those cases where we set τ = y (see below).
y -a measured temperature, intended to represent τ , being also some average over space and time, and available for some period of time. This measured value y can differ from τ because of measurement errors, but also because y and τ are somewhat different spatial and temporal averages (typically, y is an average taken over a finite set of irregularly spread observing stations and for a set of possibly time-varying observation hours). Sometimes we will assume that this observed temperature well enough approximates the true temperature τ , so we can neglect measurement type errors in these observations. However, often in practice, we expect some non-negligible errors to exist.
z -a surrogate for the true temperature τ , derived from climate proxy data. When observed temperatures y are not available, proxy measurements z will be used. Here we ignore all practical problems connected with how to construct temperature proxy series from raw proxy data (e.g. tree-ring width or density measurements). Hence, we think of a proxy series as a final product for use in climate reconstruction (e.g. a tree-ring chronology), constructed in the best possible way.
The following statistical model explicitly allows climatic forcing effects jointly in the climate model simulations (x) and in the actual temperature (i.e. all of τ , y and z). This is crucial, since inclusion of temporally varying external forcings in the climate model simulation is the only reason to expect any temporal correlation (covariation) between simulations and actual temperature. The forcing effects can, for example, be the temperature response to radiative forcing from stratospheric aerosols ejected from large volcanic eruptions or the response to variations in solar radiation. Note that, in general, a particular type of forcing imposed on the climate model is not a true reflection of reality, because the forcing history is incompletely known regarding its temporal evolution, its amplitude and its spatial distribution pattern. Moreover, it is typically only crudely represented in the simulations. Its effect on temperature needs not be the same in reality as in the model, because these worlds may have different sensitivities to the forcing and possibly also different spatial response patterns.
For simplicity, we will assume that the latent relationship between the true response to a real forcing and the simulated response to a reconstructed forcing of the same type, when imposed upon a climate model, is (approximately) proportional, when measured as deviations from the mean values of τ and x. This does not prevent additional uncorrelated random forcing effects in the climate model, due to causes discussed above. Note that we make no assumption about how the response to the forcing is related to the forcing itself, but only that the real and simulated responses are linearly related. This way of thinking about the response to climate forcings is similar to that used in detection and attribution studies (e.g. IDAG, 2005). We will discuss, in Sect. 9, some relationships between our framework and that used in detection and attribution.
Statistical Model 1: Climate model simulation sequence {x i }, true climate sequence {τ i }, instrumental measurement sequence {y i }, and proxy sequence {z i } are mutually related through the following model, explained below: Here, Greek letters are used for latent variables, random variables, unobserved errors and unknown coefficients, to indicate their unobservability. In contrast, x, y and z are observed or measured. Terms µ x , µ τ and µ z are the mean values over time, around which x, τ (and y), and z naturally vary. Quantities δ, η, θ, and are regarded as random variables, with mean values zero and variances σ 2 δ , σ 2 η , σ 2 θ , and σ 2 , whereas α and β are unknown coefficients: ξ denotes the true effect of a specific type of forcing that has influenced the true temperature τ . Since both the causes behind the forcings and the actual effects are uncontrolled, we regard this variation as only partially random. The forcing can be either of a singletype (e.g. only volcanic forcing) or a combination of several forcings (e.g. volcanic and solar forcing). Note that ξ is not the forcing itself, but rather its temperature response.
α ξ represents the unknown variability in x that can be linearly explained by the true forcing effect. For simplicity, we have assumed an (approximately) proportional relationship to the true effect on τ . A correct representation of the forcing effect in the climate model corresponds to α = 1, whereas an unforced climate model has α = 0.
η denotes the (residual) variation in true temperature that cannot be statistically explained by the particular forcing under consideration. This is then uncorrelated with ξ .
δ represents internal noise variability in the simulations and any variability in the simulations unrelated (uncorrelated) with the true forcing effects. It will thus incorporate the uncorrelated part of nonlinear climate model effects corresponding to true climate forcing effects.
θ denotes the measurement error in the temperature variable y, making y differ randomly from the true temperature τ .
β(τ i − µ τ ) is the regression of the proxy z on the true temperature τ . The observed proxy value z will be correlated with the measured temperature y, due to the τ they have in common, and we will use that correlation to calibrate the proxy variable.
-represents the residual variation in z, uncorrelated with τ .
It is judged reasonable that all random variables ξ , η, θ, δ and should be mutually uncorrelated, and this is also assumed below. Under this assumption, a positive correlation between x and τ (or y or z) implies that they share the term ξ . In other words, the effect of the forcing in x corresponds positively with that in the true temperature τ (or y or z).
The ξ and η (i.e. the components of τ ) sequences will certainly show autocorrelation on various time scales, and our theory allows this. Sometimes it could be of interest to consider the more complicated case of multiple forcings (i.e. the joint effect of several individual but possibly interacting forcings, not equally represented in the simulations), represented by a vector ξ instead of a scalar ξ . Although climate model simulations driven by multiple forcings are used in the experimental companion paper (Part 2), the theoretical aspects of multiple forcings will be investigated further in a future analysis.
The internal variability sequence δ (but neither θ nor ) will be assumed to be temporally uncorrelated, i.e. white noise, whenever a specification is needed (essentially only for properties of the test statistics of Sects. 6 and 8). This is a slight limitation of the present version of the theory, because the real simulation processes will certainly show some degree of autocorrelation, at least on rather short time scales. In the pseudo-proxy experiment in Part 2, white noise δ is assumed, motivated by finding and selecting a time unit for which the autocorrelation is negligible.
We cannot expect a forced simulation to explain some of the variability in the real temperature, unless it is statistically correlated. Thus, in practice, if we want to test some forced simulations of different types, to rank them according to how well they are able to explain the observed temperatures, it is natural to first test by correlation tests whether they can explain any of the observed temperature variations. Only forcings that provide statistically significant correlations between simulated and observed temperatures are worthwhile studying for determination of the optimal forcing magnitude and for use in calculations of a distance measure. Although a correlation test should therefore be carried out before any distance measure is calculated, we start the description of our statistical framework by developing a distancebased performance metric (in Sects. 3-7) before we formulate a correlation-based test (in Sect. 8).
3 The distance measure, D 2 (x, z) The problem is to identify, among several forced climate model simulations, a simulation that is able to predict the actual temperature better than the others -and in particular better than unforced model simulations. For comparison of different forced simulations, to find out whose x-sequence of temperatures is best at capturing the real variation in temperatures (τ ), we need a criterion. Performance metrics for climate model simulations are typically expressed as some kind of squared difference measure (Mu et al., 2004;Goosse et al., 2005Goosse et al., , 2006, and we choose a criterion of this kind.
We postpone the problem with proxy data and assume first that we have the true temperatures τ available. We define the simple distance measure: A statistical motivation for this criterion is obtained by considering D 2 as a mean squared error of prediction (MSEP) of τ .
It may be argued that such an MSEP criterion function should be formulated as dependent on possible autocorrelation among the simulation errors δ i in x i (Mu et al., 2004). If we assume δ i to be white noise, this argument disappears. Even without this assumption, we can choose criterion (1), primarily for its simplicity. We will return to motivations in Sect. 5. The requirement on δ i to be white noise will reappear in the calculation of standard errors for the test statistics, but even there, this assumption is not crucial so long as we use these statistics more as criteria for ranking than for stating significances.
The better the climate model represents the forcing effects that underlie the true temperature, the smaller the expected distance between simulations and true temperatures. However, any systematic bias in x will also contribute to D 2 . If one has good reason to assume that systematic biases can be neglected for a particular study, then this can be achieved by subtracting the mean values of x and τ over a common time period. Doing so, however, obviously makes the criterion unsuitable for evaluating systematic model biases; rather, it then solely focuses on comparing the temporal evolution of climate model simulations with the true temperature evolution.
Since the true τ i is not available, we have to replace it by the measured y i for the period when y is observed and else by a suitable proxy z i . For notational convenience, we suppress y and write D 2 (x, z), where z i is assumed to be replaced by y i when y i is available: Leaving aside how z should be chosen for the moment, it is enough that z satisfies the Statistical Model 1. There is motivation to modify D 2 by giving different weights to different terms of D 2 (x, z), depending on how good the available data are. However, this discussion will be postponed to Sect. 5. We will first (in Sect. 4) compare D 2 (x, z) with the ideal D 2 (x, τ ). We do not want D 2 (x, z) to yield a systematically different ranking of a set of different x than that given by D 2 (x, τ ) and we will see under what circumstances it does not. The criterion for this will yield a procedure for the calibration of the proxy series z, for use in D 2 (x, z). Later, we will discuss the statistical significance and precision of D 2 (x, z) (Sect. 6) and how to combine information from several regions or seasons into a unified model performance metric for each model simulation (Sect. 7).

Proxy calibration to avoid ranking bias in D 2 (x, z)
We assume here that we want to rank different climate model simulations according to their ideal distance measure D 2 (x, τ ). However, we only have the surrogate measure D 2 (x, z), using the observed temperature variable y (when available) or a proxy measurement z instead of the true temperature τ , and we do not want this to change the ranking in any systematic way. More precisely, we demand the mean value of D 2 (x, z) − D 2 (x, τ ), given x and τ , to be free of x and τ , in particular free of x − τ . We first conclude that replacement of τ by y in D 2 (x, τ ) does not introduce any ranking bias. This is seen from the relation Averaging over the noise term θ, for a given x and τ , we obtain zero for the first term and a constant σ 2 θ = E(θ 2 ) for the second term on the right-hand side. This means that the noise term θ of y does not introduce any ranking bias.
For the proxy data, following the statistical model, the formula corresponding to Eq. (3) is In the mean value of Eq. (4), the contribution of the noise term will only be the constant σ 2 , corresponding to σ 2 θ from Eq. (3). For the mean value of Eq. (4), given x and τ , to be totally free of x − τ , however, it is seen from Eq. (5) that we must have both µ z = µ τ (= µ y ) and β = 1. This tells how raw proxy data z must be calibrated, so that it does not introduce any systematic deviations from the ideal ranking.
For the calibration of the proxy data z, we assume that we have available a period of both proxy and temperature measurements. This allows the estimation of the relationship between z and y, and this is the basis for the calibration. The first requirement, that z should have the same mean value as y, is naturally achieved except for the unavoidable calibration uncertainty by adjusting z by an additive constant, so its average value z over the calibration period satisfies z = y.
The second requirement is that z should have a regression on the latent variable τ with regression coefficient β = 1. Thus, given an uncalibrated proxy z 0 with regression coefficient β 0 on τ , z 0 should be rescaled by division with β 0 to form a calibrated z as z = µ y + (z 0 − µ z 0 )/β 0 , except for calibration uncertainty. In the case when the error in y is negligible, so y = τ , this corresponds precisely to the so-called classical calibration procedure (Osborne, 1991;Brown, 1993), when z 0 is regressed on y and this relationship is inverted to yield a predictor/estimator for y.
Next, allow the error in y to be nonnegligible. Then we have a statistical relationship between z 0 and y of the structural relationship type (an errors-in-variables model). Provided that we can estimate or otherwise judge the size of the error variance in y (i.e. σ 2 θ ), we can obtain an approximately unbiased estimator of β 0 by where s yz 0 and s 2 y are the empirical covariance and variance, respectively (see Fuller, 1987;Cheng and van Ness, 1999). This is the quantity by which to normalize z 0 to obtain the desired z sequence: z = µ y + (z 0 − µ z 0 )/ β 0 . Setting σ 2 θ = 0 brings us back to the previous situation.

Conclusion.
To avoid systematic ranking error in the squared distance D 2 (x, z) relative to the ideal D 2 (x, τ ), the proxy z should be mean adjusted and normalized, such that the estimated regression coefficient of z on τ is 1. This corresponds to use of the so-called classical calibration procedure for calibrating z against y, when errors in y are negligible. To allow errors in y, the modified Eq. (6) should be used.
Note that, in comparison with the observed temperature y, the amplitude of variation in the proxy, Var(z), is exaggerated after classical calibration or when using Eq. (6). The reason is that the full amplitude of the true temperature signal is retained and that the proxy noise variance is superimposed on the temperature signal variance. We will see in the next section how this is compensated for by an optimal weighting of the different observed values according to their variance components.
Remark. A calibrated proxy z, obtained by classical calibration or by using Eq. (6), may be called an estimator or predictor of τ in the sense that the true temperature component embedded within the noisy proxy series is estimated with its correct variance. However, the weaker the correlation is between z and τ (or y), the larger the total variance in z because the variance of the noise term becomes increasingly dominant. The z calibrated in this way is therefore not an optimal predictor of the true temperatures at each individual time point. For a single time-point prediction, direct regression of τ (or y) on z 0 would provide a more appropriate predictor/estimator, where the prediction error variance is minimized. This alternative way of calibrating climate proxy data has often been used in palaeoclimate studies (NRC, 2006). For the climate reconstruction problem in general, though, the seemingly desirable property of (in theory) minimized prediction error is not necessarily an advantage, because (in practice) it leads to a systematic bias of the mean reconstructed (i.e. predicted) past climate in periods that have a mean value that differs from that of the calibration period. As this is an undesired property, e.g. when judgements are made on how the recent climate differs from previous climates, this has led to vigorous discussions in the climate literature on how proxy data should be calibrated. von Storch (1999) 2012) and several others have recently discussed the importance of retrieving the full variance of the temperature signal, including discussions on the use of errors-in-variables models. It must be stressed, however, that in the present context, the way proxy data should be calibrated comes out as a corollary from the statistical model formulated in combination with the explicit desire to obtain an unbiased ranking of forced simulations against the true past temperatures.
If more than one proxy series is available for the region and season of interest, they should be combined to a single z 0 sequence in order to increase statistical precision and thus yield the smallest possible randomness in D 2 (x, z). In theory, this is achieved by multiple regression of y on the set of available proxy series to obtain z 0 . In practice, however, a number of complications must be dealt with, e.g. different time periods for different proxies, and there are several reasons (e.g. collinearity among the proxies, or that the relationships from the calibration period do not hold outside this period) why another way to combine the proxies may be preferred. We will not attempt to deal with these more practical problems here, but primarily conclude that, whatever method chosen, the goal should be to optimize the correlation between z 0 and τ . The preferred z 0 is then rescaled using classical calibration or Eq. (6). In cases when different proxy data are available in different pre-instrumental periods, they must be separately calibrated, and when the calibrated proxy series is known to have different precision in different time periods, this must be adjusted for in the weighting (see Sect. 5 below).
In practice, it is necessary to decide a time unit to use for the calibration. For annually resolved proxy data, the calibration will have its highest precision if calibration is made using the full annual resolution. However, if the model evaluation is made for a lower resolution (e.g. 10-or 30-yr means) and if there is reason to assume that the proxy/temperature regression relationship is time scale-dependent, then it may be better to use a lower resolution for the calibration. However, this will of course decrease the statistical precision. The instrumental noise variance to be used in Eq. (6) can be difficult to estimate in practice, but efforts to estimate errors in gridded temperature data have been made (Brohan et al., 2006). Moreover, Moberg and Brattström (2011, Sect. 6.1) discuss a procedure to estimate the error variance in the mean of a set of neighbouring temperature station records.

Weighting in D 2 (x, z)
Direct temperature measurements y and proxies z have different precision. Moreover, the precision (particularly in z) can vary with time due to the quality and quantity of raw data. This motivates giving different weights to different terms (time points) in D 2 (x, z). In order to understand how we should introduce this weighting in D 2 , we first reconsider Statistical Model 1, assuming both α = 1 (correct amplitude of the forced component ξ ) and β = 1 (calibrated z), so that the forcing effect ξ vanishes from x − z and x − y. We also assume µ x = µ y = µ z , so there is no bias in x − y or x − z.
If the climate model is perfect in this sense, and if we first assume a Gaussian distribution with constant variance for the variability of x i − z i , the resulting Gaussian probability density for the whole observed series x − z is proportional to where σ 2 δ , σ 2 η and σ 2 are the variances of the corresponding components of the statistical model, and Var(x i − z i ) = σ 2 δ + σ 2 η + σ 2 . When y i is available and replaces z i , σ 2 should be replaced by σ 2 θ , but for simplicity of notation we leave that alternative aside for the moment. If there is a bias in x and/or a true forcing effect that does not have a linearly correct representation in the climate model (i.e. α = 1, including the case α = 0), its D 2 -value will tend to be higher and the probability (7) to observe this vector x − z will tend to be exponentially smaller. Thus, the D 2 measure is proportionally equivalent to a Gaussian log-density.
The denominator σ 2 δ + σ 2 η + σ 2 in the exponent of Eq. (7) was assumed constant. However, when the variances in this denominator vary with i, in particular the proxy noise term σ 2 (i), the interpretation of D 2 as a Gaussian log-density tells us how different terms should be (ideally) weighted in D 2 , forming a weighted version D 2 w : An alternative formulation is to introduce the constant factor σ 2 δ + σ 2 η , corresponding to use of the density for x − τ instead of x − z in the numerator of the exponent of Eq. (7). We will use that version as our definition for w i : For times i when a precise y = τ is available (i.e. with σ 2 (i) = σ 2 θ = 0), the normalized weight (Eq. 9) equals 1, whereas w i < 1 when a noisy proxy z is used, or an imprecise instrumental y.
Since the denominator of Eq. (8) is the expected value of the numerator of the same term, an alternative interpretation of the proposed weights is that they are chosen to make all terms of D 2 w be of the same magnitude. The weight factor introduced in Eq. (9) is an ideal weight (under the assumptions made), for which we can at best give an estimate. Thus, we must insert estimates for each of the three variance components σ 2 δ , σ 2 η and σ 2 (i). We assume that the first two components are constant over time, but we have reason to allow σ 2 (i), and thus also the weight w i , to vary over time depending on the precision of the available instrumental or proxy measurement.
To estimate σ 2 δ , we propose to use the (time-)sample variance s 2 δ , pooled from simulations of an unforced model (control simulation). The more simulations available, the better the estimate will be. The main reason to avoid using forced models here is that their simulations contain an unknown forcing effect source of variation, contributing to the sample variance of the x series. A second reason is that the weights should not differ between the climate models to be ranked.
The variance σ 2 η is arguably more difficult to estimate. It represents the unforced real temperature variance, which cannot be estimated directly from instrumental observations (y) because they will always include some forced variance. In particular, the anthropogenic greenhouse gas forcing is likely to be represented as a trend-like component in y which acts to increase the estimated variance of y. Therefore, we propose to detrend the observed y before using it to estimate σ 2 η . Fortunately, σ 2 η (as well as σ 2 δ ) occurs in both the numerator and denominator of Eq. (9), so reasonably small errors in its estimate have little influence on the ratio.
Next, we need an estimate of the (possibly) time-varying σ 2 (i). Although this quantity is needed for time points i outside the calibration period, we estimate it by using information from the calibration period when both y and z are available. Assume first that y = τ , i.e. σ θ = 0. We can use the calibration period to estimate the correlation ρ(y, z). The model formula z = y + implies ρ 2 = Var(y)/Var(z), from which we obtain the relationship σ 2 = Var(y) (1−ρ 2 )/ρ 2 (knowing that the regression coefficient of z on y is 1).
Note that this estimate of σ 2 is determined by the empirical correlation between the proxy and the instrumental data and therefore by the estimated statistical precision of the proxy. In cases when the proxy series z i is known to have different precision in different time periods (and hence different calibrations have been made), a unique weight should be used for each such period, where each weight should be determined by using the corresponding calibration ρ 2 . In this way, we can allow σ 2 (i) in Eq. (9) to vary with time.
Let s 2 y be the empirical variance of (detrended) y and follow the procedure described above. This yields the weights formula: for i in the proxy period. Note that for ρ 2 = 1 the formula yields w i = 1, as it should do when we use y = τ . As ρ 2 approaches zero, so does w. The higher the ratio s 2 δ /s 2 y , the slower the approach is to zero.
Let us now allow noise in y, with noise variance σ 2 θ . If the ratio q = σ 2 θ /s 2 y > 0 is known, the weighting formula for the period when only instrumental data y are used becomes In this case, the weight is somewhat smaller than 1, depending on the size of q.
For the period when the proxy z is used, the weighting formula becomes A minor drawback (in practice) of Eq. (12) is that it might generate weights w i > 1. This occurs when ρ 2 > 1 − q for the estimated values of ρ and q (not being possible for the true values). Should that happen, we advise that the estimation procedures for q and ρ are checked, and the assumptions of uncorrelated noise in, and between, θ and . As a resort, if this does not help, w i could be redefined by using Eq. (10) or (11), bearing in mind that the resulting weights are not optimal.

Statistical significance and statistical precision of D 2 (x, z)
When a D 2 value is calculated for a forced climate model simulation, for a region and season corresponding to a true temperature series τ , it is relevant to first ask whether this D 2 is better (smaller) than a corresponding D 2 value for an unforced model. To make it possible to answer this question, we construct a statistical test of a null hypothesis expressing that the forced model is not better than an unforced model: where δ i (but still neither η i , θ i nor i ) is regarded as white noise, for the variance formula below.
Note that the previous forced component of τ , i.e. ξ in Statistical Model 1, is now included in η, because there is no www.clim-past.net/8/1339/2012/ Clim. Past, 8, 1339-1353, 2012 longer a point in expressing it explicitly. This means the true temperature τ can include forcing effects also in Model 2, albeit only implicitly. Except for that change of interpretation of η, the observed climate part of the model is the same as in Model 1. For the simulation variance σ 2 δ , there is no clear general answer. If the incorporation of a forcing effect in the climate model does not increase the overall variance in the simulations, the residual variance σ 2 δ must shrink. Another possibility is that the forcing effect is simply added to the variation and thus σ 2 δ is not affected. However, to the extent the forcing effects in the climate model are not precisely proportional to those in the true climate, they will contribute to an increased σ 2 δ , so this is also a possible scenario. We will not deal further with this problem here, but, when necessary, simply assume that σ 2 δ is the same both with and without forcings. A somewhat related approach to the problem of comparing climate models with the same types of forcings, but with different magnitudes, would be to try estimating α. Again, this will be a topic for future study.
The unforced climate model is assumed to have been run a number of times, and for each such "replicate" run (differing in initial conditions, and hence also in the actual trajectories of simulated climate variables), we calculate a D 2 value. Let K denote this number of simulations, and let k denote the number of simulations with a forced model (also differing in initial conditions) where all simulations share the same forcing history. Before we calculate the difference in D 2 between forced and unforced simulations, we average D 2 over all replicates in each of the two terms, respectively. This procedure yields the test statistic: where x f and x u represent data from the forced and unforced models, respectively. An alternative averaging procedure would be to take averages over the x series inside each D 2 , i.e. to use the average time series x f and x u and compute the difference T (x f , x u , z) = D 2 w (x f , z) − D 2 w (x u , z). This alternative procedure would be even more efficient but is not used here, because it would also introduce a bias in the comparison, unless k = K. However, we provide details necessary to use this alternative in Appendix A and both variants are used in our experiments in Part 2.
We show below that an approximate distribution under H 0 for the test statistic in Eq. (13) can be obtained with the help of an analytical formula for its standard error. In doing this, we will regard the z (and y) series as fixed and given. It means that we do not need any distributional assumptions about the z series. This is possible because z is common to both terms of Eq. (13).
Since we are more interested in variation than in mean values, we assume that all x u and x f series are mean value adjusted to a common value, which will be denoted µ x . The test statistic value can be rewritten as where the overlines in the first two terms represent averaging over both replicates and time index i. Here the factor (z − µ x ) has the role of a weight factor, multiplying with w.
It is natural to adjust the x u and x f series additively so that the z series also has the same mean value, z = µ x . Then we write z − z in the last term.
The distribution for T is presumably well approximated by a normal distribution, since all terms of the representation Eq. (14) are sums of a large number of terms (referring to the central limit theorem of probability). Under H 0 , the expected value of T (x f , x u , z) is zero, since the forced climate model is equivalent to the unforced model. Assuming normality not only of T but already of x f and x u , the variance of T can be expressed as An approximately N(0, 1)-distributed test statistic is obtained by normalizing the T-value in question by its standard error, i.e. by the square root of Eq. (15) after insertion of the average z for µ x and of the estimate s 2 δ for σ 2 δ . It is of some importance to make sure that the estimate s 2 δ is not too imprecise. As in Sect. 5, we propose to obtain this estimate by calculating the sample variance from all available control simulations.
The test should reject H 0 if the resulting value is too negative, e.g. to the left of −1.65 at the 5 % significance level. It should be kept in mind that, if many mutually independent climate models are tested against the unforced model, but none of them has an (appreciable) correlation with the real data y and z, we must nevertheless expect 5 % false positives from tests at the 5 % level, and analogously for the 1 % level. Thus, it is not enough to find one or a few models showing statistical significance, but the whole sequence of model tests must be considered. As a final comment, we note that an alternative way to perform the significance test would be to use a simulated/randomized resampling procedure to empirically determine the distribution of T instead of using the analytical variance formula in Eq. (15). This is not discussed further here (but see Appendix B for details).

Combination of data from different seasons and regions
Evaluation of palaeo-simulations from climate models should preferably be made using proxy records from as many regions as possible. Data from different regions and/or seasons could then be combined in a single test, but it is not obvious how this should be done. Proxy records from different regions may represent different seasons and may also be of different lengths. In this section, we define a unified performance metric for each model, based on a normalized sum of test statistics T for all regions/seasons with available proxy data. This sum of test statistics can be a simple or a weighted one. Weighting could be implemented if we want a balanced spatial average but the regions are of different size or have a different density of proxy values, or if we want a balanced annual average for a region with quite different numbers of summer and winter values. Different quality (statistical precision) of the proxies does not necessitate weighting, because such effects are treated in the precision of the individual Tvalues (through the weights w i used in D 2 ).
For simplicity of notation, we first consider only a simple sum of T-values, T j , where the index j identifies the different regions and/or seasons used. We need the standard error of this sum, and we then use the standard formula for the variance of a sum of correlated terms: Thus, what we need, together with the variances discussed in the previous section, are the covariances. Consequently, we need to supplement the variance Eq. (15), by the corresponding formula for covariances. This is obtained for Cov(T j 1 , T j 2 ) from Eq. (15) by the following three operations: -Change σ 2 δ to Cov(δ(j 1 ), δ(j 2 )), and σ 4 δ to the covariance squared.
-Change w 2 i to w i (j 1 ) w i (j 2 ).
Here Cov(δ i (j 1 ), δ i (j 2 )) = ρ(j 1 , j 2 ) σ δ(j 1 ) σ δ(j 2 ) , where ρ is the correlation coefficient. We have here assumed that not only are the variances σ 2 δ constant over time, as in Eq. (15), but also the corresponding covariances. Note that the first term in the sum contains a covariance squared, corresponding to s 4 δ in the variance Eq. (15). We assume that the covariances and the mean values µ x (j ) are estimated as with the variance σ 2 δ and the µ x in Eq. (15). Now let nonequal weights be allowed, in the form c j T j , where c j are fixed coefficients which need not sum to 1. To express the corresponding calculations in this case, we arrange the variances and covariances for T j in the covariance matrix V(T ) for the vector T with components T j . Let c be the corresponding column vector with components c j . Then the variance for j c j T j is obtained as the scalar: We now have all requisites to calculate a unified performance metric, U T , for each climate model under consideration: Thus, our final model score is a normalized sum of (possibly weighted) individual T-values for all available regions/seasons with proxy data, normalized by its standard error. This means that we can interpret U T as a unified normalized test statistic of the null hypothesis, H 0 , in the same way as for the individual T-values in the previous section. Hence, U T can have a double usage: (i) to test if a forced climate model is better than unforced models, and (ii) as a rank value to compare different forced models; the more negative the U T -value is, the better (note that a forced model with U T > 0 performs worse than an unforced model).
At this point, some practical issues are considered. In reality, the different proxy series may be of different lengths. This gives us reason to think of what n represents; recall that n is used in the calculation of individual D 2 values, and in the Var(T ) and Cov(T ) values. How should we choose n in the different parts of the calculations when the proxy records are of different length? We suggest to let the longest record determine n in all calculations. For a particular shorter proxy series, we simply let w i = 0 for all time points i, where we have no measurement. A consequence of this is that more weight will be given to regional/seasonal data with long proxy series than those with short series, which seems reasonable. Note, as an example, that a mean value µ x in Eq. (15) for each region should be interpreted as only representing the time period when proxy data are available for that region, so the actual z can be a natural estimate of µ x .
Note that, in the period when all proxies are available, the weighting will be made both according to the proxy quality (through their respective w i ) and according to the variances and covariances of the T-values (which include information from the behaviour of the simulated climate in the unforced models). If the additional weights c j in the sum of T are used, then this will give further weighting to the data. We will, however, not discuss here how to construct such additional weights, because we think this has to be determined uniquely for each particular set of available proxy data by external considerations, and no simple general rule seems plausible. In our pseudo-proxy experiment in Part 2, we will simply use the same weight for all regions.

R. Sundberg et al.: Statistical framework for evaluation of climate model simulations -Part 1 8 Correlation as test statistic
As pointed out in Sect. 2, before any distance-based performance metric is computed, one should first test if a forced climate model simulation is able to explain with statistical significance some part of the variation in instrumental and proxy data. If a forced climate model is unable to explain any variation in the instrumental and proxy data, then the D 2 and U T measures provide little interpretable information. Here we suggest a test statistic, U R , based on the correlation between a climate model simulation and the observations. The x and z series are uncorrelated under H 0 (defined below), but (positively) correlated when forcing effects appear jointly in model simulations and real climate data. The stronger the forcing effect is in the model, the higher the expected correlation coefficient. We first consider a local test for a single grid box (season) and next extend to a combination of data from several regions (and/or seasons). We will again use notation z for the instrumental/proxy series, and the number of time units possible will be denoted n. For a particular grid box (season), data may be available only during a shorter period of time, but with a weight factor that is zero when data is missing, as before, we can let n be the same for all grid boxes (seasons).
The null hypothesis to be tested is:

climate model under consideration does not explain any of the temporal variation in the actual instrumental/proxy data.
Under H 0 , we should of course not expect any significant correlation or covariance between x and z. However, unforced model simulations are important in providing a check that the test works reasonably under H 0 .
We propose the following regression type statistic: for a given z series. We allow k replicates of the same type of forced model, and we use their mean (x i ) above. If only one replicate is available (or if only one replicate is tested), then x i represents a single simulation. When R(x, z) is normalized (divided) by its standard error, i.e. the square root of its variance, we get the correlation coefficient in a semi-empirical form, which is our test statistic for a single grid box (season). As before, k is the number of replicates used to form x i , and the variance factor σ 2 δ is again estimated from all available control runs, which we know satisfy the hypothesis H 0 . The mean value µ z is naturally estimated by the weighted average, z = w i z i / w i . The weight factor w i , however, is not the same weight factor w i as used with D 2 and T , because now only properties of the z series influence the weight. The principle is that the statistics (z i − µ z ) should be weighted such that they get the same variance for all time units i. The weights should then be the following: 1. If y = τ (in periods where instrumental data with no, or negligible, noise are used): w = 1.
Short-term autocorrelation being present in unforced x series is avoided by the use of a sufficiently long time unit, as before. Short-or long-term autocorrelation in the x series due to modelled forcings will not be present under H 0 and therefore does not affect the validity of the test. On the contrary, we can expect that adequate forcing effects in the simulations will covary with the actual variation in the z sequence and therefore contribute to a significant test outcome.
With a number of grid boxes (seasons), we assume, as for the test statistic T , that we form j c j R j for some suitable coefficients c j . To this end, we need the variance for j c j R j . The variance for the local statistic R j was given above, but we will also need the covariance between two such statistics. Given z, the covariance between R j 1 and R j 2 is given by the following formula: Here ρ is the coefficient of correlation between two joint xsequences from a single simulation, to be estimated from a set of (unforced) simulations. Finally, we arrange the variances and covariances for R j in the covariance matrix V(R) for the corresponding vector R. Let c be the corresponding column vector with components c j . Then the variance for j c j R j is obtained as the scalar: Thus, our unified correlation-based test statistic becomes A significant positive value of U R implies that the model is able to explain some of the observed temperature variation. As with U T , we can use the normal distribution to test for significance. Thus, for example, if U R > 1.65, then H 0 can be rejected at the one-sided 5 % significance level. The larger the positive U R values are, the stronger the correlation between the model and the observations. A high negative U R would imply a negative correlation between model and observations. If such values are found, this may be a warning sign of possible erroneous calculations or possible problems with the climate model or proxy data. As with the T statistic, one may also consider a randomized-based significance test of R (see Appendix B).
9 Discussion -relationships between our framework and optimal fingerprinting used in detection and attribution studies As pointed out by , the statistical framework developed here has similarities to the ideas underlying the optimal fingerprinting method used in detection and attribution (DA) studies, which have been important for our understanding of the relative roles of man-made and natural influences on recent climate change (e.g. IDAG, 2005;Hegerl et al., 2007b). Given the importance of the optimal fingerprinting and DA framework, it seems worthwhile discussing some differences and similarities to our framework. Our framework is specifically developed to obtain an unbiased ranking of several competing forced simulations where a variety of climate proxy data, representing different regions, covering different time periods and having different precision, are used to represent the real climate. The optimal fingerprinting method, although it has indeed been used for comparisons of simulated climate with climate reconstructed from proxy data (e.g. Hegerl et al., 2007a), has largely been designed for DA studies using spatially more complete and homogeneous gridded instrumental climate data sets (e.g. Stott et al., 2003). Initially, in our work, we considered the option of modifying an existing empirical-orthogonalfunction-based quadratic form type metric of climate model performance (Mu et al., 2004;Rowlands et al., 2012). This incorporated essential elements from optimal fingerprinting, but it was not clear how to modify it in order to reach our goals, for example allowing noise in simulations, instrumental observations, and proxies -in particular allowing proxy series of different length and with temporally variable statistical precision. Therefore, we preferred to start from scratch and develop a model-based statistical framework specifically designed for our purposes.
A central assumption behind the use of optimal fingerprinting in DA studies is that the observed climate record, which is influenced by multiple sources of external forcings, can be expressed as the sum of internal unforced climate variations plus a linear combination of simulated response patterns to each of the individual forcings as determined from climate model simulations. Thus, the method relies on the existence of an underlying linear relationship between the real response to a real forcing and the expected simulated response to the same type of forcing within a climate model.
Mathematically, but not statistically, the same idea underlies the specification of the simulated (x) and real (τ ) temperatures in our Statistical Model 1. Here, the real forced temperature response is represented by the term ξ and the corresponding simulated response is represented by αξ , where α can be interpreted as a linear scaling factor, in a theoretical linear regression of the climate model output on the corresponding true climate forcing effect. Thus, α plays an inverse type of role to the regression coefficients used in optimal fingerprinting to scale the simulated response patterns (signals) such that they best fit the observed climate. In both cases, i.e. in our framework and optimal fingerprinting, a perfectly simulated amplitude of the response to a particular forcing means that the scaling factor should be equal to one. A difference between the two approaches, so far, is that optimal fingerprinting allows a vector of scaling factors to deal with jointly linear effects of different forcings, whereas our framework has been deliberately restricted to a single factor and a single type of forcing (although this may consist of a combination of several individual forcings). However, as mentioned in Sect. 2, we intend to investigate the implications of an extension of our framework to allow explicit treatment of several individual forcings. Under additional assumptions, our framework can in principle be extended to include a regression-type estimation of α, which would correspond to the estimation part of optimal fingerprinting.
The first requirement in any DA study is to determine whether an observed climate change can be detected beyond the level of internal unforced variability (IDAG, 2005). This occurs when the estimated signal pattern scaling factors are significantly different from zero. In our framework, this corresponds to determining whether a forced model is able to explain any of the observed temperature variations. This is achieved by our correlation test statistic U R defined in Sect. 8. The second DA requirement is to assess the consistency between the observed and simulated response to forcing (IDAG, 2005). This is the same as evaluating the null hypothesis that all fingerprint regression coefficients are equal to one. This part of the DA framework has, so far, no counterpart in our framework. However, after a future modification to allow an estimation of α (also in vector form), our framework could handle this aspect too. Our main test statistic U T , defined in Sect. 7, has no direct counterpart in optimal fingerprinting, but plays a similar role as a performance metric, namely as the "cost function" defined in Eq. (1) of Mu et al. (2004), or the goodness-of-fit statistic used in Rowlands et al. (2012). Our U T , however, is not merely a distance metric, but is in fact a test statistic derived from a set of distance metrics calculated for different regions and seasons.

R. Sundberg et al.: Statistical framework for evaluation of climate model simulations -Part 1
Both our framework and optimal fingerprinting require some assumptions about the spatiotemporal character of the unforced internal climate variability. Optimal fingerprinting typically relies on a spatiotemporal covariance matrix estimated from long control simulations (Allen and Tett, 1999). This matrix can be very large, which complicates its inversion -something that is needed in optimal fingerprinting. This inversion may require non-trivial "pre-whitening" operations (Allen and Tett, 1999). Our framework does not require any inversion of covariance matrices and is therefore easier to use. The price paid for avoiding the need to allow temporal correlation is a simplifying assumption of white noise in the specification of the temporal character of the unforced simulated climate (δ in Statistical Models 1 and 2). This limitation places a restriction on the time unit (time resolution) that can be used; namely, this unit must be sufficiently long such that the unforced simulated climate can be approximated as white noise. Optimal fingerprinting is in theory not hampered by such a restriction, but in practice it may not be possible to use a much smaller time unit, because that would significantly increase the size of the covariance matrix to be inverted. Moreover, to obtain information about the full spatiotemporal covariance on all timescales relevant for studies of climate of an entire millennium, several very long control simulations would be needed. Thus, it seems questionable whether it is possible, in practice, to take full advantage of the inclusion of the covariance matrix in the optimal fingerprinting framework for multivariate model evaluations that consider the entire last millennium, or longer. It is also debatable whether or not the spatial correlation should go into the criterion in an automatic way (via EOFs), to be used for evaluating climate models. It should, however, certainly go into the evaluation of the properties of the criterion.
Note also that -although not being an explicit part of our statistical framework -we suggest the white noise assumption for δ of the control simulations can be satisfied to a sufficient degree (i.e. not severely violated) by a suitable choice of time unit (see Part 2, where this is investigated). Even better could be to have a physically more realistic character of δ in the model and find the corresponding adjustments required in the theoretical results. Future work could be aimed at this problem. Concerning spatial variation, note that explicit information from control simulations is already included in our framework, for the estimation of variances and covariances needed in the test statistics U T and U R . This part of our framework does not put any theoretical restriction on the character of the spatial covariances. We also remind the reader that no assumptions at all are needed for the temporal character of the real unforced temperature (η) or the forced temperature component (ξ ), nor of the measurement error components θ and .

Conclusions and recommendations
We have presented statistical models for observed and latent variables that play a role when comparing forced climate model simulations with climate proxy data or instrumental records. Based on these models, a distance measure between simulations and proxies has been developed that ranks the simulations in the same order as if the distance to the true, unknown, climate were used -of course with an unavoidable stochasticity due to the noise in the data. This distance measure can be used with a set of multiple proxies that represent different regions and seasons, and includes weights that depend on the statistical precision of these proxies, which is allowed to vary in time. A significance test is then developed, to test if a forced simulation performs better (i.e. has a smaller distance to the observations) than unforced simulations. Another significance test is formulated for the correlation between a forced simulation and the proxies. Although distance measures are a standard concept, this is -to our knowledge -the first time the specific form of the distance measure and the calibration of proxies are jointly developed based directly on a statistical model for comparing simulated and observed past climate records, rather than being ad hoc.
The new framework may be used to rank a set of alternative simulations, where the models are driven with different amplitudes of past external forcings, e.g. solar and volcanic forcing. This may help to better understand how large these past forcings have been, something that is not yet fully understood, by assuming that those reconstructed forcings that provide the best fit of simulated temperatures to the observed ones are more likely to better represent the true past forcings. In our companion study (Part 2) we will investigate, in a pseudo-proxy experiment, the possibility to distinguish between multiple-forced simulations that include either a small or a large amplitude of past solar forcing.
Alternatively, our framework could be used to rank different simulations that have been driven with the same past forcing history, but where the climate models include different parametrizations of various non-resolved physical processes, i.e. a perturbed physics ensemble simulation experiment. Different parametrizations may cause the models to have different climate sensitivities, resulting in different amplitudes of the response to external forcings. The models that provide the best fit to the observations would likely include the more appropriate parametrizations.
Note that our framework is designed for being applied to fully coupled general circulation models (GCMs). Thus, we do not advise to rank simulations with a simple energy balance model (EBM), or an Earth system model of intermediate complexity (EMIC), against simulations with a GCM. Such rankings would be rather meaningless. For example, an EBM might provide a smaller distance to the observations than a GCM just because its unforced variability is virtually zero. Also, note that our correlation-based test statistic should not be used to rank simulations; its sole purpose is to determine whether a forced model is able to explain (in a statistical sense) some part of the observed temperature variations. A ranking using the distance-based statistic is meaningful only when this occurs.
The framework developed here is not yet fully developed and several aspects could be improved in future versions. It would be desirable to have a statistical precision measure attached to the D 2 differences when comparing differently forced climate models, not only as now, when forced simulations are only compared with unforced controls. The assumption that the simulated unforced variability can be modelled as white noise should be replaced by a physically more realistic representation, making it possible to work with shorter time units. Also, further modification would be required for use with climate variables other than temperature -or a combination of several climate variables. Future developments should aim at improving some of the aspects mentioned above, as well as to allow estimation of the forced amplitude of simulated climate variability.