Reconstructing past climate by using proxy data and a linear climate model

We examine the skill of a new approach to climate field reconstructions (CFRs) using an online paleoclimate data assimilation (PDA) method. Several recent studies have foregone climate model forecasts during assimilation due to the computational expense of running coupled global climate models (CGCMs), and the relatively low skill of these forecasts on longer timescales. Here we greatly diminish the computational cost by employing an empirical forecast model (linear inverse model; LIM), which has been shown to have comparable skill to CGCMs. We reconstruct annual-average 2m air temperature over the 5 instrumental period (1850 2000) using proxy records from the Pages 2k Consortium phase 1 database; proxy system models for estimating proxy observations are calibrated on GISTEMP surface temperature analyses. We compare results for LIMs calibrated on observational (Berkeley Earth), reanalysis (20th Century Reanalysis), and CMIP5 climate model (CCSM4 and MPI) data relative to a control offline reconstruction method. Generally, we find that the usage of LIM forecasts for online PDA increases reconstruction agreement with the instrumental record for both spatial fields and global mean temperature (GMT). 10 Specifically, the coefficient of efficiency (CE) skill metric for detrended GMT increases by an average of 57% over the offline benchmark. LIM experiments display a common pattern of skill improvement in the spatial fields over northern hemisphere land areas and in the high-latitude North Atlantic – Barents Sea corridor. Experiments for non-CGCM-calibrated LIMs reveal region-specific reductions in spatial skill compared to the offline control, likely due to aspects of the LIM calibration process. Overall, the CGCM-calibrated LIMs have the best performance when considering both spatial fields and GMT. A compari15 son with the persistence forecast experiment suggests that improvements are associated with the dynamical evolution, and not simply persistence of temperature anomalies.

In this work, we propose a computationally efficient "online" data assimilation approach for use in paleoclimate field reconstructions. The primary goal is to investigate whether the addition of dynamical constraints with a forecast can increase reconstruction skill relative to the offline EnKF method. We perform a series of reconstruction experiments using annual forecasts from a cost-efficient LIM. Global average and spatial results of the online reconstructions are compared to both a persistence forecast method and the offline method of Hakim et al. (2016). In Sect. 2 we discuss the basics of the EnKF method 5 and define the use of LIM forecasts in reconstruction experiments. Section 3 details the datasets used and the general experimental configuration. Section 4 discusses and compares results between the online and offline reconstructions, followed by conclusions in Sect. 5.

Online PDA
The Last Millennium Reanalysis (LMR) framework (Hakim et al., 2016) provides a setting to run many computationally 10 efficient realizations of an offline climate reconstruction. Here we begin with it as the basis for our implementation and investigation of online PDA. Central to the LMR framework is the use of the ensemble Kalman filter (EnKF; Kalnay, 2003). The EnKF update equation (Eq. 1) describes the calculation of a posterior (analysis) state vector x a through the optimal update to a prior (background) state vector x b using proxy information, (1) The term G 1 is a mapping term calculated from the calibration of a LIM that maps the current state to a forecasted state 1-year later. Details of the EnKF reconstruction algorithm can be found in Appendix A. The formulation of the LIM used here is described in the following section.

Linear inverse model formulation
The linear inverse model (LIM; e.g. Penland and Sardeshmukh, 1995) used in this study closely follows the implementation 5 described in Newman (2013). The basic equation describes a linearized dynamical system as the tendency of an anomaly state vector x, given by a linear dynamical operator L, which is linearized about a mean state, plus random white noise, ξ. The dynamical operator L is assumed to be constant in time. After integrating (Eq. 4) in time, the solution is a mapping of x at time t (in years) to a state at time t + 1 10 x(t + 1) = G 1 x(t) + σ(t) where G 1 is equivalent to e L . As in Newman (2013), we choose to empirically estimate G 1 rather than L due to sampling deficiencies of a few highly damped eigenmodes of L on an annual timescale. See Appendix B for a summary of the G 1 calculation.
While the simplicity of a LIM makes it well suited for the current application, it also has issues to be considered. First, 15 LIM forecasts are performed in EOF space based on patterns derived from the calibration data. As such, the LIM makes an assumption of stationarity for the EOFs and encoded dynamics over the entire reconstruction time period. Furthermore, the process of converting data into EOF space truncates spatial anomaly information and reduces ensemble variance. The next section discusses how we handle variance reductions when using a LIM in the LMR framework. 20 In any ensemble forecast setting, a basic assumption is made that the sample of ensemble members gives a good approximation to the statistics of the full system (Murphy, 1988). Sampling error often results in too-small variance, which can cause "filter divergence" where observational information is underweighted relative to the forecast prior and the ensemble variance collapses toward zero. The online PDA technique presented here is especially vulnerable to filter divergence because all eigenmodes of G 1 are damped (negative real eigenvalues). Moreover, the conversion of the analysis (x a ) into EOF space at each timestep 25 removes any spatial information that does not project upon the retained modes of a given LIM. Consequently, LIM forecasts lose ensemble variance in time.

Ensemble calibration
There are a variety of well-tested methods available to address information loss in the forecast ensemble. Here we use an adaptation of the hybrid ensemble Kalman filter-3D variational scheme (Hamill and Snyder, 2000) to prevent filter divergence and to facilitate comparison with the offline PDA technique. This technique handles the loss of ensemble variance in the forecast ensemble (x f b ) by blending it with a static source (x s b ), which is the same climatological prior that would be used independently for each year in the offline method. As a result, the update equations use a blended prior statex f b (Eq. 6) and a blended Kalman gain termK (Eq. 7): Appendix A provides details on how this is incorporated into the reconstruction algorithm.
In these hybrid DA equations, the parameter a controls the relative weighting between static and forecast information sources. When a = 0.0, reconstructions are identical to the offline case wherein the priorx f b is reset to the static prior for every year with no blending. For the opposite case, a = 1.0, only forecast information is used with no contribution from static 10 information.

Data and experimental configuration
The relative forecast skill of a LIM is dependent on the data used to empirically derive the mapping term G 1 . For this reason, we explore LIMs calibrated on four different data sets. CGCM calibration data are used from two last-millennium climate simulations in the Coupled Model Intercomparison Project phase 5 (CMIP5; Taylor et al., 2012): the Community Climate 15 System Model v4 (CCSM4; Landrum et al., 2013) and the Max Planck Insitute Earth System Model paleo-mode (MPI). These simulations cover a 1000 year pre-industrial (850-1850 C.E.) time period including volcanic forcing events (aerosols and greenhouse gases), solar variability, and human-related land cover changes. The 20th Century Reanalysis (20CR; Compo et al., 2011), a DA synthesis of observations and a weather forecast model, provides over 150 years of reanalysis data spanning the instrumental record (1850-2012). Finally, we use the Berekely Earth surface temperature dataset (BE; Rohde et al., 2013) as an 20 observational calibration. BE provides a 60-year sample with nearly complete global coverage. The different LIM calibration datasets used here span linear modes of predictability derived from model space to that of observations. The basic configuration we use for all experiments, including the offline control, involves a choice of data to sample as the prior, an instrumental data source to calibrate proxy observation models, and a proxy record dataset. For the prior, we use annually-averaged 2m air temperature anomalies from the CCSM4 last-millennium simulation. The linear observation 25 models for proxy data are calibrated against the NASA Goddard Institute for Space Studies surface temperature analysis dataset (GISTEMP; Hansen et al., 2010). All experiments use annually-resolved proxy records from the PAGES 2k Consortium (2013) database. These proxies have been ascertained to covary regionally with temperature and include: tree rings, ice cores, corals, sediment cores, and speleothems. Only proxies with a minimum of 10 years overlapping with the observation model calibration data, and a minimum calibration-fit correlation of 0.2 are used. It should be noted that the correlation threshold is 30 not strictly necessary, but Hakim et al. (2016) found that this threshold did not quantitatively affect the reconstruction results.
Here, the reduction in proxies to those with more information helps reduce computational costs, allowing a larger number of reconstruction experiments.
We reconstruct annual-mean 2m air temperature anomalies for the period of 1850-2000 C.E. as in Hakim et al. (2016). Using the four LIM calibrations, we search the parameter space of 0 ≤ a ≤ 1 for an optimal weighting between static and forecast information sources in our hybrid PDA framework. Additionally, we perform a persistence experiment for comparison against 5 LIM-based performance where the posterior for year n is used as the prior for year n +1. The persistence forecast uses the same hybrid PDA blending scheme as the LIM forecast experiments to mitigate the effects of reductions in ensemble variance from the assimilation process. We account for the sensitivity to the proxy data used in a CFR through random resampling of available proxy data and the static prior ensemble. In a single realization a random sample of 75% of the usable proxy records, and a 100 member sample of anomaly states from the prior source, are selected. A total of 100 realizations are performed for each 10 LIM calibration and blending coefficient. In order to make the realizations consistent between the experiments using different blending coefficients, we ensure the same sequences of random samples are taken by seeding the random number generator for each a-value. In total, this gives 10 4 reconstructions of the climate state for each experiment. These reconstructions are then averaged to give the final analysis.

15
The primary skill metrics are correlation and the coefficient of efficiency (CE; Eq. 8,9;Nash and Sutcliffe, 1970). Correlation gives an overall sense of signal timing (phase), while CE is a stricter metric that is sensitive to signal timing, amplitude, and bias. Using these metrics, we compare the reconstructed ensemble-mean 1 values x against GISTEMP verification values v . The value τ represents the number of verification times available (in this case representing the period of 1880 -2000), an overbar (e.g.v) denotes a temporal average, while σ x and σ v are the standard deviations of the respective time series. Skill scores are 20 compared for the reconstructed global mean temperature (GMT) and spatial grid points.
We also use the continuous ranked probability score (CRPS; Gneiting and Raftery, 2007) as a comparison against CE skill metrics.
The CRPS is considered to be a 'proper' scoring technique which prevents manipulations of the data from overestimating the reconstruction skill; the measure reflects the mean absolute error and narrowness of the ensemble distribution. We use the 1 Ensemble mean represents the average taken over all ensemble members and all realizations.
CRPS as defined in Tipton et al. (2016) to calculate the CRPS of the reconstructed GMT for each realization (Eq. 10) where t denotes a single member of a K -member ensemble. We take the score as the average CRPS over all realizations and use a Kolmogorov-Smirnov test on the resulting distribution to determine whether it is significantly different than the offline case with 95% confidence. 20CR LIM display the best overall CE performance with a 9% improvement over the offline method. These two experiments also display a 2% increase in correlation and have slightly smaller correlation than the persistence forecast experiment (Table   1). Correlations for the 20CR, CCSM4, and MPI experiments are significantly different (with 95% confidence) than the offline reference based on bootstrap results for skill metric error bounds. The CE changes for all LIM experiments are also significant.

Results and discussion
CRPS values for the GMT results ( Fig. 2) are generally consistent with those for the CE skill metric. 2 Specifically, GMT 15 verification with CRPS shows that all LIM forecasting experiments outperform the offline method, with the CCSM4 and 20CR LIMs having the best performace (18% better than the offline case). All LIM experiments' best CRPS scores are significantly better than the offline case with 95% confidence. There are slight differences in results for the MPI and BE LIM cases when comparing CE and CRPS: the blending coefficient achieving the best score shifts to the next highest a-value in both cases, and the BE LIM outperforms the MPI LIM when considering CRPS (Table 1). 20 Both the CE and CRPS measures for different blending coefficients are affected by the degree of fit to the warming trend in the GISTEMP reference. The trends for all experiments are shown in Fig. 3. The trend of the offline case (a = 0) is 0.62 K/100 yrs, about 0.07 K/100 yrs above the GISTEMP trend. However, for the MPI and CCSM4 LIM experiments, as well as the persistence forecast experiment, the reconstructed trend increases as the blending parameter a increases. This increase in trend away from the GISTEMP trend for the MPI and CCSM4 experiments is reflected in the lowered CE ( Fig. 1) and CRPS (Fig. 2) 25 for the a-values from approximately 0.0 to 0.6. The reconstructed trend from the two CGCM-based LIM experiments begins to decrease around a = 0.6 where CE also shows a significant increase towards maximum values. The persistence forecast trend has the largest disagreement, increasing to approximately 0.72 K/100 yrs for a = 0.9, which results in the CE and CRPS never surpassing the offline benchmark in this case. The 20CR LIM only increases the reconstructed trend slightly over the GISTEMP trend for middle a-values. The BE experiment has a decreasing trend for increasing a, and drops to a very low trend 30 of 0.38 K/100 yrs when a = 0.9. Interestingly, though the trends for the MPI, CCSM4, and 20CR experiments are below that 2 Note that best results for CRPS occur at minimum values instead of the maximum.

7
Clim. Past Discuss., doi: 10.5194/cp-2016-129, 2016 Manuscript under review for journal Clim. Past Published: 6 December 2016 c Author(s) 2016. CC-BY 3.0 License. of the GISTEMP trend for a = 0.95, their skill still outperforms the offline case in all three metrics. Despite the mismatch in the overall trend, these online forecasting methods still produce better matches of phase and amplitude of GMT variability for the reconstructed anomalies compared to the offline case.
The trend results also illustrate the relative amount of proxy data utilization between these different experiments. Given that every experiment uses the same prescribed list of seeds to generate proxy record samples, the differences in reconstructed forecast displays an interesting disparity between the skill metrics; overall, it performs the best in correlation, but the worst in CE and CRPS. Having the largest reconstructed trend suggests that the persistence case has the highest weighting of proxy 20 data. With a persistence forecast there is no damping of reconstructed spatial anomalies or truncation of the ensemble variance from projection into EOF space. The resulting higher proxy weighting may explain why the persistence case correlation is better than the other forecasting methods. The linear observation models used to estimate observations in each case are based on a calibration against GISTEMP. Proxies that have a better calibration fit (higher correlation) with GISTEMP have less error variance, and therefore, have more influence on the posterior analysis. The persistence case allows more information from the 25 influential (well-correlated) proxies into the analysis because the ensemble variance is larger. However, from the CE and CRPS values, which are sensitive to more than just signal phase matching, it is clear the general trend mismatch degrades the quality of the persistence forecast reconstruction compared to the offline benchmark.
Removing the linear trend from each case allows for an examination of how well the reconstructions capture variability not associated with the warming trend (i.e. interannual and decadal variability; evident in Fig. 4). Generally, the performance 30 increases for the detrended data over the offline case are much larger than for the full time series. Compared to verification with the full time series, the correlation of the detrended offline case drops from 0.9 to 0.67, and the CE drops from 0.77 to 0.29 (Table 2). With respect to CE, all experiments (including the persistence method) improve upon the offline benchmark.
The 20CR LIM achieves the best improvement over the offline case (72% increase), while the persistence case shows the least improvement (35% increase). Except in the BE LIM experiment, detrended correlation metrics again increase slightly over the offline case. The BE LIM hovers around the benchmark correlation for 0.0 < a < 0.7 and then drops below it. A CE improvement with no change in correlation implies that the BE LIM improves the detrended anomaly amplitudes and bias, but does not improve the signal timing compared to the offline case.
The CE and correlation skill metrics of the offline experiment both decrease when calculated on the detrended GMT. In contrast, the CRPS improves by 7% (minimizing from 12.5 to 11.6). CRPS rewards reduction in mean absolute error, and 5 'narrowness' of the forecast ensemble, whereas CE and correlation depend more on variance properties of the reconstructed time series. In removing the linear warming trend, we remove a large degree of the time series' variance and subsequently lose the associated skill in CE and correlation. In the case of CRPS, the linear trend is only a source of mean error when it does not closely match the reference trend. Since we remove the trend, while not affecting the ensemble spread, the mean errors decrease and the CRPS metric improves. Figure 5 shows CRPS for all blending coefficients with detrended data. The behavior 10 is quite similar to the full GMT CRPS, but as the detrended CE reflects, even the persistence forecast shows improvement over the offline method. An aspect that stands out with detrended CRPS is that the persistence forecast achieves the best value when a = 1.0. A cursory examination of the detrended GMT timeseries of the persistence case compared to the detrended GISTEMP GMT timeseries (not shown) reveals that it captures decadal variability over the instrumental record, but virtually none of the interannual variability; i.e., the a = 1.0 persistence reconstruction gives a smoothed representation of the GMT.

15
This again highlights a difference between the two metrics of CE and CRPS. The CRPS metric, which generalizes to the mean absolute error of the ensemble summed over time, does not penalize the smoothed GMT signal approximately bisecting the interannual signal for a = 1.0. The CE metric, which sums the squared errors of the ensemble mean and then normalizes by the climatological variance, does penalize this behavior. 20 Here we examine the skill of the spatial fields against the offline case using correlation and CE; CRPS is omitted because the full spatial field ensembles are too large to store. The offline case shows positive skill over most of the globe except in the Southern Hemisphere oceans and in the high-latitude North Atlantic to Barents Sea corridor (Fig. 6). All LIM-forecasting cases show improvements to CE, most notably in the same North Atlantic to Barents Sea area, and across northern Europe into Asia; there are also smaller skill increases across western North America. The CCSM4, MPI, and BE LIMs generally show 25 large CE increases in the high-latitude southern ocean. In contrast, the reconstruction with the 20CR LIM does not improve the southern ocean at all and has large deficiencies over many ocean areas. The persistence case generally shows decreases in CE across large areas of the globe. Of the global mean CE for each grid, the CCSM4 LIM gives the best performance, increasing the global mean CE by 0.09, followed by the MPI LIM with an improvement of 0.06. The 20CR and persistence cases show decreases in average spatial skill across the grid, with the 20CR being worst with a global mean CE change of that improvements are not related to signal phasing; however, some of the large decreases in CE for the 20CR, BE, and persistence experiments do coincide with areas of correlation decreases.

Verification of spatial fields
In the spatial results, there is a clear distinction between LIMs calibrated on data from the shorter instrumental era, and the millennium-scale climate simulation data. Compared to the offline spatial skill, large areas of skill degradation are apparent for both the 20CR and BE LIM reconstructions. It is surprising that the 20CR LIM has the worst spatial skill given that it has the 5 best GMT timeseries CE skill. The leading EOF for each LIM calibration reveals a difference in spatial structure that forms part of the forecast basis (Fig. 7). The leading EOF of the 20CR experiment lacks the ENSO/PDO-like pattern of the other LIMs and instead focuses on variability structures rooted in the southern ocean where relatively fewer pressure observations are available for assimilation; many of the large decreases in CE are in these same regions. The BE LIM displays a leading mode more similar to the CGCM-based LIMs, but still has skill problems over large ocean areas. One reason the skill of 10 the instrumental era LIMs may be different is that they are based on shorter records that coincide with variability related to anthropogenic forcing. Separating the global warming trend from the LIM by means of linear detrending is bound to leave residual signals that affect LIM forecast modes. A LIM based on a shorter record may not have enough of a sample to properly characterize representative modes of variability over a longer time span. While the BE LIM is based on only 60 years of data, it produces much less spatial skill degradation than the 150 years of 20CR data. This suggests there may be inconsistencies in 15 variability caused by observational coverage and the data assimilation method used in creating the 20CR dataset.

Conclusions
We have outlined and tested a new method for performing online paleoclimate data assimilation (PDA) for climate field reconstructions (CFRs) using linear inverse models (LIMs). We tested four different LIMs empirically derived from surface temperature data from the following data sets: Berkeley Earth (BE), the 20th Century Reanalysis (20CR), and two last-millennium 20 climate simulations (CCSM4 and MPI) from the Coupled Model Intercomparison Project phase 5 (CMIP5). We also performed a persistence forecast experiment for comparison. In general, we find that LIM-enabled online assimilation improves upon the offline results for both the global average and spatial field of 2m air temperature.
With respect to GMT verification, LIM experiments show large improvements for skill metrics calculated on the detrended GMT. The coefficient of efficiency (CE) values show an average increase around 57%, while correlations increase around 4%.

25
The continuous ranked probability score (CRPS) metrics increase by an average of 15% across all LIM experiments. Skill metrics tend to maximize for blending coefficients with a higher weighting of flow-dependent forecast information (0.7 < a < 0.95). Spatial skill reveals that the addition of LIM forecasting provides spatial information in regions where the offline method performs poorly-including North Hemisphere land areas, and the North Atlantic to Barents Sea corridor. The two LIMs calibrated on instrumental era data (20CR and BE) display large regions over the ocean where the skill degrades compared to the CGCM-based LIMs have the best overall performance with the CCSM4 LIM slightly outperforming the MPI LIM. The persistence forecast fails to improve the more stringent GMT skill metrics (CE and CRPS) as well as general spatial skill, but does well in GMT timeseries correlation. Subsequently, this suggests that the improvements of online reconstructions when using a LIM are due to forecast information and not simply the addition of temporal persistence.
Though we are reconstructing instrumental era surface temperatures, it is an interesting result that CGCM-calibrated LIMs 5 based on last millennium (850-1850) simulations have the best overall performance. This could mean that having long-running samples of variability that do not contend with major sources of variability related to anthropogenic forcing are beneficial for reconstruction purposes. The LIMs used in these experiments were all calibrated on data with the least-squares linear fit trend removed. In order for LIMs based on observational data sources to achieve similar results, it may be necessary to employ a more sophisticated method of filtering out the global warming signal. However, one benefit of using CGCM-based LIMs is that 10 it enables forecasts for a much larger set of climate-related quantities than are available from observations alone.
In this work, we have shown that we can improve both GMT and spatial field skill over the offline EnKF PDA method through the inclusion of a simple forecast model. A previous comparison of offline and online PDA using a CGCM as a forecast model found no discernible difference in reconstruction skill (Matsikaris et al., 2015), and earlier studies of the EnKF PDA method forewent the usage of forward models citing insufficient model skill to justify the expense (Bhend et al., 2012;15 Steiger et al., 2014). Our results show that an online method can increase reconstruction fidelity, and more importantly that it can be done using an empirical forecasting method that is nearly as computationally efficient as the offline approach. As such, this method provides a useful foundation for further investigation of incorporating dynamical constraints of a forecast model into climate field reconstructions.
Appendix A: Online assimilation algorithm 20 This section details the data assimilation equations used to perform paleoclimate field reconstructions, and the algorithm steps for a single realization of an online climate reconstruction.

A1 Ensemble square root filter (EnSRF)
The EnSRF approach (Whitaker and Hamill, 2002) uses an ensemble sampling approach to solve the Kalman filter equations by separating the ensemble mean (z b ) and ensemble perturbations about that mean (z b = z b −z b ). Note that z b represents an 25 augmented state vector, z b =   x b y e   , combining the prior state (x b ) and the estimated observations (y e ). The EnSRF method allows for the serial assimilation of proxy observations using the equations for each proxy i = 1, ..., p. The mean statez b is an m × 1 column vector, the ensemble perturbations z b is an m × n matrix, the mean estimated observation for proxy i,ȳ ei , is a scalar value, and perturbations about the mean estimated observation y ei is a 1 × n row vector. Note that when observations are serially assimilated, the Kalman gain (Eq. 2) is simplified to where σ 2 i is the observational error for proxy y i and the denominator is now a scalar value . The perturbation update equation 5K is given bỹ Finally, to adapt the hybrid assimilation scheme into the EnSRF method, we incorporate the data source blending as shown in Eq. (7). At this point, the blended state (ẑ f b ) contains both flow dependent and static (climatological) information. After incorporation, the Kalman gain (Eq. A3) becomes the perturbation Kalman gain (Eq. A4) becomeŝ and the mean and perturbation updates from Eq. (A1) and Eq. (A2) becomē A2 Assimilation algorithm 1. Choose a static ensemble prior (x s b ) of n members, and group of p proxies (y) to assimilate.
2. Calibrate observation models for each proxy record by applying a univariate linear fit against co-located instrumental data. vi. Reassemble z a and z s a by adding the ensemble mean back into the perturbations. These will be used for the next proxy assimilated asẑ f b and z s b .

15
(c) After all proxies have been assimilated, extract the climate field x a from the augmented analysis state z a . Starred points indicate a statistically significant (95% confidence) difference between the offline benchmark and online experiment.

19
Clim. Past Discuss., doi: 10.5194/cp-2016-129, 2016 Manuscript under review for journal Clim.   Figure 6. Spatial maps displaying the difference in coefficient of efficiency (CE) from the offline case. Difference maps are displayed for each forecasting experiment using the blending coefficient that achieves the highest full GMT CE skill except for the persistence case where a = 0.9 was used. (See Table 1 for the list of corresponding blending coefficients.) The reference CE of the offline case is shown in the upper left, and uses the same color scale as the difference maps. Area-weighted global average differences are given in the title of each panel. All global mean differences except in the BE LIM case are significantly different than the offline benchmark with 95% confidence.

24
Clim. Past Discuss., doi: 10.5194/cp-2016-129, 2016 Manuscript under review for journal Clim. Past Published: 6 December 2016 c Author(s) 2016. CC-BY 3.0 License. Table 1. Best value of the coefficient of efficiency (CE), correlation (r), and continuous ranked probability score (CRPS) verification metrics for global mean 2m air temperature. For each experiment, best values are given with the corresponding blending coefficients (a) that achieved it and the percentage change compared to the offline case. A (*) indicates which experiment achieved the best performance in a given metric.
Offline verification metrics are given for reference.