Reconstructing paleoclimate fields using online data assimilation with a linear inverse model

We examine the skill of a new approach to climate field reconstructions (CFRs) using an online paleoclimate data assimilation (PDA) method. Several recent studies have foregone climate model forecasts during assimilation due to the computational expense of running coupled global climate models (CGCMs), and the relatively low skill of these forecasts on longer timescales. Here we greatly diminish the computational cost by employing an empirical forecast model (linear inverse model; LIM), which has been shown to have comparable skill to CGCMs for forecasting annual-to-decadal surface temperature 5 anomalies. We reconstruct annual-average 2m air temperature over the instrumental period (1850 2000) using proxy records from the Pages 2k Consortium phase 1 database; proxy models for estimating proxy observations are calibrated on GISTEMP surface temperature analyses. We compare results for LIMs calibrated on observational (Berkeley Earth), reanalysis (20th Century Reanalysis), and CMIP5 climate model (CCSM4 and MPI) data relative to a control offline reconstruction method. Generally, we find that the usage of LIM forecasts for online PDA increases reconstruction agreement with the instrumental 10 record for both spatial fields and global mean temperature (GMT). Specifically, the coefficient of efficiency (CE) skill metric for detrended GMT increases by an average of 57% over the offline benchmark. LIM experiments display a common pattern of skill improvement in the spatial fields over northern hemisphere land areas and in the high-latitude North Atlantic – Barents Sea corridor. Experiments for non-CGCM-calibrated LIMs reveal region-specific reductions in spatial skill compared to the offline control, likely due to aspects of the LIM calibration process. Overall, the CGCM-calibrated LIMs have the best per15 formance when considering both spatial fields and GMT. A comparison with the persistence forecast experiment suggests that improvements are associated with the linear dynamical constraints of the forecast, and not simply persistence of temperature anomalies.


Introduction
Climate field reconstructions (CFRs) aim to provide essential information on climate variability beyond the instrumental record.These experiments take noisy and sparse proxies (e.g., tree rings, ice cores, isotope ratio measurements) and use them to infer a spatial estimate of relevant climate variables.A common approach to CFR uses a statistical regression model calibrated on the instrumental record to project as far into the past as data will allow (e.g., Mann et al., 1998Mann et al., , 2009;;Smerdon et al., 2011b, see Smerdon et al., 2011a, for a discussion and comparison of methods).These techniques provide a useful estimate of past spatial patterns (Wahl and Smerdon, 2012) but also have inherent limitations.For example, regression-based CFRs assume climate state to be a function of the proxy data, which can lead to an underestimation of past climate anomaly amplitudes (Smerdon et al., 2011b;Wahl and Smerdon, 2012).Furthermore, because regression methods produce past spatial fields through combinations of primary variability modes (i.e., empirical orthogonal functions, EOFs), the resulting field is not guaranteed to be a physically consistent solution.
An alternate method of performing CFRs known as paleoclimate data assimilation (PDA) can circumvent some of the limitations inherent in regression-based methods.PDA broadly characterizes a set of techniques where observational information from proxy data is combined with dynamical information from climate models.Recently, the ensem-ble Kalman filter (EnKF) was adapted for use with timeaveraged observations like those used in CFRs (Dirren and Hakim, 2005;Huntley and Hakim, 2010).Studies using the EnKF method and idealized pseudo-proxy experiments show that it operates well under sparse data availability (Bhend et al., 2012) and outperforms modern statistical CFR methods (Steiger et al., 2014).More recently, EnKF PDA was tested with real proxy data in the Last Millennium Reanalysis project (LMR; Hakim et al., 2016) and shows promising skill in reconstructing robust spatial fields in a computationally efficient manner.Due to the expense of performing coupled global climate model (CGCM) simulations and relatively low forecast skill, the initial EnKF adaptation for PDA does not use a forecast and instead reconstructs each time period independently using climatological data.This is known as an offline approach.The EnKF method is traditionally accompanied by forward model forecasts to translate information between analysis time periods (e.g., reanalysis products of the instrumental era).Dynamical constraints from these forecasts can increase physical consistency and reconstruction skill given that the model has sufficient predictability on proxy timescales (e.g., Pendergrass et al., 2012).For CFR applications, predictability on seasonal and longer timescales is required.Ocean memory can be leveraged for interannual (e.g., El Niño-Southern Oscillation, ENSO) to potentially decadal predictability (Branstator et al., 2012).However, at this timescale coupled climate models only seem to capture linearly predictable dynamics (Newman, 2013).
Online assimilation has been attempted using other PDA techniques.Crespin et al. (2009) use forecasts from an Earth system model of intermediate complexity (EMIC) in conjunction with the ensemble selection PDA method (see Goosse et al., 2006Goosse et al., , 2010) ) to reconstruct surface temperatures but do not investigate a comparison with an offline method.Annan and Hargreaves (2012) perform a pseudo-proxy experiment using a weighted ensemble selection method and a persistence forecast to reconstruct surface temperatures but find no benefit compared to their offline experiments.Matsikaris et al. (2015) take a similar approach to Crespin et al. (2009) but used an ensemble of decadal forecasts from a coarse-resolution CGCM instead of an EMIC.The authors find that the use of CGCM forecasts has skill, but it is not discernibly superior to the offline method.Possible reasons for the lack of improvement include low skill for regional decadal forecasts of temperature and issues related to ocean initialization for each decadal interval.
These results suggest that neither the simple persistence forecast nor a small ensemble of decadal CGCM forecasts add significant information to CFRs.In order to test the viability of a more traditional EnKF method, we require the ability to perform annual forecasts for longer time spans (the past millennium) and in large ensembles ( ∼ 100 members).These requirements rule out the use of a CGCM.Instead, we explore a simple, empirically based forecast from a linear inverse model (LIM; Penland and Sardeshmukh, 1995).
A LIM encodes the linear dynamical properties of a system and produces forecasts that are subject to the constraints of its derived linear modes.The forecast skill of LIMs is such that they are currently used for operational ENSO forecasts (Newman et al., 2009).Moreover, recent studies show LIM skill to be comparable to that of CGCMs when performing annual-to-decadal hindcast experiments over the instrumental era (Newman, 2013;Huddart et al., 2016).
Here, we propose a computationally efficient online data assimilation approach for use in paleoclimate field reconstructions.The primary goal is to investigate whether the addition of dynamical constraints with a forecast in the online case can increase reconstruction skill relative to the offline EnKF method, which has no forecasting.We perform a series of reconstruction experiments using annual forecasts from a cost-efficient LIM.Global average and spatial reconstructions from the online experiments are compared to results from both a persistence forecast method and the offline method of Hakim et al. (2016).In Sect. 2 we discuss the basics of the EnKF method and define the use of LIM forecasts in reconstruction experiments.Section 3 details the datasets used and the general experimental configuration.Section 4 discusses and compares results between the online and offline reconstructions, followed by conclusions in Sect. 5.

Online PDA
The LMR framework (Hakim et al., 2016) provides a setting to run many computationally efficient realizations of an offline climate reconstruction.Here we begin with it as the basis for our implementation and investigation of online PDA.Central to the LMR framework is the use of the EnKF (Kalnay, 2003), which assumes Gaussian distributed errors.The EnKF update equation (Eq. 1) describes the calculation of a posterior (analysis) state vector x a through the optimal update to a prior (background) state vector x b using proxy information, (1) The innovation, [y − H(x b )], characterizes new information content as a difference between proxy observations in vector y and observations estimated from the prior by H(x b ) (hereafter denoted as y e ).H() is a potentially nonlinear operator that maps the prior state into observation space.The Kalman gain matrix K, defined by Eq. ( 2), spreads information into the analysis weighted by prior covariance and the observational error covariance matrix R: where cov(a, b) represents a covariance expectation.The LMR framework uses a variant of the EnKF update, known as an ensemble square root filter (EnSRF; Whitaker and Hamill, 2002).This process updates the ensemble mean and perturbations from the mean separately, allowing for the serial assimilation of proxy data and simplification of the update calculations.Typical implementations of the EnKF method include a model forecast between analysis times.As stated earlier, the computational expense and low skill of CGCM forecasts prompted the use of the offline method where each year is reconstructed independently without forecasting.Here, instead of using static prior (x b ) at the beginning of each reconstruction year, the current year's posterior analysis is forecast forward by 1 year with a LIM defined by The term G 1 is a mapping term calculated from the calibration of a LIM that maps the current state to a forecasted state 1 year later.Details of the EnKF reconstruction algorithm can be found in Appendix A. The formulation of the LIM used here is described in the following section.

Linear inverse model formulation
The LIM (e.g., Penland and Sardeshmukh, 1995) used in this study closely follows the implementation described in Newman (2013).The basic equation describes a linearized dynamical system as the tendency of an anomaly state vector x, given by a dynamical operator L, which is linearized about a mean state, plus random white noise, ξ .The dynamical operator L is assumed to be constant in time.After integrating (Eq.4) in time, the solution is a mapping of x at time t (in years) to a state at time t + 1: where G 1 is equivalent to exp(L).As in Newman (2013), we choose to empirically estimate G 1 rather than L due to sampling deficiencies of a few highly damped eigenmodes of L on an annual timescale.Each LIM is constructed using an EOF basis, retaining the leading eight modes of variability.See Appendix B for a summary of the details associated with the G 1 calculation.
While the simplicity of a LIM makes it well suited for the current application, it also has issues to be considered.First, LIM forecasts are performed using an EOF basis derived from the calibration data.The EOF basis is used to maximize the variance captured by the fewest degrees of freedom.Although the dominant modes of variability can change in time during the reconstruction period, the space spanned by the variability cannot.This means that variability that falls outside the span of the EOFs will not be resolved.A second issue is that the LIMs tend to be damped (modes of L decay in time), which reduces the ensemble variance; we elaborate on this issue in the next section.

Ensemble calibration
In any ensemble forecast setting, a basic assumption is that the sample of ensemble members gives a good approximation to the statistics of the full system (Murphy, 1988).Sampling error often results in too-small variance, which can cause filter divergence, where observational information is under-weighted relative to the forecast prior and the ensemble variance collapses toward zero.The online PDA technique presented here is especially vulnerable to filter divergence because all eigenmodes of G 1 are damped (negative real eigenvalues).Moreover, the conversion of the analysis (x a ) into EOF space at each time step removes any spatial information that does not project upon the retained modes of a given LIM.Consequently, LIM forecasts lose ensemble variance in time.
There is a variety of well-tested methods available to address information loss in the forecast ensemble.Here we use an adaptation of the hybrid ensemble Kalman filter and 3-D variational scheme (Hamill and Snyder, 2000) to prevent filter divergence and to facilitate comparison with the offline PDA technique.This technique handles the loss of ensemble variance in the forecast ensemble (x f b ) by blending it with a static source (x s b ), which is the same climatological prior that is used independently for each year in the offline method.As a result, the update equations use a blended prior state xf b (Eq.6) and a blended Kalman gain term K (Eq.7): Appendix A provides details on how this is incorporated into the reconstruction algorithm.In these hybrid data assimilation (DA) equations, the parameter a controls the relative weighting between static and forecast information sources.When a = 0.0, reconstructions are identical to the offline case, wherein the prior xf b is reset to the static prior for every year with no blending.When a = 1.0,only forecast information is used, with no contribution from static information.

Data and experimental configuration
The relative forecast skill of a LIM is dependent on the data used to empirically derive the mapping term G 1 .For this reason, we explore LIMs calibrated on four different datasets.CGCM calibration data are used from two last-millennium climate simulations in the Coupled Model Intercomparison Project phase 5 (CMIP5; Taylor et al., 2012): the Community Climate System Model v4 (CCSM4; Landrum et al., 2013) and the Max Planck Institute Earth system model paleomode (MPI).These simulations cover a 1000-year preindustrial (850-1850 CE) time period, including volcanic forcing events (aerosols and greenhouse gases), solar variability, and The basic configuration we use for all experiments, including the offline control, involves a choice of data to sample as the static prior ensemble, an instrumental data source to calibrate proxy observation models, and a proxy record dataset.For the static prior, we use annually-averaged 2 m air temperature anomalies from the CCSM4 last-millennium simulation.The observation models for proxy data are calibrated against the NASA Goddard Institute for Space Studies surface temperature analysis dataset (GISTEMP; Hansen et al., 2010) by linearly regressing the proxy time series against the nearest grid point in the calibration data (for details, see Hakim et al., 2016).All experiments use annuallyresolved proxy records from the PAGES 2k Consortium (2013) database.These proxies have been ascertained to covary regionally with temperature and include tree rings, ice cores, corals, sediment cores, and speleothems.Only proxies with a minimum of 10 years overlapping with the observation model calibration data, and a minimum calibration-fit correlation of 0.2 are used.It should be noted that the correlation threshold is not strictly necessary, but Hakim et al. (2016) found that this threshold did not quantitatively affect the reconstruction results.Here, the reduction in proxies to those with more information helps reduce computational costs, allowing a larger number of reconstruction experiments.
We reconstruct annual-mean 2 m air temperature anomalies for the period of 1850-2000 CE as in Hakim et al. (2016).Using the four LIM calibrations, we search the parameter space of 0 ≤ a ≤ 1 for a weighting between static and forecast information sources that improves the reconstruction compared to the offline method.We judge improvements by means of our chosen skill metrics (coefficient of efficiency, CE; correlation; and continuous ranked probability score, CRPS) for both GMT and spatial fields.Additionally, we perform a persistence experiment for comparison against LIMbased performance, where the posterior for year t is used as the prior for year t + 1.The persistence forecast uses the same hybrid PDA blending scheme as the LIM forecast experiments to mitigate the effects of reductions in ensemble variance from the assimilation process.We account for the sensitivity to the proxy data used in a CFR through random resampling of available proxy data and the static prior ensemble.A single realization uses a random sample of 75 % of the usable proxy records and a 100-member sample of anomaly states from the prior.A total of 100 realizations are performed for each LIM calibration and blending coef-ficient.In order to make the realizations consistent between the experiments using different blending coefficients, we ensure that the same sequences of random samples are taken by seeding the random number generator for each a value.In total, this gives 10 4 reconstructions of the climate state for each experiment.These reconstructions are then averaged to give the final analysis.See Sect.S1 in the Supplement for a brief discussion of the computational costs associated with these experiments.

Skill metrics
The primary skill metrics used are correlation and the CE (Eqs.8, 9; Nash and Sutcliffe, 1970).Correlation gives an overall sense of signal timing (phase), while CE is a stricter metric that is sensitive to signal timing, amplitude, and bias.Using these metrics, we compare the reconstructed ensemble-mean 1 values x against GISTEMP validation values v.The value τ represents the number of validation times available (in this case representing the GISTEMP timespan of 1880-2000), an overbar (e.g., v) denotes a temporal average, and σ x and σ v are the standard deviations of the respective time series.Skill scores are compared for the reconstructed global mean temperature (GMT) and spatial grid points.
We also use the CRPS (Gneiting and Raftery, 2007) as a comparison against CE skill metrics.
The CRPS is considered to be a proper scoring technique that prevents manipulations of the data from overestimating the reconstruction skill; the measure reflects the mean absolute error and narrowness of the ensemble distribution.We use the CRPS as defined in Tipton et al. (2016) to calculate the CRPS of the reconstructed GMT for each realization (Eq.10), where x (i) t denotes a single member of a Kmember ensemble.We take the score as the average CRPS over all realizations and use a Kolmogorov-Smirnov test on the resulting distribution to determine whether it is significantly different than the offline case with 95 % confidence.As defined here, lower values of CRPS indicate better performance, with a limiting case of CRPS = 0 being a perfect ensemble fit (no error and no ensemble spread).
1 Ensemble mean represents the average taken over all ensemble members and all realizations.Figure 2. Reconstructed global mean 2 m air temperature compared to GISTEMP (solid black) and the offline experiment (dotted black) for each online experiment.The GMT plotted for each forecasting experiment is from the blending coefficient that achieves the highest GMT CE skill (CCSM4: a = 0.9, MPI: a = 0.9, 20CR: a = 0.9, BE: a = 0.7), except for the persistence case where a = 0.9 was used.The top row shows the three forecasting experiments with the highest GMT CE score, while the bottom row shows the other two experiments.The blending coefficient that achieves the highest GMT CE skill is shown for each experiment (CCSM4: a = 0.9, MPI: a = 0.9, 20CR: a = 0.9, BE: a = 0.7), except for the persistence case where a = 0.9 was used.

Validation of global mean temperature
Figure 1 displays 2 m GMT results validated against GIS-TEMP for all tested values of the blending parameter a.Every case except for the persistence forecast method yields CE values greater than the offline case.Correlations are higher than the offline benchmark for all experiments, including the persistence forecast.Best skill is achieved between the a values of 0.7 to 0.9, with a steep drop in validation skill as a approaches unity (a pure LIM forecast) due to filter divergence (the ensemble variance decreases with time, decreasing the weight on the proxies, so that the reconstructed states diverge from reality).Results for the CCSM4 and 20CR LIM display the best overall CE performance, with a 9 % improvement over the offline method.These two experiments also display a 2 % increase in correlation and have slightly smaller correlation than the persistence forecast experiment (Table 1).
Figure 2 shows the reconstructed GMT from each experiment at the best blending coefficient (with respect to CE) compared to GISTEMP and the offline case.As evidenced by the high skill scores, the reconstructions capture the vari-ability in the global temperature signal of GISTEMP quite well.Compared to the offline experiment, the forecasting experiments tend to decrease the amplitude of the interannual variability, which is at times largely overestimated by the offline experiment, and they tend to change the overall warming trend.There is no apparent systematic bias in the reconstructed GMTs; thus, skill changes for the different experiments are likely controlled by these two factors.Figure 3  the CE metric (albeit mirrored in the vertical 2 ).Specifically, GMT validation with CRPS shows that all LIM experiments outperform the offline method, with the CCSM4 and 20CR LIMs again having the best performance (18 % better than the offline case).All LIM experiments' best CRPS scores are also significantly better than the offline case with 95 % confidence.Overall, CRPS gives nearly equivalent information as CE about the skill of the GMT reconstruction.However, as CRPS is a different metric, there are slight differences in results for the MPI and BE LIM cases when comparing CE and CRPS: the blending coefficient achieving the best score shifts to the next highest a value in both cases, and the BE LIM outperforms the MPI LIM when considering CRPS (Table 1).Both the CE and CRPS measures for different blending coefficients are affected by the degree of fit to the warming trend in the GISTEMP reference.The trends for all experiments are shown in Fig. 5.The trend of the offline case (a = 0) is 0.62 K/100 year, about 0.07 K/100 year above the GISTEMP trend and near the edge of the GISTEMP trend 95 % confidence interval (Fig. 3).However, for the MPI and CCSM4 LIM experiments, as well as the persistence forecast experiment, the reconstructed trend increases as the blending parameter a increases.This increase in trend away from the GISTEMP trend for the MPI and CCSM4 experiments is reflected in the lowered CE (Fig. 1) and CRPS (Fig. 4) for a values from approximately 0.0 to 0. to decrease around a = 0.6, where CE also shows a significant increase towards maximum values.The persistence forecast trend has the largest disagreement, increasing to approximately 0.72 K/100 year for a = 0.9, which results in the CE and CRPS never surpassing the offline benchmark in this case.The 20CR LIM only increases the reconstructed trend slightly over the GISTEMP trend for middle a values.The BE experiment has a decreasing trend for increasing a and drops to a very low trend of 0.38 K/100 year when a = 0.9.Interestingly, though the trends for the MPI, CCSM4, and 20CR experiments are below that of the GISTEMP trend for a = 0.95, their skill still outperforms the offline case in all three metrics.Despite the mismatch in the overall trend, these online forecasting methods still produce better matches of phase and amplitude of GMT variability for the reconstructed anomalies compared to the offline case.
The trend results also illustrate the relative amount of proxy data utilization between these different experiments.Given that every experiment uses the same prescribed list of seeds to generate proxy record samples, the differences in reconstructed trend can only arise from differences in weighting of the proxies or the LIM forecasts.Since LIMs are calibrated on detrended data and their forecast modes are damped, the forecast contribution to a long-term global mean trend is likely small; the trend is instead governed by utilization of proxy information.For the EnKF PDA method, the weighting of information is controlled by the prior ensemble variance and proxy error variance.The proxy error variance is fixed for all experiments we perform; thus, the changes in the reconstructed trend are a result of how the LIM forecasts affect the ensemble variance.In all forecast experiments, skill and the reconstructed trends drop off severely as a approaches 1.0.When using only forecast information (a = 1.0), the ensemble variance collapses due to the damped properties of the LIMs, which results in filter divergence.The BE LIM case reaches its maximum CRPS and CE values at smaller a and also has the lowest reconstructed trends of the LIM experiments.This suggests that the BE forecast produces less ensemble variance than the other LIMs, possibly due to forecast mode damping or poor projection of the posterior analysis into the LIM EOF space.The eigenvalues of the BE LIM's leading two forecast modes have e-folding times of 5.4 and 1.5 years.This is in the same range of the leading forecast modes of the CGCM-calibrated LIMs (e.g., 3.7-and 1.2-year e-folding times for the MPI LIM).Consequently, a poor projection of the analysis ensemble onto the forecast modes of the BE LIM is likely the cause of the reduced ensemble variance.The persistence forecast displays an interesting disparity between the skill metrics; overall, it performs the best in correlation but the worst in CE and CRPS.Having the largest reconstructed trend suggests that the persistence case has the highest weighting of proxy data.With a persistence forecast there is no damping of reconstructed spatial anomalies or truncation of the ensemble variance from projection into EOF space.The resulting higher proxy weighting may explain why the persistence case correlation is better than the other forecasting methods.The linear observation models in each case are based on a calibration against GISTEMP; thus, proxies with a better calibration fit (higher correlation) with GISTEMP have less error variance and therefore have more influence on the posterior analysis.
The persistence case allows more information from the influential (well-correlated) proxies into the analysis because the prior variance is larger.However, from the CE and CRPS values, which are sensitive to more than just signal phase matching, it is clear the general trend mismatch degrades the qual-ity of the persistence reconstruction compared to the offline benchmark.
Removing the linear trend from each case allows for an examination of how well the reconstructions capture variability not associated with the warming trend (i.e., interannual and decadal variability; evident in Fig. 6).Generally, the performance increases for the detrended data over the offline case are much larger than for the full time series.Compared to validation with the full time series, the correlation of the detrended offline case drops from 0.9 to 0.67, and the CE drops from 0.77 to 0.29 (Table 2).With respect to CE, all experiments (including the persistence method) improve upon the offline benchmark.The 20CR LIM achieves the best improvement over the offline case (72 % increase), while the persistence case shows the least improvement (35 % increase).Except in the BE LIM experiment, detrended correlation metrics again increase slightly over the offline case.The BE LIM remains near the benchmark correlation for 0.0 < a < 0.7 and then drops below it.A CE improvement with no change in correlation implies that the BE LIM improves the detrended anomaly amplitudes and bias but does not improve the signal timing compared to the offline case.
Both the CE and correlation skill metrics of the offline experiment decrease when calculated on the detrended GMT.In contrast, the CRPS improves by 7 % (minimizing from 12.5 to 11.6).CRPS rewards reduction in mean absolute error, and narrowness of the forecast ensemble, whereas CE and correlation depend more on variance properties of the reconstructed time series.In removing the linear warming trend, we remove a large degree of the time series' variance and subsequently lose the associated skill in CE and correlation.In the case of CRPS, the linear trend is only a source of mean error when it does not closely match the reference trend.When we remove the trend, the mean errors decrease (the en- semble spread is unaffected) and the CRPS metric improves.
Figure 7 shows CRPS for all blending coefficients with detrended data.The behavior is quite similar to the full GMT CRPS, and as the detrended CE reflects, even the persistence forecast shows improvement over the offline method.An aspect that stands out with detrended CRPS is that the persistence forecast achieves the best value when a = 1.0.A cursory examination of the detrended GMT time series of the persistence case compared to the detrended GISTEMP GMT time series (see Supplement, Fig. S1) reveals that it captures some decadal variability over the instrumental record but none of the interannual variability, i.e., the a = 1.0 persistence reconstruction gives a smoothed representation of the GMT.This again highlights a difference between the two metrics of CE and CRPS.The CRPS metric, which generalizes to the mean absolute error of the ensemble summed over time, does not penalize the smoothed GMT signal approximately bisecting the interannual signal for a = 1.0.The CE metric, which sums the squared errors of the ensemble mean and then normalizes by the climatological variance, does penalize this behavior.

Validation of spatial fields
Here we compare the skill of the spatial fields with the offline case using correlation and CE; CRPS is omitted because the full spatial field ensembles are too large to store.For ease of visualization, we provide CE difference maps to highlight changes relative to the offline case, but full spatial skill maps can be found in the supplementary material (Figs.S2 and S3).Over the globe, skill is mostly positive, but there are a few regions with highly negative skill departures such as the Southern Hemisphere oceans and the high-latitude North Atlantic-Barents Sea corridor (Fig. 8).In the high-latitude Atlantic region, the proximity to the sea ice edge seems to negatively affect the ability to constrain the temperature field when using only proxies.All LIM-forecasting cases show improvements to CE in the same North Atlantic-Barents Sea area and across northern Europe into Asia; there are also smaller skill increases across western North America.Studies of LIM predictability have shown the North Atlantic-Barents Sea region can have forecast skill on annual and longer timescales (e.g., Hawkins and Sutton, 2009;Newman, 2013).This may be one reason why skill increases are common in this area for all LIM experiments.An inspection of grid point temperature values northeast of Iceland (see Supplement, Fig. S5) shows that the reconstructed temperature variance is largely overestimated in the offline case and that there is a slight trend in the reconstruction that is not present in GISTEMP data. .Spatial maps displaying the difference in coefficient of efficiency (CE) from the offline case.Difference maps are displayed for each forecasting experiment using the blending coefficient that achieves the highest full GMT CE skill (CCSM4: a = 0.9, MPI: a = 0.9, 20CR: a = 0.9, BE: a = 0.7), except for the persistence case where a = 0.9 was used.The reference CE of the offline case is shown in the upper left and uses the same color scale as the difference maps.Area-weighted global average differences are given in the title of each panel.All global mean differences except in the BE LIM case are significantly different than the offline benchmark with 95 % confidence.
LIM with an improvement of 0.06.The 20CR and persistence cases show decreases in average spatial skill across the grid, with the 20CR being worst with a global mean CE change of −0.18.The BE LIM, while showing improvements over Northern Hemisphere land areas, has compensating decreases in skill over the ocean that make the global mean CE nearly equivalent to the offline case.All global mean CE values, except in the BE LIM case, are significantly different (at 95 % confidence) from the offline case when comparing grid point skill distributions using a Student's t test.Changes in spatial correlation (see Supplement, Fig. S4) are generally small in regions where CE increases, which suggests that improvements are not related to signal phasing.However, some of the large decreases in CE for the 20CR, BE, and persistence experiments do coincide with areas of correlation decreases.
In the spatial results, there is a clear distinction between LIMs calibrated using data from the shorter instrumental era (20CR and BE), and the millennium-scale climate simulation data (CCSM4 and MPI).Compared to the offline spatial skill, large areas of CE skill degradation are apparent for both the 20CR and BE LIM reconstructions.The 20CR LIM CE skill degradations are large in amplitude ( CE < −1) and mostly over ocean regions.It is surprising that the 20CR LIM has the worst spatial skill given that it has the best GMT time series CE skill.However, previous CFR studies show similar results (Annan and Hargreaves, 2012;Wang et al., 2014), where the spatial averaging over a poorly reconstructed field can boost the signal-to-noise ratio for large-scale indices enough to result in positive index skill.The situation is somewhat different in this study.In the offline case, spatial skill is reasonably positive, but when adding 20CR LIM forecasts, the spatial CE skill decreases and the GMT CE skill increases.A possible interpretation for this behavior is that the degradation in spatial field fidelity compensates for aspects of the GMT that are overestimated in other experiments.For example, the offline reconstructed GMT overestimates interannual variability during certain time periods when compared to GIS-TEMP, but the usage of LIM forecasts mitigates this effect.Compared to the CCSM4 LIM experiment, the reconstructed GMT for the 20CR LIM experiment reduces the sum of the squared error (numerator term in CE) by approximately 5 %.The bias accounts for only about 1 % of the total sum squared error term, which leaves the trend and anomaly amplitude agreement as primary candidates for the error reduction.As shown earlier, the 20CR GMT trends tend to track lower than the two CGCM LIM experiments and closer to the GISTEMP trend (Fig. 5).The trend reduction in the 20CR experiment appears to be caused by large areas of negative trends in the Southern Hemisphere (see Supplement, Fig. S6).
The reason behind the poor spatial performance of the 20CR LIM appears to be linked to its EOF basis (Fig. 9).The leading EOF of the 20CR experiment lacks the pattern similar to ENSO and the Pacific Decadal Oscillation of the other LIMs and instead focuses on variability structures in the Southern Ocean.This is a region of high variability due to its proximity to the storm track, but there are also fewer pressure observations available for assimilation by the 20CR (see Fig. 3 in Woodruff et al., 2011), especially during the early portion of the record.Many of the large decreases in CE are located in these same regions, which leads us to speculate that the features of the 20CR LIM may be influenced by artifacts in the 20CR dataset.Another consideration for the lower performance of both instrumental-era LIMs is that they are based on shorter records that coincide with variability related to anthropogenic forcing.The BE LIM displays a primary EOF more similar to the CGCM-based LIMs but still has skill problems over large ocean areas.Separating the global warming trend from the LIM by means of linear detrending is bound to leave residual signals that affect LIM forecast modes.A LIM based on a shorter record may not have enough of a sample to properly characterize representative modes of variability over a longer time span.While the BE LIM is based on 65 years of data, it produces much less spatial skill degradation than the 150 years of 20CR data.This suggests that there may be confounding factors influencing the skill in the 20CR experiment between the length of record, linear detrending, and the assimilation method used to create the 20CR data.

Conclusions
We have outlined and tested a new method for performing online paleoclimate data assimilation (PDA) for climate field reconstructions (CFRs) using linear inverse models (LIMs).We tested four different LIMs empirically derived from surface temperature data from the following datasets: Berkeley Earth (BE), the 20th Century Reanalysis (20CR), and two last-millennium climate simulations (CCSM4 and MPI) from the Coupled Model Intercomparison Project phase 5 (CMIP5).We also performed a persistence forecast experiment for comparison.In general, we find that LIM-enabled online assimilation improves upon the offline results for both the global average and spatial field of 2 m air temperature.
Broadly speaking, the LIM experiments show good ability to reconstruct many aspects of the GISTEMP GMT data, including interannual and low-frequency variability.The largest skill improvements occur for skill metrics calculated on the detrended GMT, which suggests that the LIM forecasts add useful information at interannual timescales.The coefficient of efficiency (CE) for detrended metrics show an average increase around 57 %, while correlations increase around 4 %.The continuous ranked probability score (CRPS) metrics improve by an average of 15 % across all LIM exwww.clim-past.net/13/421/2017/Clim.Past, 13, 421-436, 2017 periments.Skill metrics tend to maximize for blending coefficients with a higher weighting on LIM forecast information (0.7 < a < 0.95).Spatial skill reveals that the addition of LIM forecasting provides spatial information in regions where the offline method performs poorly -including Northern Hemisphere land areas and the North Atlantic-Barents Sea corridor.Large skill improvements in the North Atlantic through the Barents Sea region are primarily a result of better constraining the temperature variance at these locations.The two LIMs calibrated using instrumental-era data (20CR and BE) display large regions over the ocean where the skill degrades compared to the offline case.Even with the large areas of improvement, the 20CR LIM decreases the area-weighted average CE (−0.18), and the BE LIM area-weighted average breaks even.In contrast, the two CGCM-based LIM experiments show area-weighted average CE increases of 0.09 (CCSM4) and 0.06 (MPI).When considering both GMT and spatial skill results, the CGCM-based LIMs have the best overall performance, with the CCSM4 LIM slightly outperforming the MPI LIM.The persistence forecast fails to improve the more stringent GMT skill metrics (CE and CRPS) as well as general spatial skill but does well in GMT time series correlation.Subsequently, this suggests that the improvements of online reconstructions when using a LIM are due to forecast information and not simply the addition of temporal persistence.
Though we are reconstructing instrumental-era surface temperatures, it is an interesting result that CGCM-calibrated LIMs based on last-millennium (850-1850) simulations have the best overall performance.This could mean that having long-running samples of variability that do not contend with influences from anthropogenic forcing are beneficial for reconstruction purposes.The LIMs used in these experiments were all calibrated on data with the least-squares linear fit trend removed.In order for LIMs based on observational data sources to achieve similar results, it may be necessary to employ a more sophisticated method of filtering out the global warming signal.However, one benefit of using CGCM-based LIMs is that it enables forecasts for a much larger set of climate-related quantities than are available from observations alone.
In this work, we have shown that we can improve both GMT and spatial field skill over the offline EnKF PDA method through the inclusion of a simple forecast model.A previous comparison of offline and online PDA using a CGCM as a forecast model found no discernible difference in reconstruction skill (Matsikaris et al., 2015), and earlier studies of the EnKF PDA method forewent the usage of forward models, citing insufficient model skill to justify the expense (Bhend et al., 2012;Steiger et al., 2014).Our results show that an online method can increase reconstruction fidelity, and more importantly we show that it can be done using an empirical forecasting method that is nearly as computationally efficient as the offline approach.As such, this method provides a useful foundation for further investigation of incorporating dynamical constraints of a forecast model into climate field reconstructions.
Data availability.GMT and selected spatial reconstruction output (for blending coefficients 0.7 ≤ a ≤ 1.0) along with code capable of running the experiments and the required preprocessed input data (from CCSM4, MPI, 20CR, BE, GISTEMP, and PAGES 2k sources) are published under doi:10.6069/H5VD6WC0(Perkins and Hakim, 2017).
The Supplement related to this article is available online at doi:10.5194/cp-13-421-2017-supplement.

Figure 1 .
Figure 1.Comparison of global mean 2 m air temperature coefficient of efficiency (CE; a) and correlation (b) metrics for different blending coefficients.The colored lines represent the different LIM calibration experiments using data from the Community Climate System Model v4 (CCSM4), NOAA 20th Century Reanalysis v2 (20CR), Max Planck Institute Earth system model (MPI), Berkeley Earth surface temperatures (BE); or the persistence forecast case (Persist).The offline benchmark is depicted as the horizontal dashed black line.

Figure 4 .
Figure 4. Comparison of global mean 2 m air temperature continuous ranked probability score (CRPS) for different blending coefficients.The colored lines represent the different forecasting experiments, while the offline benchmark is depicted as the horizontal dashed black line.Starred points indicate a statistically significant (95 % confidence) difference between the offline benchmark and online experiment.

Figure 5 .
Figure5.Calculated trends from a least squares fit against the reconstructed global mean 2 m air temperature.Colored lines depict the calculated trends for each LIM experiment across a range of blending coefficients, while the black lines represent the benchmark trends calculated from the offline reconstruction (dashed) and GISTEMP data (solid).

Figure 6 .
Figure 6.Comparison of detrended global mean 2 m air temperature CE (a) and correlation (b) metrics across different blending coefficients for all experiments.The colored lines represent the different forecasting experiments, while the offline benchmark is depicted as the horizontal dashed black line.

Figure 7 .
Figure 7.Comparison of detrended global mean 2 m air temperature continuous ranked probability score (CRPS) for different blending coefficients.The colored lines represent the different forecasting experiments, while the offline benchmark is depicted as the horizontal dashed black line.Starred points indicate statistical significance (95 % confidence) between the offline benchmark and online experiment.

Figure 9 .
Figure 9.Leading mode of the empirical orthogonal function (EOF) basis for each LIM calibration.The total fraction of the variance explained is given in the title of each panel.All EOFs have been multiplied by their corresponding singular value.
Rohde et al., 2013)analysis (20CR;Compo et al., 2011), a DA synthesis of observations and a weather forecast model, provides over 150 years of reanalysis data spanning the instrumental record(1850- 2012).Finally, we use the Berkeley Earth surface temperature dataset (BE;Rohde et al., 2013)for observational calibration.BE provides a 65-year sample (1960-2014) with nearly complete global coverage.The different LIM calibration datasets used here span linear modes of predictability derived from model space to those of observations.

Table 1 .
Best value of the coefficient of efficiency (CE), correlation (r), and continuous ranked probability score (CRPS) validation metrics for global mean 2 m air temperature.For each experiment, best values are given with the corresponding blending coefficients (a) that achieved it and the percentage change compared to the offline case.An ( * ) indicates which experiment achieved the best performance in a given metric.Offline validation metrics are given for reference.Full GMT Max CE % CE CE a value Max r % r r a value Min CRPS % CRPS CRPS a value

Table 2 .
Best value of the CE, correlation, and CRPS validation metrics for detrended global mean 2 m air temperature.For each experiment, best values are given with the corresponding blending coefficients (a) that achieved them and the percentage change compared to the offline case.An ( * ) indicates which experiment achieved the best performance in a given metric.Offline validation metrics are given for reference.Detrended GMT Max CE % CE CE a value Max r % r r a value Min CRPS % CRPS CRPS a value