Benchmarking monthly homogenization algorithms

Introduction Conclusions References

normally distributed breakpoint sizes. To approximate real world conditions, breaks were introduced that occur simultaneously in multiple station series within a simulated network of station data. The simulated time series also contained outliers, missing data periods and local station trends. Further, a stochastic nonlinear global (network-wide) trend was added.

15
Participants provided 25 separate homogenized contributions as part of the blind study as well as 22 additional solutions submitted after the details of the imposed inhomogeneities were revealed. These homogenized datasets were assessed by a number of performance metrics including (i) the centered root mean square error relative to the true homogeneous value at various averaging scales, (ii) the error in linear trend es-

Introduction
Monitoring and analysis of our climate has received more and more attention following assessments that most of the temperature change observed over the last fifty years can be attributed to anthropogenic forcings (IPCC, 2007). To study climate change and variability, at the surface many long instrumental climate records are available. These 5 datasets are essential since they are the basis for assessing century-scale trends, for the validation of climate models, as well as detection and attribution of climate change at a regional scale. The value of these datasets, however, strongly depends on the homogeneity of the underlying time series.
In essence, a homogeneous climate time series is defined as one where variations 10 are caused only by variations in weather and climate. Long instrumental records are rarely if ever homogeneous. Results from the homogenization of instrumental western climate records indicate that detected inhomogeneities in mean temperature series occur at a frequency of roughly 15 to 20 yr. Moreover the typical size of the breaks is often of the same order as the climatic change signal during the 20th century (Auer changes is the relative homogenization approach, which assumes that nearby stations are exposed to almost the same climate signal and that thus the differences between nearby stations can be utilized to detect inhomogeneities (Conrad and Pollack, 1950). In relative homogeneity testing, a candidate time series is compared to multiple surrounding stations either in a pairwise fashion or to a single composite reference time 15 series computed for multiple nearby stations. Homogenization has a long tradition. In the early instrumental period, documented change-points have been removed with the help of parallel measurements. For example, biases due to changes in observing times, were adjusted using multi-annual 24 h measurements (Kreil, 1854a, b). In the early 20th century Conrad (1925)  Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | segments in the candidate. The annual dataset generated by Menne and Williams (2005) was more realistic than the previously mentioned studies. They also inserted breaks in the reference time series and did not enforce an artificial minimum period between breaks. Moreover, by studying the sizes of breaks known from metadata, they showed that these sizes 5 follow a normal distribution; such breaks were thus implemented in their dataset. The consequence of such a distribution is that the dataset contains many small breaks that are hardly detectable; see also Domonkos andŠtepánek (2009). However, these small breaks are important for the detection of the climatologically more important detectable ones (Domonkos, 2011a) and likely for the correction as well (Easterling and Peterson, 1995). A recent validation study by Domonkos (2011a) directly generated artificial difference time series to compare eight different objective detection methods. The inserted inhomogeneities range from simple one-break cases, to cases with a very complete and realistic description of the inhomogeneities, including platform-like inhomogeneities in which after the first break there is soon a second break in the opposite 15 direction.
The large number of different monthly homogenization methods and the need for a realistic comparative study was the reason to start a coordinated European initiative, the COST Action HOME ES0601: Advances in Homogenization Methods of Climate Series: an integrated approach (HOME). Its main objective was to review and improve 20 common homogenization methods, and to assess their impact on climate time series (HOME, 2011). As part of the Action a dataset was generated that serves as a benchmark (Sim et al., 2003) for comparing homogenization algorithms. This study analyses the results of this exercise. Based upon a survey among homogenization experts, the Action has chosen to focus on networks with monthly values for temperature and 25 precipitation. Temperature and precipitation were selected because most participants consider these elements as most relevant. Furthermore, these elements represent two important types of statistical models (additive and multiplicative). For climate data aggregated to monthly scales, there is a large selection of possible homogenization algorithms. However, so far intercomparison studies have been based on annual data. Consequently, an intercomparison study is most needed for monthly data. All studies before Domonkos (2008) have assessed the skill of homogenization algorithms based on the accuracy of the detection of breaks, which is a basic metric for a developer of homogenization algorithms. However, a climatologist may want to know to 5 what degree decadal variability and trends in homogenized data may be due to remaining small inhomogeneities. To be able to answer such questions requires an evaluation of the output of full homogenization methods in terms of other statistical metrics, for instance the remaining error in linear trend estimates and the mean square error between the true time series and the homogenized ones (Domonkos, 2008;Domonkos et al., 2011). For these errors to be applicable to real datasets and to be able to perform a benchmarking of homogenization algorithms, the structure of the artificial data and its inserted inhomogeneities should be realistic.
Realistic climate data are generated with the surrogate data approach (Venema et al., 2006a), which is able to reproduce the cross-correlation structure of existing ho-15 mogenized networks, as well as the auto-correlation functions of the stations and their difference time series. For comparison also Gaussian white noise is generated for the so-called synthetic data section of the benchmark dataset. In the homogeneous artificial datasets, known inhomogeneities are randomly inserted. Break inhomogeneity are modeled as an independent Poisson process and the sizes are normally distributed. 20 Additionally, breaks are introduced that occur simultaneously in a multiple stations. Furthermore, outliers, missing data and local trends are inserted and a random global (network-wide) trend is added.
To be able to study how realistic the inserted inhomogeneities are, a third section of the benchmark contains real inhomogeneous data. This allows for a comparison of the Introduction returned. Among the papers studying multiple algorithms, this study can be considered the most comprehensive one with 25 contributions based on 13 algorithms being returned by the participants, including contributions based on manual methods. For well-known algorithms -MASH, PRODIGE and SNHT -multiple contributions have been returned; see Sect. 4. This allows the study of the importance of the implemen-5 tation of an algorithm or of the operator of the software. This paper will focus on the properties of the benchmark dataset and provides a first analysis of the accuracy of the algorithms. It is intended as a reference for follow-up studies analyzing the results in more detail. In Sect. 2, the data and the methods are presented that are used to generate the three data sections (real, surrogate and 10 synthetic data) of the benchmark. The surrogate and synthetic data are treated as real homogeneous climate data, to which inhomogeneities are added. Section 3 will explain how the inhomogeneities are introduced to the artificial dataset. Further details on the datasets and the types of breaks added can be found in the report by Venema et al. (2011). Section 4 provides a discussion of the homogenization principles and 15 algorithms employed. The metrics used in the assessment are explained in Sect. 5. A general analysis of the submitted results is provided in Sect. 6. Some discussion and conclusions are offered in Sect. 7.

Data for benchmark dataset
The benchmark contains three data sections, one with observed, unhomogenized cli-20 mate data (see Sect. 2.1) and two with artificial data. The main features of the real inhomogeneous data set and the generation of the homogeneous artificial data are summarized below.
While the general statistical properties of the artificial data and the inhomogeneities required to simulate real world observing networks were discussed and approved within 25 the COST Action HOME management team, the dataset was generated solely by the first author. The true underlying homogeneous artificial data was therefore not known to other participants until after the deadline for submitting homogenized results. After the deadline, the truth and all homogenized contributions were made available to all contributors for analysis and are now freely available via HOME (2011). The main type of artificial data, which most contributors homogenized, is the socalled surrogate data section; see Sect. 2.2. Surrogate data reproduce the distribution, power spectrum and cross spectra of a real homogenized dataset. The power spectrum is equivalent to the correlation function, thus the auto-and cross-correlation functions of the input data are also replicated.
For every surrogate network, a so-called synthetic network was also generated. The difference (or ratio) time series of the synthetic dataset is temporally uncorrelated 10 Gaussian white noise. To generate pairs of surrogate and synthetic networks with a similar configuration, the cross-correlation matrix, mean and standard deviation of the synthetic networks mimic those of a corresponding surrogate network; see Sect. 2.3.
While the surrogate data is most realistic, the statistical properties of the synthetic data are those of most statistical tests used for homogenization. A comparison of 15 the results between these two types of artificial data can thus be used to study the influence of violations of these conditions. The benchmark dataset contained 20 surrogate and 20 synthetic networks for both temperature and for precipitation. During the analysis it was found that some of the input data was not homogenized well enough. Consequently, only the best 15 surrogate networks were used in the analysis. Se-Networks with 100 yr of data (1900 to 1999) with 5, 9 or 15 stations were generated. The statistical properties of the surrogate data are based on homogenized complete (or with estimated values for missing data) temperature datasets from Austria, France (Brittany), and the Catalonian region, as well as such precipitation datasets from Austria and France (Bourgogne). These precipitation datasets were demeaned, detrended 10 and cropped to one century. The temperature records were deseasonalised and detrended. After generating the surrogate, these means of the precipitation stations and the seasonal cycles of the temperature stations were added again. Some temperature datasets were shorter than 100 yr and were extended by mirroring them as often as needed and then cropping the dataset to 100 yr. To generate networks with different 15 network configurations and a range of spatial correlations a different subset of stations was selected for each surrogate network.
The surrogate data was generated using the Iterative Amplitude Adjusted Fourier Transform Algorithm (IAAFT), developed by Schreiber and Schmitz (1996), with a small modification of the second iterative step as described in Venema et al. (2006b). The 20 IAAFT algorithm tends to generate time series that are not very intermittent in the sense of the variance of the (small-scale) variance (Venema et al., 2006a). Thus, if the input data contains inhomogeneities, its large-scale variability will be reproduced in the surrogate (difference) time series and the intense small-scale variability of the jump will be spread over the full period. 25 To produce a new time series each time, the iterative IAAFT algorithm starts with white noise. The first iterative step adjusts the Fourier coefficients. The second step adjusts the (temperature or precipitation sum) distribution. The latter changes the Fourier 2665 Introduction

Tables Figures
Back Close

Full Screen / Esc
Printer-friendly Version

Interactive Discussion
Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | spectrum somewhat, which necessitates several iterations. These Fourier spectra and distributions stem from an example homogenized dataset

Synthetic data section
Every surrogate network has a corresponding synthetic network. The generation of the synthetic data begins with computing a time series with the network mean precipitation 5 or temperature. A difference (temperature) or ratio (precipitation) this mean is then computed to create each station series. This relative time series is converted to Gaussian white noise, which has the same mean, standard deviation and a similar spatial cross-correlation matrix, and added (or multiplied) to the network mean time series as described in Venema et al. (2011).

10
After the transformation to a Gaussian distribution, negative precipitation totals may occur; these values are explicitly set to zero. The cross-correlation matrix of the ratio time series of the synthetic data is close to that of the surrogate data, but after multiplying the ratio time series to network mean time series the cross-correlations are perturbed. . For this reason, the cross-correlation between the precipitation stations 15 within a network are biased by several percent points towards low correlations.

Inserted inhomogeneities
The artificial surrogate and synthetic data represent homogeneous climate data. To create the benchmarks, known inhomogeneities and other data disturbances are added: two types of break-type inhomogeneities and local trends, as well as outliers. 20 Furthermore two types of missing data are simulated and a global trend is added.
The two types of step-type breaks are random and clustered. Random breakpoints are inserted to the serial data at an average rate of five per hundred years. To vary the quality of the data on a station by station basis, this frequency is drawn from a uniform distribution between 2 and 8 %. The break events are independent of each Introduction other (Poison process). Breaks are thus also inserted in missing data periods, in close succession or near the beginning or end of the series. The size of the break points is based on a Gaussian distribution with a standard deviation of 0.8 • C for temperature and 15 % for rain. These mean break sizes have a seasonal cycle with standard deviation 0.4 • C and 7.5 %. The breaks points are in-5 serted by multiplying the precipitation with monthly factors or adding monthly constants to temperature.
To simulate network-wide changes, clustered breaks are also added in 30 % of the networks. In the affected networks, 30 % of the stations have a break point at the same time. The random numbers for the mean size and seasonal cycle of these breaks 10 are drawn from the same distributions and have the properties as the random breaks. However, in this case the random numbers are not only drawn for every station, but additional once for all breaks. The random numbers are then averaged with a weight of 80 % for the random number for all breaks and a weight of 20 % for the station specific break.

15
In 10 % of the temperature stations a local linear trend is introduced. The station and beginning date of the trend were selected at random. The length of the trend has a uniform distribution between 30 and 60 yr. The beginning and the trend length were reselected as often as necessary to ensure that the local trend ended before the year 2000. The size of the trend at the end is randomly selected from a Gaussian distribution 20 with a standard deviation of 0.8 • C. In half of these cases the perturbation due to the local trend continues at the end of the trend, e.g. to simulate urbanization, in the other half the station returns to its original value, e.g. to simulate a growing bush or tree that is cut at the end.
A small number of outliers was inserted to study the influence of imperfect quality 25 control. The outliers are generated with a frequency of 1 per 100 yr per station. The outliers are added to the anomaly time series, i.e. without the annual cycle for temperature. The value of the outliers is determined at random by a value from the tails of the distribution. Introduction Two types of missing data are added. The earliest data is removed to simulate a gradual increase in the availability of data, which is common in real datasets. This is done by forcing a linear increase in the number of stations from a total of three with data in 1900 to all stations having data in 1925. In addition, a large part of the network is set to missing during the years covered by World War II, which is typical for European 5 datasets. In this case, there is a 50 % chance that the data is missing in 1945. For the years preceding backward from 1944 to 1940, the stations with missing data have a probability of 50 % that the data for the previous year is also missing.

Conclusions References
Finally, a global trend is added to every station in a network to simulate climate change. This trend is nonlinear given that homogenization should be independent of preconceived ideas about climate change. Furthermore, a different trend is stochastically modeled for every network because a known trend would allow for an improper validation of the results. The trend is generated as very smooth fractal Fourier "noise" with a power law power spectrum with an exponent of −4; only part of the signal is used to avoid the Fourier periodicity. This noise is normalized to a minimum of zero 15 and a maximum of unity and then multiplied by a random Gaussian number. The width of this distribution is 1 • C or 10 %.

Homogenization algorithms
This section describes the main characteristics of the homogenization methods. This paper will only list features used to homogenize the benchmark; many tools have addi-20 tional possibilities. Most of the algorithms test for relative homogeneity, which implies that a candidate series is compared to some estimation of the regional climate ("comparison phase"). Comparison may be performed using one composite reference series assumed homogeneous (e.g. SNHT), several ones, not assumed homogeneous (MASH), or via direct pairwise comparison (USHCN, PRODIGE); see Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | may be annual, seasonal or monthly. All four seasonal or twelve monthly time series may be analyzed independently in parallel or serially as one time series. When several comparisons are performed because multiple references are utilized or monthly data are analyzed in parallel, a synthesis phase is necessary, that may be automatic, semi-automatic, or manual.

5
The comparison series are tested for changes. Detection implies a statistical criterion to assess significance of changes, which may be based on a statistical test -Student, Fisher, Maximum Likelihood Ratio (MLR) test, etc. -or on criteria derived from information theory (penalized likelihood). Detection requires an optimization scheme, to find the most probable positions of the changes among all possibilities. Such a searching scheme may be exhaustive (MASH), based on semi-hierarchic binary splitting (HBS), stepwise, or moving windows (AnClim) or may use dynamic programming (DP).
The homogenization corrections, see Table 2, may be estimated directly on the comparison series (SNHT). When several references or pairwise estimates are available, 15 a combination of those estimates is used, e.g. a mean or median. PRODIGE employs a decomposition of the signal into three parts: a common signal for all stations, a station dependent step function to model the inhomogeneities and random white noise. In some methods, raw monthly estimates are smoothed according to a seasonal variation.
Once a first correction has been performed, most methods perform a review; see 20 Table 2. If inhomogeneities are still detected, corrections with additional breaks are implemented in the raw series (examination; raw data), except in MASH where the corrected series receive additional corrections, until no break is found (called "examination; cumulative" in Table 2). The 25 submitted contributions, their operators and main purposes are listed in Table   25 3, where contributions denoted by "main" are the ones where the developer of the algorithm deployed it himself with typical settings. Additional details on the contributions can be found in the report Venema et al.

Error metrics
A true benchmark would produce one or two numbers for every contribution for a ranking and this error metric would be fixed in advance. In case of homogenization this is not possible, different users have different requirements for the homogenized data and the ranking of the contributions depends on the chosen error metric. For this study 5 the focus is on a number of error metrics related to the expectations of the users of homogenized data.
As the main aim of homogenization is not to improve the absolute values, but rather the temporal consistency, the time series are centered by subtracting their mean values before computing the RMSE. The centered root mean square error (centered RMSE, 10 CRMSE) of the time series themselves is thus used as a basic accuracy metric of the data at the highest available resolution (Sect. 6.1.1). This metric is similar to the standard deviation of the time series of the difference between the homogenized data and the truth. It is computed on single station data directly (station CRMSE), as well as on the average climate signal of all stations in one network (network CRMSE). When 15 one or more of the stations is missing for a particular month, the network mean is not computed.
This metric is aggregated over all networks of each benchmark section in three different ways. The most direct way and important for a user is the arithmetic mean. However, because not all contributions homogenized all networks and some networks 20 may be easier than others, the arithmetic mean may lead to a distorted judgment for the smaller contributions. Therefore, the mean of the CRMSE anomalies is also computed, where the anomalies are computed by subtracting the mean station or network CRMSE of a number of complete reference contributions (MASH main, PRODIGE monthly, USHCN main, ACMANT and PMTred). This anomaly is the best metric to 25 compare (incomplete) contributions. Furthermore, to show the improvements after homogenization, the ratio between the mean CRMSE over all homogenized data with the mean CRMSE of the inhomogeneous data of the same cases is computed. The same metrics are computed on yearly averages and results are presented in Sect. 6.1.2.
To assess the reproduction of decadal variability after homogenization, the yearly time series are first smoothed, after which the CRMSE is computed (Sect. 6.1.3). These smoothed time series or nonlinear trends are computed by a nonparametric 5 regression method called locally weighted regression (LOESS; Cleveland and Devlin, 1998). For every year, the smoothed value is estimated by fitting a quadratic function using weighted regression on the nearest 25 % of the data points. The standard local weighting function described in Cleveland and Devlin (1998) is utilized. The effective smoothing period is about six years. An advantage of this method is that small-scale 10 variability is strongly reduced. Furthermore, the method is robust to distortions at the edges of the time series. Nevertheless, the first and last five years were excluded from the computation of the CRMSE.
To study the remaining error in trend estimates after homogenization, the difference in the linear regression coefficient between the original data and the homogenized data 15 is computed (for results see Sect. 6.1.4). The linear trend is estimated on the yearly time series using least squares regression and the standard RMSE of the trend coefficients over all stations (or networks) is computed as aggregated trend error metric.
Since some methods do not perform reconstitution of missing data, or do not handle outliers, data corresponding to missing data or outliers are not taken into account in the 20 above computations. Thus while there is an influence of the outliers on the results of the homogenization algorithm, the outliers do not influence the error metrics themselves.
In Sect. 6.1.5 the accuracy of break detection will be investigated. An algorithm, which ranks high on detection, but is less good with respect to CRMSE or trends, may need to work on its correction methods. Thus even if in many (iterative) algorithms 25 detection and correction cannot be fully separated, such a comparison does give qualitatively important information for the developer.
A comparison of detection scores among the contributions is impaired by the use of different methodologies. Most contributions aim at estimating the exact date a break CPD 7,2011 Benchmarking monthly homogenization algorithms physically happened, while others (PRODIGE main, C3SNHT ) associate the break with the beginning or the ending of a year. Alternatively, all MASH contributions report the breaks in the monthly time series, but do not synthesize these breaks to one date; one true break may thus lead to up to 12 detected breaks. To mitigate this difference the data was analyzed at yearly resolution, i.e. every year containing a break is 5 considered as break point, in both the tested contribution and the original time series.
Nevertheless, the MASH contributions should be compared to the other contributions with care. Four cases can be distinguished: true positives (hits, a), false positives (false alarms, b), false negatives (misses, c) and true negatives (no breaks present, nor predicted, 10 d ). Periods with missing data or with a local trend are ignored in this computation. Using this notation, the most basic skill scores using are the probability of detection, POD, and the probability of false detection, POFD, defined as: The Peirce Skill Score (or true skill score) is defined as POD minus POFD. In addition, the standard Heidke Skill Score (HSS) can be computed as: where for a given number of predicted breaks. It is independent from the fact whether this number of predicted breaks is actually realistic, i.e. whether it is comparable to the number of true breaks.
As an alternative Heidke special skill score, HSS spc , is considered where the r std of Eq. (3) is substituted by r spc given by: with f , the mean frequency of true breaks as reference for the proportion of predicted positives and (1-f ) the frequency for the predicted negatives. The special HSS becomes zero if the correct number of breaks is predicted and if this number were randomly inserted. Given that breaks are rarer than negatives, in essence this skill score 10 mainly punishes false alarms stronger.

Results
This section starts with an analysis of the quality of the homogenized data for all blind contributions in Sect. 6.1. This analysis is largely mainly based on the surrogate data because these networks were homogenized most by the participants and are more 15 realistic than synthetic. Furthermore, the focus is more on temperature than on precipitation because more contributions were submitted for this climatic element. The latter may be because homogenization of temperature is less challenging and because there is more interest in the homogeneity of temperature records. Section 6.2 discusses some interesting contributions submitted after the deadline. 20 In Sect. 6.3, the realism of the benchmark dataset is studied by comparing results obtained for surrogate and synthetic data, as well as by comparing the detected inhomogeneities of the artificial dataset with those of the real raw data section of the benchmark. This information is needed for the interpretation of the results in the discussion in Sect. 7. Introduction

Tables Figures
Back Close

Full Screen / Esc
Printer-friendly Version

Interactive Discussion
Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper |

Results for blind contributions
This section assesses the homogenized data based on a range of different error metrics. The analysis follows the temporal scale of the data: Sect. 6.1.1 discusses errors on monthly scales, Sect. 6.1.2 on yearly scales, Sect. 6.1.3 on decadal scales and Sect. 6.1.4 treats the errors in secular trends after homogenization. Finally in 5 Sect. 6.1.5, contingency scores are computed to investigate the accuracy of the detection of break inhomogeneities. Figure 1 shows scatterplots of the centered RMSE before and after homogenization for monthly surrogate temperature data by six comprehensive contributions. Good results can be achieved either by improving the homogeneity on average or by never increasing the inhomogeneity of any station. PRODIGE seems to follow the former route, USHCN the latter, with the others making a compromise. The USHCN contribution is unique in that it has almost no stations with a higher error after homogenization, the contribution also has many values exactly on the bisect (no changes performed) and 15 it made only small changes to the network without any inserted breaks (values on the ordinate). It should be noted that the same plots for yearly mean temperature show many fewer data points above the bisect for all contributions. The exception is absolute homogenization (PMFred abs), which typically decreases the homogeneity of the data for both monthly and yearly mean values. 20 For a more quantitative analysis of the monthly CRMSE, Fig. 2 shows boxplots for the complete blind contributions and Table 4 lists aggregated error metrics for all blind contributions for both temperature and precipitation. The boxplots show that the best contributions, with respect to the mean CRMSE of the temperature station data, are PRODIGE, ACMANT, MASH main and USHCN 52x; the CRMSE anomalies in the inhomogeneous, i.e. had an improvement quotient over the inhomogeneous data above one. If all station series in a network are averaged to one network series representing the regional climate, the errors tend to become much smaller and results can be very different; see the last four columns in Table 4. For the network CRMSE the USHCN 5 52x performs best, followed by the best versions of iCraddock, MASH and PRODIGE. Interestingly, ACMANT, one of the best for the station CRMSE, performs much less well for the network CRMSE. Six contributions made the network average data more inhomogeneous.

Errors on monthly scale
For precipitation many fewer contributions were submitted. The best contribution 10 regarding the monthly CRMSE anomaly of the station data is PRODIGE main, where monthly values are adjusted using a coefficient estimated on annual values. In contrast PRODIGE monthly made the data more inhomogeneous. The partial contribution MASH Marinova achieved the smallest CRMSE, but the larger mean CRMSE anomaly suggests that relatively easy networks were homogenized and that the contribution is 15 actually second best. Over half of the contributions did not improve the CRMSE of the station data and none of the algorithms improved the network CRMSE meaningfully.

Errors on yearly scale
The errors in the inhomogeneous yearly data are smaller than in the monthly data; see Table 5. The monthly station temperature error of the inhomogeneous monthly 20 data is 0.57 • C, whereas at yearly scale the error is reduced to 0.47 • C. Notably, the reduction in error for the homogenized temperature data is typically much stronger; the average reduction factor over all contributions for monthly data is 77 %, whereas for yearly data it is 53 %. With some exceptions, the contributions with an improvement factor for monthly data of around 1.0, perform similarly for yearly data, whereas the 25 better contributions for monthly data achieve an even better improvement factor for yearly data. For instance, where the best contributions improve the homogeneity of the monthly station data by about a factor 0.6, the improvement ratio of these contributions 2675 of the yearly data is around 0.3. As mentioned above, scatterplots of the CRMSE show that at yearly scales most contributions improve nearly all stations and networks individually.
For precipitation the yearly station-based results are more encouraging than the monthly results: only absolute homogenization increases the yearly CRMSE signifi-5 cantly. For the yearly CRMSE of precipitation MASH main is the most accurate algorithm. Network average precipitation data is not clearly improved by homogenization.

Errors on decadal scale
The errors in the inhomogeneous decadal data are again smaller than in the yearly data; see Table 6. Still, the intercomparison between the contributions are very similar 10 for the CRMSE of yearly and decadal station data. The explained variance of a linear fit of the CRMSE at these two scales is 98 % (97 %) for temperature (precipitation). Therefore, only boxplots for the decadal CRMSE are shown in Fig. 3. Compared to the monthly data, the range of the results is larger because the errors of the best contributions decrease much more than for contributions that did not perform as well.

15
At this scale ACMANT performs less well than the other contributions that were good with respect to the monthly CRMSE.
For the network mean signal there is a strong difference between yearly and decadal data as shown in Tables 5 and 6. The most evident difference is the typically much smaller error. In contrast to the yearly network CRMSE of precipitation, the decadal 20 CRMSE is improved by homogenization. For the network mean precipitation there is almost no correlation between the yearly and decadal values. While in both cases MASH main is one of the best and absolute homogenization increases the inhomogeneity of the data, the ranking of most other contributions changes considerably. A clear feature of this figure is, furthermore, the u-shape of especially the yearly and 5 decadal data. This is a natural consequence of using the centered time series to compute the errors in case of systematic deviations such as differences in slope.
The period with missing data during the WWII seems to be important. This is where the error often starts to grow more rapidly or even jumps higher. Another important period is the first quarter of the century where many stations do not yet have data.
Therefore, the CRMSE of selected contributions are shown in Table 7 for the first and second quarter, as well as for the last half a century. The table shows that the error of the homogenized data in the first quarter is always higher or equal compared to the other two periods. For some contributions the errors in the second quarter are higher than for the last half of the century; this points to problems with the missing data in the 15 middle of the time series after the Second World War. An exceptional contribution is Climatol, which has the lowest monthly temperature errors around 1900, which grow slowly towards 2000; not shown. This fits to Climatol starting the correction of the breaks at the beginning of the series. 20 More accurate trend estimation is a primary motivation to homogenize climate data. Figure 5 shows scatterplots of the station trends before and after homogenization for six selected contributions. Vertical lines start at the trend in the inhomogeneous data and end with a symbol at the trend estimate for the homogenized data. The figure illustrates the improvement of the temperature trend estimates and indicates that trend 25 improvement was smaller for precipitation. Because all stations in one network have the same symbol, the figure also shows that all stations within one network tend to have a bias in the same direction, whereas for the networks overall there is no bias. Climatol is an exception in that it greatly decreases the magnitude of any trend in temperature. Figure 6 gives an overview of the differences between the trends in the homogenized station data and the original data for all complete contributions; the smaller the spread, the better the contribution. MASH main performs best for precipitation. For this 5 selection PRODIGE monthly performs best for temperature. Table 8 summarizes all contributions and metrics for both station and network trends. Overall, the incomplete iCraddock and MASH Marinova contributions performed even better for temperature station trends. With respect to the trends in station or network precipitation trends MASH Marinova is the most accurate contribution.

Linear trends
The correlation between the scores for the station-based and the network-based trends is again modest. A considerable number of contributions do not decrease the uncertainty of the trends of the network. For network averaged precipitation only three contributions improve the trends: MASH Marinova, C3SNHT and AnClim main. Absolute homogenization (PMFred) increases the uncertainty of the trends in the raw data 15 by about a factor two for all four metrics in Table 8.

Detection scores
A scatterplot with the probability of detection, POD, against the probability of false detection, POFD, for all complete contributions is presented in Fig. 7. As the Peirce Skill Score, PSS, is defined as POD minus POFD, the isolines of PSS can be indicated 20 by slant lines in Fig. 7. Table 9 shows all contributions and more detection skill scores. Because these skill scores are computed on all networks simultaneously, anomalies could not be computed as before. Therefore comparisons with incomplete contributions have to be made with care.
The scatterplot shows that MASH is an outlier with respect to both detection scores. 25 Because MASH reports breaks for multiple monthly time series, it naturally has more breaks than the other algorithms, which combine monthly results to one date per break. because of the noise in the detected date, the larger number of detected monthly breaks for MASH still leads to an artificially larger number of annual breaks and thus false alarms. Thus intercomparisons of MASH with the other contributions remain difficult, especially for the POD and POFD. For both temperature and precipitation MASH main performs best according to the Peirce skill score, while it has the lowest Heidke 5 special score.
Most remarkable is that most other algorithms have a probability of false detection well below the target 5 % level. C3SNHT PMTred rel and AnClim main are close to this target level. The USHCN contributions have the lowest POFD. With respect to the POD and the Heidke skill scores the incomplete iCraddock contributions stand out and the three USHCN contributions perform very well. ACMANT, PMTred rel, and Climatol perform well, especially in contrast to the previous error metrics; Climatol is even the best precipitation contribution with respect to the Heidke special score. All SNHT and AnClim contributions as well as PMFred abs are characterized by relatively low skill scores, mostly due to low probabilities of detection. The correlations between 15 the various probability of detection and skill scores is modest, even between the two Heidke scores. Figure 8 shows the temporal behavior of the number of true and predicted breaks (top panel), as well as the POD and the POFD (bottom) averaged over all complete surrogate temperature contributions. In the middle of the period, between about 1925 20 and 1975, a high correlation between true and predicted data is found in the top panel. However, there is a surplus of predicted breaks of 1 to 2 percentage points in this period.
The POD and POFD are reduced markedly at the edges of the time series, especially in the beginning of the century. The reason for this is a decrease in the total 25 number of predicted breaks. This is presumptive due to a combination of a large uncertainty in the means needed to find a break and the smaller number of stations in the beginning. PMFred abs and PMTred rel are designed to compensate for the former problem. PMFred abs shows a reasonably constant POFD around the 2 percent level. On the other hand, PMTred rel shows a strong decline in POFD from 8 % in 1925 to 1 % in 1900, likely in response to the missing data.

Late contributions
This section describes contributions submitted after the deadline at which the truth was revealed to the participants. Some of these contributions aim to mend problems 5 discovered by the results for the blind contributions. While the results found for these late contributions are interesting, their performance should be interpreted with care as these updated contributions are by definition benefiting from knowing the truth.

ACMANT late
ACMANT late has been generated with an improved version of ACMANT (Domonkos, 10 2011b stations is 0.34 • C and of the network average data is 0.16 • C. The linear trend estimate shows an error of 0.26 • C/100 yr (station) or 0.21 • C/100 yr(network).
Notable is that the CRMSE is almost constant as a function of time. Craddock late is consequently more accurate in the first half of the century, but less accurate than iCraddock Vertacnik or Klanar in the second half. This may be due to four strategies.

5
Firstly, the most relevant pairs of stations are selected not only based on correlation, but for climatological similarity, e.g. exposure. Secondly, often only a part of the homogeneous subperiod is used for correction. Thirdly, also breaks that are not clearly evident are corrected. Finally, depending on the strength of the seasonal cycle of the break, the operator selects annual or monthly corrections.

Climatol2.1a
Climatol's blind contribution showed good results for detection, but strongly reduced the trends. After the deadline a new Climatol2.1a contribution was submitted. The important changes are as follows. The main change is in the normalization of the series by the mean. As series are often incomplete, the means of the whole period are 15 unknown, and therefore the normalization must be computed iteratively until getting stable values. The new stopping criterion for the iterations is stricter. Furthermore, the test of the squared relative mean difference was replaced by the SNHT test.
The late contribution shows a clear improvement over the blind contribution. With respect to all CRMSE metrics Climatol2.1a is the most accurate SNHT version; except 20 for precipitation on decadal scales for which C3SNHT is more accurate. More importantly, Climatol2.1a no longer shows the reduction in the linear trends and the RMSE of the station temperature trends decreased from 0.

PRODIGE automatic
This late contribution is similar to PRODIGE main, but the synthesis of the change points is performed automatically. It computes a weighted mean number of breaks per year, based on the cross-correlations between the stations. The decision to accept a break depends on thresholds, which were found by training on the first two precipitation 5 networks.
For monthly precipitation, this automatic version is more accurate than PRODIGE main, whereas on larger averaging scales the error is larger. For linear trends in the precipitation, the RMSE of PRODIGE automatic for station (network) data is 9.9 mm (12.52 mm), respectively. Because this contribution was trained on a part of the bench-10 mark dataset, these errors may not be representative.

RhTestV3
After the deadline 16 surrogate temperature contributions similar to PMTred rel and PMFred abs were produced, but with the detection and correction functions from the new software package RhTestV3. After the deadline the outliers were known. Conse-15 quently in half of these late contributions the outliers could be removed to study their influence. Furthermore, half of the contributions corrected monthly and the other half yearly values; half did so correcting the mean values, half with quantile matching.
Comparing the contributions with and without outliers did not show a clear influence of outliers on the CRMSE at different averaging scales and periods, nor on the RMSE 20 of the linear trends. All contributions corrected using quantile matching or absolute homogenization made the station data more inhomogeneous. All contributions made the network data more inhomogeneous. The results for the comparable late contributions are similar to the blind ones. To answer the question whether there are differences between the surrogate and the synthetic data, an additional large dataset with 200 networks for each data section was generated. This dataset was homogenized with a newer version of ACMANT; see also 5 Sect. 6.2.1. The analysis of the homogenized data showed that the remaining error after homogenization, in terms of the monthly CRMSE, is 7 % smaller for the synthetic data. The standard deviation of the trend differences is 15 % smaller for the synthetic data compared to the surrogate data. All differences between surrogate and synthetic data are statistically highly significant. Thus synthetic data is easier to homogenize 10 than the more realistic surrogate data.

Artificial inhomogeneities
To investigate how realistic the inserted inhomogeneities are, the detected breaks in the artificial data (surrogate and synthetic) are compared to those of the real data section of the benchmark. Only USHCN, Climatol, Acmant, and AnClim main have homogenized 15 all real temperature networks. From the three USHCN contributions, USHCN main was selected to obtain independent data. Climatol was omitted as it showed problems with temperature trends. For precipitation, only AnClim main is available for analysis.
In the comparison below between the real and artificial networks of the properties of the detected breaks, also the power of detection should be taken into account and is 20 analyzed first. The length of the record of the artificial data is set at 100 yr, whereas the real temperature (precipitation) data has a lower average record length of 87 yr (95 yr). The real temperature data has more missing data (on average about 20 yr) and it is more interspersed than in the artificial data, which on average has only 10 yr of missing data. The precipitation in all data sections contains about 90 yr of data. The Introduction

Tables Figures
Back Close

Full Screen / Esc
Printer-friendly Version

Interactive Discussion
Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | data (94 %) than for the artificial data (90 %). For precipitation these cross-correlations are 86, 81, 72 percent for real, surrogate and synthetic data, respectively. The average annual break size in all data sections is not statistically different from zero. The magnitude of the artificial temperature breaks is larger: the average standard deviation of the annual detected break size distribution is 0.94 • C in the artificial data, 5 whereas in the real data it is only 0.72 • C. For comparison, the average magnitude of all inserted breaks was 0.8 • C. The artificial annual precipitation break sizes are larger than the real ones: the standard deviation of the detected real breaks is 9.5 mm (10 %), but of the artificial breaks 15 mm (19 %). For comparison: the size of the inserted breaks is 15 %. Partially the smaller mean break size may be due to the stronger spatial correlations in the real precipitation dataset, which allows for the detection of smaller breaks. The frequency of the artificial temperature breaks is lower: average frequency of detected breaks is 4.0 % and 4.7 % in the artificial and real data, respectively. More breaks are detected in the artificial precipitation data: 2.3 %, against 1.0 % in the real 15 data.
Taken together the statistical properties of the networks and the nature of the breaks discovered do not differ greatly among the three data sections. Thus the differences discussed below are probably due to real differences in the statistical properties of the inhomogeneities and not due to differences in the accuracy of homogenization. 20 If the perturbations applied at a break were independent, the perturbation time series would be a random walk. In the benchmark the perturbations are modeled as random noise, as a deviation from a baseline signal, which means that after a large break up (down) the probability of a break down (up) is increased. Defining a platform as a pair of breaks with opposite sign, this means that modeling the breaks as a random 25 noise produces more than 50 % platform pairs. The percentage of platforms in the real temperature data section is 59 (n = 742), in the surrogate data 64 (n = 1360), and in the synthetic data 62 (n = 1267). The artificial temperature data thus contains more platforms; the real data is more like a random walk. This percentage of platforms and the difference between real and artificial data become larger if only pairs of breaks with a minimum magnitude are considered. In the precipitation data, the percentage of platforms is also above 50 %, but the values for the real and artificial data are similar. The perturbations in precipitation may thus be modeled as random noise, but more data and algorithms would be needed for firm conclusions.

5
Another important parameter is the seasonal cycle of the inhomogeneities. First the monthly anomalies are computed by subtracting the yearly means. Consecutively, the homogenization perturbations are computed from these anomalies. The size of the seasonal cycle of a break is operationalized as the change in the standard deviation of these perturbations before and after a break. The distribution of the break sizes 10 of this seasonal cycle has a standard deviation of 0.19 (0.23) • C in the real (artificial) data. The seasonal cycle of the breaks in the artificial data is thus larger than in the real data and the homogenization algorithms underestimate the size of the seasonal cycle of the breaks (the seasonal cycle of the breaks inserted into the benchmark is 0.4 • C). USHCN does not introduce a seasonal cycle and was omitted. ACMANT found 15 stronger seasonal cycles in the breaks than AnClim main, but the difference between real and artificial data is about the same. In the precipitation data, the seasonal cycle of the breaks is 12 % in the real data and 19 % in the artificial data.

Global biases and inhomogeneities
If inhomogeneities have a tendency to be in one direction during a certain period, they 20 may have an influence on the network average signal, even for large networks. This could happen in case new technologies or measurement procedures are introduced. This effect can be studied in the cross-correlations between stations of the homogenization adjustments implemented and can be best seen in smoothed data.
Therefore, the perturbations were computed by comparing the inhomogeneous with 25 the homogenized data and smoothing these perturbations in the same way as for the computation for the decadal CRMSE (Sect. 5). Consecutively, the average crosscorrelation between all pairs of stations in a network was computed, after which this 2685 Introduction

Tables Figures
Back Close

Full Screen / Esc
Printer-friendly Version

Interactive Discussion
Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | correlation was averaged over all networks in one of the three data sections of the benchmark. The same contributions were analyzed as in Sect. 6.3.2. For the real, surrogate and synthetic data the cross-correlations are 9.1, −4.3 and 3.5 percent, respectively. Surprisingly, the cross-correlation for the surrogate data is negative. For the real and surrogate data these correlations are significant and they are also 5 significantly different from each other. The values depend strongly on the homogenization method. Therefore only complete contributions have been used. However, when additionally including incomplete contributions the above inferences stay the same.
For precipitation, only AnClim main is available for analysis. The same inferences as for temperature may be made, but the difference between real and surrogate data 10 is only significant at the p = 7 % level.
The raw datasets studied here are relatively recent. Records from the early instrumental records typically show artificial trends in all stations, because all stations made similar measurement errors. The bias effect studied here may thus be stronger in older data.

Discussion
The discussion is divided into two parts. The lessons learned about homogenization of climate records will be discussed in Sect. 7.1, while Sect. 7.2 will deal with the benchmarking itself.

20
Before discussing the performance of the algorithms it should be stated that the results for individual contributions should not be compared in too much detail for three reasons. First of all, the errors are non-Gaussian and dependent within one network. Especially in case of networks with multiple breaks that happen in multiple stations simultaneously, basically neutral changes in the algorithms can make the difference between solving a combinatorial problem or not. Therefore, the number of 15 networks is still quite limited and especially results for partial contributions should be interpreted with care. Secondly, there are uncertainties due to the limited realism of the benchmark data. While Sect. 6.3 showed that the average properties of the breaks in the temperature stations are reasonable in general, some deviations were found. The an-5 nual cycle of the breaks is somewhat exaggerated, which unfairly benefits the detection of breaks by ACMANT. Moreover the perturbations due to inhomogeneities in the stations are stronger cross-correlated in real data, which leads to larger perturbations in the network mean signal. As a consequence, the errors in the network mean signals of the benchmark are small and harder to improve than in reality. See Sect. 7.2 for more details. Thirdly, results depend on the error metric analyzed, not only between the CRMSE of the time series, the RMSE of the linear trends and the detection scores, but also for the different averaging scale at which the CRMSE is computed and the period under consideration. Moreover, different treatments of the data particularly with respect to the missing data and the annual cycle, which are all reasonable, lead to differences 15 in the errors found. In this context it should be noted that while many contributed to the analysis, the final pre-processing and analysis was performed by authors who did not submit homogenized data to avoid unfair biases. The all-over best blind contributions are homogenized by Craddock, MASH, and PRODIGE. The blind ACMANT contribution had some problems with the network mean 20 signal and trends, but the updated ACMANT late contribution suggests that ACMANT is currently the most accurate method available. USHCN, while less proficient than the four best ones, is nonetheless the best for the monthly network mean CRMSE and achieves its performance with a very low false alarm rate and without correcting the seasonal cycle. 25 All of these best methods have been designed to work with an inhomogeneous reference, either by using pairs or testing multiple reference time series for their suitability. Algorithms that circumvent the inhomogeneous-reference problem by first detecting the largest breaks are clearly less accurate. In praxis, the choice of a homogenization algorithm will also depend on the degree of automation desired or needed, which is related to the size of the network, and the access to expertise. Expertise and training is important; contributions using good algorithms by first time users often produced sub-optimal results. Some contributions result in data that is more inhomogeneous. In case of relative 5 homogenization of temperature data, these cases could mostly be traced back to operating or programming errors. The latter are often related to the way iterations are performed. Algorithms using iterations have to be validated with extra care. Implicitly, this connected to the advice "to always start homogenization from the beginning, assuming all series contain potential breaks and ignoring any previous homogeneity work undertaken for any of the series" (Auer et al., 2005). Unfortunately only one contribution utilized absolute homogenization. This contribution produced much more inhomogeneous data, both for temperature as well as for precipitation. Absolute homogenization should thus be used with care and always accompanied by metadata. A more detailed study using multiple absolute homog-15 enization methods (Reeves et al., 2007) would be worthwhile. The performance of absolute homogenization may have been reduced by the sometimes strong nonlinear global trends added to the data; see Sect. 3.
Precipitation data is expected to be more difficult to homogenize due to lower crosscorrelations. The lower correlations should, however, only lead to less improvement 20 of the data. The increases in inhomogeneity, found especially for the network average signals, are worrisome and warrant more research into the homogenization of precipitation. Given that the break detection score were positive, the problem probably lies in the noisy correction of precipitation data, especially for monthly correction. This is also suggested by the considerable difference for precipitation between PRODIGE monthly, 25 which experimentally performed monthly corrections, and PRODIGE main, which applied more stable yearly corrections and was more accurate. Annual corrections are thus currently recommended for homogenization of typical precipitation networks. The improvements achieved in CRMSE were much larger for yearly and decadal data than for monthly data. This is mainly related to the much smaller signal to noise ratio in the ratio time series due to the high spatial variability of precipitation, but may also be related to the fact that previous validation studies were limited to annual data. Monthly correction methods warrant more study. The correlations between the error 5 metrics based on the time series themselves and break detection scores are modest (Sherwood et al., 2009), as well as for the detection scores amongst each other. The use of detection scores as sole performance criterion should thus be discouraged.
Most, but not all contributions, showed much larger errors in the beginning quarter or half of the century. This may point to possibilities for developers of homogenization algorithms to improve the handling of missing data and of networks with few stations.
Some contributions applied algorithms that did not remove outliers themselves. The late surrogate temperature contributions applying the tests PMTred and PMFred did not show an influence of outliers on the results. Probably the results for the other temperature contributions without outlier removal are thus representative.

15
The contribution PRODIGE trendy that corrected local trends did not perform better than the versions that only corrected breaks, but trends were also only implemented in ten series. It should be studied whether improvements are more evident in those stations where local trends were present.

20
The synthetic data is apparently easier to homogenize than surrogate data. Especially the about 15 % smaller error in the linear trend estimation is climatologically relevant when interpreting results based on homogenized data. As many validation studies did take into account the lag-one auto-correlation, it would be interesting to study in more detail whether this aspect of the surrogate data made it harder to homogenize. Alter- In software engineering it has been observed that a benchmark can help a field of science to mature, both due to social as well as technical factors (Sim et al., 2003). Also in the COST Action, the definition of the properties of the benchmark and the joint work on the same dataset helped to bring scientists closer together. The benchmarking also led to technical improvements, ranging from finding bugs, to improved understanding, 5 and to an upcoming open-source state-of-the-art homogenization package. Sim et al. (2003) state that benchmarking is more than providing a problem, but that is should also be announced in advance how the solutions will be judged. In this respect, the homogenization effort did not constitute a true benchmark. In case of homogenization, it is difficult, and may even be impossible, to boil down the results to one 10 or two accuracy metrics. The contributions have been judged with respect to how well they reconstruct the temporal climatic variability, which is the most common reason to homogenize data. The data could also have been judged on how well the crosscorrelations are reproduced or even the absolute values of the measured elements. With such an aim, another benchmark should have been produced, one in which ob-15 servations performed at different locations are not merged to one long record. With the current experience, it is possible to communicate how the contributions will be judged in more detail for a future benchmarking exercise.

CPD
It is planned to redo the exercise every few years to monitor improvements in homogenization. As typical for a benchmarking project, also this benchmark will likely evolve. 20 Updates will be implemented to avoid tuning and based on lessons from this study, see Sect. 5.3. Correlations in the perturbation applied to stations are important to increase the perturbations in network average data to realistic values. The best contributions and especially ACMANT late perform very well. A future benchmark dataset should thus be more challenging, for instance by reducing the density of the networks. 25 The participants were requested to focus on homogenizing the surrogate data section. In retrospect more emphasis on the importance of the real data section should have been given and the real and surrogate data should be based on similar datasets for better comparison. While the surrogate data provides an estimate of the accuracy of CPD 7,2011 Benchmarking monthly homogenization algorithms the homogenization algorithms, the comparison of the results for the surrogate and the real data is needed to interpret the differences between the contributions. Furthermore, this comparison is important for the development of more realistic future benchmarks.

General conclusions and recommendations
The main research impetus for the last two decades has been the development of 5 homogenization algorithms that also function with an inhomogeneous reference time series. This effort has paid off. There is a clear split in performance on the benchmark data between these direct algorithms and the ones, which evade the inhomogeneousreference problem using older concepts such as stepwise or semi-hierarchical splitting, as well as detection on moving windows. With mathematical argumentation, climatological reasoning and the benchmark metrics all pointing in the same direction, we thus strongly recommend the use of direct homogenization algorithms. Such participating algorithms are: ACMANT, Craddock, MASH, PRODIGE and USHCN. ACMANT, MASH and PRODIGE also tackle the multiple break-point problem directly, which is also important for their performance.

15
Almost all relative homogenization algorithms improved the homogeneity of the temperature data. The exceptions could mostly be explained by inexperienced users or be traced back to algorithms (or parts thereof) newly written for this exercise. The results illustrate that statistical absolute homogenization has the potential to make the data even more inhomogeneous. Some contributions created with the best algorithms were 20 much less accurate than the contributions by the developers. This indicates that training of the operator is very important and that developers should invest more effort into making their software easy to use and give out relevant warnings.
We feel that this blind test of homogenization algorithms has benefited the homogenization community, see Sect. 7.2, and advocate to repeat the exercise in future.
One follow-up is the surface temperature initiative, which is working on a global homogenized surface temperature dataset and has started a benchmarking initiative for CPD 7,2011 Benchmarking monthly homogenization algorithms  (Thorne et al., 2011). Due to its sheer size, such a benchmark would only be of interest to automatic homogenization algorithms. There may thus be room for additional initiatives studying other climate variables and utilizing smaller networks for comparison with manual methods.
Benchmarking is not only useful to study the performance of the homogenization 5 algorithms. The definition of the properties of the benchmark, the work on the same dataset and the joint analysis of the results has strengthened the community. The benchmarking has also let to technical improvements, ranging from finding bugs, to improved understanding, and to the recommendations for an upcoming open-source state-of-the-art homogenization package.
Benchmarking officially requires agreeing on the error metrics in advance. For homogenization there is not one clearly preferred metric, however. With the current experience, it should be possible, though, to define the initial analysis in more detail for a future benchmark. The results showed only modest correlations between the break detection scores, which developers of homogenization methods tend to focus on, and 15 the other error metrics, which are close to the needs of climatologists. It is thus recommended to use both types of error metrics in future validation studies.
In retrospect too little emphasis was given to the homogenization of the real data section, which provides a validation of the statistical properties of the inserted inhomogeneities. For future benchmarking exercises, more studies on the statistical charac-20 teristics of inhomogeneities for various climate elements would be important. The size distribution of temperature inhomogeneities in Western countries is studied reasonably well, but for other regions and climatic variables more information would be valuable. Too little studied and quantified are cross-correlations of the breaks between stations, see Sect. 6.3.2. Especially periods in which breaks are biased in one direction lead 25 to a much stronger perturbation of the regional climate signal (average over multiple stations) as the random breaks used in this study and should be included in any future benchmark dataset. Furthermore, the breaks in the benchmark are modeled as deviations from the baseline values, i.e. as random noise. An alternative way to model breaks would be relative to the previous values, i.e. as a random walk. The random noise model was found to be reasonable, but for the temperature records a mixed model with a small random-walk component may be even more realistic.
Irrespective of the above mentioned advantages of benchmarking and the reliability of the blind results, there are also disadvantages to benchmarking and alternative validation methodologies should also be used. An important disadvantage is that the blind test does not allow for the correction of problems discovered during the analysis. Consequently, not all methods could deliver their optimal performance. The interpretation is furthermore hampered by differences in experience and effort of the participants. Finally, because of its competitive character it is paramount that the statistical properties of the data and the inhomogeneities are realistic. Otherwise it would be possible to tailor the algorithms to the benchmark and perform better on the benchmark than on real data. Therefore, benchmarking does not allow for systematic studies aimed 15 at understanding the algorithms, for instance by systematically testing varieties of an algorithm, and for testing the limits of the methods with unrealistic easy or difficult data. The latter being the strength of standard intercomparison studies and mathematical analysis. Another valuable validation strategy is the testing of the methods on real data with good metadata as in this case is the most realistic one. 20 The use of metadata and reconstructions of past observation methodologies is preferred over statistical homogenization, especially in case sufficiently long parallel series are available and to precise the dates of the breaks. To find additional not documented breaks, statistical homogenization should always be used as well. In future, more homogenization algorithms should implement the automatic use of metadata, so that a 25 future benchmark can also include simulated metadata. National Meteorological Services should intensify their work on the digitization of metadata (Brunet and Jones, 2011) and the formulation of a standard machine-readable format for metadata. The intelligent use of metadata is an advantage of manual methods over automatic ones and automatic methods may tempt people to rely less on metadata. Further advantages of manual methods are the climatological knowledge of the operator on how much variability is allowed in the difference time series, which accordingly allows for an intelligent selection of similar stations. Furthermore, humans are good at solving 5 combinatorial problems, which explains the quality of the Craddock and PRODIGE contributions. Strengths of automatic methods are their objectivity and reproducibility. Furthermore, automatic methods can be easily applied to large datasets and thus also lend themselves better to validation and benchmarking, which aids their refinement. This study showed that currently automatic algorithms can perform as well as manual 10 ones.
A considerable difference in improvement of the data by homogenization was found between annual and monthly data. Furthermore, the break detection scores are only modestly related to the remaining centered root mean square error. Both findings suggest that more work on the correction algorithms could be fruitful. The benchmark 15 dataset could be used to study the performance of various correction methods.
The results for precipitation were not as good as for temperature. This may well be due to the more difficult estimation of the correction factors. This is suggested by the positive performance for detection and the higher accuracy of the PRODIGE contribution with annual factors compared with the version with monthly factors. The 20 operators also have more experience with temperature and the algorithms are better validated for temperature. It should also be noted that the properties of the benchmark data may have been less good for precipitation as less is known about the statistical properties of breaks in precipitation and too little homogenized real data was available for a stringent validation of the benchmark. Given these results and the importance of 25 precipitation for climate impact research the homogenization of precipitation should be given priority. It may be worthwhile to generate a dedicated benchmark for precipitation.
Many evidently interesting questions are not yet answered and will hopefully be studied in subsequent articles. For instance, the network without inserted inhomogeneities should be studied separately. This analysis was mainly based on statistical metrics of interest to many users of the homogenized data. With the benchmark dataset being available, any climatologist can now study the influence of remaining inhomogeneities on a specific analysis. Users may for instance be interested in the annual cycle, the cross-correlations between stations, as well as secular trends for individual months and 5 long range dependence (Rust et al., 2008). Based upon the results on the benchmark and theoretical consideration, the Action is currently working on providing a free software package with recommended homogenization tools, which will be published on the HOME homepage (HOME, 2011). Introduction

Tables Figures
Back Close

Full Screen / Esc
Printer-friendly Version

Monthly Yearly Decadal
Temperature 1900-1925-1950-1900-1925-1950-1900-1925-1950-1925 1950 2000 1925 1950 2000 1925 1950 2000 Inhomogeneous   Fig. 1. Scatterplot of the centered RMSE before and after homogenization for selected contributions. The squares display the errors of the stations; the dots show the errors of the network mean (regional climate) time series. Points on the bisect indicate no change, above the bisect the data is made more inhomogeneous, while below the bisect homogenization improved the homogeneity of the data. 7,2011 Benchmarking monthly homogenization algorithms  data. The top row shows trends for selected temperature contributions, the bottom row for 3

CPD
precipitation. The open symbols denote the trends of homogenized stations, the closed black 4 symbols the trend of the homogenized regional network averaged trend; every network has its 5 own symbol, which shows that station trend errors are correlated. The vertical grey lines run 6 from the trend in the inhomogeneous data to the trend in the homogenized data. The open symbols denote the trends of homogenized stations, the closed black symbols the trend of the homogenized regional network averaged trend; every network has its own symbol, which shows that station trend errors are correlated. The vertical grey lines run from the trend in the inhomogeneous data to the trend in the homogenized data.