Inferences on weather extremes and weather-related disasters : a review of statistical methods

Introduction Conclusions References


Introduction
The study of weather extremes, and impacts thereof, plays an important role in climate-change research. Due to the great societal consequences of extremes -historically, now and in the future -the peer-reviewed literature on this theme has been growing enormously since the important findings of Mearns et al. (1984) and Wigley (1985). These authors showed that small shifts in the mean and variance of a weather or climate variable might lead to a strong nonlinear shift in the frequency of extreme values of that variable. Examples of recent publications on extremes are Trenberth and Jones (2007, Sect. 3.8), Gamble et al. (2008), Karl et al. (2008) and IPCC-SREX (2011). Furthermore, the literature shows that inferences on extremes can be based on all types of meteorological/climatological information: documentary evidence and paleo-climatological proxies (Battipaglia et al., 2010;Stoffel et al., 2010;Büntgen et al., 2011), instrumental data (Alexander et al., 2006;, disaster statistics (Pielke, 2006;Bouwer, 2011, Guha-Sapir et al., 2011 and model-generated climate data Published by Copernicus Publications on behalf of the European Geosciences Union.
In scanning the peer-reviewed literature on weather extremes and impacts, we noticed that many different methods are used to make inferences on extremes. However, discussions on methods are rare. We name Katz et al. (2002) in the field of hydrology and Katz (2010) in the field of climate change research. Zhang at al. (2004) study the detection of three types of trends in extreme values, based on Monte Carlo simulations. A third example is that of Wigley (2009) and Cooley (2009), where the use of linear trends and normal distributions (Wigley) is opposed to the use of extreme value theory with time-varying parameters (Cooley). Clearly, the calculation of a return period of, say, once in 500 yr, based on a normal distribution will deviate from that based on a generalized extreme value (GEV) distribution. In other words, the specific choice of methods (here the shape of probability density functions or PDFs in short) might influence the inferences made on these extremes. Another example is the particular choice of a trend model to highlight temporal patterns in extreme-weather indicators. Conclusions based on an OLS straight line might differ from those made by more flexible trends. And the inclusion or exclusion of uncertainty information may influence inferences. A rising trend or increasing return periods are not necessarily statistically significant.
In this article, we will review the statistical methods used in the peer-reviewed literature. First, we will give a concise overview of methods applied. These methods deal with the computation of return periods of extremes, chances of crossing a pre-defined high (or low) threshold, the estimation of a trend in weather indicators (number of warm and cold days, annual maximum of 1-day/5-day precipitation, global number of floods, etc.) or the comparison of PDFs over different periods in time.
Next to this overview we will discuss a number of methodological aspects. We will discuss (i) the assumption of a stationary climate when making inferences on extremes, (ii) the choice of (extreme value) probability distributions for the data at hand, (iii) the availability of uncertainty information and (iv) the coupling of weather or disaster statistics to climate change. As for point (iv) we will pay attention to methods in the peer-reviewed literature and to the way these results are assessed by the Intergovernmental Panel on Climate Change (IPCC).
There are two aspects of weather extremes and their impacts (disasters) which will not be dealt with in this methods review. The first aspect concerns the quality of the data, and more specifically, methods for testing the quality of data and correcting them, if necessary. For homogeneity issues the reader is referred to Aguilar et al. (2003) and Klein Tank et al. (2009). For the reliability of disaster statistics please refer to Gall et al. (2009).
The second methodological aspect not dealt with, is that of methods for detecting anthropogenic influences in climate or disaster data. For detection studies in relation to extremes please refer to , Zwiers et al. (2011) and Min et al. (2011). For a review on detecting climate change influences in disaster trends, the reader is referred to Höppe and Pielke (2006) and Bouwer (2011). We further note that we will use the term "climate change" in the general sense, thus, climate change both due to natural and anthropogenic influences (unless denoted otherwise).
The contents of this article are as follows. In Sect. 2, we will give a concise description of how inferences on extremes are made in the peer-reviewed literature. Then, we will discuss these methods in Sect. 3 through 6 with respect to four aspects: the assumption of a stationary climate (Sect. 3), assumptions on probability distributions (Sect. 4), the use of uncertainty information (Sect. 5) and the coupling of extremes to climate change (Sect. 6). Conclusions are given in Sect. 7. A number of statements throughout this article will be illustrated by an analysis of annual maxima of daily maximum temperatures for station De Bilt in the Netherlands (TXX t ; Figs. 1, 4 and 6).

Preliminaries
There is a diverse use of terminology in the fields of climate change research and disaster risk management. Terms used in the literature comprise weather or climate extremes, weather or climate extreme events, weather or climate indicators, weather or climate extreme indicators and indices of extremes. As for disasters, any type of weather-related disasters can be analysed (floods, droughts, heat waves, hurricanes, etc.). Mostly, three types of disaster burden are presented in the literature: economic losses, the number of people killed and the number of people affected. For details see Guha-Sapir et al. (2011). The general term that is used throughout this article, is "extreme indicator". Extreme indicators can be constructed from underlying data (mostly daily data) by computing block extremes or threshold extremes. Block extremes are gained by taking highest (or lowest) values in a block of observations. In most cases seasonal or annual blocks are taken. Examples are the annual maximum value of daily maximum temperatures (TXX t ), the annual maximum value of one-or fiveday precipitation totals (RX1D t , RX5D t ) or the annual maximum of river discharges. Another block value is gained by taking the r-largest value. For example, one can choose the 7largest value from annual daily data (or, in other words, take the 98 percentile). As for disaster burden indicators block indicators are generally chosen to be block sums (e.g. the annual number of flood disasters or annual global economic losses due to weather-related disasters). An overview of extreme indicators, as well as their definition and notation, can be found in Alexander et al. (2006) or the ECA website http://eca.knmi.nl/indicesextremes/indicesdictionary.php.
Block extremes are often modelled by applying the generalized extreme value (GEV) distribution. Also normal or log-normal distribution can be chosen. For a description of these methods please see Coles (2001, Chapters 3 and 6) and Katz et al. (2002), and for a description in the context of Bayesian statistics see Renard et al. (2006).
Threshold extremes, also denoted as peaks over threshold (POT), are gained by taking exceedances of a predefined threshold. Here, one can be interested in the number of exceedances of that threshold, as in the number of summer days or tropical nights, or in the positive differences between data within a block and the threshold chosen (the excesses). Generally, excess variables are modelled by applying the generalized Pareto distribution (GPD). For a description of these methods take note of Coles (2001, Chapters 4 and 5), Katz et al. (2002) and Coelho et al. (2008). For a description in the context of Bayesian statistics please see Renard et al. (2006). The frequency of exceedances may be analysed by a nonhomogeneous Poisson process (Caires et al., 2006) or by a Poisson regression model (Villarini et al., 2011).

Stationarity and trend methods
At the basis of any analysis of extreme indicators lies the estimation of trends 1 . Trends play a key role in judging if the data at hand are stationary, i.e. if the data follow a stochastic process for which the PDF does not change when shifted in time (or space). Consequently, parameters such as the mean and variance do not change over time (or position) for a stationary process. Loosely formulated, stationarity means that "the data" are stable over time: no trends, breaks, shocks, ramps or changes in variance over time. Methods for assessing stationarity are given by Diermanse et al. (2010) and Villarini et al. (2011) and references therein.
Once a choice for stationarity has been made, trends can be estimated as such or as part of a specific non-stationary time-series approach. Examples of the latter approach are (i) making the location parameter of a GEV distribution timevarying in a certain pre-defined way (e.g. Katz et al., 2002), or (ii) making the threshold in a GPD analysis time-varying (e.g. Coelho et al., 2008). 1 Harvey (2006) gives two definitions for "trend". In much of the statistical literature a trend is conceived of as that part of a series which changes relatively slowly (smoothly) over time. Viewed in terms of prediction, a trend is that part of the series which, when extrapolated, gives the clearest indication of the future long-term movements in the series. In many situations these definitions will overlap. But not in all situations. In case of data following a random walk, the latter trend definition does not lead to a smooth curve. Typical examples of the first definition are splines, LOWESS smoothers and Binomial filters. Typical example of the second definition is the IRW trend model in combination with the Kalman filter (examples shown in Figs. 1, 4 and 6).
The choice of a specific trend model is not a trivial one. If we scan the climate literature on trend methods, an enormous amount of models arises. We found the following trend models or groups of models (without being complete): low pass filters (various binomial weights; with or without end point estimates), ARIMA models and variations (SARIMA, GARMA, ARFIMA), linear trend with OLS, kernel smoothers, splines, the resistant (RES) method, Restricted Maximum Likelihood AR(1) based linear trends, trends in rare events by logistic regression, Bayesian trend models, simple Moving Averages, neural networks, Structural Time-series Models (STMs), smooth transition models, Multiple Regression models with higher order polynomials, exponential smoothing, Mann-Kendall tests for monotonic trends (with or without correction for serial correlations), trend tests against long-memory time series, robust regression trend lines (MM or LTS regression), Seidel-Lanzante trends incorporating abrupt changes, wavelets, Singular Spectrum Analysis (SSA), LOESS and LOWESS smoothing, Shiyatov corridor methods, Holmes double-detrending methods, piecewise linear fitting, Students t-test on sub-periods in time, extreme value theory with a time-varying location parameter and, last not but least, some form of expert judgment (drawing a trend "by hand"). See Mills (2010) and references therein for a discussion.
This long list of trend approaches holds for trends in climate data in general. However, the number of trend models applied to extreme indicators, appears to be much more limited. The trend model almost exclusively applied, is the OLS straight line. This model has the advantage of being simple and generating uncertainty information for any trend difference [µ tµ s ] (indices "t" and "s" are arbitrary time points within the sample period) 2 . Examples of OLS trend fitting are given by Brown et al. (2010). They estimate trends in 17 temperature and 10 precipitation indices (all for extremes) at 40 stations. Their sample period is 1870-2005. Furthermore, Brown et al. (2010) analyse the sensitivity of their results with respect to the linearity assumption. To do so, they splitted the sample period in two parts of equal length and estimated the OLS trends on these two sub-periods.
Other examples of OLS linear trend fits can be found in Klein  and Alexander (2006), albeit that the significance of the trend slope is estimated differently. Klein  apply the Student's t-test, while Alexander et al. (2006) apply Kendall's tau-based slope estimator along with a correction for serial correlation according to a study of Wang and Swail (2001). Karl et al. (2008, Appendix A) 2 The OLS regression model reads as y t = µ t + ε t = a + b*t + ε t , with "a" the intercept, "b" the slope of the regression line and ε t a noise process. Now, the variance of any trend differential [µ tµ s ] follows from var(µ t − µ s ) = var(b * (t-s) ) = (t-s) 2 * var(b). Note 1: this variance estimate is only unbiased if the residuals are normally distributed and not serially correlated. Note 2: some authors estimateb using Sen's estimator. This estimator is more robust against outliers.

268
H. Visser and A. C. Petersen: Inferences on weather extremes and weather-related disasters choose linear trend estimation in combination with ARIMA models for the residuals. This is another way of correcting for serial correlation.
In the field of disaster studies OLS trends are the dominant method, albeit that the original data are log-transformed in most cases. See Pielke (2006, Figs. 2 and 3) or Munich Re (2011, p. 47) for examples. Another trend method in this field is the moving average trend model where the flexibility is influenced by the length of the averaging window chosen. See Pielke (2006, Fig. 5) for an example. We only found one example where the GPD distribution with time-varying parameters was applied to economic loss data due to floods in the USA (Katz, 2002).
Occasionally, other trend approaches for extreme indicators are reported. Frei and Schär (2001) apply logistic regression to time series of very rare precipitation events in the Alpine region of Switzerland. They include a quantification of the potential/limitation to discriminate a trend from the stochastic fluctuations in these records. Visser (2005) applies sub-models from the class of STMs, in combination with the Kalman filter, to estimate trends and uncertainty in weather indicators where trends may be flexible. The measure of flexibility is estimated by ML optimization. Klein  use the LOWESS smoother to highlight trend patterns in extreme weather indicators (their Figs. 3, 4, 6 and 7). Tebaldi et al. (2006) do not apply any specific trend model but show increases or decreases over two distant 20-yr periods: indicator differences between 2080-2099and 1980-1999, and between 1980-1999and 1900-1919. Hu et al. (2012) apply Mann-Kendall tests with correction for serial correlation (no actual trend estimated in this approach).
Finally, some authors acknowledge that the use of a specific trend model, along with uncertainty analysis, may lead to deviating inferences on (significant) trend changes. Therefore, they chose to evaluate trends using more than one trend model. For example, Moberg and Jones apply two different trend models to the same data: the OLS trend model and the resistant (RES) model. Subsequently, they evaluate all their results with respect to these two trend models. Even more methods are evaluated by Young et al. (2011). They estimate five different trend models to 23-yr wind speed and wave height data and evaluate uncertainty information for each model (their supporting material). We note that the application of more than one trend model to the same data has been published more often (not specifically for the evaluation of extremes). The reader is referred to Harvey and Mills (2003) and to Mills (2010) with references therein.

Return periods
If a particular analysis deals with extreme indicators, based on block extremes, return periods or the chance for crossing a pre-defined threshold can be calculated from the specific PDF chosen. These chances p t , with t some time point within the sample period, follow directly from the PDF. Average return periods R t follow simply by taking the inverse of p t . An example of return periods is given in Fig. 4 of IPCC-SREX (2011).
A variant is the so-called x-year return period, with x some fixed number (often 20 in the literature). If we denote an extreme indicator by I t , a 20-yr return period, denoted as I 20 t , stands for an indicator value in year t which is crossed once in 20 yr, on average. In fact, I 20 t stands for the 95 percentile of the PDF at hand. Confidence limits for such extreme percentiles can be computed by standard theory (e.g. Serinaldi, 2009).
An example illustrates the calculation of return periods. Suppose we are interested in the following extreme indicator: annual extreme temperatures TXX t , with t in years. For the Netherlands we constructed a time series for this indicator over the period 1901-2010 (station De Bilt). Homogeneity tests showed a large discontinuity in 1950, the year where the type of temperature screen changed. Therefore, we decided to limit analyses to the period 1951-2010. Other homogeneity tests were satisfactory (Visser, 2007). The TXX t series is shown in Fig. 1.
The upper panel shows the data along with an Integrated Random Walk (IRW) trend model and 95 % confidence limits (Visser, 2004;Visser and Petersen, 2009). Tests showed the residuals (or in Kalman filter terms: innovations or one-step-ahead prediction errors) to be normally distributed. t , yielding return periods R 35 t of once in 420, 62 and 5.6 yr, respectively. For the calculation of annual 20-yr return periods (TXX 20 t ) we choose the yellow area such that it covers 5 % of right-hand tail of the normal distributions, for all times t. We find the temperature thresholds 32.8, 34.1 and 36.4 • C, respectively (cf. Fig. 6).

Comparing PDFs
Next to trends and return periods one may make inferences on extreme indicators by computing PDFs for distinct periods of time. These PDFs can be derived from historic data, or from GCM calculations (historic or future periods). Differences between PDFs can be discussed qualitatively, as in Alexander et al. (2006) and Ballester et al. (2010), or by applying statistical tests (t-test for means, F-test for variances, Kolmogorov-Smirnov test for any difference in PDF shape, etc.). Non-parametric techniques have been summarized by Ferro et al. (2005).  Mearns et al. (1984) and Wigley (1985). A return period is calculated as the inverse of these chances. For each of the three normal distributions one could calculate the temperature which is exceeded once in x years, the x-year return periods. For statistical details see Von Storch and Zwiers (1999, Ch. 2). Chances and return periods are further illustrated in Fig. 6.

Software
Standard statistical techniques mentioned in this Section are available in software packages such as SPSS, S-PLUS, SAS or STATA. These packages also contain a wide range of trend models (OLS straight lines and polynomials, ARIMA models, robust trend models (MM or LTS), and a range of smoothing filters (splines, Kernel smoothers, LOESS smoothers, Supersmoothers). For the estimation of GEV models and POT-GPD distributions (stationary or non-stationary) we refer to Stephenson and Gilleland (2006) and Gilleland and Katz (2011). On their website a wide range of software is given, based on extreme value theory (EVT): http://www.ral.ucar.edu/ \simericg/softextreme.php. For a software package based on the book of Coles (2001) the reader to the extRemes software, written in R: http://cran.r-project.org/web/packages/ extRemes/extRemes.pdf.
The software for estimating structural time series models (STMs), as applied in this article, is freely available from the first author (H. Visser). Other software on STMs is the package STAMP. For information please see http: //stamp-software.com/.

Methods in the literature
In this Section, we will give a concise overview of the recent literature on extremes and disasters. In doing so, we have categorized the literature for the stationarity assumptions that researchers have made. Besides stationarity and nonstationarity we will give examples for block-stationarity, that is a period or "block" of a certain length, typically between 20 to 30 yr, where climate is assumed to be stationary.

Non-statistical approaches
Extreme events or disasters can be analysed without assuming statistical properties. The method employed is simply by enumerating a number of record-breaking values. These records can be discussed with respect to their spreading over time. If x of the highest values occurred in the past decade, this might give an indication of a shifting climate. The method of enumeration is often applied in communication to the media. An example is the annually recurring discussions on the extremity of global mean temperatures. For example, see the NOAA and NASA GISS websites http://www.noaanews.noaa.gov/ stories2011/20110112 globalstats.html and http://www.giss. nasa.gov/research/news/20110113/, discussing the extremity of the 2010 value.
In the peer-reviewed literature enumeration is found only incidentally. For instance, Prior and Kendon (2011) studied the UK winter of 2009/2010 in relation to the severity of winters over the last 100 yr. They give an overview of coldness rankings for monthly and seasonal average temperatures, as well as rankings for the number of days with snow. Furthermore, Battipaglia et al. (2010) study temperature extremes in Central Europe reconstructed from tree-ring density measurements and documentary evidence. Their tables and graphs show a list of warm and cold summers over the past five centuries.
In the grey literature (reports) many examples of enumeration can be found. Buisman (2011) gives a detailed description of weather extremes and disasters, for a large part based on documentary information in the area of the Netherlands. His enumeration covers the period from the Middle Ages up to the present. Enumerations of disasters in recent decades are found in, e.g. WHO (2011) and Munich Re (2011). Zorita et al. (2008) consider the likelihood that the observed recent clustering of warm record-breaking mean temperatures at global, regional and local scales may occur by chance in a stationary climate. They conclude this probability to be very low (under two different hypotheses).

Assuming GEV distributions
Wehner (2010) fits GEV distributions to pre-industrial control runs from 15 climate models in the CMIP3 dataset. These control runs are assumed to be stationary; 20-yr return periods are estimated for annual maximum daily mean surface air temperatures along with uncertainties in these return periods. Min et al. (2011) also estimate the GEV distribution. They analyse 49-yr time series of the largest one-day and five-day precipitation accumulations annually (RX1D t and RX5D t ). Afterwards, these distributions are used to transform precipitation data to a "probability-based index" (PI), yielding a new 49-yr time series with values between 0 and 1. Time-dependent behaviour of the PI t series is shown by estimating trends (their Fig. 1).

Assuming GDP distributions (POT approach)
Della-Marta et al. (2009) apply the POT approach in combination with the generalized Pareto distribution (GDP) and declustering. They apply this approach to extreme wind speed indices (EWIs). The GDP parameters are regarded to be time-independent.

Assuming a block-stationary climate
Assuming no specific PDF shape Alexander et al. (2006) analyse changes in PDF shapes without specifying the shape itself. In their Fig. 8 the sample period  is split-up into three block periods and PDF shapes are discussed in a qualitative way. In their Figs. 9, 10 and 11 two block periods have been chosen. Brown et al. (2010) analyse temporal changes in PDFs in their Figs. 5 and 6. Data are seasonal minimum and maximum temperatures over the period 1893-2005, taken from northeastern US stations. The block size is around 28 yr. No specific PDF shape is assumed in their analyses. Kharin and Zwiers (2007) Fig. 4). To this end they choose blocks of 30 yr and base their return-period calculations on these 30-yr blocks. Uncertainties in return periods are gained through 1000 times resampling of block data.  choose 10-yr blocks for the location parameter of the GEV distribution. The other two GEV parameters are kept constant in their approach.

Assuming a non-stationary climate
Assuming GEV distributions Schönwiese (2005, 2007) analyse monthly total precipitation data from a German station network of 132 time series, covering the period 1901-2000. They use a decomposition technique which results in estimations of Gumbel distributions with a time-dependent location and scale parameter. Kharin and Zwiers (2005) estimate extremes in transient climate-change simulations. Their sample period is 1990-2100. They assume annual extremes of temperature and precipitation to be distributed according to a GEV distribution with all three parameters time-varying (linear trends).
In doing so, their GEV model has six unknown parameters to be estimated. Brown et al. (2008) essentially follow the same approach for extreme daily temperatures over the period 1950-2004. Fowler et al. (2010 estimate GEV distributions with linear changing location parameters and apply this technique to UK extreme precipitation simulations over the period 2000-2080. Their approach deviates from that of Kharim and Zwiers (2005) and Brown et al. (2008) in that they do not assume this approach to be the only approach possible. They estimate eight different modelling approaches and evaluate the best fitting model using Akaike's AIC criterion. Hanel et al. (2009) apply GEV distributions where all three parameters are time-varying. Furthermore, the GEV location parameter may vary over the region. This non-stationary model has been applied to the 1-day summer and 5-day winter precipitation maxima in the river Rhine basin, in a model simulation for the period 1950-2099. A similar approach has been followed by Hanel and Buishand (2011). Katz et al. (2002) assumes a general Pareto distribution for US flood damages where a linear trend is assumed in the log-transformed scale parameter (their Fig. 5). Parey et al. (2007) assume a POT model with time-varying parameters and analyse 47 temperature stations in France over the 1950-2003 period. As in Fowler et al. (2010) they consider a suit of models such as situations where station data are assumed to be stationary versus those where they are assumed to be non-stationary. Coelho et al. (2008) apply a flexible generalized Pareto model that accounts for spatial and temporal variation in excess distributions. Non-stationarity is introduced by using time-varying thresholds (local polynomial with a window of 20 yr). Sugahara et al. (2009) apply the same approach as Coelho et al. (2008), using large p quantiles of daily rainfall amounts. A sensitivity analysis was performed by estimating four different GPD models. Acero et al. (2011) use the POT-GDP approach where thresholds are made time-varying, allowing them to change linearly over time. An automatic declustering approach was used to select independent extreme events exceeding the threshold.

Assuming normal or log-normal distributions
Wigley (2009) analyses changes in return periods using OLS trend fitting plus a normal distribution for the residuals. He gives an example for monthly mean summer temperatures in England (the CET database). We come to this approach in more detail in Sect. 4.1. Visser and Petersen (2009) apply a trend model from the group of structural time series models, the so-called Integrated Random Walk (IRW) model. This IRW model has the advantage of being flexible where the flexibility can be chosen by maximum likelihood optimization. The OLS straight line is a special case of the IRW model. They apply this trend model to an indicator for extreme cold conditions in the Netherlands for the period 1901-2008. Return periods are generated along with uncertainty information on temporal changes in these return periods (cf. the TXX t example shown in the Figs. 1, 4 and 6). Alexander et al. (2006) show trends estimated by a 21-term binomial filter in their Figs. 2 through 7. Results using straight lines are shown in their Tables 1 and 2, and Figs. 12 and 13. The slope of these trends has been estimated by

H. Visser and A. C. Petersen: Inferences on weather extremes and weather-related disasters
Kendall's tau-based slope estimator. Klein  apply LOWESS smoothers in their Figs. 3, 4, 6 and 7. Results using straight lines are also presented. Here, OLS fits are used where significance is tested using a Student's t-test. Pielke (2006) shows several examples of trend estimation for disaster data. Both OLS straight lines are shown (after taking a log-transformation) and 11-yr centred moving averages.

Stationarity
We have seen in Sect. 2 that methods fall apart with respect to their assumption of stationarity (Sects. 2.6.1, 2.6.2 and 2.6.3). At first glance one may judge this choice as a matter of taste. As long as one makes his or her assumptions clear, all seems okay at this point. Of course, there is no problem as long as the processes underlying the data at hand are truly stationary, such as in the study of Wehner (2010) who estimates GEV distributions to pre-industrial control runs from 15 climate models, part of the CMIP3 dataset. The same holds for Villarini et al. (2011) who apply GEV distributions for extreme flooding stations with stationary data over time only.
However, inferences might go wrong if data are assumed to be stationary when they are not. Figure 2 gives an illustration of this point by simulation. Suppose that a specific weather index shows an increasing trend pattern over time. However, the year-to-year variability slowly decreases over time (heteroscedastic residuals). Now, if we would assume these data to be stationary, we would conclude that the frequency of high extremes is decreasing over time. This conclusion could be easily interpreted as an absence of climate change. However, the increasing trend in these data is contradictory to this conclusion. The example shows that conclusions on the influence of climate change should not be done on the behaviour of extremes alone. Proper methods for stationary checks should be applied.
A second danger of assuming stationarity while data are in fact non-stationary, occurs if GEV distributions are applied. GEV distributions are very well suited to fit data which are stable at first and start to rise at the end. See the simulation example in Fig. 3, upper panel. This example is composed of an exponential curve where normally distributed noise is added. Now, if we regard this hundred-year long record as stationary and estimate for example the Gumbel distribution to these data, a perfect fit is found, as illustrated in the lower panel.
This result might seem surprising, but it is not. The residuals of the simulated series are normally distributed, having symmetric tails. Due to the higher values at the end of the series the right-hand tail of the distribution will become "thicker" than the tail of the normal distribution if we discard the non-stationarity at the end of the series. And this is exactly the shape of the right-hand tail of the Gumbel distri-bution, and more generally the GEV distribution. In practice the GEV distribution will give a good fit in many such occasions since it has three fit parameters instead of the two of the Gumbel distribution.
Our conclusion is that care should be taken if climate is assumed to be stationary. If data are assumed to be stationary when they are not, inferences might become misleading.

Block stationarity
As we have described in Sect. 2.6.2, a number of authors assume their data to be stationary over short periods of time, typically periods of 20 to 30 yr. Such assumptions are often made in climatology and are clearly reflected in the definition of "climate" (IPCC, 2007 Of course, if the extreme indicator at hand shows stable behaviour over the block period chosen, the choice of stationarity is satisfactory. However, due to rapid climate change, the stationarity assumption may be invalid, even for very short periods. Young et al. (2011) give such examples for 23-yr extreme wind speed and wave height data. They find many significant rising trends (their Table 1  Again, our conclusion is that care should be taken in assuming stationarity, even for such short periods of time (20 to 30 yr). Changes in extreme weather variables may be highly significant even over these short periods. Wigley estimated linear trends and normal distributions to monthly mean temperatures in England (the CET database, Parker and Horton, 2005). Cooley estimated GEV distributions with time-varying parameters to annual maxima of daily maximum temperatures, also taken from the CET database. He finds a linear fit for the GEV location (mean) parameter, and constants for the variance and shape parameter. Cooley discusses the advantages of taking the GEV distribution rather than the normal distribution. Who is right, or are both right?

Clim
We re-estimated the CET TXX t data 3 with the IRW trend model (cf. Fig. 1), and checked the distribution of the residuals. The IRW flexibility is estimated by ML optimization and appears to be a straight line, mathematically equal to the OLS linear trend. The innovations (= one-step-ahead prediction errors) do not show obvious non-normal behaviour and we conclude that a straight line, along with normally distributed residuals, gives feasible results for these TXX t data. Compared to the trend of Cooley, our trend appears to have a slightly steeper slope: 0.0155 ± 0.005 (1-σ ) against their slope estimate 0.0142. This result implies that (i) more than one PDF may be applied to the same data and (ii) the choice of the PDF shape (slightly) influences the trend slope estimate (cf. the simulation example shown in Fig. 3).

Comparing four PDF shapes
To get a better grip on this "PDF shape discussion" we have tested four PDF shapes frequently encountered in the literature, on the same data. PDF shapes are (i) the normal distribution, (ii) the log-normal distribution, (iii) the Gumbel distribution and (iv) the GEV distribution (of which the Gumbel distribution is a special case). For such a test, we performed two groups of simulations yielding a number of TXX t and RX1D t "look alikes". We varied the time series length N (65, 130 and 1300 yr) and the number of effective days N eff (1, 60, 180 and 365 days). The latter parameter mimics the effective number of independent daily data within a year for a certain weather variable. Details are given in Appendix A.   Quantile-Quantile Plot with 0-1 Line  The "measurements" are gained by choosing an exponential as a "trend" and adding normal distributed white noise to this trend (constant variance) . If it is assumed that the measurements follow a stationary process, the data appear to follow an extreme value (Gumbel) distribution. This is shown in the lower four panels which are generated by the S-PLUS Envstats module. Shown is the Kolmogorov-Smirnov test, where the data are compared to a Gumbel distribution. The Gumbel distribution appears to fit very well (QQ-plot shows all data on the 0-1 line; p value of the KS test is 0.93).

Results of Kolmogorov-Smirnov GOF
An example from these simulations is given in Fig. 5. Here, we have plotted four PDFs for the same TXX t simulation (N eff = 60 days; N = 130 yr). This simulation resembles the Wigley -Cooley case with daily CET temperatures since 1880. The four panels show the Kolmogorov-Smirnov goodness of fit test, along with three graphic presentations (as in the lower panel of Fig. 3). The panels show that the only distribution which fits not very well, is the Gumbel distribution (right tail deviates in the QQ plot, panel lower left).
Although the simulation exercise described in Appendix, is certainly not exhaustive, the following inferences can be made: -both log-normal and GEV distributions fit very well for the vast majority of simulations, (both TXX t and RX1D t simulations). This result is in line with the many examples of these PDFs in the literature, applied to real data.
-the Gumbel distribution fits only moderately to the TXX t simulations. Much better fits are found for data which are skewed in nature, such as in case of the RX1D t simulations. This result is in line with the findings of Trömel and Schönwiese (2007) who find Gumbel distributions for 132 precipitation series in Germany . No Gumbel distributions have been reported in the literature for temperature data, which is in line with our TXX t simulation results. Trend differences between consecutive years Corresponding 95% confidence limits -The normal distribution fits well for the TXX t simulations as long as the number of years is rather small (sample periods shorter than ∼130 yr). This result is in line with the Wigley -Cooley discussions for CET data since 1880. For skewed data, as in the second group of simulations, the normal distribution is not a good choice.

Clim
One might conclude from the inferences above that the GEV distribution would be the ideal PDF choice in general: (i) it fits in almost all cases and (ii) it has an interpretational background in relation to extremes. However, we note that the estimation of time-varying GEVs in combination with linearity assumptions on the three parameters, demands the estimation of six parameters (Kharin and Zwiers, 2005). And the linearity assumption for GEV parameters might be limiting in some cases. In contrast, the estimation of flexible trends and normal distributions (as in the TXX t examples for CET and De Bilt) (i) does not fit for skewed data and (ii) lacks interpretation. However, it demands the estimation of only one parameter. Also uncertainty information on extremes is gained more easily (cf. Fig. 6). The same advantage is gained after taking logarithms of the indicator at hand, as in Visser and Petersen (2009 -their Fig. 5 and Appendix). The simulations in Appendix A show that log-normal distributions fit very well.
We found one example in the literature where different PDFs are analysed for the same data. Sobey (2005) analyses detrended high and low water levels according to four different PDF shapes: the Gumbel distribution, the Fréchet distribution, the Weibull distribution and the log-normal distribution (the first three distributions are part of the GEV distribution). Furthermore, he gives a guidance for choosing the suitable distribution for the data at hand. For both extreme high and extreme low water levels at San Francisco the log-normal distribution fits best to the data. He identifies the Gumbel and Weibull distribution as promising alternatives. His results are consistent with the findings from our simulation study above (although different in detail).

Available statistical techniques do not suffice in all cases
Uncertainty information is an important source of additional information pertaining to inferences on extremes. Within climate science, and particularly within the Intergovernmental      Panel on Climate Change (IPCC), there has been increased attention to dealing with uncertainties over the last decade or so (see e.g. Moss and Schneider, 2000;Petersen, 2000Petersen, , 2012IPCC, 2005;Risbey and Kandlikar, 2007;Swart et al., 2009;Hulme and Mahony, 2010;Mastrandrea et al., 2010). We scanned the literature for their treatment of statistical uncertainties. In doing so, we discerned three levels of statistical uncertainty information: -Class 0: research giving no statistical uncertainty information.
-Class 1: research giving point-estimate uncertainty for extreme statistics. Here, we mean uncertainty statistics at one specific point in time, such as confidence limits for a return period R t or confidence limits for a trend estimate µ t . An example for extremes has been given in the three panels of Fig. 6. An example for trends has been given in Fig. 1, upper panel.
-Class 2: research giving uncertainty information both for point estimates and for differential estimates. Here, we mean "Class 1" uncertainty information along with uncertainty information on differential statistics such as the return-period differential [R t -R s ], or trend differentials [µ tµ s ] (times "t" and "s" lie in the sample period with t > s). 4 An example has been given in Fig. 4.

Return period (years)
Return period for exceedance of 35.0 C Corresponding 95% confidence limits With respect to return periods or the chance for crossing pre-defined thresholds we found only rarely examples of "Class 0". In most cases "Class 1" uncertainty information is given: Feng and Nadarajah (2007), Della-Marta et al. (2009), Fowler et al. (2010), Wehner (2010 and Lucio et al. (2010). However, we found that "Class 2" uncertainty information is lacking almost completely. The only example we found was in a previous paper of ours (Visser and Petersen, 2009). There, we give approximate uncertainty estimates for return period differentials in an Appendix.
As for trends, we only rarely found examples of "Class 0" uncertainty. Examples lacking uncertainty information are mostly found in the estimation of trends in disaster data: although OLS linear trends have been applied (and, thus, uncertainty information is easily available), no uncertainty information is given in publications. Other examples are those where moving averages of other digital filters have been applied. These trend models are not statistical in nature and, thus, do not give uncertainty information.
Since most articles apply OLS linear trend fits to their data, both "Class 1" and "Class 2" uncertainty information are covered at the same time (cf. footnote 2). Examples are Klein

Annual losses (USD 2009 billion)
Losses from global weather-related disasters IRW trend 95% confindence limits trend . The trend has been estimated by the OLS straight line fit after taking logarithms. The lower panel shows the IRW trend fit on logarithms of the same data. Flexibility of the trend has been optimized by ML estimation (Visser, 2004). Können (2003), Klein Tank et al. (2006), Alexander et al. (2006), Brown et al. (2010), Min et al. (2011) and Charpentier (2011). Brown et al. (2008) give full statistical uncertainty information for the time-varying location parameter of the GEV distribution. Trends from the class of structural time series models (STMs), as shown here in Figs. 1 and 4, give a generalization of the OLS linear trend: they also give full statistical uncertainty information (Visser, 2004;Visser et al., 2010).
Our "uncertainty scan" shows that full uncertainty information ("Class 2") is missing for statistics such as return periods or the chance for crossing thresholds. And the reason for that is simple: the statistical literature on extremes, such as Coles (2001), does not report methods to compute these differential uncertainties. Therefore, our conclusion is a simple one: such methods should be developed. For trend estimation, we conclude that full uncertainty information is available as long as OLS linear trends or trend models from the class of STMs are chosen.  1950 1955 1960 1965 1970 1975 1980 1985 1990 1995 2000 2005 2010 Year 0 4 8 12 16

Number of disasters
Number of great natural disasters IRW trend 95% confindence limits trend shows an IRW trend, fit on logarithms of the same annual data (i.e. y t = log(x t + 1)). Flexibility of the trend has been optimized by ML estimation. Details of the IRW trend fit are given by Visser (2004). Source upper graph: Munich Re (2010b), p. 37 and their website.

Best modelling practices and uncertainty
As described at the end of Sect. 2.2, some authors have chosen to apply more than one trend model to analyse their data. This type of sensitivity analysis does not evaluate uncertainties in estimators only, but also tries to find the influence of under-lying model assumptions -thus, often moving beyond the realm of statistical uncertainty into scenario (what-if) uncertainty. See Mills (2010) and Charpentier (2011 -Sect. 2). Another example is given by Moberg and Jones (2005) who evaluate trends in extreme weather indicators using two trend models: the OLS linear trend and the RES method. The latter method is more appropriate if the data contain outliers and behave non-normally. Zhang et al. (2004)  Carlo experiments where three ways of estimating linear trends have been evaluated (OLS linear trend, Kendall taubased method and time-varying GEV distributions). In fact, the evaluation of different trend models, and corresponding uncertainty inferences, is a way of evaluating structural uncertainty, i.e. evaluating the potential influence of specific model assumptions.
An illustration of the importance of considering more than one trend model, is given in Fig. 7a. The upper panel shows the economic losses due to global weather-related disasters, as published by Munich Re (2010a). The trend is estimated by fitting the OLS linear trend model, after taking logarithms of the event data. The result is an exponential increasing trend. If an IRW trend is estimated, where the flexibility is optimized by ML (Visser, 2004), a different trend pattern arises (lower panel): an increase up to 1995 and a stabilization afterwards. The trend value in 2009 is significantly higher than trend values before 1987 (tested for α = 0.05, graph not shown here). A comparable example is given in Fig. 7b. The upper panel shows the number of great natural disasters, as published by Munich Re (2010b) and reprinted in Pielke (2010, p. 167). Again, the result is an exponential increasing trend. If an IRW trend is estimated, a different trend pattern arises (lower panel): an increase up to 1992 and a decrease afterwards. The trend value in 2010 is not sig-nificantly higher than the trend values before 1980 (tested for α = 0.05, graph not shown here). These two examples illustrate that the interpretation of trend patterns in extreme indicators might be influenced by the trend method chosen.
Another approach to assess structural uncertainty is the evaluation of the stationarity/non-stationarity of the data at hand (cf. discussion in Sect. 3). Examples are: -Feng and Nadarajah (2007)  We also found other sensitivity approaches which could be categorized under the term "best modelling practices". In the field of future extremes it might be of importance to evaluate extreme statistics on the basis of more than one GCM or RCM. Examples are: - Kharin et al. (2007) give multi-model uncertainty limits for 20-yr return periods in their Figs. 3, 5, 6 and 7, based on 14 IPCC AR4 models.
- Barriopedro et al. (2011 -Figs. 4 and S12) evaluate return periods for mega-heatwaves on the basis of 11 RCMs and one reanalysis run.
A second sensitivity approach deals with the sensitivity of trend estimates and corresponding uncertainties in relation to the sample period length. Examples are: - Moberg and Jones (2005 - 1901-1999, 1921-1999, 1901-1950and 1946-1999 (2005) analyses detrended high and low water levels according to four different PDF shapes. From his analysis he gives a guidance for choosing the suitable distribution for the data at hand.
In our judgment, some form of sensitivity analysis is important to assess the reliability of results. This conclusion of course pertains more generally to environmental research.

Coupling extremes or disasters to climate change
There are several ways to couple trends in extremes or disaster to (anthropogenic) climate change (see, e.g. Hegerl and Zwiers et al., 2011;Min et al., 2011 for spatiotemporal approaches). One has to be careful, however, in coupling individual extremes to climate change. In fact, statistical inferences are about chances for groups of events and not about individual events.
Even though most publications do not strictly couple single extremes to climate change, that is, with 100 % certainty, many are suggestive about the connection while they focus actually on the changed chances. A recent example on flooding is Pall et al. (2011) and an example of suggestive information on the Pakistan floodings in 2010 is given in Fig. 8 Table TS4.2 of the Technical Summary (p. 51) and in the Executive Summary (p. 543), present the health impacts from the 2003 heat wave as an example of "wide-ranging impact of changes in current climate". Thus, the text implicitly suggests that the 2003 heat wave is the result of recent climate change.
However, one can never attribute a specific extreme weather event of the past -such as that particular heat wave -to changes in current climate. In fact, we agree with Schär and Jendritzky (2004) who stated the following: "The European heatwave of 2003: was it merely a rare meteorological event or a first glimpse of climate change to come? Probably both." Stott et al. (2004) come to a comparable conclusion: "It is an ill-posed question whether the 2003 heatwave was caused, in a simple deterministic sense, by a modification of the external influence on climate -for example, increasing concentration of greenhouse gases in the atmosphere -because almost any such weather event might have occurred by chance in an unmodified climate." Finally, IPCC-SREX (2011, p. 6) concludes that "the attribution of single extreme events to anthropogenic climate change is challenging".

Conclusions
In this article, we have given a concise overview of methods applied in the peer-review literature to make inferences on extreme indicators. Furthermore, we have evaluated these methods for specific choices that researchers have made. These choices are (i) the choice of a specific type of stationarity, (ii) the choice for a specific PDF shape of the data (or residuals) at hand, (iii) the treatment of uncertainties and (iv) the coupling of extremes or disasters to climate change. We draw the following conclusions: -In making a choice for treating data as stationary or non-stationary, good testing is essential. Inferences on extremes may be wrong if data are assumed stationary while they are not (cf. Figs. 2 and 3). Some researchers choose block-stationarity (blocks of 20 to 30 yr). However, climate may be non-stationary even for such short periods (cf. Figs. 1 and 4). Thus, such an assumption needs testing too.
-In calculating statistics such as average return periods, a certain PDF shape is assumed. We found that often more than one PDF shape fits the same data (cf. the www.clim-past.net/8/265/2012/ Table A1. Judgments of distributional fits for (i) simulated meteorological data (cf. Fig. 5) and (ii) daily precipitation data in the Netherlands. Meaning of codes: −− stands for a very bad fit; − stands for a bad fit; + stands for a good fit; ++ stands for a very good fit. These judgments are based on visual inspection of the actual fits and on p values of the Kolmogorov-Smirnov goodness of fit tests. All judgments are based on three repeated simulations (using different seeds in random number generation). Data in black are for 65 yr of simulation, in blue for 130 and in green for 1300 yr of simulation.
Simulations based on Simulations based on 100 yr of daily normally distributed daily data precipitation data in The Netherlands Normal Log-normal Gumbel GEV Normal Log-normal Gumbel GEV Cooley -Wigley example, and Fig. 5). From a simulation study we conclude that both the GEV and the lognormal PDF fit very well to a variety of indicators (both symmetric and skewed data/residuals). The normal PDF performs well for data which are (i) essentially symmetrical in nature (such as extremes for temperature data) and (ii) have relatively short sample periods (∼130 yr). The Gumbel PDF fits well for data which are skewed in nature (such as extreme indicators for precipitation). For symmetrical situations the Gumbel PDF does not perform very well.
-Statistical techniques are not available for all cases of interest. We found that theory is lacking for uncertainties for differential statistics of return periods, i.e. uncertainties for a particular difference [R t -R s ]. For trends these statistics are available as long as OLS trends or structural time series models (STMs) are chosen (cf. Figs. 1 and 4).
-It is advised to test conclusions on extremes with respect to assumptions underlying the modelling approach chosen (structural uncertainty). Examples are given for (i) the application of different trend models to the same data, (ii) stationary versus non-stationary GEV models, (iii) evaluation of extremes for a suite of GCMs or RCMS to evaluate statistics in the future, and (iv) the role of the sample period length. An example has been given where the choice of a specific trend model influences the inferences made (Fig. 7).
-The coupling of extremes to climate change should be performed by spatio-temporal detection methods. How-ever, in the communication of extremes to the media it occurs that researchers couple one specific exceptional extreme event or disaster to climate change. This (suggestive) coupling should be avoided (Fig. 8). Statistical inferences are always directed to chances for groups of data. They do not apply to one specific occurrence within that group.

Simulation and PDF shapes
As described in Sect. 4.2, we have tested four PDF shapes frequently encountered in the literature, on the same data. PDF shapes are: the normal, the log-normal, the Gumbel and the GEV distribution (of which the Gumbel distribution is a special case). For such a test, we performed two groups of simulations, one yielding TXX t "look alikes" and one yielding RX1D t "look alikes". The first set is totally based on random drawings from a normal distribution for daily values; the second set is based on real daily precipitation totals over the period 1906-2005 (De Bilt, the Netherlands). We varied the time series length N (65, 130 and 1300 yr) and the number of effective days N eff (1, 60, 180 and 365 days). The latter parameter mimics the effective number of independent daily data within a year for a certain weather variable. The judgment of distributional fit has been done with two criteria: visual inspection of the QQ plot and the p value from the Kolmogorov-Smirnov goodness of fit test (p < 0.05: bad result; p > 0.80: very good result). See Fig. 5 for an example. Each judgment was repeated three times to rule out the influence of incidental deviating simulation results. Table A1 shows that the log-normal and the GEV distribution give good fits for all simulations (all judgments are "+/++" or "++"). This result is independent of the specific choices made for N eff or N. The normal distribution fits well for the TXX t "look alikes" as long as time series are shorter than ∼130 yr of length and N eff shorter than 180 days. The fit for the precipitation simulations are moderate to bad throughout. For the Gumbel distribution, the situation is the other way around: a moderate result for the temperature simulations and a good result for the precipitation simulations. Time series with 1300 yr of length are the only exception here.