Similarity estimators for irregular and age-uncertain time series

. Paleoclimate time series are often irregularly sampled and age uncertain, which is an important technical challenge to overcome for successful reconstruction of past climate variability and dynamics. Visual comparison and interpolation-based linear correlation approaches have been used to infer dependencies from such proxy time series. While the ﬁrst is subjective, not measurable and not suitable for the comparison of many data sets at a time, the latter in-troduces interpolation bias, and both face difﬁculties if the underlying dependencies are nonlinear. In this paper we investigate similarity estimators that could be suitable for the quantitative investigation of dependencies in irregular and age-uncertain time series. We compare the Gaussian-kernel-based cross-correlation (gXCF, Rehfeld et al. , 2011) and mutual information (gMI, Rehfeld et al., 2013) against their interpolation-based counterparts and the new event synchronization function (ESF). We test the efﬁciency of the methods in estimating coupling strength and coupling lag numerically, using ensembles of synthetic stalagmites with short, autocorrelated,


Introduction
Time series are often used to assess the properties of the processes that generated them, in climate science (Rehfeld et al., 2011) but also in many other scientific fields ranging from ecology (Lhermitte et al., 2011) to astrophysics (Scargle, 1989). Time series similarity measures quantify the degree of statistical association and are, particularly in the geoscientific context, often equated with Pearson correlation (Chatfield, 2004). They help to identify the strength of dependencies between climate processes and potential lead-lag relationships. For modern-day weather stations, both daily temperature and the time of observations are logged precisely. To identify relationships between distant weather evolution, time series of temperature anomalies can be compared. Paleoclimate data are crucial to investigate climate interrelationships beyond the instrumental record. Paleoclimate time series are, however, more challenging than the data sources in other disciplines: neither observation time nor the climatic variable are Fig. 1. Illustration: assume that the climatic process Y is driven by process X at a given lag. They are sampled by a paleoclimate proxy archive (X) and an automatized measurement device (Y ), resulting in corresponding time series. A typical task in paleoclimate data analysis is to estimate the strength of statistical association between such time series; the delay time can hint at physical driving mechanisms.
known precisely. Both have to be reconstructed, resulting in irregular and age-uncertain time series, because variability in the growth of the archive impacts on the temporal resolution of the resulting proxy time series (Fig. 1). The dependency of reconstructed paleoclimate time series, and their relationship to global or external forcing, is often inferred from similarities, coinciding maxima/minima or trends, between graphical visualizations of the time series (for example in Zhang et al., 2008Zhang et al., , 2011Cheng et al., 2012;Sinha et al., 2011). Visual comparison is, however, inherently subjective, cannot be quantified and tested in a hypothesis test and will not suffice with the growing number of paleoclimatic data sets available.
Standard statistical techniques, such as estimating the Pearson correlation (XC), cannot readily be applied when the sampling of the time series is irregular. XC is, in principle, computed by taking the arithmetic mean over the products of coeval, centralized and standardized observations and reflects the goodness of a linear fit to the scatter plot of the data. If the two time series to be correlated are irregular, coeval observations are only given in the special case that both time series have the same timescale. In practice, this would arise only if, for example, two proxies were measured on the same samples. In the general case the irregularity precludes the direct computation.
Interpolating the time series to a regular coinciding timescale, however, results in a loss of high-frequency variability and a spectral bias towards low frequencies (Schulz and Stattegger, 1997). In a comparison of correlation analysis techniques the Gaussian-kernel-based Pearson correlation was identified as a reliable and robust estimator for irregular time series (Rehfeld et al., 2011). However, relationships in the climate system are not always linear, and therefore not necessarily identifiable by linear techniques such as Pearson correlation. This is not a problem in the geosciences alone, and similarity measures that can capture nonlinear interrelationships exist. Mutual information (MI), an entropybased measure, has been used to investigate nonlinear dependencies of processes from observations (Donges et al., 2009;Runge et al., 2012;Hlinka et al., 2013). In this mea-sure, the joint and marginal distributions of processes X and Y are evaluated. Its advantage is that it is model free and able to quantify nonlinear dependencies, but it is symmetric, MI(X, Y ) = MI(−X,Y ), and more difficult to quantify as the quantification bias changes considerably for different sample sizes and estimator techniques (Khan et al., 2007;Kraskov et al., 2004). It has been adapted and tested for irregular and autocorrelated time series (Rehfeld et al., 2013) in a Gaussian-kernel-based variant. Both MI and XC depend on the notion of a scatter plot between the data.
An alternative, especially in the analysis of extreme events, could be found in the measure of event synchronization (ES, Quian Quiroga et al., 2002), which is not based on the available time series, but the relative timing of distinguished events in two time series. Originally conceived for neurophysiological signals, it has become a popular measure to investigate dependencies in precipitation time series (Malik et al., 2010(Malik et al., , 2011Rheinwalt et al., 2012), but it has not been tested for its suitability on short and autocorrelated time series. In its original form it provides a measure for the strength of synchronization and for the direction of a potential coupling between the processes generating the events, but not for the lag of the potential coupling. Although stated differently in the original paper, ES does not require regular observation intervals.
A number for an individual correlation coefficient can be interpreted, when its level of significance is determined as well. For the usually short and autocorrelated paleoclimatic time series, this can be done by bootstrapping the result (Mudelsee, 2002), or by testing the similarity for mutually uncorrelated surrogate time series with similar autocorrelation properties (Rehfeld et al., 2011(Rehfeld et al., , 2013. The values of the different estimators, however, cannot be compared directly, as they vary on different scales. In this paper we evaluate the impact of age uncertainty and time series irregularity on the accuracy of the estimators. Furthermore we propose the concept of a link strength, to summarize the hypothesis test results of different estimators. If no outcome is significant, it is zero, if three out of five employed estimators yield a significant similarity, the link strength is 3/5 and if all tests for null correlation were rejected the link strength is equal to unity. The advantage of this approach lies in its robustness due to the different estimators, and in the easy consideration of uncertain data sets. If the uncertainty of the time series can be modeled, for example using the Monte Carlo techniques in age modeling software such as StalAge (Scholz and Hoffmann, 2011) or COPRA (Breitenbach et al., 2012), it can be incorporated in the link strength considerations in a straightforward manner.
In this paper we will investigate how well each of these estimators identify the strength and the delay time of actual coupling between paleoclimatic processes from irregular and age-uncertain time series. First we review the similarity measures (XC, MI), and develop a event synchronization function (ESF) based on the concept of ES. We simulate artificial stalagmites with linearly and nonlinearly coupled proxy time series based on autoregressive (AR) and threshold-autoregressive (TAR) models. Using these and the stalagmite time series from Dandak (Sinha et al., 2007;Berkelhammer et al., 2010) and Wanxiang (Zhang et al., 2008) caves, we investigate how the similarity estimators perform for irregular, age-uncertain and autocorrelated time series, and how they are impacted by age uncertainty.

Methods
In this section we first give necessary definitions for time series and similarity measures, and derive the ESF and the link strength concept.

Time series
Time series are a collection of measurements of specific properties of a dynamical process, together with the time when the observation (or measurement) took place. The individual data points of the series are often regarded as observations of processes, which may be deterministic, stochastic, or a combination of both. In classical time series analysis the observation times of the process X t are expected to be regular and certain, and the observation values to be measured exactly.
In contrast to this, for irregular time series no unique sampling rate can be defined, and the observation times cannot be directly related to an index anymore, but have to be given explicitly for each measurement.
Definition 1 (Irregular time series) An irregular time series x(t) = (t i , x i ) is defined by its observation times t i and the respective observations x i , where i = 1, . . . , N. The two vectors have a common length N x , with t x 1 < t x 2 < · · · < t x N x as observation times.
In the following we focus on the age-uncertain paleoclimate proxy time series for which a growth model of the archive has been combined with pointwise age information, for example from uranium/thorium measurements. Input data to this age modeling are (i) a dating table with its entries containing depths, associated age estimates and their uncertainties, usually given as standard deviations, and (ii) the proxy observations.

Definition 2 (Dating table)
..,N dat contains N dat pointwise age estimates T i taken at depths D i and their corresponding age standard deviations σ T i .

Definition 3 (Proxy observation series)
Proxy observation series X d = (d j , x j ) are given for j = 1, . . . , N obs measurement depths d j and proxy measurements x j .
For paleoclimate archives, the ages at few depths are estimated, with some uncertainty. Age models are then created to Significant similarities between the time series at two locations, X and Y, can arise from (a) direct physical coupling, (b) a teleconnection, (c) a common driving mechanism or (d) by chance as false positives.
interpolate from these few dates to a time axis for the proxy time series, which is sampled much more densely in depth than the dating table. Thus, an age model is defined here as one potential depth-age relationship t i (z i ) out of the possible ensemble of age models T. For Monte Carlo (MC) age modeling, whole ensembles of age models, T are created, sampling the probability space inherent in the dating table (cf. def. 2). By convention, usually the most likely age model is selected as the time axis for proxy time series (Breitenbach et al., 2012;Scholz and Hoffmann, 2011). Finally the dating table is combined with the proxy observation series using a single-age model to form a time-uncertain time series.

Estimating similarity of irregular time series
Similarity measures reflect statistical properties of time series, which may not reflect the same climatic parameters. Different estimators focus on different characteristic properties related to the distributions of the observations. We summarize them in Table 1.
Assume that the processes X and Y generated time series x(t) and y(t). These processes, and the time series, are similar if, for example, coeval minima or maxima were observed. Comparison can then give information about functional relationships between processes underlying time series: given that two processes X and Y are not independent, there may either be a causal relationship or they are both driven by a global common driver, or there are unobservable intermediate processes, as illustrated in Fig. 2. A significant similarity estimate may therefore arise for such physical reasons -or as a false positive of the statistical test. If a transfer function between these two processes exists in a form Y t = F(X t+ ), this results in a repetition of a pattern, though maybe distorted, that occurs in X t at t 0 and in Y t at a time t = t 0 + later. A similarity estimator can help identify F and quantifies the similarities in the contemporary evolution of two time series: Definition 4 (Similarity estimator) A similarity estimator S = F ((t x , x)(t y , y)) reflects the similarity between x(t) and y(t) to a numeric value in an interval [a, b] For most similarity measures a = −1, b = 1 is considered, but for different estimators different bounds exist. Here we only require that the relationship between true dependency 110 K. Rehfeld and J. Kurths: Similarity estimators  Malik et al. (2010) and estimated similarity is monotonically increasing, which is what we test for using artificially generated time series. If the delay time in the transfer function is nonzero, a similarity function gives the similarity between two time series for increasing delay: Definition 5 (Similarity function) A similarity function S( ) gives the estimated similarity over different lag times : The spacing of the lag vector is uniform and depends on the mean time resolution of the time series: t = max( t x , t y ). To indicate that we are focusing on bivariate similarity we also use the alternative notation S(X, Y ) which does not explicitly refer to the possible lags.
Similarity measures as required in this context should be symmetric, reflexive, translation and scale invariant (Batyrshin et al., 2012). The estimators presented here fulfill these requirements.

Kernel-based estimators for Pearson correlation
Pearson correlation is defined as the mean over coeval products of standardized observations (Chatfield, 2004). For irregular time series the inter-sampling time intervals vary and the classical definition cannot be applied. Rehfeld et al. (2011) tested different correlation estimators for irregular time series and found that a Gaussian-kernel-based estimator performed best. In the definition of the correlation function ρ(k t) at the lag k t: the kernel b k (t y j − t x i ) weights those products higher whose time lag lies closer to k t: where h = t/4 or 0.25 for the rescaled time axis, t x i = t orig i / x t , and d denotes the distance between the product inter-observation time and the desired lag, d = t y j −t x i −k t; k denotes the lag index. The standard width parameter h is chosen to result in a main lobe width of t, the mean sampling interval or common sampling period in the bivariate case. Note that the observations have to be standardized to zero mean and unit variance before the analysis.

Kernel-based estimators for mutual information
Mutual information I (X, Y ) = I xy is a measure of the dependency (linear or nonlinear) between two random variables, X and Y . This measure from information theory can be interpreted as the uncertainty reduction in variable X, given that Y was observed. It is symmetric, that is, relationships of opposite sign but the same association strength, correlation and anti-correlation give the same MI. By definition, the measure yields a null result if, and only if, the two random variables, in this case time series of observations, are independent (Kraskov et al., 2004;Cover and Thomas, 2006).
While more complex estimators exist (e.g., Kraskov et al., 2004), the simplest estimator iŝ where p x,y is the two-dimensional joint probability density function of the variables X and Y and p x resp. p y are the one-dimensional probability distributions of X resp. Y . The unit of measurement of MI depends on the logarithm chosen in the estimator: it is measured in bits if the logarithmic base 2 is chosen, and in nats for the natural logarithm.
In case of irregular sampling, however, the bivariate observation set (X t , Y t ) at regular observation points t that are required for a scatter plot is not available. In standard interpolation procedures, both (t x , x) and (t y , y) would be resampled to obtain a bivariate set of observations with regular observation time intervals, (t r , x r , y r ). This is undesirable for paleoclimate records (a) because every interpolation routine involves an assumption on the dynamics of the underlying process, and this is difficult to justify for climate data, and (b) it reduces the observable variability in the process (Schulz and Stattegger, 1997;Stoica and Sandgren, 2006;Babu and Stoica, 2010).
There are two main points where this problem can be addressed: either by reconstructing bivariate observations while avoiding variance reduction as much as possible or by a modification of the joint distribution, for example by introducing weights proportional to the sampling time distance similar to the Gaussian-kernel-based XC (Rehfeld et al., 2011). For MI the latter is difficult to achieve. But following the former solution, the probabilities required for Eq. (4) are straightforward to derive from relative frequencies.
Algorithmically, this can be described as follows: 1. A local reconstruction of the signal is performed by estimating for each point i in the time series X = (t x , x) a corresponding observation from Y = (t y , y), by estimating a local, observation-time weighted mean y lr j around a time point t x i in Y , with the Gaussian-kernel-based local weight b k (d) defined as in Eq. (3). For MI the standard deviation of the Gaussian weight function is set to h = 0.5. If there are less than five observations y i available in a time window ±3 t around t x i this reconstruction is not performed. Repeating this for each time point 2. Afterwards the procedure is repeated by stepping through t y j , which yields X y = (t y j , x lr j , y j ) .
3. The local reconstruction Y x and the original observations Y are then concatenated into one series Y r = {Y ∪ Y x } combining locally reconstructed and original observations. Similarly, a time series X r = (X ∪X y ) is obtained.
4. Based on this set of bivariate observations (X r , Y r ) the joint density of X and Y can be estimated using standard binning estimators for MI.
The reconstructed set of bivariate observations can also be used to construct Gaussian-weighted scatter plots, where the size of the marker reflects the amount of weight placed on the reconstructed observation (cf. Figs. 4b and 5b). MI is difficult to estimate in practice, first and foremost because of the large bias effects produced in the inference of the joint and marginal probabilities. Elaborate algorithms have been devised to improve this (described, for example, in Kraskov et al., 2004;Papana and Kugiumtzis, 2009;Roulston, 1999), but no straightforward solution to this has been found yet. We have tested several algorithms and finally resorted to the most simple equidistant binning estimator (Kraskov et al., 2004), due to its computational efficiency and simplicity. Bias effects are predominantly tied to the temporal sampling and length of the time series due to the occurrence of empty bins. Thus, if necessary, we can estimate and subtract the bias using uncorrelated processes with the same observation times as in X and Y . However, for use as a similarity measure comparable to XCF and ES in the context of paleoclimate networks, we only require that the estimated MI be proportional to the actual association strength. For bivariate normally distributed and linearly correlated X and Y , MI is by definition proportional to their estimated correlation coefficient r 2 xy : and can, by inversion of this equation, be scaled to the positive range of the correlation coefficient so thatÎ ∈ [0, 1] (Nazareth et al., 2007). The expected value for mutual information of these processes at the lag of coupling is then given by MI(X(t), Y (t +l)) = −0.5 log(1−r 2 xy ). For the evaluation of the joint and marginal distributions, n bins = 10 equidistant bins were employed. In principle, the number of bins should be adapted to the respective length of the time series involved, to reduce bias effects from empty bins.

Event synchronization function
The concept of event synchronization (ES) was introduced by Quian Quiroga et al. (2002). The motivation behind the development was to obtain a simple, fast method that quantifies the synchronization between time series where certain events can be distinguished. The primary application was focused on neurophysiological signals (Quian Quiroga et al., 2002;Kreuz et al., 2009), but it was also applied later for the investigation of rainfall patterns in the Asian monsoon domain (Malik et al., 2010(Malik et al., , 2011 and Europe (Rheinwalt et al., 2012).
The main idea behind ES is that two time series are synchronized, if events in time series x occur close in time to events in time series y. Considering the temporal order of the events (e.g., if an event in y occurred before one in x), it is also possible to infer which process is leading. In the following we will define the event synchronization function, ESF, further developing the ES concept (Quian Quiroga et al., 2002;Malik et al., 2010).
Given two time series (t x , x) and (t y , y) that represent observations of autocorrelated stochastic processes, events are given by the set of observations that are considered extreme, in that their observation value lies above or below the q/2 resp. (1 − q/2) percentiles of the distributions of x and y. The actual value of the observation at the event points is not relevant for the further analysis. Once the events are defined, only the observation times are considered in the event time vectors t * x and t * x . Next a temporal threshold τ is defined to evaluate the relationship between the events in X and Y with a maximum separation time: Here, t x is the mean sampling rate of X, and t * x and t * y are the inter-event times in X and Y , respectively. Subsequently, the co-occurrence of events in X and Y is counted and summed for all events as where N x and N y , respectively, give the total numbers of events in X and Y . The counter variable J xy lm is defined as C(Y |X) is obtained by exchanging X vs. Y in the above expression, and combining both, gives the strength of the event synchronization and the direction of the association. Unless double counting of events occurs, these are normalized to 0 ≤ Q ≤ 1 resp. −1 ≤ q ≤ 1. Q = 1 corresponds to completely synchronous occurrence of events in X and Y , and q = 1 implies that all events in Y precede those in X.
For the previous studies (Quian Quiroga et al., 2002;Malik et al., 2010Malik et al., , 2011 local definitions of the temporal threshold τ were used, preventing, in most cases, events from being double counted, and adapting it to the local inter-event rate. The chosen definition of τ is motivated by the fact that, to be able to compare the results for ES to those obtained from MI and XCF, a similarity function over the delay is needed. Thus, the delay τ cannot be allowed to be arbitrarily large or small, as in Malik et al. (2010) or Quian Quiroga et al. (2002).
The ESF is obtained by shifting the observation times of time series X according to the desired lag: which, using the delay time τ from Eq. (7), makes it possible to use the ESF as a similarity function.

An approach to similarity assessment of time-uncertain time series
Age uncertainty is a key obstacle to be overcome for a comprehensive understanding of past earth system dynamics. To investigate the potential dependency structure of paleoclimate processes X and Y as they are reflected in natural archives, the contribution of age uncertainty to the uncertainty of the similarity S(X, Y ) is important. Thus the aim is to estimate the distribution p(S(X, Y )) of similarity for given data sets X and Y , where Both input data sets consist of a dating table (Def. 2) D with dating depth vector D, the corresponding estimated ages T and their uncertainties σ T y and a set of proxy measurements X d resp. Y d (Def. 3), visualized as Step 1 in Fig. 3. The smoothing resulting from the size of the samples in depth direction, σ D , is assumed to be negligible here. The input proxy measurements are mapped to observation times in the age modeling process. In general, algorithms to assess similarity between time series are not capable of processing probability distributions or confidence intervals instead of singleton values, neither for the observation times nor for the measurement values.
For Pearson correlation, an analytical approach to propagate the uncertainty around the input data into the correlation estimate is possible. However, Pearson correlation alone is insufficient to characterize similarity between paleoclimate time series in general and in the context of paleoclimate networks. Therefore, a Monte Carlo-based approach based on time series ensembles which are obtained via age modeling is used here, to keep the flexibility regarding similarity estimators: 1. In a first step the input data sets X and Y are processed.
The monotonicity of the depth control variables, d and D is checked.

A Monte
tively, for all i = 1, . . . , N X dtg pointwise age estimates corresponding to j = 1, . . . , N Y dtg entries in the dating table. This results in dating matricesX andŶ with N ens columns containing the sampled ages. If no distribution of ages is otherwise given, the ages are expected to be Gaussian distributed with the given standard deviation. 4. Each of the members of the ensemble of proxy time series is used as an input to the similarity statistic S(X, Y ). This results in a distribution of estimates p(S(X,Ŷ)).
5. Analysis of distribution S(X,Ŷ): apart from inspection of mean, variance and skewness of this distribution, a hypothesis test can be conducted, comparing S(X,Ŷ) with a distribution obtained from suitable surrogate time series S(X * ,Ŷ * ).
This approach is general in the sense that it is independent of the specific function F([X,Ŷ]) that maps the uncertain input to some output estimate. Apart from F = S, F may represent any bivariate statistic, and with minor modification is also applicable to calculate the influence of sampling uncertainty on univariate statistics, like the autocorrelation coefficients or persistence times (Rehfeld et al., 2011;Mudelsee, 2002). Bivariate similarity assessment is often concerned with estimation of a potential coupling strength α (hinting towards the same process of origin) and/or the lag of coupling for model-building. For Pearson correlation, the ratio of shared vs. total variance between two linearly correlated processes at a given lag , S( ), is given in the maximum of the crosscorrelation function. While the relation to the overall variance of the processes does not necessarily hold by definition for other similarity measures, they, too, will observe the maximum of their similarity function max(Ŝ), at the lag of coupling .

Synthetic data
"True" growth histories for two synthetic stalagmites SS1 and SS2 and according climate histories are obtained via simulation. These pseudo-archives are then "dated", climate histories are "sampled". Then the age modeling procedure is performed and its output is fed into similarity estimation. Finally, we assess how much of the similarity that was originally present in the climate history is still recognizable significantly, considering the uncertainties. The test strategy is illustrated in Fig. 3.

The synthetic stalagmite
A synthetic (or virtual) stalagmite is grown for the sensitivity analysis. The main parameters controlled are -the growth rate λ in mm yr −1 , -the total length of the stalagmite (in mm), -the type of accumulation (linear growth, or growth modeled via randomly distributed accumulation rates).
A growth rate of µ(λ(z)) = 1 mm yr −1 is chosen. Linear growth may be a reasonable first order approximation (Telford et al., 2004), but microscopically, the growth rates of natural archives vary. Therefore, Gamma-distributed accumulation times are drawn for each depth z i = {0, . . ., Z}mm of the stalagmite, with the sampling time step mean µ(λ(z)) determined by the desired growth rate and shape and scale parameters α and β as (α, β) = (α, µ(λ(z))/α). This way, the mean sampling rate can be kept constant, even when the irregularity of the sampling distribution is changed (Rehfeld et al., 2011). The cumulative sum of the accumulation times then gives the "true" ages of the archive at the depths z i :

The simulated climate history
We attach each synthetic stalagmite SS1 and SS2 to a climate history. The climate/pseudo-proxy simulation is based on the assumption that SS1 lies in an area whose climate is controlling that around SS2, through a teleconnection or, for example, by being situated downstream of the same monsoon branch (cf. Fig. 2). We simulate climate variability using two different coupling schemes, one linear, one nonlinear, to investigate how the proposed methods perform.

K. Rehfeld and J. Kurths: Similarity estimators
Linearly coupled AR(1) processes Assuming that the archive SS2 samples the same climate variability as SS1, in the same way though at a later time, we model such a causal sequence using coupled AR(1) processes. Then, the true proxy history of climate as recorded in SS1 is given by and it determines part of the proxy history of SS2: Here, ε and ξ are additional Gaussian white noise whose variances σ ε and σ ξ are scaled such that the variances of X and Y are equal to unity. α ∈ [−1, 1] is the coupling strength between SS1 and SS2 and φ the autocorrelation of SS1.
Since there is no autocorrelative term in Y t , the true similarity S(X, Y ) is equal to the cross-correlation: S(X, Y ) = ρ xy = α (Rehfeld et al., 2011).

Nonlinear threshold-AR(1) processes
Let us assume that SS1 samples climate variability in a certain place, and that this can be modeled as in Eq. (15). Then the climate variability in another place, where SS2 is located, could be controlled in a nonlinear manner: the processes are negatively correlated, similar to Eq. (16) with α < 0. If, however, a threshold in the climate system is exceeded, X(t) > τ , the correlation changes and might even become positive. Such a multi-scale behavior can be modeled using threshold-AR processes (TAR, Tsay, 1989), which are similar to the regime-dependent AR models Zwiers and Storch (1990) used to model the behavior of the Southern Oscillation. Assume that the negative coupling α below the threshold τ , here τ = 0, for X(t −1) τ turns into a positive correlation, with the same magnitude, for X(t − 1) > τ . Then the proxy history of SS2 can be modeled as where the κ = −1 if X(t −1) τ and κ = 1 when X(t −1) > τ . For convenience, the variance of the innovation term ξ is scaled such that the overall variance of Y is equal to unity in both cases.

"Dating" of the synthetic stalagmite
Mimicking the real-life situation, the true growth history of the synthetic stalagmite z(t true ) is, in the following, inaccessible. The stalagmite is subjected to dating along its depth. The dating table contains the dating depths D, the estimated age at these depths T j , the proxy measurement sample width σ D and the age uncertainty σ T . In real life, the stalagmite would be dated using radiometric dating techniques based on uranium-thorium (Sinha et al., 2007;Dykoski et al., 2005;Breitenbach et al., 2012) or radiocarbon (Yadava et al., 2004;Webster et al., 2007), yielding an estimate of T (z j ) at a few points. The corresponding dating uncertainty, in reality dependent on many factors from initial isotope concentrations, overall age of the core, dating technique, lab and contamination (Fairchild and Baker, 2012), often lies between 0.1 to 0.5 % of the age for stalagmites, but may be considerably higher.
For the synthetic stalagmites, dating "samples" are taken at equidistant depths D j and the center points of the assumed age distribution are taken directly from the true age-depth relationship. The age uncertainty, however, is modeled as increasing proportionally with age, as p· T j . p here denotes the (im-)precision of the dating and is varied in the following numerical experiments.

Age modeling for SS1 and SS2
Age modeling aims to reconstruct the "true" depth-age relationship that is inaccessible in real paleoclimate archives.
Based on the synthetic stalagmite dating tables D x and D y for SS1 and SS2, the "observation times" for the proxy observations X d and Y d , t x and t y , are constructed by interpolation from the known ages (see Eq. 13). In Monte Carlobased numerical frameworks such as StalAge (Scholz and Hoffmann, 2011) or COPRA (Breitenbach et al., 2012), an ensemble of age models T = {t k , z k } k=1,...,N ens is created, which, in their entirety, reflect the age uncertainty of the estimated depth-age relationship. Based on this ensemble of age models, the uncertainty in the similarity estimates can be inferred, as is visible in Fig. 3.
In summary, the test plan is thus as follows: 1. Simulate a growth history z(t) of a synthetic stalagmite of length Z mm, corresponding to a "true" agedepth relationship t true i (z i ), resp. z i (t true ). For this, assume gamma-distributed growth and an accumulation rate λ = 1 mm yr −1 . Z can be varied to study the influence of changing time series length.
2. Simulate proxy histories {T , x} SS1 and {T , y} SS2 according to the true growth history using coupled autoregressive processes (cf. Eqs. 16 and 17). Forget the true growth history.
3. Sample the true growth history at the dating depths and infer corresponding uncertainties.
4. Create N ens surrogate dating tables for SS1 and SS2 with increasing uncertainty of the ages according to the (im)precision p (i.e., an ensemble of dating tables).
5. Assess if the estimates S(X,Ŷ) are statistically significant for the given uncertainty, and how they are influenced by sampling heterogeneity and time uncertainty.
The core of the COPRA algorithm is used for MC simulations. N ens = 2000 MC iterations are used to sample the probability space and linear interpolation is employed to infer ages between point estimates of the age at depth.

Tests on synthetic stalagmites
We evaluate the performance of the different estimators described in Sect. 2, for which parameter choices and references are given in Table 1.

Characterization of linear proxy dependency
We first consider the linear dependency case, where the proxy history of SS1 is linearly correlated with that of SS2 a lag time later. We chose a length for the stalagmite of L = 100 mm for which we expect the time series to be roughly 100 yr long (cf. Sect. 2.3.2) and linearly correlated, as in Fig. 4a. For each test 100 time series were generated from AR1 processes (cf. Sect> 2.3.3), where process Y is coupled to process X at an intrinsic lag and with a coupling strength α. The autocorrelation parameter was set to φ = 0.8, the coupling lag to = 5 and the coupling parameter to α = 0.6. For such stochastic processes, the true similarity function is single peaked, with its peak height determined by α, and its location on the lag-axis by the coupling lag . The time series are irregular, therefore a direct scatter plot of the data is not possible. Figure 4b shows a weighted scatter plot where the time series have been reconstructed using Gaussian weights, as for the MI estimation in Sect. 2.2.2.
The tests were guided by two questions: do the similarity estimators reflect the actual similarity (here, the coupling strength at lag , α) truthfully and monotonically? and, how well do they identify the lag of coupling as the maximum of the similarity function?
To answer the first question, we fix the imprecision at zero (at the dating points) and vary the coupling strength by setting the parameter α in Eq. (16) to values from 0.1 to 1. The results are given in Fig. 4c. The expected value of the similarity, α est , and the variance of the estimate are computed from the mean and standard deviations of the estimated, α est,i , for 100 realizations for each value of the coupling parameter. Each of the similarity measures returns estimates whose expectation values increase monotonically with the actual similarity, α true in Eq. (16), except for the ESF, which has a single reversal which may be due to the low number of MC realizations (100) for each point in this diagram.
In practical data analysis, the potential lag and strength of (primary) coupling, identified as the maximum of the similarity function is of interest (e.g., for model-building). If no age uncertainty exists at the dating points, the maximum of the similarity function is correctly identified in 50-60% of the ensemble cases. When timescale uncertainty exists in the time series, this becomes difficult quickly (Fig. 4d). When the fraction of correct identifications has dropped to 1 n ≈ 0.05, where n is the number of lags for which S( ) has been estimated, the maxima of the similarity functions are perfectly uncorrelated. This limit is approached as an imprecision of more than 10 % is reached. Increasing imprecision contained in the time series also results in increasing estimation error (i.e., root mean square error(RMSE)) for the similarity at the lag of coupling, S( ) (results not shown). When the stalagmite length is increased, the time series length increases and both the RMSE and the false identification rate decreases for all estimators.

Nonlinear dependencies
For the nonlinear TAR model, the time series in Fig. 5a are not as straightforward to compare visually as the linearly coupled ones in Fig. 4a. The weighted scatter plot for these time series in Fig. 5b shows the two different slopes of the positive and negative correlation regimes above and below the threshold value of zero.
The comparison of true vs. estimated coupling strength α in Fig. 5c shows no monotonous behavior for the linear correlation measures and no overall increase of their expected similarity estimates with the coupling strength. The MI estimators retain a monotonic increase, starting from a considerable bias value, while the ESF increases monotonically, but does not show consistent similarity estimate increases until the coupling strength is rather large. The monotonicity and linearity of the response for gMI, iMI and ESF improve considerably when the time series are chosen longer, that is, with a length of 200 or more (results not shown).
In the identification of the maximum lag the Gaussian MI succeeds most often for imprecisions up to 2.5 %. For more imprecise data sets the ESF remains stable, while the other measures perform worse and worse. The linear estimators, gXCF and iXCF do not identify the maxima correctly, neither the coupling strength, nor the lag of coupling.

Error source attribution
Age uncertainty has a considerable impact on the accuracy of similarity estimates, as we have shown in the previous section. But to what extent can this impact be attributed to the short length of the time series, or the time series irregularity that results from the increasing age uncertainty? The uncertainty around the ages in the dating table is, in Monte Carlo-based age-depth modeling, reflected by drawing different "dates" from distributions around these ages for each MC realization. These realizations will therefore have different partial slopes between any date D i and D i+1 . This corresponds to different estimated growth rates for the individual segments of the synthetic core. At a proxy sampling rate over depth that is constant, this will lead to uneven observation times for the time series which correspond to the MC realizations, and this irregularity increases with the age uncertainty. The RMSE of S( ) is, however, also dependent on the irregularity of the time series, as it was shown for both XCF and MI previously (Rehfeld et al., 2011(Rehfeld et al., , 2013. To separate these sources of uncertainty, M = 2000 realizations of coupled climate histories, as defined in 2.3.2, were generated in three different ways: age uncertain, irregularly and regularly sampled. The age-uncertain ensembles were the direct product of the age modeling efforts, as in the previous sections and with same parameter settings (φ = 0.8, α = 0.9, = 5) For the irregular data set the proxy histories were re-generated with the true coupling strength on the irregular timescales of the age modeling output. To assess the impact of regular sampling, regular time series of the same length, average temporal spacing and coupling scheme were also simulated. We evaluated the performance of the different estimators for the different sampling schemes at increasing dating imprecision using the root mean square error (RMSE) of the estimators for the target coupling parameter α: where bias(α est ) = α true − α est . We did this separately for each sampling scheme to obtain the RMSE reg , the "baseline" RMSE for each estimator under regular sampling, RMSE irreg for the irregularly sampled ensembles and the RMSE au for the age-uncertain ensemble. Coupling strength, autocorrelation and time series length were fixed to the same values for the three different sampling schemes. To improve the comparability for the MI estimators, the bias offset was estimated from mutually uncorrelated time series with the same autocorrelation and length and subtracted prior to the conversion to the XCF scale.
Based on the assumption that the RMSE should increase from regular to irregular to age-uncertain time series, RMSE reg < RMSE irreg < RMSE au , the "baseline" contribution is estimated from regular time series as RMSE reg , the additional contribution from timescale irregularity as RMSE irreg − RMSE reg and the additional RMSE of the ageuncertain time series' similarity as RMSE au − RMSE irreg .
The results, averaged over the realistic imprecision values (the 2nd-5th points in Figs. 4d and 5d), are given in Fig. 6.
Ideally the RMSE should of course be as small as possible. For the linear (CAR) case in Fig. 3.3, the smallest RMSE is observed for the ESF and the gXCF, the largest -by far -for the interpolation-based iXCF. While the regular (estimator) bias is low for the correlation estimators, the contribution of increasing irregularity of the time series sampling (due to the uncertain inputs) is non-negligible particularly for the interpolation-based cases. The age uncertainty alone accounts for additional, but generally smaller, error. While a large amount of the uncertainty of the interpolation-based estimators, iMI and iXCF, is due to sampling irregularity, ES has a large RMSE for regular time series, which is even higher than that for regular to slightly irregular time series. Therefore the contribution of irregular sampling to the cumulative uncertainty, as depicted in Fig. 3.3, is negative, thus improving the estimation efficiency! In the nonlinear (TAR) case the picture is quite different. The correlation-based estimators are not able to tell the coupling strength, regardless of the sampling scheme. The gMI estimator ranks lowest, with a lower uncertainty contribution from irregular sampling compared to the iMI estimator. The ESF, again, improves its accuracy when the time series are irregular. The overall error level is higher than for the linear case.  Fig. 6. Attribution of the uncertainty to its sources for (a) the linear CAR model and (b) the nonlinear TAR model: general (estimator) error in red, error introduced via irregular sampling (orange) and additional error due to the age uncertainty (yellow). The source-dependent RMSE was averaged over the second through to fifth imprecision levels given in Figs. 4d and 5d, as these correspond to the error levels most likely found in real-world studies. Errorbars indicate the associated standard deviation. For event synchronization the RMSE is lower for irregular than regular sampling, folding the irregular part of the bar backwards.

The link strength concept
Each of the tested similarity estimators comes with different underlying assumptions, estimator bias and variance, and they refer to different properties of the time series: the goodness of a linear fit to the joint distribution (XCF), the sharpness of the joint vs. the marginal distributions (MI) or the relative positions of extreme points, or events, in the time series (ES). Therefore direct results obtained from the different estimators are difficult to compare, and they respond to coupling strength increases differently (Figs. 4c and 5c). The MI estimates, to this end, have to be converted to the XCF scale and thus are bound to the interval [0, 1], not [−1, 1] as for XC. This, together with the substantial and non-negative bias, induces a different proportionality between the actual coupling and the inferred association strength. Inferred ES, on the other hand, increases nonlinearly, but monotonically, with the coupling.
The main use of similarity measures is to assess the association strength between dynamics of processes. This can only be interpreted properly, if the significance of this estimate is known. To unify the results obtained from different similarity estimators, we propose to use a link strength p(X, Y ), to ho-mogenize and summarize the results obtained for individual similarity measures.
The link strength p(X, Y ) for two observed time series X and Y is defined as the relative frequency of significant estimates from the N sim employed estimators S i : as illustrated in Fig. 7. The link strength of the individual estimators, P q i (X, Y ) is recorded on a binary scale: where S hi/lo refer to the critical values of a hypothesis test, the null hypothesis being that both X and Y are autocorrelated, but mutually uncorrelated, Gaussian distributed stochastic processes. The significance q determines the critical values S hi,xy i and S lo,xy i which are obtained from the q hi = 1 − 0.5q and q lo = 0.5q quantiles of surrogate similarity estimates S i (X * , Y * ).
Independent AR(1) surrogate time series X * and Y * are generated on the same time axes as X and Y according to Eq. (15). The individual AR(1) persistence time for actual paleoclimate data can be obtained using an efficient leastsquares fitting algorithm (Rehfeld et al., 2011;Mudelsee, 2002). The link strength can be extended to incorporate age uncertainties by computing the similarities for N mc realizations of an age model and adding a second summation over these in Eq. (19).

118
K. Rehfeld and J. Kurths: Similarity estimators 4 Application to real stalagmite data Now after having ensured the efficacy of the estimators using synthetic data sets, we apply the estimators to real-world stalagmite data sets from India, (the Dandak cave δ 18 O record originally published in Sinha et al., 2007), and China (the Wanxiang record, Zhang et al., 2008). Comparisons of these data sets have been performed by Berkelhammer et al. (2010) and Rehfeld et al. (2011). Thirteen U/Th dates constrain the age model of the Dandak cave record, 19 are available for the Wanxiang cave record. Age modeling was performed on the full proxy data sets, comprising of 1875 and 703 oxygen isotope measurements over depth and using the COPRA algorithm with 1000 realizations (Breitenbach et al., 2012). The time series were cut to the overlapping time period from 600 to 1550 AD and detrended by subtracting the long-term mean, estimated using a Gaussian kernel smoother with a width W of 1000 yr. Berkelhammer et al. (2010) determined an averaged correlation of 0.27 for 50 yr overlapping time windows, while Rehfeld et al. (2011) found a lag zero correlation coefficient of 0.290 and 0.295 for iXCF and gXCF, respectively. This correlation was found to be significant at the 95 % level in the two-sided test for zero correlation, the null hypothesis being that the time series are autocorrelated but mutually uncorrelated.
Does this correlation persist, when the age uncertainties are considered in the analysis? We estimated the similarities for the two records considering all five estimators of Table 1 and for the original records as well as the results from age modeling, and give the results in Fig. 8. The histograms of similarity estimates for 100 realizations of the age models show a considerable spread. The mean similarity for the correlation estimators (indicated by the solid red line in Figs. 8a and 8b) is higher than that of the 95 % quantile of the surrogate distribution. The mean gMI estimate (8c) is close to the critical value, while the iMI (8d) and ES (8e) results lie well below. The median link strength (red line in Fig. 8f) is equal to 0.4. In contrast, the original age models published by Berkelhammer et al. (2010) and Zhang et al. (2008) yield significant results for all estimators except the ESF, resulting in an overall link strength of 0.8.
When we compute the similarities using the COPRA ensembles for the more sparse Dandak δ 18 O time series published earlier (Sinha et al., 2007) the outcome is quite different -the link strength is only 0.2.

Discussion
Age uncertainty clearly affects all estimators of similarity for time series, and it is an illusion that it would be possible to mitigate the effects of uncertainty on the time axis for any type of analysis depending on observation times. Even if the observation -or accumulation -time of a grown archive is known precisely at some depths, an observation time reconstruction from age modeling requires an assumption on the accumulation behavior which, necessarily, will be wrong to some extent, as stochasticity and irregularity in the growth will always be present. This is a fact not challenged by the choice of a different interpolation routine (e.g., to a continuous cubic spline), which is often preferred by geoscientists (Breitenbach et al., 2012;Scholz and Hoffmann, 2011). On the positive side, and although counterintuitive, incorporating (small) age uncertainty in the analysis might even improve the estimate when a deterministic (thus necessarily wrong) assumption on the growth of the archive is made.
A low imprecision of 0-0.5 % or an age uncertainty of approximately 1-2 yr over a period of 200 yr results in minimal relative estimation error and maximal confidence on the similarity peak position for the time series similarity functionŝ S. If a similarity analysis for real-world data sets covering a time span of 100 000 yr was desired, this would amount to an "allowed" age error of 500 yr at a mean time series resolution of 500 yr, which is a lower than what is usually found (Taylor et al., 2004). Thus, the resolution desired in the analysis is necessarily dependent on age uncertainty -only if that is lower, or comparable, would an analysis of such short time series with full consideration of age uncertainties be feasible. One way to achieve higher certainty could be the incorporation of layer-counted data in the age modeling process, for example, for annually laminated archives (Breitenbach et al., 2012).
The similarity estimators tested show different behavior, dependent on the signal type. The correlation-based estimators perform better for the linear coupling scheme, but fail for the nonlinear processes.
The gXCF and iXCF error split is dominated by the age uncertainty as the largest source of error in the linear CAR case. Both have small baseline bias for regular sampling. gXCF estimates coupling strength more effectively, however, for both age uncertainty and irregular sampling contributions of iXCF are significantly larger due to interpolation effects. In the nonlinear coupling scheme there is little difference whether the time series is regular, irregular or age uncertain -the correlation-based methods cannot capture such type of dependencies.
gMI and iMI perform badly on the first glance in the linear CAR case, as their baseline bias for regular sampling RMSE is large. However, one needs to take into account that the RMSE is determined by both variance and bias -and that MI estimation, especially using binning estimators, is always associated with a significant positive bias, particularly for short time series. This bias, however, decreases with increasing time series length. If a direct comparison of MI and XC estimates is desired, this bias should be subtracted from the MI estimate prior to scaling it to the correlation scale. In the nonlinear TAR case the Gaussian-kernel-based version has the lowest overall RMSE.
The ESF, originally intended for the analysis of event series, performs well and has the lowest total RMSE, followed closely by gXCF, in the linear test case. There, its baseline RMSE dominates the RMSE split, and the RMSE for irregular sampling is lower than that for regular sampling. This is similar for the nonlinear processes. One reason for this might be that, for irregularly sampled time series of the same mean observation time distance, the number of observations spaced closely together is higher, which might increase the chances to find multiple events spaced closely together, resulting in effective double-counting of events. The comparably small contribution from age uncertainty in the linear test indicates that neither the relative nor the absolute observation time distance between the time series are crucially important to the measure. Thus, it is quite a robust similarity measure with respect to age uncertainty and comparable to gXCF for linear coupling and gMI for nonlinear coupling, which both ultimately depend on the notion of simultaneous observations.
Although the irregularity of the time series is rather low (the inter-sampling-time distribution is narrow and close to normally distributed) the estimators that do not require the time series to be sampled regularly perform better than the interpolation-based records, which confirms the previous finding (Rehfeld et al., 2011(Rehfeld et al., , 2013) that large sampling irregularity (i.e., the presence of gaps) leads to large interpolation bias, where the adapted estimators gXCF and gMI are particularly suitable. We have applied the similarity estimators to investigate the similarities between the Dandak and Wanxiang cave records. We find that the link strength aptly summarizes the results of the similarity significance tests: the time series are quite likely to be correlated, but age uncertainty blurs the results. There are several other parameters which can have a critical impact on the analysis: the choice of the significance level for link strength estimation, the detrending width and the respective resolution of the time series. The dependence of the results on the detrending parameter ( Fig. 9) illustrates the timescale dependence of the analysis: a small detrending width W results in a high-pass filter and very low link strengths, large W yields high similarity on larger timescales. This indicates that the paleoclimatic records are more clearly associated at centennial to multicentennial timescales than at decadal timescales, which are more impacted by age uncertainty. A higher temporal resolution of proxy measurements improves the accuracy of the estimators, particularly for the data-demanding MI estimators. Bootstrapping of the time series to successively lower lengths could be used to test the robustness of the estimators against such effects.
We have only considered five similarity estimators (gXCF, iXCF, gMI, iMI and ESF) here, but this could be expanded for other concepts, for example, based on (cross-)recurrence plots (Romano et al., 2005;Marwan et al., 2007;Marwan, 2002;Lange, 2011), recurrence networks (Feldhoff et al., 2012), convergent cross mapping (Sugihara et al., 2012) or distance measures (Lhermitte et al., 2011). The notion of Fig. 9. Sensitivity of the link strength result for the original records of Berkelhammer et al. (2010) and Zhang et al. (2008) to changes in the detrending parameter W of a Gaussian-kernel detrending and the significance level in the hypothesis test. a link strength, instead of XC, MI or ES values, makes it straightforward to extend the analysis to a whole ensemble of time series, be it from age modeling or out of a database of paleoclimate records. If age uncertainty does not impact the cross similarity, the link strength will not drop substantially. The actual value of the link strength can be interpreted in terms of a "degree of confidence": if the value is close to the significance level, a relationship cannot be concluded with confidence. If the link strength is close to one, all the estimators return significant similarity estimates and a similarity can be deduced with certainty.
In the future it could be evaluated whether p values from the surrogate tests can replace the binary thresholding for the link strength metric to improve the sensitivity of the link strength estimate. The ESF alone, however, could be particularly suitable for the analysis of extreme events since it does not place strong restrictions on the time series beyond stationarity, and performs particularly well for irregular time series.
The NESToolbox containing scripts and programs for the similarity analysis of age-uncertain time series in Matlab and the open source software Octave are available with this paper. We also include a function to simulate age uncertainties that arise for archives for which the chronology is based on layer counting, trees, ice cores or laminated sediments, so that these, too, can be investigated using the methods presented in this paper.

Conclusions
In this paper we have investigated similarity estimators that do not require regular sampling in time and can capture linear (gXCF) and nonlinear (gMI and ESF) relationships. We found that interpolation to regular spacing of the observation times results in worse estimates. By contrast, the adapted estimators are more efficient in the presence of sampling time irregularity and cope with age uncertainty better. Ta-ble 1 gives a comprehensive overview over the similarity estimators, parameter choices and further references. gXCF and ESF perform particularly well if the relationship is linear, but the correlation estimator fails in the presence of nonlinear coupling, where the ESF and gMI are better suited to infer dependences. The significance of results from different estimators and under varying time series length and sampling can be unified using the concept of a link strength. It combines similarity estimators and significance tests and is given by the relative frequency of positive significance tests and could be especially useful in the analysis of large paleoclimatic data sets where it is infeasible to check each pair of time series for similarity individually. We have shown that age uncertainty is the largest contributor to estimation error for time series similarity, and for a reliable of similarity function shape and coupling structure, the timescale imprecision should be as low as possible. When it exceeds 5 % of the time series length coupling phenomena on timescales close to the sampling resolution can no longer be deduced. While time series irregularity can be well addressed by the use of the adapted estimators, age uncertainty cannot, and should therefore be reduced as much as possible by measuring more ages, improved dating techniques or the use of additional temporal information from layer counting (Breitenbach et al., 2012) where possible. This is, in essence, good news, because the irregular growth of the archives cannot be reversed, but measurement devices can be optimized.