Palaeo-sea-level and palaeo-ice-sheet databases: problems, strategies, and perspectives

. Sea-level and ice-sheet databases have driven numerous advances in understanding the Earth system. We describe the challenges and offer best strategies that can be adopted to build self-consistent and standardised databases of geological and geochemical information used to archive palaeo-sea-levels and palaeo-ice-sheets. There are three phases in the development of a database: (i) measurement, (ii) interpretation, and (iii) database creation. Measurement should include the objective description of the position and age of a sample, description of associated geological features, and quantiﬁcation of uncertainties. Interpretation of the sample may have a subjective component, but it should always include uncertainties and alternative or con-trasting interpretations, with any exclusion of existing interpretations requiring a full justiﬁcation. During the creation of a database, an approach based on accessibility, transparency,


Introduction
The rapid acquisition of palaeoclimate data and the development of strategies to assimilate these data into models has resulted in a growing need for open-access and user-friendly databases with the goal of machine readability (Overpeck et al., 2011). Within the palaeo-sea-level and palaeo-icesheet communities, there is the further requirement of standardisation (Hijma et al., 2015). These communities use field data to reconstruct the elevation of past sea levels and the dimensions and extent of former ice sheets. As an example of assimilation of data into models, databases of sea-level index points have constrained model estimates of the rates of glacial isostatic adjustment (GIA) during and following the last deglaciation (e.g. Milne et al., 2005;Bradley et al., 2011;Engelhart et al., 2011;Whitehouse et al., 2012;Peltier et al., 2015;Roy and Peltier, 2015). The results from these studies have contributed, in turn, to estimating current rates of ice-sheet mass loss and sea-level rise from geodetic observations (Vaughan et al., 2013). Other databases have been used to assess the magnitude of the sea-level highstand during the last interglacial period (Kopp et al., 2009;Dutton and Lambeck, 2012) and improve our understanding of global ocean volume and earth dynamic topography during the Pliocene (Rowley et al., 2013;Rovere et al., 2014Rovere et al., , 2015. Likewise, the worldwide timing of the Last Glacial Maximum (e.g. Clark et al., 2009) and global deglaciation of valley glaciers (e.g. Shakun et al., 2015) has been determined from ice-sheet databases.
The generation of databases of past sea-level changes began with Daly (1934) and Godwin (1940), with early examples of reconstructing temporal changes in former icesheet margins by Prest et al. (1968) and Bryson et al. (1969). The need for standardisation among studies as new sea-level data emerged was recognised and implemented by International Geoscience Programme (IGCP) projects, starting with IGCP Project 61 in 1974 (van de Plassche, 1986). Subsequent IGCP projects produced Holocene databases in the United Kingdom (Shennan and Horton, 2002), the US Atlantic coast (Engelhart and Horton, 2012), South America (Milne et al., 2005), and elsewhere (Khan et al., 2015). Several recent studies have constructed deglacial databases of ice-sheet retreat, but they have used different criteria and approaches to data assimilation (e.g. Dyke, 2004;Clark et al., 2009;Tarasov et al., 2012;Briggs et al., 2014;Hughes et al., 2016;Stokes et al., 2015;Stroeven et al., 2015).
The process of setting up a sea-level or ice-sheet database can be divided into three phases: (i) measurement, (ii) interpretation, and (iii) database creation. In this paper, we build on the results of PALSEA (PALeo constraints on SEA level rise; Siddall et al., 2010) workshops over the last 8 years to report the main challenges identified for each phase and the possible solutions that can be adopted.

Measurements
A common denominator of palaeo-sea-level and palaeo-icesheet data is that they originate from two types of direct measurements. Field measurements are taken to determine the position, location, and elevation of a particular feature (e.g. a fossil coral or a glacial deposit). Meta data, such as cross sections and photographs, may also be used to illustrate the local geological and geomorphological context. Laboratory measurements include establishing the age of a feature (e.g. a 14 C or cosmogenic surface exposure age) which was sampled in the field. Sample information on location, elevation, and shielding for cosmogenic surface exposure ages is critical for recalculation of ages as inferred production rates change (Balco, 2011).
Any measurement of palaeo-sea-level and palaeo-icesheet data needs clearly specified measures of uncertainties. The scientific value of the data is maximised if uncertainties are reduced, but missing information often exacerbates difficulties in quantifying uncertainties. For example, uncertainties related to the elevation of a sea-level index point are potentially large if the original study did not indicate the tidal or geodetic data to which the elevation is referenced (van de Plassche, 1986). Elevation errors, which greatly affect palaeo-relative-sea-level (RSL) calculations, can be avoided by employing state-of-the-art GPS and levelling techniques (e.g. Muhs et al., 2011;Rovere et al., 2015). Despite this, high-accuracy GPS systems are to date seldom applied to measure Quaternary and Pliocene sea-level proxies. Although the laboratory error is often indicated as part of laboratory procedures, this is not always the case with instrumental errors in a field measurement.
Ideally, multiple studies measuring and interpreting the same proxy should have overlapping uncertainty ellipses (cases 1 and 2 in Fig. 1a). Unfortunately, there are many examples where measurements do not overlap (3 in Fig. 1a) or cannot be realistically compared due to the lack of details on measurement techniques or details on interpretation (4 in Fig. 1a). In the worst case, some studies may fail to report the error and cannot be compared. Incomplete data limit the longevity of some data, requiring new studies to remeasure the same proxies.
Measurement of palaeo-sea-level and palaeo-ice-sheet data can either be obtained by direct field or laboratory activ- ities, or derived from a previous publication and inserted in the database. In all cases, the transfer of information should be objective and complete, reporting only what can be read in the original publication and/or what is measured in the field, with no further interpretation. An important goal for the future is for different communities to agree on standardised measurements and data reporting norms (e.g. Hijma et al., 2015). Precision of terminology is vital to avoid misinterpretations of field and laboratory measurements . This will facilitate seamless interfacing with database systems for archiving and further analysis. Palaeo-sea-level and palaeo-ice-sheet databases need to include standardised documentation of fundamental data fields: Position (i.e. geographical location and elevation or depth), referred to a specific sea-level datum and, if available, the positioning techniques applied; Age including laboratory identification number, details on the dating technique used, and ideally the raw data; Description of the feature including metadata and images to complement the quantitative information; Quantification of measurement uncertainties.

Interpretation
Once measured, field and laboratory data are interpreted to reconstruct the palaeo-sea-level and the spatial and temporal extent of the palaeo-ice-sheet. Commonly it is the interpretation that will be most interesting for the final users, who may not be experts but need to compare the reconstructions with independent estimates, such as model predictions.
There is often a subjective component to the interpretation of field data. In Fig. 1b, we show fossil corals, a typical example of a sea-level indicator. An objective assessment of the coral age can be determined using U-series techniques (e.g. applying the template used in Dutton and Lambeck, 2012). The position of the deposit relative to a tidal or geodetic datum can be measured with appropriate accuracy. The taxonomy of the sample can be reported, which should include information on the benthic assemblage and its relation to sea level and geological and sedimentological properties. The subjectivity relates to the interpretation of the palaeowater depth (i.e. relation to sea level) of the coral. One possible interpretation following investigation of the depth distribution of corals in the deposit is that the corals are in situ (e.g. in living position) and sea level at the time of deposition was somewhere above the measured elevation (i.e. it is a lower limiting data point, Int. 1 in Fig. 1b). Another interpretation could be that the corals are allochthonous, and instead represent a storm deposit. In this case, it is only possible to inwww.clim-past.net/12/911/2016/ Clim. Past, 12, 911-921, 2016 fer that the deposit represents the top of a marine sequence and that the palaeo-sea-level was located below the measured elevation of the deposit (i.e. its upper limiting data point, Int. 2 in Fig. 1b). A final interpretation may instead recognise elements (e.g. microatolls, intertidal geological facies within the deposit) that tie the deposit to the palaeo-sea-level around the measured elevation within an uncertainty (i.e. identifying reference water level and indicative range (Shennan, 1986;van de Plassche, 1986;Horton et al., 2000), Int. 3 in Fig. 1b). Whenever controversial interpretations such as those summarised above exist, a database should document all of them. If one interpretation is more likely than the others, or is supported by independent studies, this information should be inserted in the database within the metadata. Issues may emerge in the interpretation of laboratory data, such as the use of different calibration curves to establish the age of an indicator. The interpretation of data can be subject to changes with scientific advances. As an example, old 14 C ages or cosmogenic surface exposure ages can be recalibrated following the availability of new calibration curves or calibration schemes or new production rates and scaling models, respectively. But the possibility of recalibrating these measurements depends on the presence of primary data, such as δ 13 C measurements, description of the dated material, sample thickness, etc. (Balco, 2011;Törnqvist et al., 2015). In principle, if measurement data are present in a database, obtaining secondary data from new interpretations can be streamlined relatively easily. Uncertainties of sea-level and ice-sheet indicators are usually treated as Gaussian distributions, with the exception of limiting data that only provide information on maximum or minimum sea level (Int. 1 and Int. 2 in Fig. 1b). In the case of Gaussian uncertainties, the uncertainty of the interpretation can be combined with the uncertainty of the measurement (dashed line in Fig. 1b) using the root mean square error formula assuming the uncertainties are independent; more complicated uncertainties may require Monte Carlo sampling. As understanding of habitat distribution for marine species or coastal facies increases, and more consideration is given to the physical processes that perturb sample elevation over time, an increasing amount of data will use more accurate uncertainty distributions that extend beyond the Gaussian approximation. We recommend recording multiple percentiles of these non-Gaussian distributions to reflect not only the width of the uncertainty but also the shape of its probability distribution.
Palaeo-sea-level and palaeo-ice-sheet databases that incorporate interpretations must therefore be Flexible to take into account the fact that, although the measurement must be unequivocal, interpretation of the data can be multiple and vary or evolve over time; Consistent in the reporting of interpretations and uncertainties of data.

Database creation
A database is primarily a collection of data records and secondarily a platform for exchange of data and information.
The process of creating a database must necessarily start from the identification of the agents that will interact with it. Data creators provide the original data sets and should carry out their work with databases in mind. In palaeoclimate sciences, these are usually geologists and geochemists, who carry out the main part of the measurement and interpretation process. Data compilers collect data from different sources and, if necessary, reinterpret it. Measurement and interpretation constitute the backbone of every palaeo-sealevel and palaeo-ice-sheet database, but there are other key elements to be considered, which we summarise under the ATTAC 3 acronym (Fig. 1c): accessibility, transparency, trust, availability, continuity, completeness, and communication of content.
Accessibility is a challenge due to the heterogeneity of the user communities. A majority of published databases today use a spreadsheet format, which is easy to access for most users. However, some information (e.g. images or non-Gaussian uncertainty estimates) is more simply presented in relational databases. Furthermore, relational databases enable different presentation formats for different end-user communities.
Transparency is critical in interdisciplinary research fields.
Scientists must trust each other on the applied methodology, but at the same time they have to be able to understand the applied procedures. As a database creator and compiler cannot know all future users of the data and the fields in which they are applied, the database description must be as detailed as possible. The description should include appropriate metadata and use standardised language and comments in data fields. Indicating the quality of each data field in understandable formats will help the end user to make appropriate use of the data (Düsterhus and Hense, 2014).
Trust is built by database compilers sharing credit with the scientists delivering the data (Costello, 2009;Kattage et al., 2014). Data creators and compilers are confronted with the risk that their original publications are no longer cited when their data are included in a larger citable database and thus will not gain credit under current performance metrics, such as the H-index (Hirsch, 2005). To ensure the availability of high-quality data sets in the future, data creators need to be given appropriate credit. Trust of a database requires consistent data quality and transparently applied procedures, and a consistent and trustworthy host. It also requires effective software design for the database and within the data processing. Ideally, the code should be openly available and well documented.  (Brooks and Edwards, 2006); (G) the UK (Shennan and Horton, 2002); (H) northwest Europe (Vink et al., 2007); (I) the Mediterranean (Vacchi et al., 2014(Vacchi et al., , 2016; (J) China (Zong, 2004), (K) Malay Peninsula (Horton et al., 2005); (L) New Zealand (Clement et al., 2016); and (M) Antarctica (Briggs and Tarasov, 2013).
Availability of a database for the long term requires longterm funding (see below). Today, most databases are attached to journal articles as a spreadsheet in the supplement. This ensures persistence, but no database maintenance and/or upgrade is possible for most journals.
Continuity of updating is important to stay relevant and reflect the changing interpretations of the data. To allow cite-ability of the database (e.g. with digital object identifiers (DOIs); Paskin, 2005;Quadt et al., 2012), version control is essential. Furthermore, the use of unique and persistent identifiers, such as the International Geo Sample Number (IGSN) that is currently used for geological samples, should be encouraged to ensure that over different update cycles a data point can be uniquely referenced by scientists.
Completeness of the database is important, especially in the context of uncertainties (Hijma et al., 2015). Even when the basic elements (like position, age, and elevation) are complete, for many applications they are of limited use when associated uncertainties are not clearly indicated or defined.
Communication of the content, for example through interfaces for visualisation software, helps to open the database for new audiences. Advanced visualisation approaches (e.g. Rovere et al., 2012;Unger et al., 2012) require standardised protocols for data extraction and consistent data types. These properties have to be determined in the design phase of the database; thus it is important to consider its applications right from the beginning.

The community structure
Any database should be aimed at serving a community of end users, who extract content for further analysis, and give feedback on specific needs regarding data sets or analyses. Databases should be centralised and interconnected via the Internet in order to reach the maximum possible number of end users, with the widest possible geographic distribution. The data are more likely to be used if the end users have a unique access point for the data sets, such as a WebGIS portal. In the geological domain, there are large initiatives to build data repositories, which are already well established and used by scientists worldwide. Two examples are the NOAA World Data Center for Paleoclimatology (Wahl et al., 2010) and PANGAEA (http://www.pangaea.de/). Some journals link PANGAEA databases to online versions of associated papers.
Most funding agencies require that data collected in the framework of a project be archived and made available through data repositories. This is achieved through a "data management plan" (National Science Foundation of the United States) or the "open data policy" (European Union), which requests that the project leaders state where they plan to store the data collected within their project. Currently, a researcher working on sea-level and/or ice-sheet databases only has the choice to store the new data sets in different repositories, which might have the effect of dispersing the data across several repositories, decentralising data storage (see example in Fig. 3).
In the framework of a single research project, the data creator is also a data compiler, and often the first end user. It is, therefore, necessary to ensure that the data sets collected in Overview: We will produce new data in the format of field notes, photos of field sites, GPS data, GIS datasets and databases. At each site we will collect samples that will be dated with U-series techniques.
Data description: We expect to produce information on paleo sea levels measured in the field and we will revise information as necessary, on sites published in literature.
Description of existing data and samples: We will use existing published data on sea levels in the area of interest. We will collect all the information in a geodatabase built with GIS software, from which, at the end of the project, we will extract the information to be submitted to the repositories listed hereafter.
Data analysis summary: Samples collected in the field will be dated using U-series.
Includes field work? Yes Description of field work: We will collect elevation and stratigraphic data in the field with a differential GPS receiver. Cameras of the researchers will be synchronized with their handheld GPS time to geotag field photos. Field notes will be digitized at the end of each day of field survey and stored in PDF format. Pictures, text description, field sketches and associated ISGN numbers will be used as metadata.

Expected data product #1
Intended repository: UNAVCO Timeline for data release: Two Years from acquisition/analysis

Expected data product #3
Responsible investigator: Researcher Product description: Results of U-series analyses. Metadata will include photos of samples, description of facies, lat/lon coordinates and ISGN numbers associated. Intended repository: EarthChem Timeline for data release: Two Years from acquisition/analysis Expected data product #4 Data type: Observational Responsible investigator: Researcher Product description: New site stratigraphies as well as stratigraphies re-evaluated or re-measured from literature data as necessary.
Intended repository: GeoStrat Timeline for data release: Two Years from acquisition/analysis Figure 3. Example of a data management plan (DMP) for a project on Pleistocene sea-level markers obtained with the IEDA (Interdisciplinary Earth Data Alliance, http://www.iedadata.org/) DMP toolbox. Note that, to correctly store sea-level data, at least four independent repositories are needed.
Clim. Past, 12, 911-921, 2016 www.clim-past.net/12/911/2016/ Table 1. List of published global sea-level databases that follow (where appropriate) the IGCP format. Formats in which the data are provided: spreadsheets (R) and interactive interfaces that allow the visualisation, extraction, or download of data (I).
Description Accessibility Dutton and Lambeck (2012) Compilation of last interglacial coral U-Th age data, elevation data, and associated sample information.
The first worksheet of the Excel file contains the data and calculated ages and elevations that have been normalised to common decay constants and elevation benchmarks, respectively; the second worksheet contains definitions of column headings and data units; the third worksheet contains a lookup table for data sources listed by number in the first worksheet. Some entries in the database are annotated by comment fields to denote supplemental information for data or calculations not included in the original publications.
Annexed to publ. (R) Klemann et al. (2013) Storage of different accessible compilations in relational database system PostgreSQL. Contains the regional databases A, B, C, and D shown in Fig. 2 and further data mainly from published compilations or grey literature. Access via visualisation and analysis software SLIVISU (beta version) or direct access (password protected).
Online (I), on request Khan et al. (2015) Compilation of global Holocene relative sea-level data. Each database entry includes location, sea level, sea-level error, age, and age error, as well as the original source of publication.
Annexed to publ. (R) Kopp et al. (2009) Multi-proxy database of last interglacial index and limiting relative sea-level index points. A legend worksheet defines column headings and data units.
Annexed to publ. (R) Kopp et al. (2016, https://www.ncdc.noaa.gov/paleo/study/19823) Database of Common Era (last 3000 years) relative sea-level data. Each database entry includes location, sea level, sea-level error, age, and age error, as well as the original source publication. There is a front page of definitions of column headings and data units.
Online ( Online (I) the framework of a single project have a standardised structure and are available to other end users. A significant concern regarding the maintenance of a healthy research community is appropriate crediting of authorship. How does an end user using thousands of data points from dozens of source publications provide appropriate credit? Journals often allow for only a limited number of citations, and often the citation credit goes to the data compiler, who created the review database, and not to the data creator. If the question above is not addressed, the longterm result will be that data creators will have no incentive to support the inclusion of their work in a centralised database. This issue must be addressed by journal editors. In some cases, editors have made exceptions to standard journal length rules in order to include all the original papers in the reference list (e.g. Khan et al., 2015). Alternatively, some journals allow longer, online-only papers with space for a full reference list (e.g. Kopp et al., 2016). A number of sea-level databases have been produced in the framework of single research projects (Table 1). In general, there are two formats in which the data are provided: data repositories in the form of spreadsheets (R) and interactive interfaces that allow the visualisation, extraction, or download of data (I). In Fig. 2 we show the geographic coverage for a number of databases representing late-glacial and Holocene RSL data which were compiled from different original studies following, where appropriate, the IGCP guidelines. Each index point has a defined location, age, elevation relative to former sea level, and appropriate accounting of errors (details to the databases in Table 2).
www.clim-past.net/12/911/2016/ Clim. Past, 12, 911-921, 2016 Pilot database intended as an initial release of Holocene geological relative sea-level data that have been compiled according to a recently developed protocol (Hijma et al., 2015). The database is provided in two versions: a complete version that consists of 77 variables and that includes all the underlying data, as well as a processed version with only the 11 most critical variables. It is anticipated that this latter version will be adequate for most users, while the former provides a full documentation for those who wish to carry out more detailed analyses.

Concluding remarks
The discussions of the PALSEA community on sea-level and ice -heet databases can be framed around the following points: 1. Any set-up of sea-level or ice-sheet databases must be divided into i. measurement, ii. interpretation, iii. database creation.
2. Storage of measurements should include position, age, description of geological features, and quantification of uncertainties. All must be described as objectively as possible with relevant metadata.
3. Interpretation of geological data will retain a subjective component, but it should always include uncertainties and include all the possible interpretations.
4. When creating a database, all the aspects related to the ATTAC 3 approach must be taken into account.
5. The community structure that creates and benefits from a database must be considered, and the needs and concerns of each part of the community must be respected.
There remains the need for a centralised database structure for the sea-level and ice-sheet communities. Despite this need, dedicated funding for "user-friendly", field-specific database creation is rarely available because funding mostly prioritises projects that follow the classic hypothesis-driven research approach. Data management is often restricted to archiving at a general level. The tasks of database creation, maintenance, and guarantee of accessibility are limited to single projects, and the possibility to hire ad hoc personnel (e.g. experts in geoinformatics) to fulfil these requirements is often disregarded by funding agencies. We favour interdisciplinary research collaborations focusing on field-specific database development and maintenance, including projects that amalgamate and reanalyse published data sets into new databases. These new databases enhance the legacy of monetary investments originally made to collect sea-level and icesheet data. Many of the aspects discussed in this paper will also be valid for other types of geological data and may be of interest to additional geoscientific communities.