Technical note : The Linked Paleo Data framework – a common tongue for paleoclimatology

Paleoclimatology is a highly collaborative scientific endeavor, increasingly reliant on online databases for data sharing. Yet there is currently no universal way to describe, store and share paleoclimate data: in other words, no standard. Data standards are often regarded by scientists as mere technicalities, though they underlie much scientific and technological innovation, as well as facilitating collaborations between research groups. In this article, we propose a preliminary data standard for paleoclimate data, general enough to accommodate all the archive and measurement types encountered in a large international collaboration (PAGES 2k). We also introduce a vehicle for such structured data (Linked Paleo Data, or LiPD), leveraging recent advances in knowledge representation (Linked Open Data). The LiPD framework enables quick querying and extraction, and we expect that it will facilitate the writing of opensource community codes to access, analyze, model and visualize paleoclimate observations. We welcome community feedback on this standard, and encourage paleoclimatologists to experiment with the format for their own purposes.


Introduction
Science is entering a data-intensive era, where insight is increasingly gained by extracting information from large volumes of data (Hey, 2012).This is particularly critical in paleoclimatology, as understanding past changes in the climate system requires observations across large spatial and temporal scales.Paleoclimatic observations are typically limited to small geographic domains, so investigating large scales requires integrating many disparate studies and datasets.Observational work in paleoclimatology exemplifies the "longtail" approach to data collection (Heidorn, 2008): the ma-jority of observations are gathered by independent scientists with no formal language for describing their data and meta-data to each other -or to machines -in a standardized fashion.This results in a "Digital Tower of Babel", making 35 the curation, access, re-use and valorization of paleoclimate data far more difficult than it should be, hindering scientific progress.
Recognizing the need for data sharing, paleoclimate investigators have made a major effort over the past decade to 40 make their data available to the broader community, largely through online archiving systems like the World Data Center for Paleoclimatology and Pangaea.Nonetheless, the lack of consistent formatting and metadata standards (i.e. a common tongue) has made the re-use of such data needlessly labor-45 intensive by preventing computers from participating in the task of making connections across datasets.As the number of records in these archives has grown, making connections manually has become more and more challenging, hampering integrative efforts at the very time they should be flour-50 ishing.Paleoclimatologists thus need a common tongue to describe their datasets to each other and to machines.Achieving this goal requires addressing two major hurdles: (1) the lack of a common language used to describe our datasets (a data standard), and (2) the lack of an accepted data format: a 55 "rule book" that describes how the data are encoded, and that allows programmatic access to the data.
These two issues are clearly related, but somewhat distinct in practice.The data format must be universally readable, a condition satisfied by, for instance, netCDF files, which have 60 been used for paleoclimate syntheses (Wahl et al., 2010).However, such files only allow for fixed schemas and require identical fields for all proxies.In reality, each paleoclimatic dataset may have a unique set of data and metadata properties.Moreover, the netCDF format is designed for large grid-

2
McKay & Emile-Geay: The Linked Paleo Data framework ded datasets, and is justifiably popular in the atmosphere and ocean science communities.However, it is unfamiliar to most paleogeoscientists, because it was not designed with the peculiarities of paleogeoscientific data in mind, and would only accommodate them with extreme effort.Further, to enhance the relevance of paleoclimate data to other fields, one would like this data container to be compatible with the Linked Data paradigm (Bizer et al., 2009), which allows for data-driven discovery between datasets that would otherwise be unlikely or impossible.For the broadest applicability, we require a more flexible format.
Elaborating a data standard is an even greater challenge.It requires that the community of paleogeoscientists agree on the meaning of, and relationship between, the terms they use every day, often informally, in different contexts, and with different cultural norms.For instance, some scientists use the the term "proxy" to liberally describe any paleoclimatic variable, whereas others restrict its use to relationships that have been rigorously quantified.Developing a consistent standard, "a common tongue", is critical to the community moving forward, but will be an iterative community process.
In this technical note, we present a solution to both problems, and present LiPD (Linked Paleo Data) a new, flexible linked-data format designed for paleoclimate data.Such a data container is a necessary first step towards a "semantic web of paleoclimatology" (Emile-Geay and Eshleman, 2013), and provides a straightforward framework in which communities and researchers can explicitly describe their data and metadata in common terms that the community, and computers, can understand.In the process, we introduce a preliminary data standard for paleoclimatology.Indeed, such a standard is essential to structuring the metadata, though the container is flexible enough to accommodate many revisions and updates to this standard.As discussed above, an accepted standard needs to evolve out of community-wide discussions, and the establishment of a consensus, which has yet to take place in our field.One goal of the present work is to spark such a discussion by giving the worldwide paleoclimate community a blueprint to improve upon.

A flexible container for paleoclimate data
Paleoclimate observations come in many varieties; standardizing the data and providing the framework to encode meaning to the parameters and metadata requires a flexible, and extensible format.The linked data variety of JavaScript Object Notation (JSON-LD), provides a lightweight, and human-readable solution to this problem.JSON-LD may be unfamiliar to most paleogeoscientists, however JSON is becoming a leading format for data exchange on the web, has a rich set of existing tools to interact with it, and a robust user community.JSON-LD augments JSON by defining each property via a Web-defined schema, and is being used by Google, the BBC, and Microsoft, among many other insti-tutions.More importantly for the paleogeosciences, it is almost infinitely customizable, meaning that it can adapt to fit any dataset, and evolve with emerging data standards in the 120 community.Here we present the structure of the Linked Paleo Data (LiPD) data format, which utilizes JSON-LD and provides a structure that is common to the overwhelming majority (if not all) of paleoclimate observational datasets.
Despite their variety, all paleoclimate datasets share the 125 same major features.
1. Some basic metadata about the dataset (e.g.) LiPD encodes these data and metadata into a structured hierarchy that allows explicit description of any aspect of the dataset at any level of the data (Figure 1).LiPD serializes this hierarchy using JSON-LD, using nests of lists and keyvalue pairs.LiPD adopts the GeoJSON standard to describe The GeoJSON standard defaults to the WGS84 ellipsoid, and units of decimal degrees for latitude and longitude and meters above sea level for elevation.This standard readily accommodates polygonal and multipoint geographic features and additional location metadata that allow for a much richer suite of geographic metadata than are typically recorded with paleoclimate datasets.
An advantage of using JSON as the default container for this information is that it is an extremely common vehicle for 285 all manner of data, and can be parsed by nearly all modern programming languages.As each LiPD dataset is comprised of a JSON-LD file and one or more csv files; each dataset is packaged using BagIt3 , which provides a simple format for collecting and validating files for distribution, and can 290 be readily serialized into a compressed file for exchange between users.

A preliminary data standard for paleoclimatology
The flexible container described in section 2 can serialize any set of paleoenvironmental data with rich metadata.How-295 ever this framework only becomes useful when a common vocabulary with explicit meanings is applied to the data.Developing this vocabulary requires buy-in from experts across the disparate domains of the paleogeosciences, and will be a gradual process of evolving standards.To begin this con-300 versation, here we outline a preliminary metadata standards for required metadata, based on phase 2 of the Past Global Changes (PAGES) past two thousand years (2k) project4 .The following are the minimal metadata for every dataset in the network.Many records include additional desirable data and 305 metadata; an ongoing extended metadata table is available here.It is illustrative to look at a simple, but realistic example to examine how a dataset is structured in LiPD using this preliminary data standard.We use the dataset of Thornalley et al. (2009) as an example in the following and in Figure 2.This is an intentionally minimal example, that does not in-400 clude all possible information.For example, the chronological metadata can describe any type of chronology, whether it is primarily based on tie-point constraints or layer-counts.
Additionally, metadata about how the ages and their uncertainties are modeled for undated layers is also readily stored,

405
including the details needed to reproduce the analyses, and even large ensembles of simulated age-depth relations.Indeed, the need to store and share these data and metadata is a primary motivation of this effort.

Connectivity and compatibility 410
This technical note is focused on a technical description of the structure of a new data format (LiPD) and a preliminary data standard that can be used with it.Most paleogeoscientists will never want to, or need to, interact with LiPD on this level.The goal of any machine-readable data format is tion is needed.On the output side, we have begun developing open-source utilities for the analytical platforms that are most commonly used by paleogeoscientists (Matlab, Python, and R).On the input side, we have developed interactivity with Google spreadsheets -a free, cloud-based alternative to 425 Microsoft Excel, recognizing that spreadsheets are the bread and butter of lab scientists, and recognizing the need for distributed editing of data/metadata (which Google spreadsheets' version control makes possible and reversible).Additionally, we have developed utilities that convert datasets 430 formatted for the World Data Center for Paleoclimatology in Microsoft Excel into LiPD files, so that users that format their data for the former can instantly turn them into LiPD.
Conversely, a partnership with WDC Paleo will ensure that LiPD-formatted datasets are easily archived on their site.

435
These utilities are in various stages of development, and are available as a public GitHub repository5 .They are all designed to plug into the workflow of paleogeoscientists.Our hope is that as paleogeoscientists discover and explore the utility of this framework, the community of contributors 440 will continue to expand; for example LiPD integration with Neotoma6 and the Neotoma R package (Goring et al., 2015) is planned for 2016.Finally, LiPD is the backbone of the LinkedEarth project7 , which will enable users to edit datasets via an intuitive wiki platform, leveraging the flexibility of 445 LiPD while eliminating its complexity from the user experience.

Discussion
The data container and preliminary data standard described here are extremely flexible, and can accommodate any pale-450 oclimatic or paleoenvironmental data that are based on any expansion of dependent/independent variable pairs.This encompasses all paleoclimate and paleoenvironmental datasets that we can imagine.A primary challenge for developing a sufficiently broad paleodata framework has long been defin-455 ing and agreeing on all of the relevant terms for such a diverse community.The framework presented here addresses the first challenge by accommodating the complexity and inevitable proliferation of terms, variables and interpretations inherent to the interdisciplinary field of paleoclimatology, 460 and by assigning explicit meaning to the terms through the Linked Open Data framework.Implementation of these semantics will be an evolving, community-driven process.This is critical for two reasons: first, defining an ontology 8 a priori has proven impossible to date; second, even if it were possible, such an ontology would be meaningless if it were not used.We will thus rely on usage and community discussion to reach agreement on terminology, and the community has clearly demonstrated it's desire and willingness to contribute to these discussions.Indeed, LiPD, and the preliminary data standard discussed in this technical are the outcome of considerable community input and development.The concepts formalized here have emerged from half a decade of formal and informal development with hundreds of paleogeoscientists.The earliest formal development of these concepts arose from the clear recognition of the need through two large community projects organized through Past Global Changes (PAGES), the PAGES 2k project, and the PAGES Arctic Holocene Transitions project.The call for standardization from the 480 community working on these projects was clear, and PAGES has made the development of formats and standards a priority as part of its "Data Stewardship" integrated activities effort.Feedback on early versions of the LiPD framework and the preliminary dataset was cultivated through the PAGES International Program Office, who reached out the large community (>5000) of paleoscientists involved with PAGES to solicit input and feedback on these ideas.
For the most part, we gathered input through the online platform Authorea, which allows online publishing, editing and feedback on manuscripts 9 , to share the information on this format and receive feedback.Through this process we received excellent feedback from the community (acknowledged below) that greatly contributed to the framework.We view LiPD as community product that evolved prior to subderived from foraminifera extracted from a marine sediment core.On one hand, these two records are measuring the same variable, and there are times when researchers might be interested in investigating all δ 18 O regardless of the details of the archive on which they were measured.On the other hand, 510 there are some important differences between the two mea-8 a formal definition of all the concepts used by the data model, and the relationships between these concepts 9 https://www.authorea.com/users/17200/articles/19163/_show_article surements that users would like to include in the data repository.If we were to describe each variable in a single term, we would have to decide whether to call them both "δ 18 O", or to call one "δ 18 O-skeletal aragonite" and the other "δ 18 O-515 foraminifera >120µm size class".By taking advantage of JSON's capacity to build hierarchical metadata structures, we can encode an entire set of metadata at the appropriate level in the dataset as: { 520 "variableName": "d18O", "description": "d18O measured on skeletal aragonite", "units": "permil", "standard": "VSMOW", 525 "material": "skeletal aragonite", "instrument": "Micromass Optima gas source triple-collector mass spectrometer" }, 530 and: { "variableName": "d18O", "description": "d18O measured on G. bulloides > 120 microns",

535
"units": "permil", "standard": "VSMOW", "material": "foraminifera calcite", "instrument": "Micromass Optima gas source triple-collector mass 540 spectrometer" "species":"Globigerina bulloides" }, This makes the commonalities and differences between the datasets explicit.Moreover, additional levels of metadata 545 may be introduced into the descriptor to accommodate climate interpretation, calibration procedures or forward models as described above.The power of the hierarchical structure is that it allows the metadata to be placed at the appropriate level, avoiding logical contradictions in lumping and 550 splitting that become necessary when trying to incorporate information from several levels into a single term -or when several users describe the same dataset in slightly different ways.
An important consideration for re-use and provenance 555 tracking is versioning: each version of a LiPD record, or collection of LiPD records, should be associated with a unique identifier, which is crucial to reproducibility.We propose the following versioning scheme: Individual records A number of the form I 1 .I 2 .I 3 , where 560 I 1 is an integer associated with a publication (e.g.Thornalley et al., 2009), I 2 is a counter updated every time a modification is made to the data and I 3 is another counter updated whenever a modification is made only to the metadata.
Data compilations A number of the form C 1 .C 2 .C 3 , where C 1 is an integer associated with a publication (e.g.PAGES2k Consortium, 2013) and C 2 is a counter updated every time a record is added or removed, and C 3 is a counter updated every time a modification is made 570 to the data or metadata in an individual record.
We are presently implementing a large-scale test of the LiPD framework by using it as the primary data archive for Phase 2 of the PAGES2k global temperature database (PAGES2K Consortium, in prep).Consequently, the de-575 scribed framework for describing the proxy data is fairly mature and field-tested.It also means that a large (>600 datasets), robust collection of LiPD files will soon be available publicly.LiPD has evolved to meet the needs of these diverse data, however, it may not be universal, and we welcome suggestions for increased generality.
The standards for reporting and storing geochronological data are much less tested and will require far more community input.For instance, there seems to be no universal way of reporting radiocarbon, U/Th, or 210 P b dates.Ideally, co-585 ordination between geochronologists would yield a universal standard for all radiometric age models; however, if there is to be any standard, it is more likely to first emerge within each sub-community.JSON-LD is flexible enough to encompass any possibility, but doing so in a way that allows re-590 search algorithms to easily read those chronologies and generate age models from them will likely require more work.
Finally, it is important to realize that the JSON-LD implementation described here is just one implementation to represent the underlying data model.One of the many features of 595 linked open data is that the same data model could be serialized into other representations, such as XML or Turtle, without any loss of information.This makes this framework incredibly flexible and allows the community to move forward with implementing these concepts without trying to predict 600 community needs and the evolution of technology.LiPD is not a rigid container that one must force paleoclimate data into, but rather a flexible system designed to wrap around a data set.We are committed to the continued development and expansion of LiPD and look forward to evolving this prelim- data and metadata, including: (a) One or more tables of measurements, and their metadata (b) Variable names, units, standards, and interpretations (including forward models) 145 6. Geochronological data and metadata, which can include (a) Table(s) of radiometric dating measurements and associated metadata (b) Age model ensembles (c) Author interpretation and methodological choices 150 dataSetName name of the dataset; that is, an alphanumeric string that uniquely characterizes this record in the database, often based on site, authors, year and ancillary information example: RAPiD-315 12-1K.