gbif/pipelines

Index dwc:GeologicalContext terms

Closed this issue Β· 15 comments

A request by email from representatives of si.edu, fsu.edu, and colorado.edu, supported by the TDWG Earth Sciences and Paleobiology Interest Group:

[We] co-facilitate a paleo data working group mostly composed of people working with paleo collections and creating records that are published to aggregators. One point of discussion for the group has been the accessibility and discoverability of paleo data within aggregators like GBIF. A vital data point for fossil occurrences is found in the Darwin Core Geological Context terms, but these terms are not searchable in the current GBIF portal interface. We are inquiring about what the current situation is for indexing of these terms and the possibilities of indexing some or all of these terms if they aren’t already? and if they can be made searchable within the search interface? We note that the availability of geologic context terms in the iDigBio Portal, both via their search interface and their API directly, has greatly increased the use of iDigBio by paleo collections in the United States. This inquiry is also supported by the TDWG Earth Sciences and Paleobiology Interest Group.

The terms being requested for search are all in the DwC namespace:

  • earliestEonOrLowestEonothem
  • latestEonOrHighestEonothem
  • earliestEraOrLowestErathem
  • latestEraOrHighestErathem
  • earliestPeriodOrLowestSystem
  • latestPeriodOrHighestSystem
  • earliestEpochOrLowestSeries
  • latestEpochOrHighestSeries
  • earliestAgeOrLowestStage
  • latestAgeOrHighestStage
  • lowestBiostratigraphicZone
  • highestBiostratigraphicZone
  • group
  • formation
  • member
  • bed

The new ability to query GBIF occurrence records based on the presence of DWCA extension data is really awesome! I'm curious if this issue is on a development timeline yet?

Hi @ekrimmel
We haven't planned to take it to work yet.

I strongly agree that the Darwin Core Geological Context terms should be indexed and added to the GBIF portal's search interface. Simply put, without them, paleontological datasets on GBIF are limited in use, which is disappointing (and frustrating) given the significant resources the paleo community puts toward digitizing its lithostratigraphic data. For this reason, I often direct researchers to iDigBio because their portal does accommodate these terms.

I strongly agree that the Darwin Core Geological Context terms should be indexed and added to the GBIF portal's search interface. Simply put, without them, paleontological datasets on GBIF are limited in use, which is disappointing (and frustrating) given the significant resources the paleo community puts toward digitizing its lithostratigraphic data. For this reason, I often direct researchers to iDigBio because their portal does accommodate these terms.

I also support the added functionality generated from indexing the Darwin Core Geological Context terms (or what needs to be done to make them searchable) and adding them to the GBIF portal search interface. It will be a great benefit to paleontologists and there ability to conduct their research using GBIF.

There are today 10,207,610 fossils in the GBIF index

Below shows how often the various fields are filled

data scope v_geologicalcontextid v_earliesteonorlowesteonothem v_latesteonorhighesteonothem v_earliesteraorlowesterathem v_latesteraorhighesterathem v_earliestperiodorlowestsystem v_latestperiodorhighestsystem v_earliestepochorlowestseries v_latestepochorhighestseries v_earliestageorloweststage v_latestageorhigheststage v_lowestbiostratigraphiczone v_highestbiostratigraphiczone v_lithostratigraphicterms v_group v_formation v_member v_bed
all 5951468 1062653 1031205 3093787 1494325 5158558 2001610 4271142 1766363 2968623 822583 839441 207628 1949315 1855677 4679810 1156953 421526
fossils 770879 1016752 998128 3018294 1461525 5016526 1950960 4178616 1725848 2944163 815833 838786 207000 1941785 1836342 4676516 1156154 421506
SQL for convinence
SELECT  
count(v_geologicalcontextid) as v_geologicalcontextid,
count(v_earliesteonorlowesteonothem) as v_earliesteonorlowesteonothem,
count(v_latesteonorhighesteonothem) as v_latesteonorhighesteonothem,
count(v_earliesteraorlowesterathem) as v_earliesteraorlowesterathem,
count(v_latesteraorhighesterathem) as v_latesteraorhighesterathem,
count(v_earliestperiodorlowestsystem) as v_earliestperiodorlowestsystem,
count(v_latestperiodorhighestsystem) as v_latestperiodorhighestsystem,
count(v_earliestepochorlowestseries) as v_earliestepochorlowestseries,
count(v_latestepochorhighestseries) as v_latestepochorhighestseries,
count(v_earliestageorloweststage) as v_earliestageorloweststage,
count(v_latestageorhigheststage) as v_latestageorhigheststage,
count(v_lowestbiostratigraphiczone) as v_lowestbiostratigraphiczone,
count(v_highestbiostratigraphiczone) as v_highestbiostratigraphiczone,
count(v_lithostratigraphicterms) as v_lithostratigraphicterms,
count(v_group) as v_group,
count(v_formation) as v_formation,
count(v_member) as v_member,
count(v_bed) as v_bed

In an attempt to make it easier to get this request implemented it might be nice to clarify what if any processing is expected
The fields in the standard does not have a vocabulary as far as I know
https://dwc.tdwg.org/list/#dwc_earliestEonOrLowestEonothem

Would this just be text exactly as provided by the publisher or would it have to be normalised and aligned with a vocabulary (I imagine the latter). Is there an existing vocabulary or should one be created?

So exciting to see continued attention being put towards this, @MortenHofft – thank you! Processing the values in these fields would be very helpful for discoverability.

There are two main categories of terms in the DwC Geological Context class: those referring to chronostratigraphy (time) and those referring to lithostratigraphy (rock). You are correct that neither has an official vocabulary to reference for processing, however, there are resources for each that could be the basis of a community vocabulary maintained on GBIF.

A vocabulary for chronostratigraphy is relatively straightforward. The International Commission on Stratigraphy publishes a chronostratigraphic chart (unfortunately not in a very digitally accessible format) that includes official values you would expect to find for chronostratigraphy. There are also regional variations for chronostratigraphic values, but accommodating those could be part of a second layer of complexity for vocabulary development. The following DwC terms describe chronostratigraphy:

  • earliestEonOrLowestEonothem
  • latestEonOrHighestEonothem
  • earliestEraOrLowestErathem
  • latestEraOrHighestErathem
  • earliestPeriodOrLowestSystem
  • latestPeriodOrHighestSystem
  • earliestEpochOrLowestSeries
  • latestEpochOrHighestSeries
  • earliestAgeOrLowestStage
  • latestAgeOrHighestStage

Lithostratigraphy is more complex. Macrostrat makes a lot of values for North American stratigraphic units available via GUI and an API. Unsure if there are complementary regional equivalents, or a globally scoped source. You could expect to find values for the following DwC terms on Macrostrat:

The final two terms in the DwC Geological Context class are lowestBiostratigraphicZone and highestBiostratigraphicZone, and I do not think there is not a good source of globally-scoped values to use in a vocabulary for these fields. Processing for these fields is also lower priority for us.

We recognize that the data coming from providers in these fields is not as uniform as it could be (@tucotuco acquired unique value lists for us last year so that we could see just how "not uniform"). But even as data providers recognize this and agree to better regulate ourselves, we'll never be able to provide data as uniform as GBIF processing can make it. Members of our Paleo Data Working Group would be excited to put time into creating and maintaining controlled vocabularies for the GeologicalContext terms if it means GBIF will begin indexing them, so please let us know how we can help!

Thank you for a thorough answer and volunteering to help! Does below make sense?

Would it make sense to start with a vocabulary for chronostratigraphy, and just have raw values for lithostratigraphy and BiostratigraphicZone? If global resources for the latter two is discovered/created we could revisit.

Chronostratigraphy
For chronostratigraphy the page you link to links to this interactive resource that links to a github repo with json that links to a a vocabulary hosted by csiro.
At first glance it looks easy to generate a GBIF vocabulary based on e.g. the JSON.
Ideally inconsistent values would get a quality issue flag (e.g. Era: Mesoproterozoic; System: Orosirian is wrong I assume, as Orosirian is sitting under the Paleoproterozoic Era)

What you propose above sounds like a terrific start!

Chronostratigraphy
For chronostratigraphy the page you link to links to this interactive resource that links to a github repo with json that links to a a vocabulary hosted by csiro.
At first glance it looks easy to generate a GBIF vocabulary based on e.g. the JSON.
Ideally inconsistent values would get a quality issue flag (e.g. Era: Mesoproterozoic; System: Orosirian is wrong I assume, as Orosirian is sitting under the Paleoproterozoic Era)

@timrobertson100 @marcos-lg (anyone I should tag instead?) would it make sense to import to the vocabulary server straight away so any further editing could happen within the tool? The vocabulary has nested concepts.

@ekrimmel and the rest of the Paleo Data Working Group - I have created an issue and a work sheet for a Chronostratigraphy vocabulary: gbif/vocabulary#121. The first step would be to map the verbatim values from the unique values list that was created to the concepts, and it would be great if people from the work group could do this! If you want to get started, I propose we have a short call where I present the layout of the sheet, the parent/child relationship between the concepts and how to we usually deal with mapping the verbatim values.

Please contact helpdesk@gbif.org if you want to set up a meeting.

Thank you, @CecSve! I'll coordinate within our group and see who is available to help, then reach out via the help desk email to set up a meeting.

I am aware that substantial work went into collating data preparing for upload to the vocabulary server (https://registry.gbif.org/vocabulary/search, gbif/vocabulary#121), but would recommend to finalize that and upload before starting to use it in data interpretation / mapping and search support. With @CecSve still absent, is there any update on the status of the vocabulary pre-import spreadsheet, @ekrimmel? Would you consider it ready for import? I should say that I presently do not have access to that working sheet (https://docs.google.com/spreadsheets/d/1k3YpAeRT3HxR9DBnkh0jkZZl12jimkHU3_H_pCPOUHc/edit?usp=sharing)

Deployed to PROD

Thanks @muttcg! Now occurrence pages contain the information of the terms within geologicalContext (for example https://www.gbif.org/occurrence/4142185317) and the terms are searchable in the API https://api.gbif.org/v1/occurrence/search?earliestPeriodOrLowestSystem=Paleogene @ekrimmel