Index dwc:GeologicalContext terms
Closed this issue Β· 15 comments
A request by email from representatives of si.edu, fsu.edu, and colorado.edu, supported by the TDWG Earth Sciences and Paleobiology Interest Group:
[We] co-facilitate a paleo data working group mostly composed of people working with paleo collections and creating records that are published to aggregators. One point of discussion for the group has been the accessibility and discoverability of paleo data within aggregators like GBIF. A vital data point for fossil occurrences is found in the Darwin Core Geological Context terms, but these terms are not searchable in the current GBIF portal interface. We are inquiring about what the current situation is for indexing of these terms and the possibilities of indexing some or all of these terms if they arenβt already? and if they can be made searchable within the search interface? We note that the availability of geologic context terms in the iDigBio Portal, both via their search interface and their API directly, has greatly increased the use of iDigBio by paleo collections in the United States. This inquiry is also supported by the TDWG Earth Sciences and Paleobiology Interest Group.
The terms being requested for search are all in the DwC namespace:
- earliestEonOrLowestEonothem
- latestEonOrHighestEonothem
- earliestEraOrLowestErathem
- latestEraOrHighestErathem
- earliestPeriodOrLowestSystem
- latestPeriodOrHighestSystem
- earliestEpochOrLowestSeries
- latestEpochOrHighestSeries
- earliestAgeOrLowestStage
- latestAgeOrHighestStage
- lowestBiostratigraphicZone
- highestBiostratigraphicZone
- group
- formation
- member
- bed
The new ability to query GBIF occurrence records based on the presence of DWCA extension data is really awesome! I'm curious if this issue is on a development timeline yet?
I strongly agree that the Darwin Core Geological Context terms should be indexed and added to the GBIF portal's search interface. Simply put, without them, paleontological datasets on GBIF are limited in use, which is disappointing (and frustrating) given the significant resources the paleo community puts toward digitizing its lithostratigraphic data. For this reason, I often direct researchers to iDigBio because their portal does accommodate these terms.
I strongly agree that the Darwin Core Geological Context terms should be indexed and added to the GBIF portal's search interface. Simply put, without them, paleontological datasets on GBIF are limited in use, which is disappointing (and frustrating) given the significant resources the paleo community puts toward digitizing its lithostratigraphic data. For this reason, I often direct researchers to iDigBio because their portal does accommodate these terms.
I also support the added functionality generated from indexing the Darwin Core Geological Context terms (or what needs to be done to make them searchable) and adding them to the GBIF portal search interface. It will be a great benefit to paleontologists and there ability to conduct their research using GBIF.
There are today 10,207,610 fossils in the GBIF index
Below shows how often the various fields are filled
data scope | v_geologicalcontextid | v_earliesteonorlowesteonothem | v_latesteonorhighesteonothem | v_earliesteraorlowesterathem | v_latesteraorhighesterathem | v_earliestperiodorlowestsystem | v_latestperiodorhighestsystem | v_earliestepochorlowestseries | v_latestepochorhighestseries | v_earliestageorloweststage | v_latestageorhigheststage | v_lowestbiostratigraphiczone | v_highestbiostratigraphiczone | v_lithostratigraphicterms | v_group | v_formation | v_member | v_bed |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
all | 5951468 | 1062653 | 1031205 | 3093787 | 1494325 | 5158558 | 2001610 | 4271142 | 1766363 | 2968623 | 822583 | 839441 | 207628 | 1949315 | 1855677 | 4679810 | 1156953 | 421526 |
fossils | 770879 | 1016752 | 998128 | 3018294 | 1461525 | 5016526 | 1950960 | 4178616 | 1725848 | 2944163 | 815833 | 838786 | 207000 | 1941785 | 1836342 | 4676516 | 1156154 | 421506 |
SQL for convinence
SELECT
count(v_geologicalcontextid) as v_geologicalcontextid,
count(v_earliesteonorlowesteonothem) as v_earliesteonorlowesteonothem,
count(v_latesteonorhighesteonothem) as v_latesteonorhighesteonothem,
count(v_earliesteraorlowesterathem) as v_earliesteraorlowesterathem,
count(v_latesteraorhighesterathem) as v_latesteraorhighesterathem,
count(v_earliestperiodorlowestsystem) as v_earliestperiodorlowestsystem,
count(v_latestperiodorhighestsystem) as v_latestperiodorhighestsystem,
count(v_earliestepochorlowestseries) as v_earliestepochorlowestseries,
count(v_latestepochorhighestseries) as v_latestepochorhighestseries,
count(v_earliestageorloweststage) as v_earliestageorloweststage,
count(v_latestageorhigheststage) as v_latestageorhigheststage,
count(v_lowestbiostratigraphiczone) as v_lowestbiostratigraphiczone,
count(v_highestbiostratigraphiczone) as v_highestbiostratigraphiczone,
count(v_lithostratigraphicterms) as v_lithostratigraphicterms,
count(v_group) as v_group,
count(v_formation) as v_formation,
count(v_member) as v_member,
count(v_bed) as v_bed
In an attempt to make it easier to get this request implemented it might be nice to clarify what if any processing is expected
The fields in the standard does not have a vocabulary as far as I know
https://dwc.tdwg.org/list/#dwc_earliestEonOrLowestEonothem
Would this just be text exactly as provided by the publisher or would it have to be normalised and aligned with a vocabulary (I imagine the latter). Is there an existing vocabulary or should one be created?
So exciting to see continued attention being put towards this, @MortenHofft β thank you! Processing the values in these fields would be very helpful for discoverability.
There are two main categories of terms in the DwC Geological Context class: those referring to chronostratigraphy (time) and those referring to lithostratigraphy (rock). You are correct that neither has an official vocabulary to reference for processing, however, there are resources for each that could be the basis of a community vocabulary maintained on GBIF.
A vocabulary for chronostratigraphy is relatively straightforward. The International Commission on Stratigraphy publishes a chronostratigraphic chart (unfortunately not in a very digitally accessible format) that includes official values you would expect to find for chronostratigraphy. There are also regional variations for chronostratigraphic values, but accommodating those could be part of a second layer of complexity for vocabulary development. The following DwC terms describe chronostratigraphy:
- earliestEonOrLowestEonothem
- latestEonOrHighestEonothem
- earliestEraOrLowestErathem
- latestEraOrHighestErathem
- earliestPeriodOrLowestSystem
- latestPeriodOrHighestSystem
- earliestEpochOrLowestSeries
- latestEpochOrHighestSeries
- earliestAgeOrLowestStage
- latestAgeOrHighestStage
Lithostratigraphy is more complex. Macrostrat makes a lot of values for North American stratigraphic units available via GUI and an API. Unsure if there are complementary regional equivalents, or a globally scoped source. You could expect to find values for the following DwC terms on Macrostrat:
- group, e.g "Admire Gp"
- formation, e.g. "Green River Fm"
- member, e.g. "Douglas Creek Mbr"
- bed, e.g "Kimball Mountain Bed"
The final two terms in the DwC Geological Context class are lowestBiostratigraphicZone and highestBiostratigraphicZone, and I do not think there is not a good source of globally-scoped values to use in a vocabulary for these fields. Processing for these fields is also lower priority for us.
We recognize that the data coming from providers in these fields is not as uniform as it could be (@tucotuco acquired unique value lists for us last year so that we could see just how "not uniform"). But even as data providers recognize this and agree to better regulate ourselves, we'll never be able to provide data as uniform as GBIF processing can make it. Members of our Paleo Data Working Group would be excited to put time into creating and maintaining controlled vocabularies for the GeologicalContext terms if it means GBIF will begin indexing them, so please let us know how we can help!
Thank you for a thorough answer and volunteering to help! Does below make sense?
Would it make sense to start with a vocabulary for chronostratigraphy, and just have raw values for lithostratigraphy and BiostratigraphicZone? If global resources for the latter two is discovered/created we could revisit.
Chronostratigraphy
For chronostratigraphy the page you link to links to this interactive resource that links to a github repo with json that links to a a vocabulary hosted by csiro.
At first glance it looks easy to generate a GBIF vocabulary based on e.g. the JSON.
Ideally inconsistent values would get a quality issue flag (e.g. Era: Mesoproterozoic; System: Orosirian
is wrong I assume, as Orosirian
is sitting under the Paleoproterozoic
Era)
What you propose above sounds like a terrific start!
Chronostratigraphy
For chronostratigraphy the page you link to links to this interactive resource that links to a github repo with json that links to a a vocabulary hosted by csiro.
At first glance it looks easy to generate a GBIF vocabulary based on e.g. the JSON.
Ideally inconsistent values would get a quality issue flag (e.g.Era: Mesoproterozoic; System: Orosirian
is wrong I assume, asOrosirian
is sitting under thePaleoproterozoic
Era)
@timrobertson100 @marcos-lg (anyone I should tag instead?) would it make sense to import to the vocabulary server straight away so any further editing could happen within the tool? The vocabulary has nested concepts.
@ekrimmel and the rest of the Paleo Data Working Group - I have created an issue and a work sheet for a Chronostratigraphy vocabulary: gbif/vocabulary#121. The first step would be to map the verbatim values from the unique values list that was created to the concepts, and it would be great if people from the work group could do this! If you want to get started, I propose we have a short call where I present the layout of the sheet, the parent/child relationship between the concepts and how to we usually deal with mapping the verbatim values.
Please contact helpdesk@gbif.org if you want to set up a meeting.
Thank you, @CecSve! I'll coordinate within our group and see who is available to help, then reach out via the help desk email to set up a meeting.
I am aware that substantial work went into collating data preparing for upload to the vocabulary server (https://registry.gbif.org/vocabulary/search, gbif/vocabulary#121), but would recommend to finalize that and upload before starting to use it in data interpretation / mapping and search support. With @CecSve still absent, is there any update on the status of the vocabulary pre-import spreadsheet, @ekrimmel? Would you consider it ready for import? I should say that I presently do not have access to that working sheet (https://docs.google.com/spreadsheets/d/1k3YpAeRT3HxR9DBnkh0jkZZl12jimkHU3_H_pCPOUHc/edit?usp=sharing)
Deployed to PROD
Thanks @muttcg! Now occurrence pages contain the information of the terms within geologicalContext (for example https://www.gbif.org/occurrence/4142185317) and the terms are searchable in the API https://api.gbif.org/v1/occurrence/search?earliestPeriodOrLowestSystem=Paleogene @ekrimmel