gbif/pipelines

Request for new fields to index and expose in search and download

Opened this issue · 4 comments

There is quite a lot of ideas for fields to add to the index. So that users can search and download by adding these new filters.
I've collected them all here (at least the ones I could find)

From #400

  • earliestEonOrLowestEonothem
  • latestEonOrHighestEonothem
  • earliestEraOrLowestErathem
  • latestEraOrHighestErathem
  • earliestPeriodOrLowestSystem
  • latestPeriodOrHighestSystem
  • earliestEpochOrLowestSeries
  • latestEpochOrHighestSeries
  • earliestAgeOrLowestStage
  • latestAgeOrHighestStage
  • lowestBiostratigraphicZone
  • highestBiostratigraphicZone
  • group
  • formation
  • member
  • bed

From #7

  • acceptedKey

#182

  • verbatimTaxonKey

From #425

  • gbifRegion
  • publishedByGbifRegion

From #515

  • fieldNumber (only in index as verbatim blob) #514
  • preparations (currently indexed as keywords) #634
  • sex (currently indexed as keywords)
  • startDayOfYear (currently indexed as shorts)
  • endDayOfYear (currently indexed as shorts)
  • higherGeography
  • island
  • islandGroup
  • georeferencedBy
  • higherClassification
  • previousIdentifications

From #662

  • datasetName has since beed aded
  • datasetID has since beed aded

From #664

  • otherCatalogNumbers

#666 (comment)

  • taxonConceptID

From #515 (comment)

  • isSequenced

Thanks for collating this. We could take the approach of responding to requests as they come in - which is not to be discounted for sure - but perhaps we might just consider what it would take to index everything? Most of the fields will be incredibly sparsely populated, and I'm not sure our original concerns a decade ago of blowing index sizes would hold true today.

First of all, let me know if this is not the right issue to address this in.

Second, the following dataset has changed the DwC terms recently between versions 1.8 to 1.9, shifting from using 'individualCount' to 'organismQuantity' + 'organismQuantityType': https://www.gbif.org/dataset/91fa1a0d-a208-40aa-8a6e-f2c0beb9b253. When a user downloads the simple version of the dataset, they only get the column 'individualCount' populated with NAs and do not get the updated information stored in 'organismQuantity' + 'organismQuantityType'. Is there a way to secure that data is not 'lost' when publishers begin using new terms for the same thing?

taxonConceptID would be another one to search for, e.g. avibase ids or taxonid.org identifier:
https://www.gbif.org/occurrence/3457928716

  • sex (currently indexed as keywords)

We are working on a controlled vocabulary gbif/vocabulary#83 but it is not finalized yet.