gbif/checklistbank

Regression for Nematoda

Closed this issue · 10 comments

Nematoda (worms) vs Nemapoda (insects) - the verbatim value most likely specifies a worm with identical spelling, not the dipteran insect

{
  "count": 2483,
  "verbatim_kingdom": "null",
  "verbatim_phylum": "null",
  "verbatim_class": "null",
  "verbatim_order": "null",
  "verbatim_family": "null",
  "verbatim_genus": "Nematoda",
  "verbatim_species": "null",
  "verbatim_infra": "null",
  "verbatim_rank": "null",
  "verbatim_verbatimRank": "null",
  "verbatim_scientificName": "Nematoda",
  "verbatim_generic": "null",
  "verbatim_author": "null",
  "current_kingdom": "Animalia",
  "current_phylum": "Nematoda",
  "current_class": "null",
  "current_order": "null",
  "current_family": "null",
  "current_genus": "null",
  "current_subGenus": "null",
  "current_species": "null",
  "current_scientificName": "Nematoda",
  "current_acceptedScientificName": "Nematoda",
  "current_kingdomKey": 1,
  "current_phylumKey": 5967481,
  "current_classKey": "null",
  "current_orderKey": "null",
  "current_familyKey": "null",
  "current_genusKey": "null",
  "current_subGenusKey": "null",
  "current_speciesKey": "null",
  "current_taxonKey": 5967481,
  "current_acceptedTaxonKey": 5967481,
  "proposed_kingdom": "Animalia",
  "proposed_phylum": "Arthropoda",
  "proposed_class": "Insecta",
  "proposed_order": "Diptera",
  "proposed_family": "Sepsidae",
  "proposed_genus": "Nemapoda",
  "proposed_subGenus": "null",
  "proposed_species": "null",
  "proposed_scientificName": "Nemapoda",
  "proposed_acceptedScientificName": "Nemapoda",
  "proposed_kingdomKey": 1,
  "proposed_phylumKey": 54,
  "proposed_classKey": 216,
  "proposed_orderKey": 811,
  "proposed_familyKey": 3523,
  "proposed_genusKey": 6134734,
  "proposed_subGenusKey": "null",
  "proposed_speciesKey": "null",
  "proposed_taxonKey": 6134734,
  "proposed_acceptedTaxonKey14932": 6134734,
  "_key": 542,
  "changes": {
    "phylum": true,
    "phylumKey": true,
    "class": true,
    "classKey": true,
    "order": true,
    "orderKey": true,
    "family": true,
    "familyKey": true,
    "genus": true,
    "genusKey": true,
    "scientificName": true,
    "acceptedScientificName": true,
    "taxonKey": true
  },
  "reviewed": false
}

A match for purely Nematoda still matches the phylum: http://backbonebuild-vh.gbif.org:9000/species/match?verbose=true&name=Nematoda

and even with the (wrong) genus given as in the example record it works:
http://backbonebuild-vh.gbif.org:9000/species/match?verbose=true&name=Nematoda&genus=Nematoda

@timrobertson100 any idea how that can be?

Only when requesting a genus it snaps to the fly:
http://backbonebuild-vh.gbif.org:9000/species/match?verbose=true&name=Nematoda&genus=Nematoda&rank=genus

I'm just using the standard production code, but looking through it I found this

@mdoering - does that look to you like a request for "genus = Nematoda" and "scientificName = Nematoda" would provide the same key? My thinking is this might introduce non-predictable results that depend on the calling order when caching comes into play

Yes it would as far as I can tell. I don't know what that key is used for. But also appending without a delimiter could be dangerous to generate the same key from very different parameters. genus=foo & sciName=bar would be the same as sciName=foobar

And yes, just calling with the genus snaps to the fly: http://backbonebuild-vh.gbif.org:9000/species/match?verbose=true&genus=Nematoda

So that aside... I use the production code to make sure I apply the same cleaning operations, but then I skip the caching and call directly here replicating the same as the production code.

So while I think the key generation looks suspicious, I don't think that is the cause here.

@mdoering @MattBlissett and I have diagnosed this and we understand it at least.

The code interprets the rank here and so it actually ends up calling ...genus=Nematoda&name=Nematoda&rank=genus.

So in this case, it's not really doing all that bad a job given the original record does declare the genus is name=Nematoda.

However, I think we should do 2 things:

  1. Stop all this cleaning on the client side and rely completely on the lookup service. It's a leftover from the days before we had a cleaning routine in the lookup service itself
  2. Verify and address a potential cache key generation issue ensuring "name=foobar" and "genus=foo and species=bar" do not key on the same thing

@ahahn-gbif
Are you OK that we close this, knowing that we've logged this?

i.e. this is actually not that bad behavior and shouldn't block the release given the publisher states genus= Nematoda . We should probably alert the publisher.

Agree to close. Agree to alert the publisher in principle, when capacity allows.

Thanks

This appears fixed with recent changes, and the latest report showing proposed changes here (no longer listed)