How to deal with intermediate ranks not used in backbone?
Closed this issue ยท 28 comments
On helpdesk (January 2nd 2023), a correspondence was initiated on whether GBIF should be more strict with ranks and provide a much higher penalty for wrong rank "groups", i.e. suprafamily, family, genus, species & infraspecific names which will be continued in this issue.
There is a similar problems with other higher taxa which are also often rare genera, e.g. Vertebrata. Its a simple change, but might have more impact than desired, e-g- if a very wrong rank was given. The coding change would be simple if we think it's worth trying.
@mdoering can you expand on how such a penalty would work?
All matching is based on an overall matching score (confidence) that is the summary of various individual matching scores. If that goes below a threshold, I believe its 80 currently, the match is discarded as a NO_MATCH.
You can see the individual scores in the remarks when verbose is on:
https://api.gbif.org/v1/species/match?verbose=true&name=Vertebrata
Similarity: name=100; authorship=0; classification=-2; rank=0; status=1; singleMatch=5
https://api.gbif.org/v1/species/match?verbose=true&name=Vertebrata&rank=phylum
Similarity: name=100; authorship=0; classification=-2; rank=-35; status=1; singleMatch=5
Individual scores can be negative, e.g. the above plant genus Vertebrata is not considered a match anymore when the rank phylum is given. Rank similarity scores are set to 6 if equal, otherwise the distance in the rank enumeration is calculated and used as a negative score (penalty). As phylum and genus is quite a bit apart that leads to -35 now which makes a match hardly possible already. So really in those cases we already have a big penalty which prevents matches. The problem we had were the unknown rank superdomain
which result in a zero rank score: https://api.gbif.org/v1/species/match?verbose=true&name=Vertebrata&rank=superdomain
If we added superdomain to the ranks enum this would address the problem.
We could also score all unknown (=other
) ranks badly, but that would lead to more no matches for good name matches when there is an akward rank given, e.g. in some non english language we dont yet support. The German term Gattung is understood, but the dutch Geslacht and likely many others not. So I think it is too dangerous to provide a penalty to all unknown ranks.
The original idea with rank groups was that there are distinct groups of ranks (infraspecific/trinomial, species/binomial, genus, family and above groups) so we could increase the penalty if the match is between such groups and increase the score slightly if its still the same group. species, genus & family group names are explicitly dealt with in the zoological code and you can just use a genus name as a subgenus without publishing it as a new name.
Simply adding superdomain to our rank enum would probably adress our problem at hand best though.
Individual scores can be negative, e.g. the above plant genus Vertebrata is not considered a match anymore when the rank phylum is given. Rank similarity scores are set to 6 if equal, otherwise the distance in the rank enumeration is calculated and used as a negative score (penalty). As phylum and genus is quite a bit apart that leads to -35 now which makes a match hardly possible already. So really in those cases we already have a big penalty which prevents matches. The problem we had were the unknown rank
superdomain
which result in a zero rank score: https://api.gbif.org/v1/species/match?verbose=true&name=Vertebrata&rank=superdomain
Thank you for explaining this in detail and I see your point about that it might muddy the picture if unknown ranks would receive a negative score.
I suspect it is not the enum resource you are referring to, but I can see the superdomain -> domain mapping has been added to this resource. Is it this resource you mean?
Indeed wIth that rank parser it should work, but we don't seem to be using it in the deployed version. Will need to update
There's also this resource: https://github.com/gbif/gbif-api/blob/dev/src/main/java/org/gbif/api/vocabulary/Rank.java
Yes, thats the GBIF rank enum. The name parser maps its own, richer enum to the GBIF one.
Ideally we'd just have a single enum in name parser...
we could deploy a new version with that fix if its desired. Shall we do a release? It will only apply to new interpretations, but we could reinterpret the few problem cases we know about
Yes, that would be great, thank you. I can make sure we re-intepret the datasets we know of once the new version is in use.
@CecSve I have released a new version and deployed it to UAT where it behaves as expected:
https://api.gbif-uat.org/v1/species/match?verbose=true&name=Vertebrata&rank=superdomain
I will deploy this to prod early next week
Live on prod now: https://api.gbif.org/v1/species/match?verbose=true&name=Vertebrata&rank=superdomain
@CecSve you can go ahead with reinterpretation of the relevant cases
This returns No match because of too little confidence
. Is that expected please @mdoering?
Yes, thats the expected result. And without a match it would be linked to incertae sedis
Thank you - I am looking into which datasets are affected and could be fixed by the new version. It is unfortunately a bit more complex than what I anticipated and I will make a new issue to track the process. For example, missing taxonRanks (example), using BiotaNA instead of Biota in scientificName (issue created for the dataset with most occurrences gbif/portal-feedback#4534)
FYI, the fix with assigning taxonRank = kingdom also seems to work, which we ideally would like to have as the same end result for taxonRank = superdomain, which currently does not work even though the records are reinterpreted (example).
@timrobertson100 Species match cache maybe? We did not clear anything, just deploy a new nub-ws service
Thanks @mdoering - I follow now.
I've flushed the caches, reprocessed this record which should be calling this lookup which returns as you expect. However, in the KV cache we have this after flushing:
0SuperdomainBiota column=v:j, timestamp=1674034819069, value={"synonym":true,"usage":{"key":7326344,"name":"Biota (D.Don) Endl.","rank":"GENUS"},"accep
tedUsage":{"key":2684854,"name":"Platycladus Spach","rank":"GENUS"},"classification":[{"key":6,"name":"Plantae","rank":"KINGDOM"},{"k
ey":7707728,"name":"Tracheophyta","rank":"PHYLUM"},{"key":194,"name":"Pinopsida","rank":"CLASS"},{"key":640,"name":"Pinales","rank":"
ORDER"},{"key":8144,"name":"Cupressaceae","rank":"FAMILY"},{"key":2684854,"name":"Platycladus","rank":"GENUS"}],"diagnostics":{"match
Type":"EXACT","status":"SYNONYM","lineage":[],"alternatives":[]}}
I suspect we're backing varnish by the wrong lookup
Even though the HBase cache key has the rank (Superdomain) the request that was issued by the pipelines code only had the name property. This is what was issued http://api.gbif.org/v1/species/match2?name=Biota&verbose=false&strict=false
Moving this to pipelines
The problem lies here and here where the client is parsing the rank to null before executing the HTTP call.
@mdoering - should it be using an updated RankParser or should it just be passing through the verbatim value, please?
the verbatim value would not harm at all. Think thats the safest option, but updating the RankParser would also address the problem at hand. Unless pipelines does something more clever with rank cleaning than just using the parser...
It's also used for normalizing the taxon rank, so I opted to add it to the vocabulary. Passing verbatim values may bring in some other regression (e.g. a value the parser is able to handle but the service not) so I err on the side of caution and take the conservative approach. With the SUPERDOMAIN
in the rank, I think we should just need to bump the gbif-api versions in key-value-store
and pipelines
.
Discussion on the PR suggests this requires more thought
The RankParser is used in the matching service, so unless something else is done to the verbatim data the service does it already
@timrobertson100 will you let me know once this should work?
Is the test done? It seems like it is in production? I guess not?
@CecSve Yes, I will release the fixed version today. I plan to re-interpret all data today/tomorrow
I plan to re-interpret all data today/tomorrow
Great - thank you!
Deployed to prod