gbif/pipelines

How to deal with intermediate ranks not used in backbone?

Closed this issue ยท 28 comments

On helpdesk (January 2nd 2023), a correspondence was initiated on whether GBIF should be more strict with ranks and provide a much higher penalty for wrong rank "groups", i.e. suprafamily, family, genus, species & infraspecific names which will be continued in this issue.

There is a similar problems with other higher taxa which are also often rare genera, e.g. Vertebrata. Its a simple change, but might have more impact than desired, e-g- if a very wrong rank was given. The coding change would be simple if we think it's worth trying.

@mdoering can you expand on how such a penalty would work?

All matching is based on an overall matching score (confidence) that is the summary of various individual matching scores. If that goes below a threshold, I believe its 80 currently, the match is discarded as a NO_MATCH.

You can see the individual scores in the remarks when verbose is on:

https://api.gbif.org/v1/species/match?verbose=true&name=Vertebrata

Similarity: name=100; authorship=0; classification=-2; rank=0; status=1; singleMatch=5

https://api.gbif.org/v1/species/match?verbose=true&name=Vertebrata&rank=phylum

Similarity: name=100; authorship=0; classification=-2; rank=-35; status=1; singleMatch=5

Individual scores can be negative, e.g. the above plant genus Vertebrata is not considered a match anymore when the rank phylum is given. Rank similarity scores are set to 6 if equal, otherwise the distance in the rank enumeration is calculated and used as a negative score (penalty). As phylum and genus is quite a bit apart that leads to -35 now which makes a match hardly possible already. So really in those cases we already have a big penalty which prevents matches. The problem we had were the unknown rank superdomain which result in a zero rank score: https://api.gbif.org/v1/species/match?verbose=true&name=Vertebrata&rank=superdomain

If we added superdomain to the ranks enum this would address the problem.
We could also score all unknown (=other) ranks badly, but that would lead to more no matches for good name matches when there is an akward rank given, e.g. in some non english language we dont yet support. The German term Gattung is understood, but the dutch Geslacht and likely many others not. So I think it is too dangerous to provide a penalty to all unknown ranks.

The original idea with rank groups was that there are distinct groups of ranks (infraspecific/trinomial, species/binomial, genus, family and above groups) so we could increase the penalty if the match is between such groups and increase the score slightly if its still the same group. species, genus & family group names are explicitly dealt with in the zoological code and you can just use a genus name as a subgenus without publishing it as a new name.

Simply adding superdomain to our rank enum would probably adress our problem at hand best though.

Individual scores can be negative, e.g. the above plant genus Vertebrata is not considered a match anymore when the rank phylum is given. Rank similarity scores are set to 6 if equal, otherwise the distance in the rank enumeration is calculated and used as a negative score (penalty). As phylum and genus is quite a bit apart that leads to -35 now which makes a match hardly possible already. So really in those cases we already have a big penalty which prevents matches. The problem we had were the unknown rank superdomain which result in a zero rank score: https://api.gbif.org/v1/species/match?verbose=true&name=Vertebrata&rank=superdomain

Thank you for explaining this in detail and I see your point about that it might muddy the picture if unknown ranks would receive a negative score.

I suspect it is not the enum resource you are referring to, but I can see the superdomain -> domain mapping has been added to this resource. Is it this resource you mean?

Indeed wIth that rank parser it should work, but we don't seem to be using it in the deployed version. Will need to update

Yes, thats the GBIF rank enum. The name parser maps its own, richer enum to the GBIF one.
Ideally we'd just have a single enum in name parser...

Just to be sure - will a fix be applied now @mdoering and I can close the issue?

we could deploy a new version with that fix if its desired. Shall we do a release? It will only apply to new interpretations, but we could reinterpret the few problem cases we know about

Yes, that would be great, thank you. I can make sure we re-intepret the datasets we know of once the new version is in use.

@CecSve I have released a new version and deployed it to UAT where it behaves as expected:
https://api.gbif-uat.org/v1/species/match?verbose=true&name=Vertebrata&rank=superdomain

I will deploy this to prod early next week

Live on prod now: https://api.gbif.org/v1/species/match?verbose=true&name=Vertebrata&rank=superdomain
@CecSve you can go ahead with reinterpretation of the relevant cases

This returns No match because of too little confidence. Is that expected please @mdoering?

Yes, thats the expected result. And without a match it would be linked to incertae sedis

Thank you - I am looking into which datasets are affected and could be fixed by the new version. It is unfortunately a bit more complex than what I anticipated and I will make a new issue to track the process. For example, missing taxonRanks (example), using BiotaNA instead of Biota in scientificName (issue created for the dataset with most occurrences gbif/portal-feedback#4534)

FYI, the fix with assigning taxonRank = kingdom also seems to work, which we ideally would like to have as the same end result for taxonRank = superdomain, which currently does not work even though the records are reinterpreted (example).

@timrobertson100 Species match cache maybe? We did not clear anything, just deploy a new nub-ws service

Thanks @mdoering - I follow now.

I've flushed the caches, reprocessed this record which should be calling this lookup which returns as you expect. However, in the KV cache we have this after flushing:

 0SuperdomainBiota                             column=v:j, timestamp=1674034819069, value={"synonym":true,"usage":{"key":7326344,"name":"Biota (D.Don) Endl.","rank":"GENUS"},"accep
                                               tedUsage":{"key":2684854,"name":"Platycladus Spach","rank":"GENUS"},"classification":[{"key":6,"name":"Plantae","rank":"KINGDOM"},{"k
                                               ey":7707728,"name":"Tracheophyta","rank":"PHYLUM"},{"key":194,"name":"Pinopsida","rank":"CLASS"},{"key":640,"name":"Pinales","rank":"
                                               ORDER"},{"key":8144,"name":"Cupressaceae","rank":"FAMILY"},{"key":2684854,"name":"Platycladus","rank":"GENUS"}],"diagnostics":{"match
                                               Type":"EXACT","status":"SYNONYM","lineage":[],"alternatives":[]}}

I suspect we're backing varnish by the wrong lookup

Even though the HBase cache key has the rank (Superdomain) the request that was issued by the pipelines code only had the name property. This is what was issued http://api.gbif.org/v1/species/match2?name=Biota&verbose=false&strict=false

Moving this to pipelines

The problem lies here and here where the client is parsing the rank to null before executing the HTTP call.

@mdoering - should it be using an updated RankParser or should it just be passing through the verbatim value, please?

the verbatim value would not harm at all. Think thats the safest option, but updating the RankParser would also address the problem at hand. Unless pipelines does something more clever with rank cleaning than just using the parser...

It's also used for normalizing the taxon rank, so I opted to add it to the vocabulary. Passing verbatim values may bring in some other regression (e.g. a value the parser is able to handle but the service not) so I err on the side of caution and take the conservative approach. With the SUPERDOMAIN in the rank, I think we should just need to bump the gbif-api versions in key-value-store and pipelines.

Discussion on the PR suggests this requires more thought

The RankParser is used in the matching service, so unless something else is done to the verbatim data the service does it already

Thanks. The parser is used to normalise the key for the cache, and for interpreting the field.

@mdoering - this should mirror the change you applied behind the service. I see we just made it a synonym of domain.

@timrobertson100 will you let me know once this should work?

@CecSve Just for an update.

We're testing the fix in dev before a release to production

Is the test done? It seems like it is in production? I guess not?

@CecSve Yes, I will release the fixed version today. I plan to re-interpret all data today/tomorrow

I plan to re-interpret all data today/tomorrow

Great - thank you!

Deployed to prod