gbif/pipelines

Diagnostic: Missing names on record

Closed this issue · 6 comments

This record shows as incertae sedis but the lookup should find the species.

I'll investigate, cc @mdoering

The lookup cache contains:

hbase(main):001:0> scan 'name_usage_kv', { FILTER => "RowFilter(=, 'substring:Lissotarsus reticulata')" }
ROW                                                       COLUMN+CELL                                                                                                                                                             
 6|||||||||Lissotarsus reticulata Chaudoir, 1842|||||     column=v:j, timestamp=1696041217943, value={"synonym":true,"usage":{"key":9355155,"name":"Lissotarsus reticulatus Chaudoir, 1842","rank":"SPECIES"},"acceptedUsage":{"ke
                                                          y":7811407,"name":"Platyderus reticulatus (Chaudoir, 1842)","rank":"SPECIES"},"classification":[{"key":1,"name":"Animalia","rank":"KINGDOM"},{"key":54,"name":"Arthropod
                                                          a","rank":"PHYLUM"},{"key":216,"name":"Insecta","rank":"CLASS"},{"key":1470,"name":"Coleoptera","rank":"ORDER"},{"key":3792,"name":"Carabidae","rank":"FAMILY"},{"key":3
                                                          260555,"name":"Platyderus","rank":"GENUS"},{"key":7811407,"name":"Platyderus reticulatus","rank":"SPECIES"}],"diagnostics":{"matchType":"FUZZY","confidence":99,"status"
                                                          :"SYNONYM","lineage":[],"alternatives":[]},"iucnRedListCategory":{"category":"NOT_EVALUATED","code":"NE","scientificName":"Lissotarsus reticulatus Chaudoir, 1842","taxo
                                                          nomicStatus":"SYNONYM","acceptedName":"Platyderus reticulatus (Chaudoir, 1842)"},"issues":[]}                                                                           
1 row(s) in 45.7350 seconds

Formatted for readability:

Date is Saturday, September 30, 2023 2:33:37.943 AM

{
  "synonym":true,
  "usage":{
    "key":9355155,
    "name":"Lissotarsus reticulatus Chaudoir, 1842",
    "rank":"SPECIES"
  },
  "acceptedUsage":{
    "key":7811407,
    "name":"Platyderus reticulatus (Chaudoir, 1842)",
    "rank":"SPECIES"
  },
  "classification":[
    {
      "key":1,
      "name":"Animalia",
      "rank":"KINGDOM"
    },
    {
      "key":54,
      "name":"Arthropoda",
      "rank":"PHYLUM"
    },
    {
      "key":216,
      "name":"Insecta",
      "rank":"CLASS"
    },
    {
      "key":1470,
      "name":"Coleoptera",
      "rank":"ORDER"
    },
    {
      "key":3792,
      "name":"Carabidae",
      "rank":"FAMILY"
    },
    {
      "key":3260555,
      "name":"Platyderus",
      "rank":"GENUS"
    },
    {
      "key":7811407,
      "name":"Platyderus reticulatus",
      "rank":"SPECIES"
    }
  ],
  "diagnostics":{
    "matchType":"FUZZY",
    "confidence":99,
    "status":"SYNONYM",
    "lineage":[
      
    ],
    "alternatives":[
      
    ]
  },
  "iucnRedListCategory":{
    "category":"NOT_EVALUATED",
    "code":"NE",
    "scientificName":"Lissotarsus reticulatus Chaudoir, 1842",
    "taxonomicStatus":"SYNONYM",
    "acceptedName":"Platyderus reticulatus (Chaudoir, 1842)"
  },
  "issues":[
    
  ]
}

The lookup appears to have worked, and been cached as expected but wasn't included in the interpreted record. Reprocessing yields the same result.

With @muttcg help, we have diagnosed this, and it's behaving as intended @mdoering

It's dropping into this line

      if (usageMatch == null || isEmpty(usageMatch) || checkFuzzy(usageMatch, identification)) {
        // "NO_MATCHING_RESULTS". This
        // happens when we get an empty response from the WS
        addIssue(tr, TAXON_MATCH_NONE);
        tr.setUsage(INCERTAE_SEDIS);
        tr.setClassification(Collections.singletonList(INCERTAE_SEDIS));
      }

The web service is returning a fuzzy match (reticulata vs reticulatus) and as we described in this issue if there are no higher taxa on the record (there aren't in this case) we don't assume a fuzzy match is correct as it made too many mistakes. This record needs a higher taxon added to match.

I don't think we want to change this behavior - agree?

As it happens, this is a narrowly scoped dataset (titled "Coleoptera...") so we could add a default of kingdom = Animalia in the registry which would at least improve this dataset.

Ah, that makes sense. It would be great to understand why that has happened from a user perspective, but yes we should keep it. And for sure add a default classification to the dataset. I see this is done already.

We could add more, but I'll start conservatively

Animalia was enough for this example. but there were records being interpreted as Fungi as well, so I added Animalia / Arthropoda / Insecta and that has put this into a better shape.