gbif/pipelines

Stricter fuzzy matching

Closed this issue · 5 comments

GBIF fuzzy match names in order to accommodate things like spelling mistakes.
However, this is too lenient and making some terrible matches in cases where there is are no higher taxa in the record.

As an example the lookup for Minorisa which is a genus of Bacteria and not known to the GBIF Backbone results in Minorissa Walker, 1870. This is discussed as a source of major issue in the metagenomic datasets in this thread.

This has been discussed, the impact on data analyzed and we would like to change the behavior such that if the result of the lookup is a match type of FUZZY then only use it if there is also a kingdom, phylum, class, order or family also present on the occurrence record. If not, then the record should be treated the same as if the service responded a NONE response.

Table of most affected datasets:

fuzzy_occ_count = fuzzy taxon match with no higher taxonomy (kingdom to family = null)

datasetkey fuzzy_occ_count occ_count_total ratio
7aa26702-f762-11e1-a439-00145eb45e9a 1 1 1
bec95074-2220-4c3c-92ed-5ce729bd4307 320 320 1
8915b910-f762-11e1-a439-00145eb45e9a 23 45 0.511111
633ea97c-f762-11e1-a439-00145eb45e9a 1 2 0.5
9c8c249c-9426-4a03-a5bc-9150966b3c2c 9 21 0.428571
865368a8-f762-11e1-a439-00145eb45e9a 275 677 0.406204
d8b9c36d-c983-4b3e-a278-bbb013b5d117 45 120 0.375
663c1199-b99b-41cd-b55e-3f02feb788ea 4 11 0.363636
8919516a-f762-11e1-a439-00145eb45e9a 17 47 0.361702
79266357-23f4-439c-96f0-a501a5ec2f67 1 3 0.333333
6305b4a0-f762-11e1-a439-00145eb45e9a 1 3 0.333333
85d9bfe4-f762-11e1-a439-00145eb45e9a 4 13 0.307692
2a570f00-86e1-46ca-bfb6-7dd077e66004 28 94 0.297872
2ea221ef-6fa8-4bac-b23e-be2a2e7feb7d 218 767 0.284224
c684e21e-eed5-4c0b-9d5e-028c08d9d893 51 199 0.256281
3b9a6c7a-98cb-4e1e-bc33-7fecb2599a87 133 544 0.244485
9c76ffd1-36b0-47ab-baf9-d4ec14202872 913 3786 0.241152
e04232b1-0244-4083-852a-d652fac1e41a 517 2421 0.213548
6313ed72-f762-11e1-a439-00145eb45e9a 1 5 0.2
63591906-f762-11e1-a439-00145eb45e9a 1 5 0.2
62d5fd96-f762-11e1-a439-00145eb45e9a 1 5 0.2
24814443-5120-4da1-8b49-f345383ed4b4 128 652 0.196319
569ccbb8-79b7-4fbc-902d-5d295bf10d36 39 205 0.190244
ffa7390a-800c-4b37-8081-e5352e0310c7 6 34 0.176471
85288d1e-f762-11e1-a439-00145eb45e9a 287 1659 0.172996
7aecd33c-f762-11e1-a439-00145eb45e9a 37 215 0.172093
717776ae-f762-11e1-a439-00145eb45e9a 1 6 0.166667
5e756c82-e503-438c-8caf-63a8f8fa1725 40 253 0.158103
aee51547-6e86-4dc9-b149-0df501253d4c 115 764 0.150524
7ffd804c-d389-41ad-84ba-4ad5d51d8f04 33 222 0.148649
dcf889ca-c279-40e1-b69a-4d9d98f2d318 26 177 0.146893
7b28e160-f762-11e1-a439-00145eb45e9a 166 1139 0.145742
974c7838-232a-4fb8-bc02-ef408fcb6944 13 90 0.144444
7bd55aae-61fb-42da-986f-ceedff623749 430 3000 0.143333
62f5b442-f762-11e1-a439-00145eb45e9a 1 7 0.142857
00f03652-0e86-4038-a8a3-fc41ee4a1bc2 1 7 0.142857
85d1878e-f762-11e1-a439-00145eb45e9a 6 42 0.142857
4309a5ab-7050-4387-9abb-7dd497c96904 28 197 0.142132
ca0b53d1-7c0d-42c8-86a9-e0548bc4a39a 11 79 0.139241
c8cc0aeb-b615-499b-9508-24a3b3f9eba4 272 1968 0.138211
8433c45a-f762-11e1-a439-00145eb45e9a 69 517 0.133462
9f9286fe-ca42-4195-874b-13dc6f5d4861 871 6668 0.130624
9ee980a9-b373-4b8a-a5be-2926368e500e 326 2500 0.1304
76805c1e-0691-4d13-ab59-2211efa1a78a 13 100 0.13
7b269446-f762-11e1-a439-00145eb45e9a 186 1445 0.12872
7b0bc332-f762-11e1-a439-00145eb45e9a 127 1001 0.126873
5d17a4fd-7249-46c0-a56b-1924d477b29e 49 387 0.126615
fe7b0de4-c800-4564-afdc-d98ab465a3e3 125 988 0.126518
6e51f70c-4da6-468b-8e76-16ea0ce328d1 1 8 0.125
7145ce2e-f762-11e1-a439-00145eb45e9a 1 8 0.125
e92b921d-fc07-447a-a45c-ce6708a6f7ec 1 8 0.125
62882bb6-f762-11e1-a439-00145eb45e9a 2 16 0.125
300184f6-9998-4689-a153-7c83984def36 62 512 0.121094
cb811b65-8593-43d4-9ef7-699c4c40b50f 37 308 0.12013
266628c1-56a0-46cb-b136-3b77dbc32268 48 407 0.117936
51aff167-af50-46ef-b64c-d8f9087d695c 8 68 0.117647
85d88d4a-f762-11e1-a439-00145eb45e9a 2 17 0.117647
89268632-f762-11e1-a439-00145eb45e9a 760 6782 0.112061
843672e0-f762-11e1-a439-00145eb45e9a 50 447 0.111857
7b084162-f762-11e1-a439-00145eb45e9a 65 589 0.110357
014f4ebe-c9f3-40dc-beb1-45f4e365dbd2 606 5549 0.109209
857e1bda-f762-11e1-a439-00145eb45e9a 313 2867 0.109173
6da08497-6cf5-4f0b-aaad-8eb916ffea9d 119 1101 0.108084
52ff7a75-4354-464c-9765-eca68cf90859 116 1077 0.107707
85cdfbfa-f762-11e1-a439-00145eb45e9a 272 2532 0.107425
890e7e05-bb6d-4520-a31a-dcef3c3fb0b0 6 56 0.107143
7a6cbe9a-f762-11e1-a439-00145eb45e9a 2 19 0.105263
9f113fae-1666-4bdc-99af-4d0c94beb936 374 3554 0.105234
7b0f4480-f762-11e1-a439-00145eb45e9a 8 78 0.102564
dde7e62b-68eb-4a3c-b3d4-f7e4c1cf094a 708 6954 0.101812
762e6b74-e481-491e-9809-b03356438e93 1 10 0.1
32288bee-fc04-4d05-b1c5-68bf46f04e99 1 10 0.1

One could even consider to not accept any fuzzy match if it is a monomial, i.e. rank subgenus and higher. There are many very similar named genera while the fuzzy match for binomials should be way more accurate. With most of the names being within Animalia I doubt you remove most problems. There are lots of homonyms and closely spelled genera between insects and molluscs for example.

The fix deployed in PROD, datasets from the list were reinterpreted

Implemented initial issue logic, monomial wasn't included

I have added default taxonomic values to many of the datasets in the table above, which should prevent too much harm caused from this change.

for example this dataset from the list now has a default values for insects:
https://www.gbif.org/dataset/865368a8-f762-11e1-a439-00145eb45e9a