Stricter fuzzy matching
Closed this issue · 5 comments
GBIF fuzzy match names in order to accommodate things like spelling mistakes.
However, this is too lenient and making some terrible matches in cases where there is are no higher taxa in the record.
As an example the lookup for Minorisa
which is a genus of Bacteria and not known to the GBIF Backbone results in Minorissa Walker, 1870
. This is discussed as a source of major issue in the metagenomic datasets in this thread.
This has been discussed, the impact on data analyzed and we would like to change the behavior such that if the result of the lookup is a match type of FUZZY
then only use it if there is also a kingdom
, phylum
, class
, order
or family
also present on the occurrence record. If not, then the record should be treated the same as if the service responded a NONE
response.
Table of most affected datasets:
fuzzy_occ_count = fuzzy taxon match with no higher taxonomy (kingdom to family = null)
datasetkey | fuzzy_occ_count | occ_count_total | ratio |
---|---|---|---|
7aa26702-f762-11e1-a439-00145eb45e9a | 1 | 1 | 1 |
bec95074-2220-4c3c-92ed-5ce729bd4307 | 320 | 320 | 1 |
8915b910-f762-11e1-a439-00145eb45e9a | 23 | 45 | 0.511111 |
633ea97c-f762-11e1-a439-00145eb45e9a | 1 | 2 | 0.5 |
9c8c249c-9426-4a03-a5bc-9150966b3c2c | 9 | 21 | 0.428571 |
865368a8-f762-11e1-a439-00145eb45e9a | 275 | 677 | 0.406204 |
d8b9c36d-c983-4b3e-a278-bbb013b5d117 | 45 | 120 | 0.375 |
663c1199-b99b-41cd-b55e-3f02feb788ea | 4 | 11 | 0.363636 |
8919516a-f762-11e1-a439-00145eb45e9a | 17 | 47 | 0.361702 |
79266357-23f4-439c-96f0-a501a5ec2f67 | 1 | 3 | 0.333333 |
6305b4a0-f762-11e1-a439-00145eb45e9a | 1 | 3 | 0.333333 |
85d9bfe4-f762-11e1-a439-00145eb45e9a | 4 | 13 | 0.307692 |
2a570f00-86e1-46ca-bfb6-7dd077e66004 | 28 | 94 | 0.297872 |
2ea221ef-6fa8-4bac-b23e-be2a2e7feb7d | 218 | 767 | 0.284224 |
c684e21e-eed5-4c0b-9d5e-028c08d9d893 | 51 | 199 | 0.256281 |
3b9a6c7a-98cb-4e1e-bc33-7fecb2599a87 | 133 | 544 | 0.244485 |
9c76ffd1-36b0-47ab-baf9-d4ec14202872 | 913 | 3786 | 0.241152 |
e04232b1-0244-4083-852a-d652fac1e41a | 517 | 2421 | 0.213548 |
6313ed72-f762-11e1-a439-00145eb45e9a | 1 | 5 | 0.2 |
63591906-f762-11e1-a439-00145eb45e9a | 1 | 5 | 0.2 |
62d5fd96-f762-11e1-a439-00145eb45e9a | 1 | 5 | 0.2 |
24814443-5120-4da1-8b49-f345383ed4b4 | 128 | 652 | 0.196319 |
569ccbb8-79b7-4fbc-902d-5d295bf10d36 | 39 | 205 | 0.190244 |
ffa7390a-800c-4b37-8081-e5352e0310c7 | 6 | 34 | 0.176471 |
85288d1e-f762-11e1-a439-00145eb45e9a | 287 | 1659 | 0.172996 |
7aecd33c-f762-11e1-a439-00145eb45e9a | 37 | 215 | 0.172093 |
717776ae-f762-11e1-a439-00145eb45e9a | 1 | 6 | 0.166667 |
5e756c82-e503-438c-8caf-63a8f8fa1725 | 40 | 253 | 0.158103 |
aee51547-6e86-4dc9-b149-0df501253d4c | 115 | 764 | 0.150524 |
7ffd804c-d389-41ad-84ba-4ad5d51d8f04 | 33 | 222 | 0.148649 |
dcf889ca-c279-40e1-b69a-4d9d98f2d318 | 26 | 177 | 0.146893 |
7b28e160-f762-11e1-a439-00145eb45e9a | 166 | 1139 | 0.145742 |
974c7838-232a-4fb8-bc02-ef408fcb6944 | 13 | 90 | 0.144444 |
7bd55aae-61fb-42da-986f-ceedff623749 | 430 | 3000 | 0.143333 |
62f5b442-f762-11e1-a439-00145eb45e9a | 1 | 7 | 0.142857 |
00f03652-0e86-4038-a8a3-fc41ee4a1bc2 | 1 | 7 | 0.142857 |
85d1878e-f762-11e1-a439-00145eb45e9a | 6 | 42 | 0.142857 |
4309a5ab-7050-4387-9abb-7dd497c96904 | 28 | 197 | 0.142132 |
ca0b53d1-7c0d-42c8-86a9-e0548bc4a39a | 11 | 79 | 0.139241 |
c8cc0aeb-b615-499b-9508-24a3b3f9eba4 | 272 | 1968 | 0.138211 |
8433c45a-f762-11e1-a439-00145eb45e9a | 69 | 517 | 0.133462 |
9f9286fe-ca42-4195-874b-13dc6f5d4861 | 871 | 6668 | 0.130624 |
9ee980a9-b373-4b8a-a5be-2926368e500e | 326 | 2500 | 0.1304 |
76805c1e-0691-4d13-ab59-2211efa1a78a | 13 | 100 | 0.13 |
7b269446-f762-11e1-a439-00145eb45e9a | 186 | 1445 | 0.12872 |
7b0bc332-f762-11e1-a439-00145eb45e9a | 127 | 1001 | 0.126873 |
5d17a4fd-7249-46c0-a56b-1924d477b29e | 49 | 387 | 0.126615 |
fe7b0de4-c800-4564-afdc-d98ab465a3e3 | 125 | 988 | 0.126518 |
6e51f70c-4da6-468b-8e76-16ea0ce328d1 | 1 | 8 | 0.125 |
7145ce2e-f762-11e1-a439-00145eb45e9a | 1 | 8 | 0.125 |
e92b921d-fc07-447a-a45c-ce6708a6f7ec | 1 | 8 | 0.125 |
62882bb6-f762-11e1-a439-00145eb45e9a | 2 | 16 | 0.125 |
300184f6-9998-4689-a153-7c83984def36 | 62 | 512 | 0.121094 |
cb811b65-8593-43d4-9ef7-699c4c40b50f | 37 | 308 | 0.12013 |
266628c1-56a0-46cb-b136-3b77dbc32268 | 48 | 407 | 0.117936 |
51aff167-af50-46ef-b64c-d8f9087d695c | 8 | 68 | 0.117647 |
85d88d4a-f762-11e1-a439-00145eb45e9a | 2 | 17 | 0.117647 |
89268632-f762-11e1-a439-00145eb45e9a | 760 | 6782 | 0.112061 |
843672e0-f762-11e1-a439-00145eb45e9a | 50 | 447 | 0.111857 |
7b084162-f762-11e1-a439-00145eb45e9a | 65 | 589 | 0.110357 |
014f4ebe-c9f3-40dc-beb1-45f4e365dbd2 | 606 | 5549 | 0.109209 |
857e1bda-f762-11e1-a439-00145eb45e9a | 313 | 2867 | 0.109173 |
6da08497-6cf5-4f0b-aaad-8eb916ffea9d | 119 | 1101 | 0.108084 |
52ff7a75-4354-464c-9765-eca68cf90859 | 116 | 1077 | 0.107707 |
85cdfbfa-f762-11e1-a439-00145eb45e9a | 272 | 2532 | 0.107425 |
890e7e05-bb6d-4520-a31a-dcef3c3fb0b0 | 6 | 56 | 0.107143 |
7a6cbe9a-f762-11e1-a439-00145eb45e9a | 2 | 19 | 0.105263 |
9f113fae-1666-4bdc-99af-4d0c94beb936 | 374 | 3554 | 0.105234 |
7b0f4480-f762-11e1-a439-00145eb45e9a | 8 | 78 | 0.102564 |
dde7e62b-68eb-4a3c-b3d4-f7e4c1cf094a | 708 | 6954 | 0.101812 |
762e6b74-e481-491e-9809-b03356438e93 | 1 | 10 | 0.1 |
32288bee-fc04-4d05-b1c5-68bf46f04e99 | 1 | 10 | 0.1 |
One could even consider to not accept any fuzzy match if it is a monomial, i.e. rank subgenus and higher. There are many very similar named genera while the fuzzy match for binomials should be way more accurate. With most of the names being within Animalia I doubt you remove most problems. There are lots of homonyms and closely spelled genera between insects and molluscs for example.
The fix deployed in PROD, datasets from the list were reinterpreted
Implemented initial issue logic, monomial wasn't included
I have added default taxonomic values to many of the datasets in the table above, which should prevent too much harm caused from this change.
for example this dataset from the list now has a default values for insects:
https://www.gbif.org/dataset/865368a8-f762-11e1-a439-00145eb45e9a