DominikBuchner/BOLDigger

BOLDigger hit type 2 overflagging

Closed this issue · 1 comments

Many records in the BOLD System have in their specific epithet information that does not correspond exactly to the species name.
e.g.: sp. a AK-2021, sp. (Johor), communis A1A2, cf. alpium.

Because of this many hits are labelled in the Boldigger hit pipeline as type 2.

Therefore I think it could be interesting to add a species name cleaning step at the beginning of the Boldigger hit process as follows:

Delete the species name completely if it contains:
"sp." lack species name (e.g. sp. CFJS-2021b)
"cf." doubtful species name (e.g. cf. micrura)
"aff." doubtful species name (e.g. aff. hornsundi)
"grp." group, doubtful species name (e.g. pedellus grp.)
" / " doubtful species name

Erase after: (To leave only the species name)
" ssp." subespecies name
" var." variant name (e.g. australogibba var. subcapitata)
" " addition information for a species added after the species name (e.g. bilobata CEA) After this I would delete the boxes containing numbers. (e.g. sp0949C, Malaise3164)

One element on which I doubt whether or not it should be deleted is hybrids. It might be interesting to remove them by default but leave a command as an option not to do so.
" x " hybrids (e.g. pennsylvanicus x firmus)

Fixed with BOLDigger2.