unipept/unipept-database

Additions for the species invalidation list

Closed this issue · 3 comments

During database construction, we heuristically invalidate some species. These species should be added:

Check that everything between bacteria and bacterium is also negative in the lineage.

After doing some more research, it seems that NCBI changed their naming strategy. If they had an unknown species with a known genus, they used to call that unidentified <genusname> which we invalidated. It now seems that they use <genusname> bacterium which we don't invalidate. This potentially affects up to 23 million proteins.

An invalidation rule could be to invalidate if rank is species and name ends in a space followed by "bacterium".

I found 631 currently valid taxa would this rule would invalidate. There are also 314 additional currently valid taxa which match ' bacterium.*' at species level, should these be invalidated, too? They look invalid to me, at least.

None of these are valid taxon names.