Additions for the species invalidation list
Closed this issue · 3 comments
During database construction, we heuristically invalidate some species. These species should be added:
- Bacterium: 178K proteins
- Proteobacteria bacterium: 164K proteins
- Ruminococcaceae bacterium: 131K proteins
Check that everything between bacteria and bacterium is also negative in the lineage.
After doing some more research, it seems that NCBI changed their naming strategy. If they had an unknown species with a known genus, they used to call that unidentified <genusname>
which we invalidated. It now seems that they use <genusname> bacterium
which we don't invalidate. This potentially affects up to 23 million proteins.
An invalidation rule could be to invalidate if rank is species and name ends in a space followed by "bacterium".
I found 631 currently valid taxa would this rule would invalidate. There are also 314 additional currently valid taxa which match ' bacterium.*' at species level, should these be invalidated, too? They look invalid to me, at least.
None of these are valid taxon names.