leylabmpi/Struo2

GTDB202 tree is missing species

Closed this issue · 3 comments

Hi all,

I'm trying to perform some analyses using the Kraken2 version of the database (http://ftp.tue.mpg.de/ebio/projects/struo2/GTDB_release202/kraken2/) and the phylogenetic tree (http://ftp.tue.mpg.de/ebio/projects/struo2/GTDB_release202/phylogeny/gte50comp-lt5cont.nwk) and have noticed that a number of species in the database are missing from the tree.

A few examples (of the ~200 or so I found to be missing):

s__1XD42-69 sp003612565
s__43-108 sp001915545                  
s__Acetatifactor sp003612485
s__Achromobacter_anxifer
s__Acidaminococcus_sp900314165
s__Amylolactobacillus_amylophilus

Is there some criteria that led to these species being excluded from the tree, or are they missing by accident? Is it possible to get a tree containing everything in the database?

Thanks for your time!

The lack of overlap is combination of different formatting (e.g., s__Acetatifactor sp003612485 should be s__Acetatifactor_sp003612485) and the fact that some species in the GTDB did not have a "good" representative genome assembly and were thus not included in the Struo2 databases

Thanks for the fast response!

I'm not sure this is necessarily the case -- for example sp003612485 shows up in the Struo2 database, taxonomy file and names.dmp, but not the tree:

$ grep sp003612485 bac120_taxonomy_r202.tsv
RS_GCF_003612485.1	RS_GCF_003612485.1	d__Bacteria;p__Firmicutes_A;c__Clostridia;o__Lachnospirales;f__Lachnospiraceae;g__Acetatifactor;s__Acetatifactor sp003612485

$ grep sp003612485 names.dmp
263910	|	s__Acetatifactor sp003612485	|		|	scientific name	|

$ grep sp003612485 gte50comp-lt5cont.nwk | wc -l
       0

So it's still not quite clear why these would be present in the Struo2 Kraken2 database but not the tree

sp003612485 isn't in the actual kraken2 or humann3 database. The taxonomy files include all GTDBr202 taxa, but not all were included in the actual database