Generating sample table when updating database with MAGs - GTDB taxID and Accession Number
PeterCx opened this issue · 5 comments
Hi there,
I am trying to update the GTDB -r207 database I have downloaded using Struo2 with my own MAGs. It is not clear how I get some of the information including "ncbi_organism_name", "gtdb_taxid" and "accession".
I have annotated my MAGs using GTDB-Tk. Using the FastANI I have de-replicated my genomes removing those with 95% ANI. This has left me with a ~ 3000 MAGs. Given that these MAGs are not close to any other genome in GTDB I don't understand how I can get a taxid? I have attached the current information I have from GTDB about my MAGs.
GTDB_MAG_Information.txt
Your help is greatly appreciated.
Kind regards,
P
You should get the GTDB taxids via https://github.com/shenwei356/gtdb-taxdump
I used that taxdump for setting taxids in GTDB-r207
Hi Nick,
Thanks for your response. A few things are still not clear to me. I have the GTDB taxids for r207 as obtained through the link above. But its not clear how I generated taxids for my own MAGs? I have used the below command which I found here
gtdb_to_taxdump.py
TaxID/gtdbtk.bac120.summary.tsv
https://data.gtdb.ecogenomic.org/releases/release207/207.0/bac120_taxonomy_r207.tsv.gz \
TaxID/taxID_info.tsv
This shows a taxid in the output file taxID_info.tsv.
How do I get the ncbi_organism_name and accession required for the databse update? I have confusion because most of my MAGs cannot be assigned a taxonomy beyond the genus level.
Many thanks
P
You could go from NCBI taxids for each of your MAGs to GTDB taxids, via gtdb_to_taxdump.py
.
Another approach is getting the GTDB taxids directly from the GTDB taxdump created by https://github.com/shenwei356/gtdb-taxdump. You would probably need to create your own script for this, however. The process would likely be MAGs => GTDB-Tk (GTDB taxonomy) => map taxonomy to gtdb-taxdump => get GTDB taxids
Hi @nick-youngblut, I am experiencing a similar issue. I have a GTDB-Tk output file with GTDB taxonomies, but I don't have any TaxIDs. How do I go about this step of map taxonomy to gtdb-taxdump
?
Thank you for any assistance you can provide.
This is what I ended up doing:
# Create lineage dataframe based on gtdb_classification column
# This is what a cell looks like: 'd__Archaea;p__Aenigmatarchaeota;c__Aenigmatarchaeia;o__GW2011-AR5;f__GCA-2688965;g__GCA-2688965;s__GCA-2688965 sp002688965'
gtdb_lineages = df.set_index("genome")["gtdb_classification"].str.split(";", expand=True)
gtdb_lineages = df["gtdb_classification"].str.split(";", expand=True)
# Write a function to extract the scientific name
def get_sci_name_from_row(row):
"""This reads a Pandas Series (a row) and returns the lowest level scientific name."""
# Iterate each value in the reversed row, return that value if it's valid after trimming
ix = -1
for value in row.to_list()[ix::-1]:
if (value_fmt := value[3:]): # must trim the 'value' as it contains the prefix denoting the rank
return value_fmt
else: # if it isn't classified, go to the higher tax rank
ix -= 1
continue
return None
# Export to a text file
gtdb_lineages.apply(get_sci_name_from_row, axis=1).to_csv("scinames.csv")
Now I run that with TaxonKIT (my --data-dir
on TaxonKit is set for the GTDB r207 taxdump):
cut -f 2 -d , scinames.csv | taxonkit name2taxid > taxids.csv
This gives me a text file with the TaxIDs from the custom GTDB taxdump. I hope it helps.
Best,
Vini