leylabmpi/Struo2

Generating sample table when updating database with MAGs - GTDB taxID and Accession Number

PeterCx opened this issue · 5 comments

Hi there,

I am trying to update the GTDB -r207 database I have downloaded using Struo2 with my own MAGs. It is not clear how I get some of the information including "ncbi_organism_name", "gtdb_taxid" and "accession".

I have annotated my MAGs using GTDB-Tk. Using the FastANI I have de-replicated my genomes removing those with 95% ANI. This has left me with a ~ 3000 MAGs. Given that these MAGs are not close to any other genome in GTDB I don't understand how I can get a taxid? I have attached the current information I have from GTDB about my MAGs.
GTDB_MAG_Information.txt

Your help is greatly appreciated.

Kind regards,

P

You should get the GTDB taxids via https://github.com/shenwei356/gtdb-taxdump

I used that taxdump for setting taxids in GTDB-r207

Hi Nick,

Thanks for your response. A few things are still not clear to me. I have the GTDB taxids for r207 as obtained through the link above. But its not clear how I generated taxids for my own MAGs? I have used the below command which I found here

gtdb_to_taxdump.py
TaxID/gtdbtk.bac120.summary.tsv
https://data.gtdb.ecogenomic.org/releases/release207/207.0/bac120_taxonomy_r207.tsv.gz \

TaxID/taxID_info.tsv

This shows a taxid in the output file taxID_info.tsv.
How do I get the ncbi_organism_name and accession required for the databse update? I have confusion because most of my MAGs cannot be assigned a taxonomy beyond the genus level.

Many thanks

P

You could go from NCBI taxids for each of your MAGs to GTDB taxids, via gtdb_to_taxdump.py.

Another approach is getting the GTDB taxids directly from the GTDB taxdump created by https://github.com/shenwei356/gtdb-taxdump. You would probably need to create your own script for this, however. The process would likely be MAGs => GTDB-Tk (GTDB taxonomy) => map taxonomy to gtdb-taxdump => get GTDB taxids

Hi @nick-youngblut, I am experiencing a similar issue. I have a GTDB-Tk output file with GTDB taxonomies, but I don't have any TaxIDs. How do I go about this step of map taxonomy to gtdb-taxdump?

Thank you for any assistance you can provide.

This is what I ended up doing:

# Create lineage dataframe based on gtdb_classification column
# This is what a cell looks like: 'd__Archaea;p__Aenigmatarchaeota;c__Aenigmatarchaeia;o__GW2011-AR5;f__GCA-2688965;g__GCA-2688965;s__GCA-2688965 sp002688965'
gtdb_lineages = df.set_index("genome")["gtdb_classification"].str.split(";", expand=True)
gtdb_lineages = df["gtdb_classification"].str.split(";", expand=True) 

# Write a function to extract the scientific name
def get_sci_name_from_row(row):
    """This reads a Pandas Series (a row) and returns the lowest level scientific name."""
    # Iterate each value in the reversed row, return that value if it's valid after trimming
    ix = -1
    for value in row.to_list()[ix::-1]:
        if (value_fmt := value[3:]):    # must trim the 'value' as it contains the prefix denoting the rank
            return value_fmt
        else:  # if it isn't classified, go to the higher tax rank
            ix -= 1
            continue
    return None

# Export to a text file
gtdb_lineages.apply(get_sci_name_from_row, axis=1).to_csv("scinames.csv")

Now I run that with TaxonKIT (my --data-dir on TaxonKit is set for the GTDB r207 taxdump):

cut -f 2 -d , scinames.csv | taxonkit name2taxid > taxids.csv

This gives me a text file with the TaxIDs from the custom GTDB taxdump. I hope it helps.

Best,
Vini