hariszaf/pema

NCBI Taxon ID included in the final_table.tsv file?

cpavloud opened this issue · 12 comments

One think that has been requested is to enhance the final_table.tsv file to include (apart from the columns it already includes), the NCBI Taxon ID for each ASV/OTU and the accession number of the sequence that was its closest match in the database used. The NCBI Taxon ID could then be used as the taxonConceptID when submitting data to GBIF/OBIS using the DwC-A format (as discussed here)

For example, instead of the current final_table.tsv file, which looks like this
OTU_id,ERR0000008,ERR0000009,Classification
Otu1,1123,2,Eukaryota;Arthropoda;Insecta;Plecoptera;Capniidae;Allocapnia;Allocapnia aurora
Otu2,3,0,Eukaryota;Porifera;Demospongiae;Hadromerida;Polymastiidae;Polymastia;Polymastia littoralis

(Ideally) It could be something like this
OTU_id,ERR0000008,ERR0000009,Classification,Accession_number,NCBI_Taxon_ID
Otu1,1123,2,Eukaryota;Arthropoda;Insecta;Plecoptera;Capniidae;Allocapnia;Allocapnia aurora,JN200445,608846
Otu2,3,0,Eukaryota;Porifera;Demospongiae;Hadromerida;Polymastiidae;Polymastia;Polymastia littoralis,NC_023834,1473587

If it is not possible to retrieve the accession number and/or the NCBI taxon ID, I think we can find some workarounds.
Perhaps it will be possible to retrieve the NCBI Taxon ID using the Bio.Entrez package

@cpavloud I found out about the ncbi-taxonomist tool.

We could use it I think.

Would you like to have a look and share any thoughts?

I am not sure how it would work exactly (the ncbi-taxonomist page does not provide very good examples/explanations), but we could give it a try.

Think of a while loop that will start from the end of the taxonomy in each row of the finalTable.tsv file and will use the ncbi-taxonomist for each level.
Using each level, we ll do queries searching for an ncbi taxonomy id, and when we have one we ll have something like this:

Assiming we are looking for Saprospiraceae

ncbi-taxonomist collect -n 'Saprospiraceae'

would return:

{"taxid":131567,"rank":"no rank","names":{"cellular organisms":"scientific_name"},"parentid":null,"name":"cellular organisms"}
{"taxid":2,"rank":"superkingdom","names":{"Bacteria":"scientific_name"},"parentid":131567,"name":"Bacteria"}
{"taxid":1783270,"rank":"clade","names":{"FCB group":"scientific_name"},"parentid":2,"name":"FCB group"}
{"taxid":68336,"rank":"clade","names":{"Bacteroidetes/Chlorobi group":"scientific_name"},"parentid":1783270,"name":"Bacteroidetes/Chlorobi group"}
{"taxid":976,"rank":"phylum","names":{"Bacteroidetes":"scientific_name"},"parentid":68336,"name":"Bacteroidetes"}
{"taxid":1937959,"rank":"class","names":{"Saprospiria":"scientific_name"},"parentid":976,"name":"Saprospiria"}
{"taxid":1936988,"rank":"order","names":{"Saprospirales":"scientific_name"},"parentid":1937959,"name":"Saprospirales"}
{"taxid":89374,"rank":"family","names":{"Saprospiraceae":"scientific_name","Saprospira group":"Synonym"},"parentid":1936988,"name":"Saprospiraceae"}

So, for example, if you have this classifications in the finalTable.tsv

Main genome;Eukaryota;Opisthokonta;Nucletmycea;Fungi;Dikarya;Ascomycota;Saccharomycotina;Saccharomycetes;Saccharomycetales;Dipodascaceae;Geotrichum

you would search for Geotrichum
and then for Dipodascaceae
and then for Saccharomycetales
etc etc.

and get the last line for each of your searches?

I would search for Geotrichum, if that has a hit, i d get

  • only its ncbi taxonomy id
  • the ncbi taxonomy ids of all its lineage
    we could think about that.

If I would not get a hit, I would continue with Dipodascaceae etc.

@cpavloud have a look. would that be ok ?

root@3bbfa77ef486:/mnt/analysis# more extenedFinalTable.tsv 
OTU	ERR0000001	Classification	TAXON:NCBI_TAX_ID
Otu4056	1	Main genome;Bacteria;Patescibacteria;Saccharimonadia;Saccharimonadales	Patescibacteria:1783273

@cpavloud have a look. would that be ok ?

root@3bbfa77ef486:/mnt/analysis# more extenedFinalTable.tsv 
OTU	ERR0000001	Classification	TAXON:NCBI_TAX_ID
Otu4056	1	Main genome;Bacteria;Patescibacteria;Saccharimonadia;Saccharimonadales	Patescibacteria:1783273

If there were no NCBI taxonomy IDs for Saccharimonadia and Saccharimonadales, I think we are fine :)

Exactly!
The thing is that there is not a ncbi taxonomy id always for a name in a ref db.
So i thought we could go up to the taxonomy found and work at one rank at a time starting from the species level.
I ll add this asap.

Just fyi, here is what you would get if you d search on ncbi taxonomy db for Saccharimonadales

image

and Saccharimonadia

image

This feature is now ready and will be part of pema:v.2.1.4.

The issue is now resolved.

Re-opening the issue:
In case it might be helpful, we can go from the sequence accession number to the NCBI Id: https://www.biostars.org/p/10959/

This is definitely useful for ITS #52