Estimating abundance on ZYMO sample D6311 log dist
Closed this issue · 2 comments
Hello developers,
Thank you for the tool. I am benchmarking taxor version: 0.1.3 SeqAn version: 3.4.0-rc.1on ZYMO sample sequenced on ONT using prebuilt database containing Archaea, Bacteria, Fungii, Viruses.
taxor search --index-file /taxor/refseq-abfv-k22-s12.hixf --query-file ZYMO_D6311_14.nanoq.10.1000.fastq.gz --output-file ZYMO_D6311_14.nanoq.10.1000.taxor --threads 30 --error-rate 0.15
taxor profile --search-file ZYMO_D6311_14.nanoq.10.1000.taxor --cami-report-file ZYMO_D6311_14.nanoq.10.1000.taxor.cami --seq-abundance-file ZYMO_D6311_14.nanoq.10.1000.taxor.abundance --binning-file ZYMO_D6311_14.nanoq.10.1000.taxor.binning --sample-id ZYMO_D6311_14.nanoq.10.1000.taxor --threads 30
I am surprised to see taxor predicted 38.59% Viruses in the sample
The cami report file shows
@SampleID:ZYMO_D6311_14.nanoq.10.1000.taxor
@Version:0.10.0
@Ranks:superkingdom|phylum|class|order|family|genus|species
@@TAXID RANK TAXPATH TAXPATHSN PERCENTAGE
10239 superkingdom 10239 Viruses 38.5958
2 superkingdom 2 Bacteria 60.8576
1224 phylum 2|1224 Bacteria|Pseudomonadota 0.725765
1239 phylum 2|1239 Bacteria|Bacillota 60.1319
2731618 phylum 10239|2731618 Viruses|Uroviricota 38.3566
2732410 phylum 10239|2732410 Viruses|Hofneiviricota 0.239164
1236 class 2|1224|1236 Bacteria|Pseudomonadota|Gammaproteobacteria 0.725765
2731619 class 10239|2731618|2731619 Viruses|Uroviricota|Caudoviricetes 38.3566
2732411 class 10239|2732410|2732411 Viruses|Hofneiviricota|Faserviricetes 0.239164
91061 class 2|1239|91061 Bacteria|Bacillota|Bacilli 60.1319
order 10239|2731618|2731619| Viruses|Uroviricota|Caudoviricetes| 114.635
1385 order 2|1239|91061|1385 Bacteria|Bacillota|Bacilli|Bacillales 59.8872
186826 order 2|1239|91061|186826 Bacteria|Bacillota|Bacilli|Lactobacillales 0.244694
2732094 order 10239|2732410|2732411|2732094 Viruses|Hofneiviricota|Faserviricetes|Tubulavirales 0.239164
72274 order 2|1224|1236|72274 Bacteria|Pseudomonadota|Gammaproteobacteria|Pseudomonadales 0.725765
10860 family 10239|2732410|2732411|2732094|10860 Viruses|Hofneiviricota|Faserviricetes|Tubulavirales|Inoviridae 0.239164
1300 family 2|1239|91061|186826|1300 Bacteria|Bacillota|Bacilli|Lactobacillales|Streptococcaceae 0.244694
135621 family 2|1224|1236|72274|135621 Bacteria|Pseudomonadota|Gammaproteobacteria|Pseudomonadales|Pseudomonadaceae 0.725765
186817 family 2|1239|91061|1385|186817 Bacteria|Bacillota|Bacilli|Bacillales|Bacillaceae 0.398114
186820 family 2|1239|91061|1385|186820 Bacteria|Bacillota|Bacilli|Bacillales|Listeriaceae 59.4891
1301 genus 2|1239|91061|186826|1300|1301 Bacteria|Bacillota|Bacilli|Lactobacillales|Streptococcaceae|Streptococcus 0.244694
1386 genus 2|1239|91061|1385|186817|1386 Bacteria|Bacillota|Bacilli|Bacillales|Bacillaceae|Bacillus 0.398114
1623287 genus 10239|2731618|2731619|||1623287 Viruses|Uroviricota|Caudoviricetes|||Detrevirus 0.24911
1637 genus 2|1239|91061|1385|186820|1637 Bacteria|Bacillota|Bacilli|Bacillales|Listeriaceae|Listeria 59.4891
2560098 genus 10239|2731618|2731619|||2560098 Viruses|Uroviricota|Caudoviricetes|||Beetrevirus 0.185685
2732875 genus 10239|2732410|2732411|2732094|10860|2732875 Viruses|Hofneiviricota|Faserviricetes|Tubulavirales|Inoviridae|Primolicivirus 0.239164
286 genus 2|1224|1236|72274|135621|286 Bacteria|Pseudomonadota|Gammaproteobacteria|Pseudomonadales|Pseudomonadaceae|Pseudomonas 0.725765
1129145 species 10239|2731618|2731619||||1129145 Viruses|Uroviricota|Caudoviricetes||||Pseudomonas phage phi297 0.110057
1129146 species 10239|2731618|2731619|||1623287|1129146 Viruses|Uroviricota|Caudoviricetes|||Detrevirus|Detrevirus PMG1 0.24911
1225792 species 10239|2731618|2731619||||1225792 Viruses|Uroviricota|Caudoviricetes||||Pseudomonas phage JBD25 0.437994
1449437 species 10239|2731618|2731619||||1449437 Viruses|Uroviricota|Caudoviricetes||||Pseudomonas phage vB_PaeP_Tr60_Ab31 0.104323
1458852 species 10239|2731618|2731619||||1458852 Viruses|Uroviricota|Caudoviricetes||||Listeria phage LP-030-3 26.7424
1591073 species 10239|2731618|2731619||||1591073 Viruses|Uroviricota|Caudoviricetes||||Listeria phage vB_LmoS_293 7.65777
1639 species 2|1239|91061|1385|186820|1637|1639 Bacteria|Bacillota|Bacilli|Bacillales|Listeriaceae|Listeria|Listeria monocytogenes 0.255591
1642 species 2|1239|91061|1385|186820|1637|1642 Bacteria|Bacillota|Bacilli|Bacillales|Listeriaceae|Listeria|Listeria innocua 16.7856
1755689 species 10239|2731618|2731619||||1755689 Viruses|Uroviricota|Caudoviricetes||||Pseudomonas phage YMC11/02/R656 0.132222
1777052 species 10239|2731618|2731619||||1777052 Viruses|Uroviricota|Caudoviricetes||||Pseudomonas phage JBD44 0.182903
2011081 species 10239|2732410|2732411|2732094|10860|2732875|2011081 Viruses|Hofneiviricota|Faserviricetes|Tubulavirales|Inoviridae|Primolicivirus|Primolicivirus Pf1 0.239164
2545800 species 2|1224|1236|72274|135621|286|2545800 Bacteria|Pseudomonadota|Gammaproteobacteria|Pseudomonadales|Pseudomonadaceae|Pseudomonas|Pseudomonas sp. FDAARGOS_761 0.162082
2560663 species 10239|2731618|2731619|||2560098|2560663 Viruses|Uroviricota|Caudoviricetes|||Beetrevirus|Beetrevirus JBD67 0.185685
2678528 species 2|1239|91061|1385|186820|1637|2678528 Bacteria|Bacillota|Bacilli|Bacillales|Listeriaceae|Listeria|Listeria sp. LM90SB2 42.4479
2866282 species 2|1224|1236|72274|135621|286|2866282 Bacteria|Pseudomonadota|Gammaproteobacteria|Pseudomonadales|Pseudomonadaceae|Pseudomonas|Pseudomonas sp. PS1(2021) 0.330644
287 species 2|1224|1236|72274|135621|286|287 Bacteria|Pseudomonadota|Gammaproteobacteria|Pseudomonadales|Pseudomonadaceae|Pseudomonas|Pseudomonas aeruginosa 0.23304
is there any way i can fix this?
According to ZYMO website, this is the expected proportions
Listeria monocytogenes - 89.1%, Pseudomonas aeruginosa - 8.9%, Bacillus subtilis - 0.89%, Saccharomyces cerevisiae - 0.89%, Escherichia coli - 0.089%, Salmonella enterica - 0.089%, Lactobacillus fermentum - 0.0089%, Enterococcus faecalis - 0.00089%, Cryptococcus neoformans - 0.00089%, and Staphylococcus aureus - 0.000089%.
This is an issue we have recognized with all tools in our benchmarking on taxonomic abundance. When you are using a database that consists of bacteria and viruses, all tools will recognize a bunch of bacterial reads as belonging to phages that infect the respected bacterial species. The indexed database has a much bigger impact on the results than the used tool. So in your case, it would make sense to use a bacteria-only database. I would also try to reduce the accepted error rate to 0.05 if your nanopore reads have a high quality, which could also resolve the issue.
Thank you @JensUweUlrich. It makes sense. I am using new R10.4 library data, i will try with reducing error rate.