Kraken2 classification % goes down after adding more genomes

Question

Kraken2 classification % goes down after adding more genomes

dgolden96 opened this issue 2 years ago · 4 comments

Hi there,

I've got the db-update functionality of the pipeline working, and I recently added about 4,000 genomes from JGI GOLD to the pre-made GTDB_release207 custom database. My issue is that the database is now classifying a smaller percentage of sample reads than it did before. Initially, I thought that the issue might be that my sample TSV's taxonomy column only had genus and species names, but I've repeated the process after expanding the taxonomy to mirror that of the metadata used for GTDB_release207, and I haven't seen any change. Any guesses about why this might be? I'll attach my sample TSV here as a .txt file
JGI_downloads_data_distinct_name_acc_premade_filter_tax_fixed.txt

Answer 1 · 2022-07-29T06:25:12.000Z

How much overlap is there between the 4k genomes that you added and the existing genomes in the database? Particularly, was is the taxonomic and ANI distance between your new genomes and the existing genomes?

Answer 2 · 2022-08-02T18:33:46.000Z

We don't have those distances calculated yet, but the genomes that we're adding in our sample TSV are all soil-associated genomes with NCBI taxIDs that weren't in the pre-built GTDB. We selected them from the JGI GOLD repository with the goal of maximizing the ecosystem-specific diversity of the database. Ostensibly, there shouldn't be all that much overlap

Answer 3 · 2022-08-08T11:36:45.000Z

It would be best to check. The simplest explanation is that you have too many highly similar references, which is reducing the ability of Kraken2 to classify down to the most finely resolved taxonomic levels

Answer 4 · 2022-08-11T23:40:24.000Z

Much appreciated! I'll start taking steps to prepare to incorporate that analysis. My supervisor on this project and I are discussing whether we might be able to use the fastANI package to investigate that possibility, but we have a couple of reasons for thinking that low genomic distance between our existing database and the genomes in the sample TSV may not be a problem in our dataset. I'm going to get my notes together (as well as some data and a couple of files) and try to present the issue with more detail in a follow-up issue soon. Thanks in advance!