leylabmpi/Struo2

Kraken2_build step stalling

Opened this issue · 7 comments

Hi there,

I'm continuing to troubleshoot the db-update process for a kraken2 database, and I've hit a wall at the kraken2_build step. The pipeline doesn't throw any errors; it just continues to run indefinitely (12+ hours without failure or completion). It seems similar to the problem described here: DerrickWood/kraken2#428

So far, I've tried to implement the workaround mentioned in the comments of that issue I linked, where you add the --fast-build flag to the kraken2 call in the db-update snakefile, but it doesn't seem to have solved the issue. Any chance you've seen this before and/or have any thoughts on what might be causing it? I definitely have enough RAM. I'm using 28 cores with 16 Gb per core.

Thanks!

I've (thankfully) never experienced that issue. How many genomes are included in the build?

The database to be updated is the full GTDB_release207, and the sample TSV I'm trying to add includes ~4,000 genomes

A related question: if we instead passed the reads that were unclassified from GTDB into a second database (db-create with only the non-GTDB genomes), should that give similar results as a single database via the db-update workflow? There are methods for combining outputs for the same sample from different databases, though I imagine there could be downstream effects on Bracken estimates.

The downside of a 2-step classification approach versus a 1-step is that there is no direct "competition" during classification across the 2 steps. So, some reads could be falsely classified in the 1st step when they would actually be classified as something in the 2nd step if the 2 reference databases were combined.

Same problem here. I ran the kraken2 database building using 40 cores (7 GB each), and after 24 hours the process stalled at this point:

Creating sequence ID to taxonomy ID map (step 1)...
Sequence ID to taxonomy ID map already present, skipping map creation.
Estimating required capacity (step 2)...
Estimated hash table requirement: 75566900660 bytes
Capacity estimation complete. [37m21.355s]
Building database files (step 3)...
Taxonomy parsed and converted.
CHT created with 16 bits reserved for taxid.

@MixalisSn do you think that the stalling could be due to limited memory?

@nick-youngblut I thought the 120 GB were enough. Any way, I added the --fast-build flag, using the same resources, and the build was completed successfully.