Kraken2_build step stalling

Question

Kraken2_build step stalling

Opened this issue 2 years ago · 7 comments

Hi there,

I'm continuing to troubleshoot the db-update process for a kraken2 database, and I've hit a wall at the kraken2_build step. The pipeline doesn't throw any errors; it just continues to run indefinitely (12+ hours without failure or completion). It seems similar to the problem described here: DerrickWood/kraken2#428

So far, I've tried to implement the workaround mentioned in the comments of that issue I linked, where you add the --fast-build flag to the kraken2 call in the db-update snakefile, but it doesn't seem to have solved the issue. Any chance you've seen this before and/or have any thoughts on what might be causing it? I definitely have enough RAM. I'm using 28 cores with 16 Gb per core.

Thanks!

Answer 1 · 2022-07-14T16:40:46.000Z

I've (thankfully) never experienced that issue. How many genomes are included in the build?

Answer 2 · 2022-07-14T21:01:42.000Z

The database to be updated is the full GTDB_release207, and the sample TSV I'm trying to add includes ~4,000 genomes

Answer 3 · 2022-07-15T03:16:27.000Z

A related question: if we instead passed the reads that were unclassified from GTDB into a second database (db-create with only the non-GTDB genomes), should that give similar results as a single database via the db-update workflow? There are methods for combining outputs for the same sample from different databases, though I imagine there could be downstream effects on Bracken estimates.

Answer 4 · 2022-07-15T08:01:46.000Z

The downside of a 2-step classification approach versus a 1-step is that there is no direct "competition" during classification across the 2 steps. So, some reads could be falsely classified in the 1st step when they would actually be classified as something in the 2nd step if the 2 reference databases were combined.

Answer 5 · 2023-12-01T11:36:52.000Z

Same problem here. I ran the kraken2 database building using 40 cores (7 GB each), and after 24 hours the process stalled at this point:

Creating sequence ID to taxonomy ID map (step 1)...
Sequence ID to taxonomy ID map already present, skipping map creation.
Estimating required capacity (step 2)...
Estimated hash table requirement: 75566900660 bytes
Capacity estimation complete. [37m21.355s]
Building database files (step 3)...
Taxonomy parsed and converted.
CHT created with 16 bits reserved for taxid.

Answer 6 · 2023-12-03T23:03:12.000Z

@MixalisSn do you think that the stalling could be due to limited memory?

Answer 7 · 2023-12-10T18:58:55.000Z

@nick-youngblut I thought the 120 GB were enough. Any way, I added the --fast-build flag, using the same resources, and the build was completed successfully.