v3.2.4 stalls on very large dataset but v3.2.1 does not

Question

v3.2.4 stalls on very large dataset but v3.2.1 does not

peter-kanvas opened this issue 4 months ago · 2 comments

I have a workflow which uses kmc to count all kmers in an extremely large dataset of about 250,000 fasta files. The workflow was originally built with v3.2.1 of kmc, but stalled when I updates to v3.2.4. Unfortunately it doesn't exit or report an error. Here's the details I can provide:

KMC call:

kmc -fm -ci0 -cx100000000000 -t94 -k75 -m745 @reference_list database databse_dir

Result:

The program spends some time printing * characters, and then it prints Stage 1: 0% before stalling. There are 511 bin files in the workdir. Htop shows no processor activity, but the commands are still listed.

Before changing versions, i spent time trying to make sure that none of the fasta.gz files were corrupted.

gzip -t was clean for all genomes
py_fasta_validator did not indicate a problem with any of the fasta formatting
I ran kmc on each genome individually and it returned a result for all (however, it did fail on a few genomes, but then passed when I reran on those. This could be because I used xargs to parallelize 94 at a time)

Answer 1 · 2024-07-05T09:17:17.000Z

Hello,

this sounds bad.
Is this data anyhow downloadable, such that I could try to reproduce this bug?

Some ideas you could try to narrow:

use fewer threads and less RAM
check on some subset of input files, for example, if it also occurs on half of the files, a quarter of the files, and so on

I would really like to fix it because it seems to be quite a serious bug, but without reproducing this, it may be really challenging.

Answer 2 · 2024-07-05T15:28:19.000Z

The data is publicly available. They are all the genomes I could collect from the gtdb database via NCBI. I'm attaching two lists. One is the ftp links I used to download all the genomes. They may or may not still be valid. The other is a subset of the genomes that I used when I encountered the error. You'll need about 351G of space to download all the genomes, and the final database ends up being about 4 TB. I'm working on an AWS EC2 instance (r5a.24xlarge) running AWS Linux 2023. KMC was installed using mamba, and the call was made from within a snakemake pipeline which I cannot share.

I've already moved passed the problem and have to get to the downstream analysis. I'll try the changes you suggested the next time I run this pipeline (likely in a few weeks).

reference_genome_list_kan002_v3.txt.gz
gtdb_in_genbank_ftp_links.txt.gz