mikolmogorov/Flye

Nanopore Long Read Assembly: Stuck on Polishing Stage, No Eukaryotic Organism Represented

Tpowell7 opened this issue · 1 comments

Hello

I am currently trying to assemble metagenomes from 5.4 million nanopore reads with an average length of of 4.4K and 27.2 billion base pairs total. The initially process seems to work and output directories are created but while running the polishing step, the process never gets past the following:

[2024-08-15 13:57:21] root: INFO: Resuming previous run
[2024-08-15 13:57:21] root: INFO: >>>STAGE: polishing
[2024-08-15 13:57:21] root: INFO: Polishing genome (1/1)
[2024-08-15 13:57:21] root: INFO: Running minimap2

I've let my assembly run for a few days as an sbatch job but the job fails due to "OUT_OF_MEMORY". Checking the bench output I se that I reserved 80G of memory but the max memory used is only 55.48G:

flye --nano-hq LFDNA_nanopore.fastq -o flye_out --meta --resume -t 20
slurmstepd: error: Detected 1 oom_kill event in StepIdName : nplongread
User :
Account :
Partition : med
Nodes : cpu-10-75
Cores : 10
GPUs : 0
State : OUT_OF_MEMORY
ExitCode : 0:125
Submit : 2024-08-12T23:47:18
Start : 2024-08-12T23:47:59
End : 2024-08-14T23:04:34
Reserved walltime : 3-00:00:00
Used walltime : 1-23:16:35
Used CPU time : 19-00:34:47
% User (Computation): 99.09%
% System (I/O) : 0.91%
Mem reserved : 80G
Max Mem used : 55.48G (cpu-10-75)
Max Disk Write : 154.92G (cpu-10-75)
Max Disk Read : 726.61G (cpu-10-75)

In addition, I've used flye before on prokaryotic organisms and it seems to work but I am looking at eukaryotic diatoms now. When a previous assembly was made of these reads, binning did not produce any diatom species. However when I check overrepresented sequences during QC of the reads, diatom related genes make up most of the overrepresented sequences. I also know for sure that they make up a majority of the sample. Does flye just need more time or am I missing something here?

It is likely to be due to running out of memory as the error message suggests. Max memory used may not reflect the spikes in memory usage. I suggest running with increased amount of RAM.