mikolmogorov/Flye

Correcting bubbles step crash

mprous1 opened this issue · 11 comments

For me, Flye 2.9.3 or 2.9.4 usually fails at 'Correcting bubbles' step in Linux Mint, regardless if I compiled it or installed through Miniconda (installation in default miniconda environment does not work apparently because of Python 3.11).

flye --nano-corr nuclear.fq -t 16 -o flye_out

[2024-05-19 22:31:22] root: INFO: Correcting bubbles
[2024-05-19 22:32:34] root: ERROR: Command '['flye-modules', 'polisher', '--bubbles', '/home/mprous/nanopore/flye_out/40-polishing/bubbles_1.fasta', '--subs-mat', '/home/mprous/miniconda3/envs/medaka/lib/python3.10/site-packages/flye/config/bin_cfg/nano_r94_substitutions.mat', '--hopo-mat', '/home/mprous/miniconda3/envs/medaka/lib/python3.10/site-packages/flye/config/bin_cfg/nano_r94_g36_homopolymers.mat', '--out', '/home/mprous/nanopore/flye_out/40-polishing/consensus_1.fasta', '--threads', '16']' died with <Signals.SIGSEGV: 11>.
[2024-05-19 22:32:34] root: ERROR: Pipeline aborted

It does not happen always, seems that smaller datasets work usually, not sure where is the threshold, maybe around 20-30 GB of fastq data. Genomes are around 200-400 Mb with 20-60 X coverage and usually haploid.

No idea what could be the problem, the computer should have enough resources, 32 cores, 128 GB RAM (peak RAM usege usually less than half of that).

Sorry for my late response! Could it be relevant to #584? Is it using error-corrected reads? Does the error ever happen with reads without error correction?

If error does happen - is it reproducible? I.e. if it crashes, and you restart using --resume option, does it crash again?

I haven't observed similar errors on my end, so I'd need an example that consistently reproduces the error to work on that.

Not sure if it is related to #584. It happens often in my computer, but never happend (for the same dataset) with Flye 2.9-b1768 in a server, also using --nano-corr. The reads have not been corrected in any way. R10.4.1 nanopore reads basecalled with Dorado 0.5 and now 0.7 (1-10% duplex reads). I'm using --nano-corr which perhaps produces more contiguous assemblies (?).

--resume did not help, it died at same step I think... In one instance I remember (not sure if it died in a different step) with --resume it strangely produced the genome 2X the size without decrease in coverage.

I need to check if with --nano-hq it would crash at the same step.

I did more tests with earlier flye versions using the same dataset (2.9 and 2.9.2 installed from bioconda), but got the same result, crashing at bubble correction step. No difference when using --nano-corr or --nano-hq and no help with --resume (starts running minimap2 and then crashes at same step). The same error message every time.

But the same dataset works in a server with Flye 2.9 in its own environment without problems and I would assume the newer versions as well. Could there be some software conflicts?

I can nevertheless send a dataset (14 GB) for testing, but not sure if there is much point if the issue is more likely software environment.

I think it may be specific to your hardware / environment then. Could you give more info about both your personal machine and the server? E.g. processor type, OS.. Also the output of cat /proc/cpuinfo from both system will be helpful.

On our personal machine, if you build from source instead of using bioconda, does it still crash?

My computer where Flye crashes:
Operating system: Kernel: 5.15.0-107-generic x86_64 bits: 64 compiler: gcc v: 11.4.0 Desktop: Cinnamon 6.0.4
tk: GTK 3.24.33 wm: muffin vt: 7 dm: LightDM 1.30.0 Distro: Linux Mint 21.3 Virginia
base: Ubuntu 22.04 jammy

Hardware: 24-core (8-mt/16-st) model: Intel Core i9-14900KF bits: 64 type, 128 Gb RAM, NVIDIA GeForce RTX 4090/PCIe/SSE2 v: 4.6.0 NVIDIA 535.171.04

I'm away for the next three weeks, so can't access the computer and check the output of "cat /proc/cpuinfo"

Yes, building flye from source on my computer produces the same error.

The computing cluster where it works:
https://hpc.ut.ee/services/HPC-services/Rocket

The cpuinfo from the rocket cluster is attached, although it might not be for the same node when running flye.
cpu_UTHPC.txt

Thanks for the info, I don't see anything unusual though.. 128Gb should be more than enough for the bubble correction step. It would be hard to fix without reproducing the issue, but since the same input works fine on the server, it is more likely somehow tied to your machine.

Could you also send the full flye.log of a failed run?

If you are interested in digging in - it would be great to have a stack trace of the crash. First, try running the command manually, e.g.

flye-modules polisher --bubbles /home/mprous/nanopore/flye_out/40-polishing/bubbles_1.fasta --subs-mat /home/mprous/miniconda3/envs/medaka/lib/python3.10/site-packages/flye/config/bin_cfg/nano_r94_substitutions.mat --hopo-mat /home/mprous/miniconda3/envs/medaka/lib/python3.10/site-packages/flye/config/bin_cfg/nano_r94_g36_homopolymers.mat --out /home/mprous/nanopore/flye_out/40-polishing/consensus_1.fasta --threads 16

If it still fails (it should), run it using GDB debugger: gdb --args FLYE_CMD, and then type run. It should then produce a stack trace. If stack trace is obfuscated, you'll need to rebuild using make debug.

Sorry for the slow response.

I ran
flye-modules polisher --bubbles /home/mprous/nanopore/flye_out/40-polishing/bubbles_1.fasta --subs-mat /home/mprous/Flye/flye/config/bin_cfg/nano_r94_substitutions.mat --hopo-mat /home/mprous/Flye/flye/config/bin_cfg/nano_r94_g36_homopolymers.mat --out /home/mprous/nanopore/flye_out/40-polishing/consensus_1.fasta --threads 16

which finished 100% without any error messages.

I thought I could then resume, but it started again with polishing with minimap2 and correcting bubbles and failed again.
flye.log

Log file is attached.

I then ran
gdb --args flye-modules polisher --bubbles /home/mprous/nanopore/flye_out/40-polishing/bubbles_1.fasta --subs-mat /home/mprous/Flye/flye/config/bin_cfg/nano_r94_substitutions.mat --hopo-mat /home/mprous/Flye/flye/config/bin_cfg/nano_r94_g36_homopolymers.mat --out /home/mprous/nanopore/flye_out/40-polishing/consensus_1.fasta --threads 16
run

Which produced this output:

GNU gdb (Ubuntu 12.1-0ubuntu1~22.04.2) 12.1
Copyright (C) 2022 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
https://www.gnu.org/software/gdb/bugs/.
Find the GDB manual and other documentation resources online at:
http://www.gnu.org/software/gdb/documentation/.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from flye-modules...
(gdb) run
Starting program: /home/mprous/Flye/bin/flye-modules polisher --bubbles /home/mprous/nanopore/flye_out/40-polishing/bubbles_1.fasta --subs-mat /home/mprous/Flye/flye/config/bin_cfg/nano_r94_substitutions.mat --hopo-mat /home/mprous/Flye/flye/config/bin_cfg/nano_r94_g36_homopolymers.mat --out /home/mprous/nanopore/flye_out/40-polishing/consensus_1.fasta --threads 16
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff7a22640 (LWP 32094)]
[New Thread 0x7ffff7221640 (LWP 32095)]
0% [New Thread 0x7ffff6a20640 (LWP 32096)]
[New Thread 0x7ffff621f640 (LWP 32097)]
[New Thread 0x7ffff5a1e640 (LWP 32098)]
[New Thread 0x7ffff521d640 (LWP 32099)]
[New Thread 0x7ffff4a1c640 (LWP 32100)]
[New Thread 0x7fffd7fff640 (LWP 32101)]
[New Thread 0x7fffd77fe640 (LWP 32102)]
[New Thread 0x7fffd6ffd640 (LWP 32103)]
[New Thread 0x7fffd67fc640 (LWP 32104)]
[New Thread 0x7fffd5ffb640 (LWP 32105)]
[New Thread 0x7fffd57fa640 (LWP 32106)]
[New Thread 0x7fffd4ff9640 (LWP 32107)]
[New Thread 0x7fffbffff640 (LWP 32108)]
[New Thread 0x7fffbf7fe640 (LWP 32109)]
10%
Thread 6 "flye-modules" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff5a1e640 (LWP 32098)]
Alignment::addInsertion (this=this@entry=0x7ffff5a1d9c0, pos=pos@entry=20, base=base@entry=67 'C', reads=std::vector of length 30, capacity 30 = {...}) at polishing/alignment.cpp:142
142 maxVal = std::max(maxVal, sum);
(gdb)

I then ran again flye-modules polisher without gdb --args, but this time it failed: 0% 10% 20% 30% Segmentation fault (core dumped).

Don't know if any of this helps.

Thank you! That's helpful - indeed it looks like a bug. Hard to tell where exactly though - this line is causing memory reallocation, but the memory might have been corrupted elsewhere. Any chance you could share the bubbles_1.fasta file?

Thank you! I tried this input on my server in debug mode + memory sanitizer and it did not produce any errors. Not sure how to proceed unfortunately, it is likely something specific to a combination of hardware + system libraries.

Ok, thanks for looking into this. Maybe have to update/change some software. Maybe Linux Mint is the problem. I have also issues with the whole system freezing totally, particularly with Medaka (never with Flye).