mikolmogorov/Flye

Error parsing bubbles file during polishing

Closed this issue · 8 comments

Hi,

I get the following error during the polishing step:

flye --nano-hq lr.fastq.gz -o flye.hap -t 64 --keep-haplotypes --scaffold --asm-coverage 50 --genome-size 3.1g --resume

[2023-03-27 09:54:55] INFO: Starting Flye 2.9.1-b1780
[2023-03-27 09:54:55] INFO: Resuming previous run
[2023-03-27 09:54:55] INFO: >>>STAGE: polishing
[2023-03-27 09:54:55] INFO: Polishing genome (1/1)
[2023-03-27 09:54:55] INFO: Running minimap2
[2023-03-27 14:48:58] INFO: Separating alignment into bubbles
[2023-03-27 14:50:07] WARNING: Input contain non-ACGT characters - they will be converted to arbitrary ACGTs
[2023-03-27 14:50:08] WARNING: Input contain non-ACGT characters - they will be converted to arbitrary ACGTs
[2023-03-27 14:50:08] WARNING: Input contain non-ACGT characters - they will be converted to arbitrary ACGTs
[2023-03-27 14:50:08] WARNING: Input contain non-ACGT characters - they will be converted to arbitrary ACGTs
[2023-03-27 14:50:08] WARNING: Input contain non-ACGT characters - they will be converted to arbitrary ACGTs
[2023-03-27 14:50:08] WARNING: Input contain non-ACGT characters - they will be converted to arbitrary ACGTs
[2023-03-27 14:50:08] WARNING: Input contain non-ACGT characters - they will be converted to arbitrary ACGTs
[2023-03-27 14:50:08] WARNING: Input contain non-ACGT characters - they will be converted to arbitrary ACGTs
[2023-03-27 14:50:08] WARNING: Input contain non-ACGT characters - they will be converted to arbitrary ACGTs
[2023-03-27 14:50:08] WARNING: Input contain non-ACGT characters - they will be converted to arbitrary ACGTs
[2023-03-27 14:50:09] WARNING: Input contain non-ACGT characters - they will be converted to arbitrary ACGTs
[2023-03-27 14:50:09] WARNING: Input contain non-ACGT characters - they will be converted to arbitrary ACGTs
[2023-03-27 14:50:09] WARNING: Input contain non-ACGT characters - they will be converted to arbitrary ACGTs
[2023-03-27 14:50:09] WARNING: Input contain non-ACGT characters - they will be converted to arbitrary ACGTs
[2023-03-27 14:50:09] WARNING: Input contain non-ACGT characters - they will be converted to arbitrary ACGTs
[2023-03-27 14:50:09] WARNING: Input contain non-ACGT characters - they will be converted to arbitrary ACGTs
[2023-03-27 14:50:09] WARNING: Input contain non-ACGT characters - they will be converted to arbitrary ACGTs
[2023-03-27 14:50:09] WARNING: Input contain non-ACGT characters - they will be converted to arbitrary ACGTs
[2023-03-27 14:50:09] WARNING: Input contain non-ACGT characters - they will be converted to arbitrary ACGTs
[2023-03-27 14:50:10] WARNING: Input contain non-ACGT characters - they will be converted to arbitrary ACGTs
[2023-03-27 14:50:10] WARNING: Input contain non-ACGT characters - they will be converted to arbitrary ACGTs
[2023-03-27 14:50:10] WARNING: Input contain non-ACGT characters - they will be converted to arbitrary ACGTs
[2023-03-27 14:50:10] WARNING: Input contain non-ACGT characters - they will be converted to arbitrary ACGTs
[2023-03-27 14:50:10] WARNING: Input contain non-ACGT characters - they will be converted to arbitrary ACGTs
[2023-03-27 14:50:10] WARNING: Input contain non-ACGT characters - they will be converted to arbitrary ACGTs
[2023-03-27 14:50:10] WARNING: Input contain non-ACGT characters - they will be converted to arbitrary ACGTs
[2023-03-27 14:50:10] WARNING: Input contain non-ACGT characters - they will be converted to arbitrary ACGTs
[2023-03-27 14:50:10] WARNING: Input contain non-ACGT characters - they will be converted to arbitrary ACGTs
[2023-03-27 14:50:10] WARNING: Input contain non-ACGT characters - they will be converted to arbitrary ACGTs
[2023-03-27 14:50:11] WARNING: Input contain non-ACGT characters - they will be converted to arbitrary ACGTs
[2023-03-27 14:50:11] WARNING: Input contain non-ACGT characters - they will be converted to arbitrary ACGTs
[2023-03-27 14:50:11] WARNING: Input contain non-ACGT characters - they will be converted to arbitrary ACGTs
[2023-03-27 14:50:11] WARNING: Input contain non-ACGT characters - they will be converted to arbitrary ACGTs
[2023-03-27 14:50:11] WARNING: Input contain non-ACGT characters - they will be converted to arbitrary ACGTs
[2023-03-27 14:50:11] WARNING: Input contain non-ACGT characters - they will be converted to arbitrary ACGTs
[2023-03-27 14:50:12] WARNING: Input contain non-ACGT characters - they will be converted to arbitrary ACGTs
[2023-03-27 14:50:12] WARNING: Input contain non-ACGT characters - they will be converted to arbitrary ACGTs
[2023-03-27 14:50:12] WARNING: Input contain non-ACGT characters - they will be converted to arbitrary ACGTs
[2023-03-27 14:50:12] WARNING: Input contain non-ACGT characters - they will be converted to arbitrary ACGTs
[2023-03-27 14:50:13] WARNING: Input contain non-ACGT characters - they will be converted to arbitrary ACGTs
[2023-03-27 14:50:13] WARNING: Input contain non-ACGT characters - they will be converted to arbitrary ACGTs
[2023-03-27 14:50:13] WARNING: Input contain non-ACGT characters - they will be converted to arbitrary ACGTs
[2023-03-27 14:50:14] WARNING: Input contain non-ACGT characters - they will be converted to arbitrary ACGTs
[2023-03-27 14:50:14] WARNING: Input contain non-ACGT characters - they will be converted to arbitrary ACGTs
[2023-03-27 14:50:14] WARNING: Input contain non-ACGT characters - they will be converted to arbitrary ACGTs
[2023-03-27 14:50:15] WARNING: Input contain non-ACGT characters - they will be converted to arbitrary ACGTs
[2023-03-27 14:50:15] WARNING: Input contain non-ACGT characters - they will be converted to arbitrary ACGTs
[2023-03-27 14:50:15] WARNING: Input contain non-ACGT characters - they will be converted to arbitrary ACGTs
[2023-03-27 14:50:15] WARNING: Input contain non-ACGT characters - they will be converted to arbitrary ACGTs
[2023-03-27 14:50:15] WARNING: Input contain non-ACGT characters - they will be converted to arbitrary ACGTs
[2023-03-27 14:50:17] WARNING: Input contain non-ACGT characters - they will be converted to arbitrary ACGTs
[2023-03-27 14:50:17] WARNING: Input contain non-ACGT characters - they will be converted to arbitrary ACGTs
[2023-03-27 14:50:17] WARNING: Input contain non-ACGT characters - they will be converted to arbitrary ACGTs
[2023-03-27 14:50:18] WARNING: Input contain non-ACGT characters - they will be converted to arbitrary ACGTs
[2023-03-27 14:50:19] WARNING: Input contain non-ACGT characters - they will be converted to arbitrary ACGTs
[2023-03-27 14:50:19] WARNING: Input contain non-ACGT characters - they will be converted to arbitrary ACGTs
[2023-03-27 14:50:20] WARNING: Input contain non-ACGT characters - they will be converted to arbitrary ACGTs
[2023-03-27 14:50:21] WARNING: Input contain non-ACGT characters - they will be converted to arbitrary ACGTs
[2023-03-27 14:50:23] WARNING: Input contain non-ACGT characters - they will be converted to arbitrary ACGTs
[2023-03-27 14:50:24] WARNING: Input contain non-ACGT characters - they will be converted to arbitrary ACGTs
[2023-03-27 14:50:26] WARNING: Input contain non-ACGT characters - they will be converted to arbitrary ACGTs
[2023-03-27 14:50:29] WARNING: Input contain non-ACGT characters - they will be converted to arbitrary ACGTs
[2023-03-27 14:50:30] WARNING: Input contain non-ACGT characters - they will be converted to arbitrary ACGTs
[2023-03-27 14:50:30] WARNING: Input contain non-ACGT characters - they will be converted to arbitrary ACGTs
[2023-03-27 16:35:16] INFO: Alignment error rate: 0.002858
[2023-03-27 16:35:16] INFO: Correcting bubbles
0% 10% 20% terminate called after throwing an instance of 'std::runtime_error'
  what():  Error parsing bubbles file
[2023-03-27 18:09:31] ERROR: Command '['flye-modules', 'polisher', '--bubbles', 'flye.hap/40-polishing/bubbles_1.fasta', '--subs-mat', 'flye/config/bin_cfg/nano_r94_substitutions.mat', '--hopo-mat', 'flye/config/bin_cfg/nano_r94_g36_homopolymers.mat', '--out', 'flye.hap/40-polishing/consensus_1.fasta', '--threads', '64']' died with <Signals.SIGABRT: 6>.
[2023-03-27 18:09:31] ERROR: Pipeline aborted

The error happened a first time on a machine with 48 threads and 386G of RAM attributed to the job. I resumed it from the last completed stage (see log above) on a machine with 64 threads and 480G of RAM given to the job but I got the same error. I've run the exact same command many times on similar data and never got this issue before. When inspecting the 40-polishing sub-directory, the minimap2 log shows no issue. The bubble file is quite large, 358GB.

Thank you for your help,
Guillaume

This may have something to do with the warning in the log. Flye does not expect non-ACGT fasta characters in general and may not handle them right. Do you think you may have those in your reads? Did you get this warning earlier in the run as well? If you don't expect this characters, but see the warning, this may be a fasta/q formatting error. Flye parser is supposed to catch those, but maybe not in 100% cases. Otherwise, I've never seen this error in a released version yet.

Probably I should have mentioned this earlier but my input long reads are Illumina-corrected ONT (R9.4) reads. I don't use the (--nano-corr preset as it was giving me much worse results than --nano-hq). That being said, because they are Illumina corrected, they do contain non-ACGT characters. So I had this warning too on the first run too. I have run Flye on similar data for about 40+ other genomes and none had this issue :(

I just tried Flye 2.9.2 and got exactly the same error at the same location.

@GuillaumeHolley thanks. I will need an input example that reproduces the problem to fix it.. Can you come up with a bam file that could be provided as input to --polish-target and results into the crash? You mentioned that the input is quite large, so it will be helpful if you could narrow it down to a smaller example. E.g., you can split bam into two equal-ish parts, keep the part that still give you an error, etc..

Misha

Just wanted to check if you were able to fix the problem?

Hi @fenderglass,

Unfortunately not. The problem is that even though the issue has occurred on multiple occasions, it is still a rare occurrence. Any attempt to reproduce the issue using a smaller input resulted in Flye finishing the job as expected.

Very strange.. But the whole run (that is problematic) is failing consistently? Could you be that you have a disk data corruption somewhere?

Closing the thread because of inactivity. If the issue is still unresolved, feel free to reopen!