yechengxi/DBG2OLC

segmentation fault

jaworskicoline opened this issue · 5 comments

Hello,
I have been using DBG2OLC for a while, so far never encountered any problem.
Now I am encountering a segmentation fault (during extension step apparently) using two new files:
Command:
DBG2OLC k 17 KmerCovTh 2 MinOverlap 20 AdaptiveTh 0.005 LD1 0 MinLen 200 Contigs /path-to-shortread-assembly/Platanus_contigs.fa RemoveChimera 1 f /path-to-longreads/nanopore_raw.fasta
STDERR file:
/pbs/ocelote/i0n3/mom_priv/jobs/1878034.head1.cm.cluster.SC: line 26: 26147 Segmentation fault DBG2OLC k 17 KmerCovTh 2 MinOverlap 20 AdaptiveTh 0.005 LD1 0 MinLen 200 Contigs /path-to-shortread-assembly/Platanus_contigs.fa RemoveChimera 1 f /path-to-longreads/nanopore_raw.fasta
LOG file:
Loading contigs.
129303309 k-mers in round 1.
118727528 k-mers in round 2.
Analyzing reads...
File1: /path-to-longreads/nanopore_raw.fasta
Long reads indexed.
Total Kmers: 7571672670
Matching Unique Kmers: 1651108081
Compression time: 5243 secs.
Scoring method: 3
Match method: 2
Loading long read index
Loading file: ReadsInfoFrom_nanopore_raw.fasta
1491967 reads loaded.
Average size: 8
Loaded.
1491967 reads.
Calculating reads overlaps, round 1
Multiple alignment for error correction.
1000000 sequences aligned.
Avg alignment size: 12
total alignments: 46693535
Avg alignment size: 16
Avg sparse alignment size: 3
total alignments: 104831203
Loading file: CleanedLongReads.txt
1398904 reads loaded.
Average size: 6
Done.
MSA time: 2414 secs.
1000000 reads aligned.
Avg alignment size: 9
total alignments: 1676627
Avg alignment size: 15
Avg sparse alignment size: 4
2323906 alignments calculated.
Round 1 takes 2479 secs.
Calculating reads overlaps, round 2
1000000 reads aligned.
Avg alignment size: 25
total alignments: 138384
Avg alignment size: 43
Avg sparse alignment size: 4
363757 alignments calculated.
Round 2 takes 18 secs.
1371784 contained out of 1398904
1121 tips in the graph.
Graph simplification.
Iteration: 0
2766 branching positions.
4953 linear nodes.
Iteration: 1
1298 branching positions.
5853 linear nodes.
Iteration: 2
233 branching positions.
6689 linear nodes.
Iteration: 3
211 branching positions.
6696 linear nodes.
2263 edges deleted.
70538 chimeric reads deleted.
44 bad nodes removed in aggrassive cleaning.
615 tips in the graph.
Loading contigs.
Collecting information for consensus.
1398904 reads.
Calculating reads overlaps.
1000000 reads aligned.
Avg alignment size: 35
Avg sparse alignment size: 2
total alignments: 1744676
Avg alignment size: 42
Avg sparse alignment size: 3
15484865 alignments calculated.
526 secs.
Loading non-contained sequences.
77857 loaded.
frag sum: 423189042
offset sum: 157479740
Extension warning.
Extension warning.
Extension warning.
Extension warning.
Extension warning.
Extension warning.


I first suspected the longread files to cause the problem, since when I am using an alternative file instead of this one, DBG2OLC completes successfully. Also, when I split my nanopore_raw.fasta into multiple files and feed it to DBG2OLC it completes successfully as well (obviously the output would not make real sense with just a portion of raw reads). So I thought it might be a memory problem ?
Then I see that others had segmentation faults problems and you recommended to look at the shortread assembly file. I have used Platanus, and it worked well all other times I have used it (I'm not quite sure why it would cause formatting problems with this new data set). I have given a try with SparseAssembler instead with the new data set, but DBG2OLC would still end up with the segmentation fault.

Do you have any other ideas that I could try to solve the problem ?
Thank you so much for your advice.
Coline Jaworski._

After you split the nanopore_raw.fasta into multiple files, did you feed all of them to the program? If so the program is supposed to output the same result.

However, I am not completely sure if super long reads can cause new problems to the program as when it was developed the nanopore reads were relatively short.

No, I fed them separately to the program (of course again it would not make much sense, but it was more of a test to see if it refused the file or not). I cut the nanopore raw in half, either half did not work out. I cut it in 10, all ten worked.
I also tried to cut the contig assembly. In half, just one half worked, the other I had to split it twice successively again so it would complete on all fragments.

I'm not quite sure what to conclude from that, except that the files format looks ok, I feel like it still might have to deal with memory, and/or handling too many contigs on the very large nanopore reads ?

What would be the smartest way to solve my problem ? I can't see how fragmenting the input files and concatenating the output would work, because we can't sort contigs and nanopore reads that would come together before hands. Am I wrong ?

Would cutting the very long nanopore reads in pieces provide a solution? It means losing a bit of the benefit of using nanopore data, but maybe if I cut them with overlapping sections at each end, the program would be able to reconcatenate them ?

Happy to hear what you think, and thank you very much for providing feedback !

Cutting very long nanopore reads can be one solution -- if it works. It would definitely leads to worse N50 though.

Do you mind sharing me your data so I can take a look to see if I can try to debug it?

I'd have to check if I can share, because the data is not mine. I'm using a final merging step, so I guess it might concatenate back those cut long reads. Do you think the problem occurs because of too many contigs to handle simultaneously on the very long reads ? If so, would some kind of trimming of the contig assembly be a solution ?

I don't think trimming the contig assembly could help.