TimoLassmann/kalign

Segmentation fault (core dumped) - 10 million sequence dataset

Closed this issue · 5 comments

Hi TimoLassmann/kalign,
I hope this email finds you well. As mentioned earlier I am working on large datasets of SARS-CoV-2 sequences. My current dataset has ~10 million sequences. I am getting a segmentation fault again during alignment. It ran fine with ~1 million sequences.

[2022-10-27 10:45:32] : LOG : reading fasta
[2022-10-27 10:56:22] : LOG : Detected protein sequences.
[2022-10-27 10:57:53] : LOG : CPU Time: 932.92u 00:15:32.91 Elapsed: 00:15:33.00
[2022-10-27 10:57:53] : LOG : Detected: 10842878 sequences.
[2022-10-27 10:57:56] : LOG : Calculating pairwise distances
[2022-10-27 11:23:25] : LOG : CPU Time: 3737.16u 01:02:17.15 Elapsed: 00:25:29.00
[2022-10-27 11:23:25] : LOG : 32 anchors
[2022-10-27 11:23:25] : LOG : Building guide tree.
[2022-10-27 11:56:22] : LOG : CPU Time: 14340.20u 03:59:00.19 Elapsed: 00:32:57.00
[2022-10-27 11:58:47] : LOG : Aligning
Segmentation fault (core dumped)

Hi,
I never aligned that many sequences! Would it be possible for you to share the input file with me?
Thanks, T

Great - send me the link.

The input file contained a handful of sequence entries without a sequence. To address this, kalign now runs some basic checks on the input before the alignment steps.