algbio/themisto

segmentation fault when trying to build without using the -c argument

PandorasRed opened this issue · 11 comments

hello,

i tried to build a themisto index with this command:

themisto build -k 31 -m 100000 --input-file data.fa --index-prefix data_index --temp-dir tmp --n-threads 32

and the software returned me a segmentation fault,

Hi,

Thanks for the report. How large is the input file? Is it possible to attach it here? If not, would it be possible to get more information such as the crash backtrace from gdb?

the file is ~12GB and the backtrace of GDB give that

#0 0x0000000000526f05 in Coloring::mark_redundant_color_sets(std::__cxx11::basic_string<char, std::char_traits, std::allocator >, BOSS<sdsl::int_vector<(unsigned char)1>, sdsl::wt_pc<sdsl::huff_shape, sdsl::int_vector<(unsigned char)1>, sdsl::rank_support_v<(unsigned char)1, (unsigned char)1>, sdsl::select_support_mcl<(unsigned char)1, (unsigned char)1>, sdsl::select_support_mcl<(unsigned char)0, (unsigned char)1>, sdsl::byte_tree > >&) ()
#1 0x000000000052b378 in Coloring::add_colors(BOSS<sdsl::int_vector<(unsigned char)1>, sdsl::wt_pc<sdsl::huff_shape, sdsl::int_vector<(unsigned char)1>, sdsl::rank_support_v<(unsigned char)1, (unsigned char)1>, sdsl::select_support_mcl<(unsigned char)1, (unsigned char)1>, sdsl::select_support_mcl<(unsigned char)0, (unsigned char)1>, sdsl::byte_tree > >&, std::__cxx11::basic_string<char, std::char_traits, std::allocator >, std::vector<long, std::allocator >, long, long, long) ()
#2 0x000000000051ff92 in build_index_main(int, char**) ()
#3 0x00000000004abde7 in main ()

Would it be possible to get the log for the run?

Also, could you provide some details such as:
Does the program work with other inputs, such as coli3.fna in the example_data folder?
Did you manage to run the program successfully if you provided a colorfile?
What compiler did you use to compile the program?
What operating system are you running on?

currently running it another time because i forget to to log it, so i come back with that when it finished

i don't tried to put a colorfile, but if i use the --no-color option it work correctly(tested to extract unitigs after and it seems to have extracted them correctly)
just tried to use the example and it work fine, also tried it with removing the -c option and it still work fine,

i'm currently using gcc 11.3.0 and i'm on a unix server

hello

the log of what themisto print : 0.0520 Wed Aug 31 10:16:13 2022 Themisto-v.2.1.0-13-g9959404
0.0690 Wed Aug 31 10:16:13 2022 Maximum k-mer length (size of the de Bruijn graph node labels): 31
0.0880 Wed Aug 31 10:16:13 2022 Build configuration:
Input file = threshold1HG002.hap1.Google.fa.unitigs.fa
Input format = fasta
Index de Bruijn graph output file = data_index.tdbg
Index coloring output file = data_index.tcolors
Temporary directory = tmp
k = 31
Number of threads = 32
Memory megabytes = 100000
User-specified colors = false
Load DBG = false
Handling of non-ACGT characters = delete
Verbosity = normal
0.1020 Wed Aug 31 10:16:13 2022 Starting
0.1100 Wed Aug 31 10:16:13 2022 Assigning colors
34.3950 Wed Aug 31 10:16:47 2022 Splitting sequences at non-ACGT characters
113.4010 Wed Aug 31 10:18:06 2022 Building de Bruijn Graph
113.4050 Wed Aug 31 10:18:06 2022 Building KMC database
Validating input alphabet
Calling KMC with: kmc -b -fm -k32 -m93 -ci1 -cs1 -cx4294967295 -t32 tmp/TzgoTTijZycYcc4BeRgOLcHsj tmp/KMC-KHZLbdEI7MjTMwj9eQ9GyGiFg tmp


Stage 1: 100%
Stage 2: 100%
147.5320 Wed Aug 31 10:18:40 2022 Building KMC database finished
149.1220 Wed Aug 31 10:18:42 2022 Building BOSS from KMC database
920.0280 Wed Aug 31 10:31:33 2022 Sorting 2445922690 (k+1)-mers
2322.7260 Wed Aug 31 10:54:56 2022 Adding dummy (k+1)-mers
2818.2890 Wed Aug 31 11:03:11 2022 Constructing Wheeler BOSS components.
3372.3710 Wed Aug 31 11:12:25 2022 Deleting KMC database
3378.3470 Wed Aug 31 11:12:31 2022 Building de Bruijn Graph finished (2702982421 nodes)
3378.3470 Wed Aug 31 11:12:31 2022 Building colors
3383.5650 Wed Aug 31 11:12:37 2022 Marking redundant color sets
Segmentation fault

Is it possible for you to provide source for the input(s) that cause segfault?

its the unitigs from Bcaml2 of the assembled version of this https://github.com/human-pangenomics/HG002_Data_Freeze_v1.0, (it was done by a member of my team that is currently in vacation so i can't ask him what he used precisely)

Could you provide the input file that causes segfault?

I have a small update on this. @iosfwd was able to reproduce the crash on his machine, but on my machine it works. We're investigating.

We were able to localize the bug to the KMC submodule which does the k-mer listing in the construction. Updating the KMC submodule seems to have fixed the issue. Commit 7e84e6a should work now.