muellan/metacache

Segmentation fault (core dump) issue on custom DB

donovan-parks opened this issue · 2 comments

Hi,

It appears a number of users are experiencing segmentation faults on different condition (issues 27, 28). I am experiencing this on a custom DB I just built. The DB builds without issue. I've posted the DB build output results below. The segmentation fault occurs even when running MetaCache v2.2.3 with default settings, e.g.:

> metacache query my_db my_seqs.fq.gz -out results.txt

The segmentation fault occurs immediately after the DB is loaded:

> metacache query my_db my_seqs.fq.gz -out results.txt
Reading database metadata ...
Reading 1 database part(s) ...
Completed database reading.
Classifying query sequences.
Per-Read mappings will be written to file: results.txt
[>                                                                         ] 0%Segmentation fault (core dumped)

This seems to be the same issue as previously reported. I'm running this on a machine with lots of memory (512 GB; reference DB is ~223 GB on disk and looks to require about the same amount of RAM) and disk space (>100 GB free). Notably, the following command also causes a segmentation fault:

> metacache info my_db lin > lineages.tsv
Reading database from file 'gtdb_r214_db_ext' ... Reading database metadata ...
Completed database reading.
done.
Segmentation fault (core dumped)

Any insights into what might be causing this segmentation fault?

Thanks,
Donovan

Output from building custom DB:

Max locations per feature set to 254
Reading taxon names ... done.
Reading taxonomic node mergers ... done.
Reading taxonomic tree ... 184936 taxa read.
Taxonomy applied to database.
------------------------------------------------
MetaCache version    2.3.2
database version     20200820
------------------------------------------------
sequence type        mc::char_sequence
target id type       unsigned int 32 bits
target limit         4294967295
------------------------------------------------
window id type       unsigned int 32 bits
window limit         4294967295
window length        127
window stride        112
------------------------------------------------
sketcher type        mc::single_function_unique_min_hasher<unsigned int, mc::same_size_hash<unsigned int> >
feature type         unsigned int 32 bits
feature hash         mc::same_size_hash<unsigned int>
kmer size            16
kmer limit           16
sketch size          16
------------------------------------------------
bucket size type     unsigned char 8 bits
max. locations       254
location limit       254
------------------------------------------------
Reading sequence to taxon mappings from ./genomes/assembly_summary.txt
Processing reference sequences.
Added 10919768 reference sequences in 7244.87 s
targets              10919768
ranked targets       10411403
taxa in tree         184936
------------------------------------------------
buckets              964032481
bucket size          max: 254 mean: 55.5022 +/- 68.7783 <> 1.65723
features             531291666
dead features        0
locations            29487841493
------------------------------------------------
All targets are ranked.
Writing database to file ... Writing database metadata to file 'gtdb_r214_db_ext.meta' ... done.
Writing database part to file 'gtdb_r214_db_ext.cache0' ... done.
done.
Total build time: 9440.64 s

Is there anywhere in the code that assumes sequence IDs are at most a certain length? I do know some of the sequences I used to build the DBs have long sequence IDs. I would have expected this to break the DB build, but perhaps this is isolated to querying.

@donovan-parks and I investigated this issue. I came to the conclusion that insertion of sequences with duplicate IDs (accession numbers) lead to database metadata corruption. Manual de-duplication fixed it for me as well as donovan-parks.
The latest release should prevent similar problems in the future.