Segmentation fault (core dump) issue on custom DB
donovan-parks opened this issue · 2 comments
Hi,
It appears a number of users are experiencing segmentation faults on different condition (issues 27, 28). I am experiencing this on a custom DB I just built. The DB builds without issue. I've posted the DB build output results below. The segmentation fault occurs even when running MetaCache v2.2.3 with default settings, e.g.:
> metacache query my_db my_seqs.fq.gz -out results.txt
The segmentation fault occurs immediately after the DB is loaded:
> metacache query my_db my_seqs.fq.gz -out results.txt
Reading database metadata ...
Reading 1 database part(s) ...
Completed database reading.
Classifying query sequences.
Per-Read mappings will be written to file: results.txt
[> ] 0%Segmentation fault (core dumped)
This seems to be the same issue as previously reported. I'm running this on a machine with lots of memory (512 GB; reference DB is ~223 GB on disk and looks to require about the same amount of RAM) and disk space (>100 GB free). Notably, the following command also causes a segmentation fault:
> metacache info my_db lin > lineages.tsv
Reading database from file 'gtdb_r214_db_ext' ... Reading database metadata ...
Completed database reading.
done.
Segmentation fault (core dumped)
Any insights into what might be causing this segmentation fault?
Thanks,
Donovan
Output from building custom DB:
Max locations per feature set to 254
Reading taxon names ... done.
Reading taxonomic node mergers ... done.
Reading taxonomic tree ... 184936 taxa read.
Taxonomy applied to database.
------------------------------------------------
MetaCache version 2.3.2
database version 20200820
------------------------------------------------
sequence type mc::char_sequence
target id type unsigned int 32 bits
target limit 4294967295
------------------------------------------------
window id type unsigned int 32 bits
window limit 4294967295
window length 127
window stride 112
------------------------------------------------
sketcher type mc::single_function_unique_min_hasher<unsigned int, mc::same_size_hash<unsigned int> >
feature type unsigned int 32 bits
feature hash mc::same_size_hash<unsigned int>
kmer size 16
kmer limit 16
sketch size 16
------------------------------------------------
bucket size type unsigned char 8 bits
max. locations 254
location limit 254
------------------------------------------------
Reading sequence to taxon mappings from ./genomes/assembly_summary.txt
Processing reference sequences.
Added 10919768 reference sequences in 7244.87 s
targets 10919768
ranked targets 10411403
taxa in tree 184936
------------------------------------------------
buckets 964032481
bucket size max: 254 mean: 55.5022 +/- 68.7783 <> 1.65723
features 531291666
dead features 0
locations 29487841493
------------------------------------------------
All targets are ranked.
Writing database to file ... Writing database metadata to file 'gtdb_r214_db_ext.meta' ... done.
Writing database part to file 'gtdb_r214_db_ext.cache0' ... done.
done.
Total build time: 9440.64 s
Is there anywhere in the code that assumes sequence IDs are at most a certain length? I do know some of the sequences I used to build the DBs have long sequence IDs. I would have expected this to break the DB build, but perhaps this is isolated to querying.
@donovan-parks and I investigated this issue. I came to the conclusion that insertion of sequences with duplicate IDs (accession numbers) lead to database metadata corruption. Manual de-duplication fixed it for me as well as donovan-parks.
The latest release should prevent similar problems in the future.