metacache-build-refseq gets killed at 46% every time i try to build the database
clanzett opened this issue · 2 comments
When i try to build the refseq database the process gets killed at 46%. This happens every time exactly at the same percent value. There is enough storage available. Hardware specs are: 64GB RAM, 16 Cores
Building new database 'refseq' from reference sequences.
Max locations per feature set to 254
Reading taxon names ... done.
Reading taxonomic node mergers ... done.
Reading taxonomic tree ... 2429955 taxa read.
Taxonomy applied to database.
------------------------------------------------
MetaCache version 2.0.1 (20210305)
database version 20200820
------------------------------------------------
sequence type mc::char_sequence
target id type unsigned short int 16 bits
target limit 65535
------------------------------------------------
window id type unsigned int 32 bits
window limit 4294967295
window length 127
window stride 112
------------------------------------------------
sketcher type mc::single_function_unique_min_hasher<unsigned int, mc::same_size_hash<unsigned int> >
feature type unsigned int 32 bits
feature hash mc::same_size_hash<unsigned int>
kmer size 16
kmer limit 16
sketch size 16
------------------------------------------------
bucket size type unsigned char 8 bits
max. locations 254
location limit 254
------------------------------------------------
Reading sequence to taxon mappings from genomes/refseq/archaea/assembly_summary.txt
Reading sequence to taxon mappings from genomes/refseq/bacteria/assembly_summary.txt
Reading sequence to taxon mappings from genomes/refseq/viral/assembly_summary.txt
Reading sequence to taxon mappings from genomes/taxonomy/assembly_summary_refseq.txt
Reading sequence to taxon mappings from genomes/taxonomy/assembly_summary_refseq_historical.txt
Processing reference sequences.
[=================> ] 24%
[=================> ] 24%
[==================================> ] 46%Killed
I'm afraid that the reason might be that the latest Refseq version has already become too large for building a complete database (with default settings) within 64GB.
As I see it there are three ways around this:
-
Reduce the database memory footprint by sampling the reference genomes more coarsely. The main disadvantage is that this can lead to lower mapping sensitivity. The main option for that is
-sketchlen
, the default is 16, and reducing it reduces the memory footprint proportionally. If your reads are much longer than 100bp, you could instead increase the sampling window length with option-winlen
. The default is 128 so e.g.-winlen 256
would cut the memory consumption roughly by half. -
Use the default settings, but instead split the database up into 2 or more parts and use MetaCache's merge mode, see: database partitioning.
-
buy more RAM - I guess this is the least attractive option
Oh, and you should of course make sure that the reference genome files were downloaded properly. And maybe you should also do a quick test with just the virus genomes to make sure that everything works in principle.
Yes it seems to be indeed a memory problem. i increased the memory to 128GB and the build is now skipped the 46% problem. we will see if the build will complete this time. i will keep you postet.
If not i will try to build a viral ony db and take a look if there is a problem in general with the build process.
anyway thanks for your quick reply.