muellan/metacache

metacache-build-refseq gets killed at 46% every time i try to build the database

Closed this issue · 2 comments

When i try to build the refseq database the process gets killed at 46%. This happens every time exactly at the same percent value. There is enough storage available. Hardware specs are: 64GB RAM, 16 Cores

Building new database 'refseq' from reference sequences.
Max locations per feature set to 254
Reading taxon names ... done.
Reading taxonomic node mergers ... done.
Reading taxonomic tree ... 2429955 taxa read.
Taxonomy applied to database.
------------------------------------------------
MetaCache version    2.0.1 (20210305)
database version     20200820
------------------------------------------------
sequence type        mc::char_sequence
target id type       unsigned short int 16 bits
target limit         65535
------------------------------------------------
window id type       unsigned int 32 bits
window limit         4294967295
window length        127
window stride        112
------------------------------------------------
sketcher type        mc::single_function_unique_min_hasher<unsigned int, mc::same_size_hash<unsigned int> >
feature type         unsigned int 32 bits
feature hash         mc::same_size_hash<unsigned int>
kmer size            16
kmer limit           16
sketch size          16
------------------------------------------------
bucket size type     unsigned char 8 bits
max. locations       254
location limit       254
------------------------------------------------
Reading sequence to taxon mappings from genomes/refseq/archaea/assembly_summary.txt
Reading sequence to taxon mappings from genomes/refseq/bacteria/assembly_summary.txt
Reading sequence to taxon mappings from genomes/refseq/viral/assembly_summary.txt
Reading sequence to taxon mappings from genomes/taxonomy/assembly_summary_refseq.txt
Reading sequence to taxon mappings from genomes/taxonomy/assembly_summary_refseq_historical.txt
Processing reference sequences.
[=================>                                                        ] 24%
[=================>                                                        ] 24%

[==================================>                                       ] 46%Killed

I'm afraid that the reason might be that the latest Refseq version has already become too large for building a complete database (with default settings) within 64GB.

As I see it there are three ways around this:

  • Reduce the database memory footprint by sampling the reference genomes more coarsely. The main disadvantage is that this can lead to lower mapping sensitivity. The main option for that is -sketchlen, the default is 16, and reducing it reduces the memory footprint proportionally. If your reads are much longer than 100bp, you could instead increase the sampling window length with option -winlen. The default is 128 so e.g. -winlen 256 would cut the memory consumption roughly by half.

  • Use the default settings, but instead split the database up into 2 or more parts and use MetaCache's merge mode, see: database partitioning.

  • buy more RAM - I guess this is the least attractive option

Oh, and you should of course make sure that the reference genome files were downloaded properly. And maybe you should also do a quick test with just the virus genomes to make sure that everything works in principle.

Yes it seems to be indeed a memory problem. i increased the memory to 128GB and the build is now skipped the 46% problem. we will see if the build will complete this time. i will keep you postet.

If not i will try to build a viral ony db and take a look if there is a problem in general with the build process.

anyway thanks for your quick reply.