database file parameters used in count output are too large
brianwalenz opened this issue · 0 comments
brianwalenz commented
The prefixSize used for writing count output is too large when inputs are large too.
https://github.com/marbl/meryl/blob/master/src/meryl/merylOp-countThreads.C#L404
Sets the output prefix based on the 'optimal' prefix used for counting. It works fine for moderate kmer sizes (e.g., 22) but when larger (e.g., 28) database chunks are too big for merging.
Example:
prefix # of struct kmers/ segs/ min data total
bits prefix memory prefix prefix memory memory memory
------ ------- ------- ------- ------- ------- ------- -------
14 16 kP 66 MB 98 kM 130 S 64 MB 8320 MB 8386 MB
15 32 kP 117 MB 49 kM 64 S 128 MB 8192 MB 8309 MB
16 64 kP 217 MB 24 kM 31 S 256 MB 7936 MB 8153 MB Best Value!
17 128 kP 420 MB 12 kM 16 S 512 MB 8192 MB 8612 MB
18 256 kP 824 MB 6314 M 8 S 1024 MB 8192 MB 9016 MB
> meryl dumpIndex 001.meryl
Opened '001.meryl'.
magic 0x646e496c7972656d33302e765f5f7865 'merylIndex__v.03'
prefixSize 16
suffixSize 40
numFilesBits 6 (64 files)
numBlocksBits 10 (1024 blocks)
But after merging, the prefix is more reasonable (though this is, iirc, a fixed hardcoded size). Merging seems to want to use around 1 GB per input database, not sure why.
> meryl dumpIndex 00x.meryl/
Opened '00x.meryl/'.
magic 0x646e496c7972656d33302e765f5f7865 'merylIndex__v.03'
prefixSize 12
suffixSize 44
numFilesBits 6 (64 files)
numBlocksBits 6 (64 blocks)