Scalability issues
h-2 opened this issue · 11 comments
As discussed in the video-call, we have been experimenting with different formats here at deCODE.
Unfortunately, I cannot share the data files, but one very visible observation that we had in regard to Savvy, was, that between 75k samples and 100k samples there is a huge jump in file size; it more than doubles. After that the file size of Savvy always seems to be very close ot BCF.
I suspect that the mechanism for determining when to use sparse vectors has a bug and does not cope with the large size, resulting uncompressed fallback.
Thanks for reporting. Can you provide the command line options you used for generating the SAV files and what format fields are present in the files you are evaluating (e.g., GT:DP:AD:GQ:PL)? I can try to reproduce this trend using data I have access to?
If I remember correctly, you were experiencing a crash when enabling PBWT. Did you use v2.0.1 from the releases section or did you use a commit from the master branch? if the former, updating to the latest from master may resolve this issue since a lot has been fixed since v2.0.1.
I'm actually a bit surprised that SAV does so much better than BCF for the smaller datasets. Sparse vectors are only used for fields that have mostly zero values (GT, DS, HDS, etc). PBWT is necessary in order for DP, AD, GQ, and PL to compress well in the current version of SAV.
Can you provide the command line options you used for generating the SAV files
--phasing=none --update-info=never
what format fields are present in the files you are evaluatin
GT:AD:MD:DP:GQ:PL
If I remember correctly, you were experiencing a crash when enabling PBWT. Did you use v2.0.1 from the releases section or did you use a commit from the master branch? if the former, updating to the latest from master may resolve this issue since a lot has been fixed since v2.0.1.
I will give the master branch a try if I can make it build (no internet access on our research network, no cget, no conda).
Thanks. Can you describe the MD field?
The only dependencies needed to build the sav CLI are libzstd, libz, and shrinkwrap (https://github.com/jonathonl/shrinkwrap/archive/395361020c84f664d50c1ec51055e107d9178ad3.tar.gz). Shrinkwrap is a header-only library. The other two you likely already have available in your environment.
Can you describe the MD field?
It's a single integer field described as "Read depths of multiple alleles." Maybe @hannespetur can shed some more light?
The master-branch binary is now working, and it doesn't crash with PBWT enabled. In fact, using PBWT gives a very good compression on a quick test I just did.
We will re-run the general comparison on the WGS dataset. On which fields do you recommend activating the PBWT? All of them?
P.S: Shrinkwrap looks very useful! I will definitely try replacing some of my code with that!
Great. I think --pbwt-fields AD,MD,DP,GQ,PL
should work well. Any fields that compress well with your delta encoding "should" benefit from PBWT. Your delta encoding will be faster to [de]serialize though, which makes it quite appealing.
Can you describe the MD field?
It's a single integer field described as "Read depths of multiple alleles." Maybe @hannespetur can shed some more light?
Yes it is just indicating how many reads supporting more than one variant allele equally well. It's going to be 0 most of the time.
Ok. If MD is mostly zero, then --sparse-fields GT,MD --pbwt-fields AD,DP,GQ,PL
will probably work best.
Using --sparse-fields GT,MD --pbwt-fields AD,DP,GQ,PL
over default options is improving the compression ratios quite a bit.
In most of the benchmarks, running savvy master build not using the --pbwt-fields option results in ~2x smaller file size than BCF but with --sparse-fields GT,MD --pbwt-fields AD,DP,GQ,PL
it is ~3x smaller than BCF.
These scalability issues are still occurring though, but they now occur when going from 150k to 200k samples (instead of 75k to 100k). At 200k samples, savvy (without pbwt) output is almost exactly the same size as its BCF input. savvy with pbwt does compress at 200k though but it's compression ratio drops from ~3.2 (150k) to ~2.4x (200k) compared to BCF. There must be some fallback to no compression on the sparse fields that is being triggered here, right?
Hmm... there's no application logic that I can think of that would explain this.
I have a 200k data set that I can use to try to reproduce. Are your smaller datasets merely sample subsets of the 200k dataset? If so, are monomorphic variant records (AC==0) removed when subsetting or do does each sample set have the same number of variant records?
Using a higher level of compression might scale better (adding -6
or -10
). Increasing the block size --block-size 8192
could also affect compression, but it usually only makes a small difference in my experience.
I have a 200k data set that I can use to try to reproduce. Are your smaller datasets merely sample subsets of the 200k dataset? If so, are monomorphic variant records (AC==0) removed when subsetting or do does each sample set have the same number of variant records?
Yes. The original VCF contains 487k samples that I subset into smaller and smaller sample sizes while also removing AC==0 from the file, using bcftools view --trim-alt-alleles -S<subset> ...
and removing lines that have "ALT=."
Using a higher level of compression might scale better (adding
-6
or-10
). Increasing the block size--block-size 8192
could also affect compression, but it usually only makes a small difference in my experience.
Ok thanks, I will give it a try.
The -10
option helped a lot. It improved the 200k compression ratio by a factor of 2.1x vs no -10
option.
-6
and increasing the block size did almost nothing (less than 0.5% change in file size).