Optimise the workflow to enable the use of gnomAD at 'getpileupsummaries' step
calizilla opened this issue · 0 comments
Use of the gnomAD resource at the getpileupsummaries
uses excessive memory due to its much larger size than the small_exac resource.
Using 2 Gadi hugemem CPU (64 GB RAM) and setting -Xmx60G
for GATK, the following fatal error is encountered:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
This error occurs after 11.5 hours in GATK v 4.2.1.0 and after 1.2 hours in GATK v 4.4.0.0.
Another user reported the same issue on a 768 GB job on another compute platform using GATK v 4.5.0.0.
The repository contains the following notice about this, but it is right at the very end of the user guide so is most likely missed by everyone:
I recommend NOT using gnomAD variants as a common biallelic resource as it requires extensive benchmarking to overcome Java "OutOfMemory" errors, specific to your sample and sample coverage
To enable use of gnomAD, add scatter-gather by contig functionality. Benchmarking on chromosome 1 found 1 Gadi hugemem CPU with -Xmx28G
completed in 13 minutes 20 seconds with perfect CPU efficiency and successful GetPileupSummaries output.