Sydney-Informatics-Hub/Somatic-ShortV

Optimise the workflow to enable the use of gnomAD at 'getpileupsummaries' step

calizilla opened this issue · 0 comments

Use of the gnomAD resource at the getpileupsummaries uses excessive memory due to its much larger size than the small_exac resource.

Using 2 Gadi hugemem CPU (64 GB RAM) and setting -Xmx60G for GATK, the following fatal error is encountered:

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space

This error occurs after 11.5 hours in GATK v 4.2.1.0 and after 1.2 hours in GATK v 4.4.0.0.

Another user reported the same issue on a 768 GB job on another compute platform using GATK v 4.5.0.0.

The repository contains the following notice about this, but it is right at the very end of the user guide so is most likely missed by everyone:

 I recommend NOT using gnomAD variants as a common biallelic resource as it requires extensive benchmarking to overcome Java "OutOfMemory" errors, specific to your sample and sample coverage

To enable use of gnomAD, add scatter-gather by contig functionality. Benchmarking on chromosome 1 found 1 Gadi hugemem CPU with -Xmx28G completed in 13 minutes 20 seconds with perfect CPU efficiency and successful GetPileupSummaries output.