HKU-BAL/Clair3

Disk space consumption with --gvcf option

kim-fehl opened this issue · 4 comments

After running analysis with --gvcf option on a 50 Gb BAM file containing 4 ONT runs and HG19 reference, the resulting tmp output subfolder takes 419 Gb, plus 117 Gb in the main output folder. Probably, it would make sense to remove VCF partial files after concatenating and sorting them and compress the output. For instance, a 117 Gb GVCF file takes only 8.5 Gb when bzip2-compressed. Some libraries as lbzip2 can decompress it in parallel. Perhaps you want to minimize dependencies, but disk space efficiency is also important when it comes to renting servers with fast SSDs.

547M	./tmp/full_alignment_output/candidate_bed
3.6G	./tmp/full_alignment_output
233G	./tmp/gvcf_tmp_output
117G	./tmp/merge_output
18G	./tmp/pileup_output
174M	./tmp/phase_output/phase_vcf
48G	./tmp/phase_output/phase_bam
48G	./tmp/phase_output
419G	./tmp

Will come back later with a solution.

In the next release, we will 1) compress the intermediate files for GVCF output, and 2) provide an option for users to delete intermediate files immediately after no longer needed.

scheduled for v0.1-r7 release

v0.1-r7 released with #61