dpryan79/MethylDackel

The extraction speed differs a lot between bams from different mapping software

chenzixi07 opened this issue · 4 comments

Hi @dpryan79 ,
Thanks for your software. It is very helpful.

While testing MethylDackel with bams from different mapping softwares, I found the extraction speed differs a lot.

For now, the mapping softwares I tested are bismark (v0.16.3) and bitmapperbs (v1.0.2.3). After mapping, bam files were sorted and deduplicated with biobambam2 (v2.0.87).

The extraction time of the bam mapped by bismark is about 2 hours, while it only cost 20 minutes for the the bam mapped by bitmapperbs.

The bam file created by bitmapperbs is smaller, as it does not contain MD,XM,XR,XZ tags in bismark bams. Does it matters?

Thanks,
Zixi Chen

That shouldn't matter and 2 hours seems like a LOT of time unless you're running with a single core. How much did the alignment rate differ between the two tools? Were the same settings otherwise used?

That shouldn't matter and 2 hours seems like a LOT of time unless you're running with a single core. How much did the alignment rate differ between the two tools? Were the same settings otherwise used?

I am using 4 cores for MethylDackel. The alignment rate for bitmapperbs is ~89%, while for bismark it's ~ 84%.
Since the bam file is too big to upload, I pasted the script I used for the test, you may test it with your own data.

bitmapperbs mapping
/TJPROJ1/RNA/WORK/software/BitMapperBS/conda/bin/bitmapperBS
--search /TJPROJ1/RNA/WORK/data_test/test_WGBS_v8/references/hg38
--sensitive
--pe
--seq1 /TJPROJ1/RNA/WORK/data_test/test_WGBS_v8/test_hsa/test.all_data/A_1_clean_1.fq.gz
--seq2 /TJPROJ1/RNA/WORK/data_test/test_WGBS_v8/test_hsa/test.all_data/A_1_clean_2.fq.gz
--bam -o A_1.bitmapperBS.sensitive.bam
--max 700
--threads 8
--ambiguous_out
--mapstats A_1.mapping_log \

bismark mapping
perl /PUBLIC/software/RNA/Bismark/bismark_0.16.3/bismark.v16.3 -q --multicore 1 -p 4 -X 700 --dovetail --bowtie2 --prefix map --score_min L,0,-0.2 --gzip
--path_to_bowtie /PUBLIC/software/RNA/bowtie2-2.2.5
-o /TJPROJ1/RNA/WORK/data_test/test_WGBS_v8/test_hsa/test.test_on_tj1.node128/test.map_bismark
/TJPROJ1/RNA/WORK/data_test/test_WGBS_v8/references/hg38
-1 /TJPROJ1/RNA/WORK/data_test/test_WGBS_v8/test_hsa/test.all_data/A_1_clean_1.fq.gz
-2 /TJPROJ1/RNA/WORK/data_test/test_WGBS_v8/test_hsa/test.all_data/A_1_clean_2.fq.gz

sort&deduplicate
/TJPROJ1/RNA/WORK/software/biobambam2/biobambam2/bin/bamsort
level=9
tmpfile=temp
inputthreads=8
outputthreads=8
sortthreads=8
index=1
indexfilename=/TJPROJ1/RNA/WORK/data_test/test_WGBS_v8/test_hsa/test.test_on_tj1.node128/test.biobambam2_sort/map.A_1_clean_1.fq.gz_bismark_bt2_pe.sort.bam.bai
I=/TJPROJ1/RNA/WORK/data_test/test_WGBS_v8/test_hsa/test.all_data/map.A_1_clean_1.fq.gz_bismark_bt2_pe.bam
O=/TJPROJ1/RNA/WORK/data_test/test_WGBS_v8/test_hsa/test.test_on_tj1.node128/test.biobambam2_sort/map.A_1_clean_1.fq.gz_bismark_bt2_pe.sort.bam

/TJPROJ1/RNA/WORK/software/biobambam2/biobambam2/bin/bammarkduplicates
level=9
tmpfile=temp
I=/TJPROJ1/RNA/WORK/data_test/test_WGBS_v8/test_hsa/test.test_on_tj1.node128/test.biobambam2_sort/A1.bitmapperbs.sort.bam
O=/TJPROJ1/RNA/WORK/data_test/test_WGBS_v8/test_hsa/test.test_on_tj1.node128/test.biobambam2_bammarkduplicates/A1.bitmapperbs.sort.dedup.bam
M=bammarkduplicates.log
markthreads=8
rmdup=1

MethylDackel extract
/TJPROJ1/RNA/WORK/software/MethylDackel/MethylDackel_v0.5.1/MethylDackel extract
/TJPROJ1/RNA/WORK/data_test/test_WGBS_v8/references/hg38/genome.fa
/TJPROJ1/RNA/WORK/data_test/test_WGBS_v8/test_hsa/test.all_data/A1.sorted.dedup_bammarkduplicates.bam
-@ 4 --CHG --CHH --cytosine_report -o test.0.5.1

Hmm, there's nothing obviously amiss with that, I have no good explanation why you're running into such a big performance difference.

Hmm, there's nothing obviously amiss with that, I have no good explanation why you're running into such a big performance difference.

Quite weird. Thanks anyway.