Sambamba should be able to convert SAM->BAM and sort at the same time like samtools
moldach opened this issue · 1 comments
Sambamba
has a number of functions which are reported to be quicker than counterparts.
I'm trying to exchange the following functions (view
, sort
, markdup
& index
) in our pipeline for faster alternatives.
While most of the functions appear faster from my benchmarks the use of view
/sort
raises some concern. It appears that the critical difference between Samtools
and Sambamba
seems to be the first step in the pipeline - as samtools sort
both sort
's and converts SAM -> BAM
(the typical job of view
).
I'm wondering if this is not also possible with `Sambamba, as it appears to be bottle-neck.
Old Pipeline (Samtools
& Picard
)
# Convert SAM to BAM & Sort
./samtools-1.3.1/samtools sort -@ 8 -o proband_bwaMEM_sort.bam proband_bwaMEM.sam
# Markdups
java -Xmx4G -jar picard.jar MarkDuplicates \
VALIDATION_STRINGENCY=LENIENT READ_NAME_REGEX=null \
I=proband_bwaMEM_sort.bam \
O=proband_bwaMEM_sort_dedupped.bam \
M=proband_output.metrics.bwaMEM.txt;
# Samtools index
./samtools-1.3.1/samtools index proband_bwaMEM_sort_dedupped.bam;
Samamba Implementations
#----------------
# sam -> bam
sambamba-0.8.0-linux-amd64-static view -S proband_bwaMEM.sam \
-f bam \
-t 8 \
-o proband_tmp.bam
#-----------------
# sort
sambamba-0.8.0-linux-amd64-static sort \
-t 32 \
proband_tmp.bam \
-o proband_bwaMEM_sort_sambamba.bam
#-----------------
# MarkDuplicates
sambamba-0.8.0-linux-amd64-static markdup \
-t 32 \
--overflow-list-size 800000 \
proband_bwaMEM_sort_sambamba.bam \
proband_bwaMEM_sort_dedupped_sambamba.bam
#-----------------
# Sambamba index
sambamba-0.8.0-linux-amd64-static index \
-t 32 \
proband_bwaMEM_sort_dedupped_sambamba.bam \
proband_bwaMEM_sort_dedupped_sambamba.bam.bai
Benchmarks
Samtool/Picard | Time | Sambamba | Time |
---|---|---|---|
View/Sort | 01:22:35 | View | 01:21:58 |
Sort | 00:46:01 | ||
Markdup | 04:16:02 | Markdup | 00:48:04 |
Index | 00:32:59 | Index | 00:15:48 |
06:11:46 | 03:11:57 |
Although the overall time for markdup
and index
is greatly improved I found that with playing around with the number of cores for sambamba view
and sambamba sort
(8, 16, & 32 cores) that their speed, even at the optimum number of cores, was slower than the samtools
function:
samtools sort -@ 8 -o proband_bwaMEM_sort.bam proband_bwaMEM.sam
This is not a bug. You can see the same in earlier speed tests. But thanks for reporting.