biod/sambamba

Sambamba should be able to convert SAM->BAM and sort at the same time like samtools

moldach opened this issue · 1 comments

Sambamba has a number of functions which are reported to be quicker than counterparts.

I'm trying to exchange the following functions (view, sort, markdup & index) in our pipeline for faster alternatives.

While most of the functions appear faster from my benchmarks the use of view/sort raises some concern. It appears that the critical difference between Samtools and Sambamba seems to be the first step in the pipeline - as samtools sort both sort's and converts SAM -> BAM (the typical job of view).

I'm wondering if this is not also possible with `Sambamba, as it appears to be bottle-neck.

Old Pipeline (Samtools & Picard)

# Convert SAM to BAM & Sort 
./samtools-1.3.1/samtools sort -@ 8 -o proband_bwaMEM_sort.bam proband_bwaMEM.sam

# Markdups
java -Xmx4G -jar picard.jar MarkDuplicates \
        VALIDATION_STRINGENCY=LENIENT READ_NAME_REGEX=null \
        I=proband_bwaMEM_sort.bam \
        O=proband_bwaMEM_sort_dedupped.bam \
        M=proband_output.metrics.bwaMEM.txt;



# Samtools index

./samtools-1.3.1/samtools index proband_bwaMEM_sort_dedupped.bam;

Samamba Implementations

#----------------
# sam -> bam
sambamba-0.8.0-linux-amd64-static view -S proband_bwaMEM.sam \
    -f bam \
    -t 8 \
    -o proband_tmp.bam


#-----------------
# sort
sambamba-0.8.0-linux-amd64-static sort \
    -t 32 \
    proband_tmp.bam \
    -o proband_bwaMEM_sort_sambamba.bam

#-----------------
# MarkDuplicates
sambamba-0.8.0-linux-amd64-static markdup \
    -t 32 \
    --overflow-list-size 800000 \
    proband_bwaMEM_sort_sambamba.bam \
   proband_bwaMEM_sort_dedupped_sambamba.bam

#-----------------
# Sambamba index

sambamba-0.8.0-linux-amd64-static index \
    -t 32 \
   proband_bwaMEM_sort_dedupped_sambamba.bam \
   proband_bwaMEM_sort_dedupped_sambamba.bam.bai

Benchmarks

Samtool/Picard Time Sambamba Time
View/Sort 01:22:35 View 01:21:58
Sort 00:46:01
Markdup 04:16:02 Markdup 00:48:04
Index 00:32:59 Index 00:15:48
  06:11:46   03:11:57

Although the overall time for markdup and index is greatly improved I found that with playing around with the number of cores for sambamba view and sambamba sort (8, 16, & 32 cores) that their speed, even at the optimum number of cores, was slower than the samtools function:

samtools sort -@ 8 -o proband_bwaMEM_sort.bam proband_bwaMEM.sam

This is not a bug. You can see the same in earlier speed tests. But thanks for reporting.