Why does `sambamba sort` give me a different sort order across multiple runs (on the same BAM file)? Isn't `sort` by coordinates determinstic?
etrh opened this issue · 2 comments
Running sambamba sort
on the same BAM input.bam
file gives me differing results from one run to another. Is this expected behavior? I'm not sure what's causing this but the order of queries seem to be flipped when I run a diff
.
I'm using sambamba 0.8.2
and compare the BAM files by looking at their MD5 hashes via:
samtools view sorted_input.bam | md5sum
The amount of memory (-m
) is seemingly playing a role here as well. I ran this 2 times:
sambamba sort -m 3500MB -o input_sorted_with_3500MB.bam input.bam
and the resulting MD5 hashes from both runs were the same.
Then I changed the amount of memory slightly and ran:
sambamba sort -m 3200MB -o input_sorted_with_3200MB.bam input.bam
These two runs with 3200MB memory shared the same MD5 sums. However, those MD5s were different from the prior runs with 3500MB memory.
Strangely enough, I also tried running sambamba sort
on the already sambamba-sorted BAM file, and to my surprise the MD5 sums were again different between the sorted and twice-sorted BAM files. (This does not happen when I run samtools sort
on a BAM file that has already been sorted by sambamba. That is to say, the MD5 hash remains the same after running samtools sort
on an already sambamba-sorted BAM file)
I'm really confused why this is happening. Isn't sambamba sort
supposed to be deterministic?
I have confirmed that all of these different BAM files with differing MD5sums are in fact identical when sorted by read names (i.e. with samtools sort -n
), however, coordinate sorted ones have swapped rows in them. These swaps seem to happen within the same/shared RNAME:POS
key/group. So the sorting seems to be mixed at the level of QNAME and FLAG within the same shared key (key here being RNAME and POS combination).
Tl;DR: Sorting is probably deterministic, but compressing the result not necessarily.
If you want to compare the content of the files, you should convert your BAMs to SAMs via sambamba view
prior to comparing them. You are like picking up differences in how the same information is compressed differently. Without looking at the code, I am sure compression happens in som sort of block fashion via a buffer of a certain size.
@mschilli87 I just converted two of the BAM files in question into SAM and the resulting MD5sums (md5sum converted_input.sam
) were exactly the same as samtools input.bam | md5sum
. In other words, the BAM-to-SAM converted file and the output of samtools view input.bam
are identical.