amplab/snap

High number of duplicates with -so option

carlosmag opened this issue · 3 comments

Hi,
I am getting ≃ 2.5 more duplicates marked in snap paired with -so parameter than with snap paired without -so option + pipe to samtools markdup or picard MarkDuplicates.
Stats obtained with samtools flagstat: 646886 vs 264372 duplicates.

Is there any issue with snap or interoperability with other tools?

Test genome and bam files here
Reference genome here

SNAP version 1.0dev.102
samtools 1.10
Picard 2.23.0

We are aware of the differences in SNAP duplicate marking with respect to Picard MarkDuplicates. The differences in SNAP are mainly due to: (1) not taking soft-clips into account and (2) not marking singleton read duplicates (i.e., when only read in the pair is mapped).

I tried your dataset on a version of a new release that we are currently working on and have seen significantly fewer differences (~1000-2000 reads w/ Picard Markdup). Unfortunately the new version of the code has diverged significantly from the current version that you are using, making it difficult for us to backpatch these changes onto your version, The new release will include bug fixes as well as support for affine gap scoring and performance improvements. We plan to release this in the next few months before the end of summer.

I will keep this issue open and will update here once we have resolved the discrepancies.

--Arun

The newly released 1.0 version has nearly identical duplicate marking to Picard. I'm going to close this, if you still see problems please reopen it or make a new issue.