Inconsistencies in Picard MarkDuplicates Histogram Data Output
eric-holt opened this issue · 2 comments
Bug Report
Affected tool(s)
Picard MarkDuplicates
Affected version(s)
Latest public release version 3.0.0
Description
While using the MarkDuplicates tool in Picard version 3.0.0, I have encountered an issue with the output metrics file, specifically the histogram data. In the histogram data, the "non_optical_sets" value is greater than the "all_sets" value for the first row of each sample. This is counter-intuitive as "non_optical_sets" should be a subset of "all_sets" and thus should not exceed it.
Furthermore, the sum of "all_sets" is exactly equal to "READ_PAIRS_EXAMINED" minus "READ_PAIR_DUPLICATES" from the metrics data, suggesting that "all_sets" represents the number of reads that are not duplicates. This interpretation is not clear from the documentation.
Steps to reproduce
Here is the code used to generate the metrics:
dedup() {
dir=$1
mkdir -p $dir/duplicate_metrics
mkdir -p $dir/deduped
for file in $dir/*.bam; do
base=$(basename "$file" .bam)
out="$dir/deduped/${base}_dedup.bam"
met="$dir/duplicate_metrics/${base}.txt"
java -jar ~/picard.jar MarkDuplicates \
I=$file \
O=$out \
M=$met\
ASSUME_SORT_ORDER=coordinate \
REMOVE_DUPLICATES=true \
&& echo "Successfully processed $file" \
|| echo "Error processing $file"
done
}
Unfortunately, I am unable to provide the input files due to confidentiality reasons.
Expected behavior
The "non_optical_sets" value should not exceed the "all_sets" value. The sum of "all_sets" should not be equal to "READ_PAIRS_EXAMINED" minus "READ_PAIR_DUPLICATES" unless "all_sets" is intended to represent the number of reads that are not duplicates.
Actual behavior
The "non_optical_sets" value is greater than the "all_sets" value for the first row of each sample. The sum of "all_sets" is exactly equal to "READ_PAIRS_EXAMINED" minus "READ_PAIR_DUPLICATES".
@eric-holt I think the behavior here is correct, although I do agree that it feels a bit unintuitive. The issue is that a single duplicate set can contain both optical and non-optical duplicates, in which case the sizes of the different types of duplicate sets will be different. Consider for example a duplicate set that consists of 1 non-duplicate, 2 non-optical duplicates and 1 optical duplicate. This will be counted in the "all_sets" histogram as a duplicate set of size 4, in the "non_optical_sets" histogram as a duplicate set of size 3, and in the "optical_sets" histogram as a duplicate set of size 2. Because of this, we should not expect the "all_sets" value for a particular bin of the histogram to always be larger than the "optical_sets" or "non_optical_sets" values for the same bin. However, we should expect the sum of all bins for the "all_sets" histogram to be larger than the sum of all bins for either the "optical_sets" or "non_optical_sets" histograms.
As far as the sum of "all_sets" being exactly equal to "READ_PAIRS_EXAMINED" minus "READ_PAIR_DUPLICATES", this is just a result of the way these metrics are defined. Each duplicate set will have exactly 1 non-duplicate read pair. And a single non-duplicate read pair (with no corresponding duplicates) is counted in the "all_sets" histogram as a duplicate set of size 1. So there is a 1-to-1 mapping between duplicate sets and and non-duplicate read pairs. Since "READ_PAIRS_EXAMINED" minus "READ_PAIR_DUPLICATES" gives the number of non-duplicate reads, this will be equal to the sum of the "all_sets" histogram (the total number of duplicate sets, including singleton sets), as you have observed.
This seems to be resolved.