broadinstitute/picard

Inconsistencies in Picard MarkDuplicates Histogram Data Output

eric-holt opened this issue · 2 comments

Bug Report

Affected tool(s)

Picard MarkDuplicates

Affected version(s)

Latest public release version 3.0.0

Description

While using the MarkDuplicates tool in Picard version 3.0.0, I have encountered an issue with the output metrics file, specifically the histogram data. In the histogram data, the "non_optical_sets" value is greater than the "all_sets" value for the first row of each sample. This is counter-intuitive as "non_optical_sets" should be a subset of "all_sets" and thus should not exceed it.

Furthermore, the sum of "all_sets" is exactly equal to "READ_PAIRS_EXAMINED" minus "READ_PAIR_DUPLICATES" from the metrics data, suggesting that "all_sets" represents the number of reads that are not duplicates. This interpretation is not clear from the documentation.

Steps to reproduce

Here is the code used to generate the metrics:

dedup() {
    dir=$1
    mkdir -p $dir/duplicate_metrics
    mkdir -p $dir/deduped
for file in $dir/*.bam; do 
    base=$(basename "$file" .bam)
    out="$dir/deduped/${base}_dedup.bam"
    met="$dir/duplicate_metrics/${base}.txt"
    java -jar ~/picard.jar MarkDuplicates \
        I=$file \
        O=$out \
        M=$met\
        ASSUME_SORT_ORDER=coordinate \
        REMOVE_DUPLICATES=true \
        && echo "Successfully processed $file" \
        || echo "Error processing $file"
done
}

Unfortunately, I am unable to provide the input files due to confidentiality reasons.

Expected behavior

The "non_optical_sets" value should not exceed the "all_sets" value. The sum of "all_sets" should not be equal to "READ_PAIRS_EXAMINED" minus "READ_PAIR_DUPLICATES" unless "all_sets" is intended to represent the number of reads that are not duplicates.

Actual behavior

The "non_optical_sets" value is greater than the "all_sets" value for the first row of each sample. The sum of "all_sets" is exactly equal to "READ_PAIRS_EXAMINED" minus "READ_PAIR_DUPLICATES".

@eric-holt I think the behavior here is correct, although I do agree that it feels a bit unintuitive. The issue is that a single duplicate set can contain both optical and non-optical duplicates, in which case the sizes of the different types of duplicate sets will be different. Consider for example a duplicate set that consists of 1 non-duplicate, 2 non-optical duplicates and 1 optical duplicate. This will be counted in the "all_sets" histogram as a duplicate set of size 4, in the "non_optical_sets" histogram as a duplicate set of size 3, and in the "optical_sets" histogram as a duplicate set of size 2. Because of this, we should not expect the "all_sets" value for a particular bin of the histogram to always be larger than the "optical_sets" or "non_optical_sets" values for the same bin. However, we should expect the sum of all bins for the "all_sets" histogram to be larger than the sum of all bins for either the "optical_sets" or "non_optical_sets" histograms.

As far as the sum of "all_sets" being exactly equal to "READ_PAIRS_EXAMINED" minus "READ_PAIR_DUPLICATES", this is just a result of the way these metrics are defined. Each duplicate set will have exactly 1 non-duplicate read pair. And a single non-duplicate read pair (with no corresponding duplicates) is counted in the "all_sets" histogram as a duplicate set of size 1. So there is a 1-to-1 mapping between duplicate sets and and non-duplicate read pairs. Since "READ_PAIRS_EXAMINED" minus "READ_PAIR_DUPLICATES" gives the number of non-duplicate reads, this will be equal to the sum of the "all_sets" histogram (the total number of duplicate sets, including singleton sets), as you have observed.

This seems to be resolved.