broadinstitute/picard

MarkDuplicates with single end data does not write the full report

mattsoup opened this issue · 3 comments

This issue tracker is for bug reports only. Before opening a new issue here, please make a post on our support forum so that our support team can look at your issue and determine whether it needs to be escalated into a bug report.


Instructions

  • Use a concise yet descriptive title;
  • Determine whether your issue is a bug report, a feature request, or a documentation request;
  • Choose the corresponding template block below and fill it in, replacing or deleting text in italics (surrounded by _) as appropriate;
  • Delete the other template blocks and this header.

Bug Report

Affected tool(s)

MarkDuplicates

Affected version(s)

  • Latest public release version 3.1.1
  • Latest development/master branch as of [date of test?]

Description

MarkDuplicates with paired end data writes the expected report file. However, with the same data but with all read 2s filtered out (so, 'single end'), the output report has no value for ESTIMATED_LIBRARY_SIZE, and no histogram data is generated.

Steps to reproduce

picard MarkDuplicates INPUT='Filter_on_data_16' OUTPUT='/mnt/galaxy/data6/user-data/010/975/dataset_10975886.dat' METRICS_FILE='/mnt/galaxy/data6/user-data/010/975/dataset_10975885.dat' REMOVE_DUPLICATES='false' ASSUME_SORTED='true' DUPLICATE_SCORING_STRATEGY='SUM_OF_BASE_QUALITIES' OPTICAL_DUPLICATE_PIXEL_DISTANCE='100' VALIDATION_STRINGENCY='LENIENT' TAGGING_POLICY=All QUIET=true

Expected behavior

The full MarkDuplicates report should be written regardless of whether the input is paired or single end.

Actual behavior

Single end reports do not have estimated library size or the histogram data written.

Did you reset the samflags on the remaining single reads, or are they still marked as paired ?

For this particular test I didn't reset the flags, but we found the problem with actual single end data that was never marked as paired.

Looks like ESTIMATED_LIBRARY_SIZE is only calculated for paired-end reads:

this.ESTIMATED_LIBRARY_SIZE = estimateLibrarySize(this.READ_PAIRS_EXAMINED - this.READ_PAIR_OPTICAL_DUPLICATES,
this.READ_PAIRS_EXAMINED - this.READ_PAIR_DUPLICATES);

public static Long estimateLibrarySize(final long readPairs, final long uniqueReadPairs) {

One could potentially check if the input contains all single-end or paired-end reads (what about paired-end reads with an unmapped mate being counted towards UNPAIRED_READS_EXAMINED?) and use the corresponding values as inputs to the function, but I'm not sure if this behavior was ever intended. @yfarjoun @lbergelson Any opinions?