MarkDuplicates with single end data does not write the full report
mattsoup opened this issue · 3 comments
This issue tracker is for bug reports only. Before opening a new issue here, please make a post on our support forum so that our support team can look at your issue and determine whether it needs to be escalated into a bug report.
Instructions
- Use a concise yet descriptive title;
- Determine whether your issue is a bug report, a feature request, or a documentation request;
- Choose the corresponding template block below and fill it in, replacing or deleting text in italics (surrounded by
_
) as appropriate; - Delete the other template blocks and this header.
Bug Report
Affected tool(s)
MarkDuplicates
Affected version(s)
- Latest public release version 3.1.1
- Latest development/master branch as of [date of test?]
Description
MarkDuplicates with paired end data writes the expected report file. However, with the same data but with all read 2s filtered out (so, 'single end'), the output report has no value for ESTIMATED_LIBRARY_SIZE
, and no histogram data is generated.
Steps to reproduce
picard MarkDuplicates INPUT='Filter_on_data_16' OUTPUT='/mnt/galaxy/data6/user-data/010/975/dataset_10975886.dat' METRICS_FILE='/mnt/galaxy/data6/user-data/010/975/dataset_10975885.dat' REMOVE_DUPLICATES='false' ASSUME_SORTED='true' DUPLICATE_SCORING_STRATEGY='SUM_OF_BASE_QUALITIES' OPTICAL_DUPLICATE_PIXEL_DISTANCE='100' VALIDATION_STRINGENCY='LENIENT' TAGGING_POLICY=All QUIET=true
Expected behavior
The full MarkDuplicates report should be written regardless of whether the input is paired or single end.
Actual behavior
Single end reports do not have estimated library size or the histogram data written.
Did you reset the samflags on the remaining single reads, or are they still marked as paired ?
For this particular test I didn't reset the flags, but we found the problem with actual single end data that was never marked as paired.
Looks like ESTIMATED_LIBRARY_SIZE
is only calculated for paired-end reads:
picard/src/main/java/picard/sam/DuplicationMetrics.java
Lines 111 to 112 in c8b2c06
One could potentially check if the input contains all single-end or paired-end reads (what about paired-end reads with an unmapped mate being counted towards UNPAIRED_READS_EXAMINED
?) and use the corresponding values as inputs to the function, but I'm not sure if this behavior was ever intended. @yfarjoun @lbergelson Any opinions?