getzlab/rnaseqc

Library Complexity Estimation question

Closed this issue · 4 comments

Hi Aaron,
I find that in many cases the Library Complexity Estimation is higher than # of Total Reads, let alone Mapped Reads. I looked at the Picard's EstimateLibraryComplexity page and couldn't find anything that reconciles this. I may be missing something basic - could you please explain?

Thanks,
Binyamin

The complexity estimation formula is something that was migrated from the original RNA-SeQC. I don't fully understand the significance of the equation, but my best explanation is that it is choosing some number which minimizes the difference between the observed number of unique fragments and the expected count given the ratio of duplicates in the sample.

However, eyeballing the equation, it doesn't look like it should be able to exceed the number of unique fragments in the sample. Would you mind sharing the counts of total, unique, duplicate, and mapped reads in your sample?

Examples of samples where Estimated Library Complexity > Total Reads:
Total Reads | Duplicate Reads | Mapped Reads | Mapped Unique Reads | Mapped Duplicate Reads | Estimated Library Complexity

sample1 | 170035455 | 27503680 | 154920387 | 127416707 | 27503680 | 190745164
sample2 | 124674494 | 16439632 | 112075650 | 95636018 | 16439632 | 171063844
sample3 | 129655783 | 17936652 | 118193394 | 100256742 | 17936652 | 173703316
sample4 | 230679529 | 36815144 | 210450663 | 173635519 | 36815144 | 263181864

Examples of samples where Estimated Library Complexity < Total Reads:
Total Reads | Duplicate Reads | Mapped Reads | Mapped Unique Reads | Mapped Duplicate Reads | Estimated Library Complexity

sample5 | 142125449 | 28594359 | 126888209 | 98293850 | 28594359 | 118202495
sample6 | 76333639 | 19017293 | 72214090 | 53196797 | 19017293 | 55899352
sample7 | 75365039 | 15340455 | 70754460 | 55414005 | 15340455 | 69308773
sample8 | 63131996 | 15289542 | 59435495 | 44145953 | 15289542 | 47360371

Thanks for sharing these stats. Francois shared this seqanswers post with me, which helped me understand as well. Ultimately, the estimated complexity is an estimation of the number of molecules in your library. If you sample has less total reads than the estimated complexity, then it's likely that there are unique fragments not represented in your sample, although you're probably already capturing the most common ones. Does that help?

I'm closing this issue as I haven't heard back in a month. If you still have further questions, feel free to reopen this