macs3-project/MACS

Q: impact of sequencing throughput on peak calling

SplitInf opened this issue · 4 comments

Dear @taoliu

Thank you for macs and for the constant improvements you have been adding.

Use case
I have some histone chip data on cancer cell lines where they were sequenced extra deep by the sequencing service centre (>80M reads).

Describe the problem
Surprisingly, despite all cell lines being quite similar, the number of peaks called varied. A few deeply sequenced samples have significantly more peaks being called. Visually inspecting peak calling indicates that there are noticeably more small/noise-like peaks being called. It's also surprising since the input samples were sequenced deeply and would, in theory, help to mitigate the signal/noise problem.

Describe the solution you tried
QC results indicate all samples were successful, with an average RiP% > 15% ; RiBL% <5%. Down-sampling of the samples in question helped to lower the number of peaks being called. My question is: Are these behaviours expected and there is a recommended way to handle samples with uneven sequencing depths? Thanks!

It's related to a better way to control the threshold of calling peaks. One way is to use the IDR approach if you have multiple replicates. Or if you have one deeply sequenced sample, you can divide them into multiple technical replicates, then use the IDR approach later ( it's called pseudo replicate in the ENCODE pipeline, for example encode Histone ChIP-seq pipeline ). Another way is to try to use the cutoff-analysis approach to decide a better cutoff in MACS.

Thank you for your reply @taoliu! Digging deeper into my dataset, I realized that these problematic samples have noticeably higher proportion of peaks with low signalValue (but with high confidence).
I tried running the cutoff analysis and found that even using a very stringent qvlaue couldn't get rid of these weak peaks. My question now is, would you recommend using signalValue threshold? The same antibody was used for all samples and processed together, so I don't think it's the problem with the reagent. Many thanks again.

@SplitInf Yes. I would suggest you apply two thresholds -- signalValue (it is the log2foldchange values, so 1 means 2 folds), and p or q-value. It's common that when the number is big, a small log2fc can have a high confidence (p or q-value). As in RNA-seq differentially expressed genes analysis (e.g. volcano plot), you can use both at the same time to narrow down the results.

thanks again for the explanation