Identify and remove PCR jackpots
Opened this issue · 9 comments
PCR jackpots are paired 5' and 3' peaks that are about 1 read length apart (or less). Can we automatically remove those peaks?
Idea:
A reads pre-processing prior to peak calling
For every 3' peak of value n:
Look at 40nt upstream in the 5' channel, take the maximum --> m
If m*.75 < n < m*1.25, then remove 3' peak.
Some pseudo-code:
... make peaks list, strand-specific...
3_peaks = {
loc --> {strand --> + or -, value --> int}
}
5_peaks = {
loc --> {strand --> + or -, value --> int}
}
range_to_look = 40
def trim_pcr_jackpoints(3_peaks, 5_peaks):
for peak_loc, peak_info in 3_peaks.values():
start_looking = peak_info["value"]
end_looking = start_looking - range_to_look if peak_info["strand"] == "+" else start_looking + range_to_look
# Look from start to end for a peak
for i in range(start_looking, end_looking):
if i in 5_peaks:
# Then compare 5_peaks[i]["value"] and peak_info["value], see if they're close. If so, remove both peaks.
I think it might be worth looking into how much of an issue this actually is! ie - I know we sometimes have internal peaks that are called but appear to be False Positives. Can these peaks be explained as PCR jackpots? (ie do they have the signature referenced above?)
The reason why I ask is that what Jenny was seeing was clearly not PCR jackpots.
If this does seem to be an issue for at least some cases then I could see this being added as an optional filtering step - ie "quality control" on the Peak calling that users could optionally turn on. But if we can't find any cases of this causing false positives I am a little hesitant to tack on the complexity.
Absolutely, I agree
Probably worth giving an update on Jenny's problem
There were actually quite a few examples of what I think might be PCR jackpots. Here's an example:
Scrolling through, almost all of these were 15-18nt reads. Is there another explanation that could explain that?
Usually during read trimming, we filter reads less than 15nt. I increased that threshold to 20nt. It lowered the read coverage after-mapping (from ~18 million to ~16 million), but eliminated enough of those presumed-jackpots for Jenny to find clear 3' motifs.
Just to check - we can't explain this via high sequence similarity to other regions? I thought we had verified that was the status of at least some of these Jackpot-like reads? To be honest its a little hard for me to tell exactly what is going on from the screenshot alone so perhaps I should pull up her data myself when I have more time. But- to me this still doesn't really look like what I would expect from a PCR jackpot, especially since she only amplified 8 times.
The main reason I am hesitant to jump on the "PCR jackpot" bandwagon with this error as it seems like a surprising source of error to me - at least for the rendseq pipeline where the PCR primers should be annealing to the same location for each read (ie all reads should have part of the adapter already as the result of the cir-ligation I would think), so to me it suggest that either a) - there was a bias somehow introduced earlier in the library prep that meant some reads didn't receive the regions of PCR primer homology or b) - there is some other issue in the analysis pipeline that can explain this behavior (which I again - I thought we had explained was the result of high sequence similarity - but again I might have mis-remembered).
But I take it this was after bowtie aligning/only taking reads which mapped to 1 unique location?
But I take it this was after bowtie aligning/only taking reads which mapped to 1 unique location?
Indeed! bowtie -n 2 -l 28 -e 40 -m 1
, unique alignments with <2 mismatches in the first 28 bases, where reads have a quality score sum no greater than 40.
there is some other issue in the analysis pipeline that can explain this behavior (which I again - I thought we had explained was the result of high sequence similarity - but again I might have mis-remembered).
You remember correctly! That was a problem earlier on
Nice, I would imagine that with those options it should work! Just as a sanity check though - have you have independently blasted the region you shared above to see how much sequence similarity it has to other parts of the genome?
Do you see what I mean about the PCR jackpot being an odd explanation from a library prep point of view? I could be totally off on that, but it still gives me the feeling there aught to be another explanation
Do you see what I mean about the PCR jackpot being an odd explanation from a library prep point of view? I could be totally off on that, but it still gives me the feeling there aught to be another explanation
Yeah, totally--I'm going to pull up these files and run them by Cassandra today and Gene tomorrow and see what they think, I'll update here afterwards
Sure! If Cassandra has the time. I will also be in tomorrow and can take a look then as well.