dpryan79/MethylDackel

How to exclude the first 10bp of each read, irrespective of orientation

DaGaMs opened this issue · 9 comments

I'm making a separate issue from #101 here since this is not a bug but a question, really:

I have a directional library of ~140 bp reads after adapter and quality trimming. I want to ignore the first 10bp of each F and R reads. How would I do this? --nOT 10,0,0,0 --nOB, 10,0,0,0? I'm really confused by this sentence in the command line help:

"Include calls at positions from A through B on read #1 and C through D on read #2"

If I don't know the exact length of the R2 read in a F1R2 pair, how do I say "ignore 10bp off the 5' end"? --nOT 10,0,10,0?

I think we need to first distinguish what you mean by the first 10 bases of a read. Are these the first 10 bases produced by the sequencer (what I presume you want) or something else. Normally bias arises during library prep, such as when using a Tn5-based prep and therefore it's the first 10 (or so) bases produced by the sequencer that should get ignored.

Assuming you want to exclude the first 10 bases produced by the sequencer, the --nOT 10,0,0,140 --nOB 10,0,0,140 would do it (presuming you originally had 150 base reads).

hmm... so how does this work for trimmed reads? I interpret this syntax like this: If not all reads are 150bp, then F1 reads will still have their first 10bp ignored, even if the F1 is only e.g. 140bp and not 150bp long. But for R2, no "ignoring" will happen if the first 10bp were trimmed, because 0,140 in fact covers the whole read. That's the part I don't get...

Correct, since it's impossible for the tool to know what the original length was. However you do know what it was, so you can just subtract 10 from that.

yes, but I don't know how many bases were quality trimmed for each read, because the trimming is adaptive. Also, there could have been different adapter/barcode contents. That means that each read could be slightly different in length. That means I can't express "10bp from the end of the read in OT coordinates" because that position is not well defined for arbitrary length reads. Would it not be possible to address the positions in the orientation of the read? e.g. -OT 10,0,10,0 meaning the first 10 bases in the 5'->3' direction of F1 and the first 10 bases in the 5'->3' direction of R2?

The adapter is on the other end of the read. You don't need to know how many bases were quality trimmed, if the bias is in the first 10 bases and they were quality trimmed then they're already gone, so removing yet another 10 bases will only decrease your signal. If you have some use case where you have bias that follows the 5' trimming of reads regardless of how much was trimmed I'd be curious how it arose.

I think my brain is slightly fried, it's late and I have been staring at code for too long 😅 I'm probably missing something totally obvious. I mean this:

read F1
1                  20
X^^----------------->
                         <------------------XX
                         20                  1
                                       read R2

The X are removed by trimGalore. Then F1 is 19bp long and R2 18. If I say --OT 2,0,0,18, for example, wouldn't the two bases marked as ^ on F1 be ignored, but nothing on R2?

Is there a reason you're quality trimming the beginning of your reads? That's extremely unusual. Adapters and lower quality are at the other ends of them.

Sorry, I was so knackered yesterday evening I couldn't respond any more. So, one case is that we have barcodes at the beginnings and ends of fragments, and the demultiplexing doesn't always work 100%, and also the sequence composition bias of the barcode might affect the quality of the first few bases after the barcode. In any case, my mbias plots showed a problem in the beginnings of the FW and RV reads, so I wanted to trim there

Ok, one last request for help: my mbias plots look like this one here:

071-015_scrbsl_bacteriophage_lambda_CpG mbias

Red is Read 1, blue is read 2. The data come from one of the new non-bisulfite methods, so I couldn't use the original visualisation as it is inverse. What would be a sensible setting for this data? -OT 1,130,1,130 -OB 10,140,10,140?