help combining bismark pipeline with methylated UMIs
shawpa opened this issue · 8 comments
I have a whole genome em-seq methylation library with dual indexes as well as methylated UMI's at the beginning of each R1 and R2. These are from Twist and were used to make EM-seq libraries. I am not sure how I need to modify my current bismark pipeline to use the --barcode option when I do the deduplication. I saw in a previous thread that you have a program called UMI-Grinder but I'm not sure where it fits (if at all) in the pipeline. Normally I do the following:
- Trimming with trim galore
- PE Alignment with bismark bowtie 2 option
- deduplication
- methylation calling
The UMIs are going to be the first 5 bases of each read1 and read2. The next 2 bases are supposed to be skipped and then my genomic sequencing should begin. I looked at the UMI-Grinder documention but the input seems to be a bam. Not sure how I get to a bam from a fastq without aligning and I can't align because the UMI's at the beginning of the read. Thank you for your help.
Hi there, apologies for the slow response.
UMIbam is indeed a tool for UMI-aware deduplication, kind of a slightly more capable version of deduplicate_bismark
.
What you need here is a way of transferring the UMI-sequence from the beginning of both reads to the readIDs, and then also remove the 2 additional bp, Trim Galore already has dedicated methods of doing this for the the --clock
or --implicon
modes:
--implicon This is a special mode of operation for paired-end data, such as required for the IMPLICON method, where a UMI sequence
is getting transferred from the start of Read 2 to the readID of both reads. Following this, Trim Galore will exit.
In it's current implementation, the UMI carrying reads come in the following format:
Read 1 5' FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF 3'
Read 2 3' UUUUUUUUFFFFFFFFFFFFFFFFFFFFFFFFFFFF 5'
Where UUUUUUUU is a random 8-mer unique molecular identifier (UMI) and FFFFFFF... is the actual fragment to be
sequenced. The UMI of Read 2 (R2) is written into the read ID of both reads and removed from the actual sequence.
Here is an example:
R1: \@HWI-D00436:407:CCAETANXX:1:1101:4105:1905 1:N:0: CGATGTTT
ATCTAGTTCAGTACGGTGTTTTCGAATTAGAAAAATATGTATAGAGGAAATAGATATAAAGGCGTATTCGTTATTG
R2: \@HWI-D00436:407:CCAETANXX:1:1101:4105:1905 3:N:0: CGATGTTT
CAATTTTGCAGTACAAAAATAATACCTCCTCTATTTATCCAAAATCACAAAAAACCACCCACTTAACTTTCCCTAA
After --implicon trimming:
R1: \@HWI-D00436:407:CCAETANXX:1:1101:4105:1905 1:N:0: CGATGTTT:CAATTTTG
ATCTAGTTCAGTACGGTGTTTTCGAATTAGAAAAATATGTATAGAGGAAATAGATATAAAGGCGTATTCGTTATTG
R2: \@HWI-D00436:407:CCAETANXX:1:1101:4105:1905 3:N:0: CGATGTTT:CAATTTTG
CAGTACAAAAATAATACCTCCTCTATTTATCCAAAATCACAAAAAACCACCCACTTAACTTTCCCTAA
It should be easy enough to implement this additional trimming, you could for example use the code for --implicon
and modify it slightly to meet your needs. If this is a method is getting used more widely it might make sense to add it as an additional mode of Specialised Trimming.
It seems like the --clock is most similar to what I am doing because I have UMIs on both ends. I have no "fixed" bases so that part of the code could be removed. Hopefully I can find someone to implement this type of code change. Do the following steps sound right:
- Remove adapter sequences in case any are in my reads.
- Run modified --clock script to append UMI's to R1 and R2.
- Hard clip at least 2 bases on the 3' and 5' ends.
- Run trimgalore using default settings for length and quality trimming. Not sure if this step would be necessary.
Am I missing anything important on the trimming step?
For deduplication, I would use UMIbam instead of deduplicate_bismark in my pipeline.
I think you can start with Steps 2+3 (this can be done in one go), and then proceed with Step 4 (which will remove adapter and quality). If you add both barcodes from R1 and R2 as a string after a colon together, deduplicate_bismark
should work. Something like this:
@somereadID:CGATG_CAATT
If you can't find someone to apply a local fix I can probably take a look for you.
I just got my data and I want to confirm how the RID would need to be formatted so that the current deduplicate_bismark
would work properly. An example of the R1 and R2 are below. I cut off some of the read for clarity.
@LH00512:33:22N7MVLT3:4:1101:1564:1080 1:N:0:GATAGCCA+CTGTGTTG
GNATAGTAATTTGGTATTTATATATTGTGTTTGTTAAATTTAGTGATGTATATAGGTTTAAATATTGTTTTTTATGTGTTTTAATTTTTTATAGTTTTTTAAT
@LH00512:33:22N7MVLT3:4:1101:1564:1080 2:N:0:GATAGCCA+CTGTGTTG
CNCTGTACTCCACATTTTAAAATATATAAAACCACAAAAACATCATCTTCTATATCTTATTCTTACTAACTATATAACAATAACATTCTTAAATAAACT
So the UMI for R1 in this case is GNATA and the UMI for R2 is CNCTG. Not a great example because of the N's. You had previously said it needed to be formatted as
@somereadID:CGATG_CAATT
So does this mean my RIDs should look like the following:
@LH00512:33:22N7MVLT3:4:1101:1564:1080 1:N:0:GATAGCCA+CTGTGTTG:GNATA_CNCTG
GTAATTTGGTATTTATATATTGTGTTTGTTAAATTTAGTGATGTATATAGGTTTAAATATTGTTTTTTATGTGTTTTAATTTTTTATAGTTTTTTAAT
@LH00512:33:22N7MVLT3:4:1101:1564:1080 2:N:0:GATAGCCA+CTGTGTTG:GNATA_CNCTG
TACTCCACATTTTAAAATATATAAAACCACAAAAACATCATCTTCTATATCTTATTCTTACTAACTATATAACAATAACATTCTTAAATAAACT
Or do I also need to remove the P7 and P5 barcodes from the RID?
Yes that's correct, it will take take the last element after the last :
, so here GNATA_CNCTG
and use this in addition to the chromosome, start, end and orientation of a read for the deduplication process.
Hi Felix! I will be the one helping @shawpa with this task. Right now, the functionality we are looking to implement is to remove the first N bases of R1 and R2 and then appending those to each of the read ids as Annie showed above. We are also looking to be able to receive either a list or file of "valid" UMIs and if either R1 or R2 has a UMI that is not valid, throw away the pair. I've used cutadapt/trim_galore a bunch, but never added functionality. I would love to add the functionality and make it open source, what is the best way of going about this?
Hi Zach,
This is a very interesting idea, incidentally we have talked about Twist libraries and also libraries from Watchmaker Genomics that will apparently have a similar in-line barcoding strategy for both R1 and R2. As it turns out, it seem I may have a real interest in making this become reality myself :) My first instinct would probably be to have such a 5' inline barcode transfer function (for a variable number of nucleotides) set up as a stand-alone function (for both SE and PE), and potentially one that can be called as part of the trimming process. The latter would require a little more thinking though....
Adding functionality to validate UMIs against a UMI-allowed or whitelist seems like a rather particular, application-specific step and I wonder if this isn't best handled afterwards using a post-processing script of some sort?
I could probably have a go at UMI/inline-barcode transfer function as a stand-alone process (similar to e.g. --clock
or --implicon
). Should I have a go?
Hey Felix, sorry for the late response. If you are up for it, then by all means give it a go! I agree about validating UMI as a post-processing script. We will handle that separately. If there is anything I can do to help, please let me know. Thank you for help