CpG count query
MarkOEze opened this issue · 6 comments
Hello,
I have some DNA methylation data, which I prepared using EM-Seq and got around 236 million paired-end reads after sequencing on the Illumina NextSeq platform. I have some concerns regarding the number of CpGs I obtained. It is known that there are 28 million CpGs in the human genome; however, the number of CpGs I got, especially at 1X, 3X, and 5x coverage, suggests that the CpG counts may have been overestimated.
I trimmed my reads using Trimgalore, aligned them and extracted the methylation information using Bismark. Please look at the snippets of my codes and the corresponding results below.
Alignment:
bismark --bowtie2 --gzip -N 1 --un --ambiguous --phred33-quals -p 4 --multicore 20 --genome /mnt/hcs/dsm-pathology-chatterjee-genomics/Euan_Rodger/Ref_Genomes/GRCh37 -1 AAFJNNMM5-8861-04-00-01_S1_L001_R1_001_val_1.fq.gz -2 AAFJNNMM5-8861-04-00-01_S1_L001_R2_001_val_2.fq.gz
Methylation extraction:
bismark_methylation_extractor --gzip --bedGraph --comprehensive --s --multicore 10 --merge_non_CpG --cutoff 1 -o ./bedgraph_10_2/ AAFJNNMM5-8861-06-00-01_S3_L001_R1_001_val_1_bismark_bt2_pe.bam
N/B: I repeated this for all the samples at 3X, 5X, and 10X coverage (codes not shown).
Sample | 1x coverage | 3x coverage | 5x coverage | 10x coverage |
---|---|---|---|---|
S1 | 48,841,280 | 29,043,575 | 12,581,307 | 650,604 |
S2 | 50,761,706 | 34,860,543 | 18,166,369 | 1,472,475 |
S3 | 44,959,784 | 21,303,364 | 7,247,631 | 265,220 |
I also counted the number of CpGs with methylkit using the bam files obtained from the Bismark alignment.
Result:
Sample | 1x coverage | 3x coverage | 5x coverage | 10x coverage |
---|---|---|---|---|
S1 | 49,783,470 | 31,603,367 | 16,688,990 | 1,992,364 |
S2 | 52,039,475 | 37,017,454 | 22,484,852 | 3,749,399 |
S3 | 45,501,338 | 26,431,339 | 13,237,981 | 1,623,263 |
Although the second approach seems to give more reasonable results, I still have serious concerns about the accuracy of the CpG count at 1x, 3x, and 5x coverage.
Sorry about the long text; I wanted to provide as much relevant information as possible. Please let me know if you require more information.
I look forward to hearing from you soon.
Thank you.
Kind regards,
Mark.
Hi Mark,
Forgive me if this a naive suggestion, but could the confusion simple arise from the fact that there are indeed ~28 million CpG dyads in the genome, but the calls reported by Bismark are at single bp. In other words, to account for top and bottom strand CpGs you would need to multiply by 2, so ~56M in total?
CG
GC
Thanks Felix. Hope you are good. Really long time. :-). Mark is doing PhD in my lab.
I am wondering what is the best way you will suggest to utilise methylation extractor (or other tools) to obtain CpG dyads only (i.e, 28 million, not 56 million) with coverage info.
Thanks,
Aniruddha
I'm good, thanks! You can continue to the bedGraph/coverage step, either by using --bedGraph
from within the methylation extractor, or run bismark2bedGraph
afterwards. Next, you can run coverage2cytosine
with the coverage file as input, and selecting --merge_CpG
. This should do exactly what you need.
I added a comment to this for issue #317.