FelixKrueger/Bismark

CpG count query

MarkOEze opened this issue · 6 comments

Hello,

I have some DNA methylation data, which I prepared using EM-Seq and got around 236 million paired-end reads after sequencing on the Illumina NextSeq platform. I have some concerns regarding the number of CpGs I obtained. It is known that there are 28 million CpGs in the human genome; however, the number of CpGs I got, especially at 1X, 3X, and 5x coverage, suggests that the CpG counts may have been overestimated.

I trimmed my reads using Trimgalore, aligned them and extracted the methylation information using Bismark. Please look at the snippets of my codes and the corresponding results below.

Alignment:
bismark --bowtie2 --gzip -N 1 --un --ambiguous --phred33-quals -p 4 --multicore 20 --genome /mnt/hcs/dsm-pathology-chatterjee-genomics/Euan_Rodger/Ref_Genomes/GRCh37 -1 AAFJNNMM5-8861-04-00-01_S1_L001_R1_001_val_1.fq.gz -2 AAFJNNMM5-8861-04-00-01_S1_L001_R2_001_val_2.fq.gz

Methylation extraction:
bismark_methylation_extractor --gzip --bedGraph --comprehensive --s --multicore 10 --merge_non_CpG --cutoff 1 -o ./bedgraph_10_2/ AAFJNNMM5-8861-06-00-01_S3_L001_R1_001_val_1_bismark_bt2_pe.bam

N/B: I repeated this for all the samples at 3X, 5X, and 10X coverage (codes not shown).

Sample 1x coverage 3x coverage 5x coverage 10x coverage
S1 48,841,280 29,043,575 12,581,307 650,604
S2 50,761,706 34,860,543 18,166,369 1,472,475
S3 44,959,784 21,303,364 7,247,631 265,220

I also counted the number of CpGs with methylkit using the bam files obtained from the Bismark alignment.
Result:

Sample 1x coverage 3x coverage 5x coverage 10x coverage
S1 49,783,470 31,603,367 16,688,990 1,992,364
S2 52,039,475 37,017,454 22,484,852 3,749,399
S3 45,501,338 26,431,339 13,237,981 1,623,263

Although the second approach seems to give more reasonable results, I still have serious concerns about the accuracy of the CpG count at 1x, 3x, and 5x coverage.

Sorry about the long text; I wanted to provide as much relevant information as possible. Please let me know if you require more information.

I look forward to hearing from you soon.

Thank you.

Kind regards,
Mark.

Hi Mark,

Forgive me if this a naive suggestion, but could the confusion simple arise from the fact that there are indeed ~28 million CpG dyads in the genome, but the calls reported by Bismark are at single bp. In other words, to account for top and bottom strand CpGs you would need to multiply by 2, so ~56M in total?

CG
GC

Thanks Felix. Hope you are good. Really long time. :-). Mark is doing PhD in my lab.
I am wondering what is the best way you will suggest to utilise methylation extractor (or other tools) to obtain CpG dyads only (i.e, 28 million, not 56 million) with coverage info.
Thanks,
Aniruddha

I'm good, thanks! You can continue to the bedGraph/coverage step, either by using --bedGraph from within the methylation extractor, or run bismark2bedGraph afterwards. Next, you can run coverage2cytosine with the coverage file as input, and selecting --merge_CpG. This should do exactly what you need.

I added a comment to this for issue #317.