/EPIC_annotation

An annotation for mapping cpgs on the EPIC DNA methylation microarray platform to genomic features.

title date output editor_options
Annotating DNA methylation array
07/08/2019
html_document
keep_md theme toc toc_depth toc_float
true
spacelab
true
3
collapsed
chunk_output_type
console

This repository contains code for generating an annotation for the Illumina EPIC methylation array.

There are two annotation files, one mapped to hg19 (hg19_epic_annotation.rds) and one mapped to hg38/GR38 (hg38_epic_annotation.rds). They have the same annotation information (columns), but the hg38 annotation is missing 237 probes, since some mappings are lost from converting from hg19 to hg38.

Currently these files are not git tracked because they are too large (~250 mb).

The process

1. Starting annotation

I started with the default annotations provided by Illumina. I used two files, the latest b4 annotation (MethylationEPIC_v-1-0_B4.csv), and the list of probes that are missing between b3 and b2 (MethylationEPIC Missing Legacy CpG (v1.0_B3 vs. v1.0_B2) Annotations.csv). Both can be found on the product files list from Illumina's website.

Using the intersection of these two lists of probes, I used the provided genomic location (chromomsome and position) to map annotations to each cpg. Note that Illumina's provided annotations are based on hg19.

an example of the starting coordinates from Illumina that this annotation is based on

cpg chr start
cg00000029 chr16 53468112
cg00000103 chr4 73470186
cg00000109 chr3 171916037
cg00000155 chr7 2590565
cg00000158 chr9 95010555
cg00000165 chr1 91194674
cg00000221 chr17 54534248
cg00000236 chr8 42263294
cg00000289 chr14 69341139
cg00000292 chr16 28890100

I also kept some probe-specific information that I thought some may find useful. The columns for these variables are all prefixed with "ilmn_".

cpg chr start
cg00000029 chr16 53468112
cg00000103 chr4 73470186
cg00000109 chr3 171916037
cg00000155 chr7 2590565
cg00000158 chr9 95010555
cg00000165 chr1 91194674
cg00000221 chr17 54534248
cg00000236 chr8 42263294
cg00000289 chr14 69341139
cg00000292 chr16 28890100

2. Annotate cpgs

Transcript-related features, enhancers, cpg islands

I used the R package annotatr to access UCSC annotations for cpg islands / transcripts, and FANTOM5 for enhancers.

UCSC transcript and cpg island -related elements:

cpg chr start cpg_id cpg_width genes_id genes_symbol genes_tx_id genes_width
cg00000029 chr16 53468112 shore 2000 promoter, 1to5kb RBL2, RBL2 uc002ehi.4, uc010vgv.1 1000, 4000
cg00000103 chr4 73470186 sea 491623 intergenic NA NA 480899
cg00000109 chr3 171916037 sea 398648 intron, intron, intron FNDC3B, FNDC3B, FNDC3B uc003fhy.3, uc003fhz.4, uc003fia.3 93324, 93324, 93324
cg00000155 chr7 2590565 sea 3182 intron, intron BRAT1, BRAT1 uc003smi.3, uc003smj.2 6826, 6826
cg00000158 chr9 95010555 sea 143935 intron, intron, intron, intron, intron IARS, IARS, IARS, IARS, IARS uc004ars.2, uc004art.2, uc004aru.4, uc010mqr.3, uc010mqt.2 2306, 2306, 2306, 2306, 2306
cg00000165 chr1 91194674 shore 2000 intergenic NA NA 107309
cg00000221 chr17 54534248 sea 656815 exon, intronexonboundary ANKFN1, ANKFN1 uc002iun.1, uc002iun.1 100, 400
cg00000236 chr8 42263294 sea 10587 exon, exon, exon, 3UTR, 3UTR VDAC3, VDAC3, VDAC3, VDAC3, VDAC3 uc003xpc.3, uc031tay.1, uc022aul.1, uc003xpc.3, uc022aul.1 567, 567, 567, 475, 475
cg00000289 chr14 69341139 shore 2000 exon, exon, exon, exon, exon, 3UTR, 3UTR, 3UTR, 3UTR, 3UTR ACTN1, ACTN1, ACTN1, ACTN1, ACTN1, ACTN1, ACTN1, ACTN1, ACTN1, ACTN1 uc001xkk.3, uc010ttb.2, uc001xkl.3, uc001xkm.3, uc001xkn.3, uc001xkk.3, uc010ttb.2, uc001xkl.3, uc001xkm.3, uc001xkn.3 895, 895, 895, 895, 895, 736, 736, 736, 736, 736
cg00000292 chr16 28890100 shore 2000 1to5kb, exon, exon, intron ATP2A1, ATP2A1, ATP2A1, no_associated_gene uc002drp.1, uc002drn.1, uc002dro.1, uc010vct.2 4000, 302, 302, 931314

Enhancers

cpg chr start enhancers_id enhancers_width
cg00000776 chr4 156388205 enhancer 116
cg00003578 chr1 12600529 enhancer 328
cg00004667 chr1 16292746 enhancer 536
cg00004963 chr6 147124996 enhancer 324
cg00005325 chr1 201684967 enhancer 354
cg00005461 chr3 46131480 enhancer 363
cg00007021 chr8 101819246 enhancer 437
cg00007969 chr1 41633437 enhancer 488
cg00009088 chr11 60930188 enhancer 335
cg00009585 chr15 33111077 enhancer 345

Placental partially methylated domains (PMDs) from Schroeder et al. 2013:

Taken from the primary article.

cpg chr start pmd_width pmd_id
cg00000103 chr4 73470186 332252 chr4:73435322-73767574
cg00000165 chr1 91194674 81136 chr1:91192805-91273941
cg00000363 chr1 230560793 68156 chr1:230492946-230561102
cg00000596 chr8 133098502 77607 chr8:133063957-133141564
cg00000776 chr4 156388205 162183 chr4:156298095-156460278
cg00000884 chr4 154609857 74720 chr4:154606053-154680773
cg00000974 chr20 6750606 1147 chr20:6749547-6750694
cg00001099 chr8 87081553 201811 chr8:86879841-87081652
cg00001249 chr14 60389786 171588 chr14:60386751-60558339
cg00001520 chr14 37666489 24805 chr14:37641880-37666685

Imprinting regions

These placental imprinted regions were collected from several sources. The merging of these regions into a combined resource is documented at github.com/wvictor14/human_methylation_imprints.

cpg chr start imprint_tissue_specificity imprint_methylated_allele imprint_sources imprint_region
cg00000924 chr11 2720463 other M Court 2014, Hanna 2016 11:2719948-2722440
cg00050654 chr4 4576493 placental-specific M Sanchez-Delgado 2016 4:4576220-4577911
cg00059930 chr13 48894382 other M Court 2014 13:48892341-48895763
cg00082664 chr4 154710796 placental-specific M Sanchez-Delgado 2016, Hamada 2016 4:154709200-154715220
cg00082664 chr4 154710796 placental-specific M Sanchez-Delgado 2016, Hamada 2016 4:154709200-154715220
cg00083059 chr6 39902348 placental-specific M Hanna 2016 6:39901897-39902693
cg00096536 chr4 154711906 placental-specific M Sanchez-Delgado 2016, Hamada 2016 4:154709200-154715220
cg00096536 chr4 154711906 placental-specific M Sanchez-Delgado 2016, Hamada 2016 4:154709200-154715220
cg00098799 chr15 99409360 other M Court 2014 15:99408496-99409650
cg00155882 chr8 141110747 other M Court 2014, Hanna 2016 8:141107717-141111081

3. Map to hg38

Lastly I mapped the annotation to the genome assembly hg38 using UCSC liftover's tool implemented in R. This results in a loss of 237 cpgs.