title

date

output

editor_options

Annotating DNA methylation array

07/08/2019

html_document

keep_md

theme

toc

toc_depth

toc_float

true

spacelab

true

collapsed

chunk_output_type
console

This repository contains code for generating an annotation for the Illumina EPIC methylation array.

There are two annotation files, one mapped to hg19 (hg19_epic_annotation.rds) and one mapped to hg38/GR38 (hg38_epic_annotation.rds). They have the same annotation information (columns), but the hg38 annotation is missing 237 probes, since some mappings are lost from converting from hg19 to hg38.

Currently these files are not git tracked because they are too large (~250 mb).

The process

1. Starting annotation

I started with the default annotations provided by Illumina. I used two files, the latest b4 annotation (MethylationEPIC_v-1-0_B4.csv), and the list of probes that are missing between b3 and b2 (MethylationEPIC Missing Legacy CpG (v1.0_B3 vs. v1.0_B2) Annotations.csv). Both can be found on the product files list from Illumina's website.

Using the intersection of these two lists of probes, I used the provided genomic location (chromomsome and position) to map annotations to each cpg. Note that Illumina's provided annotations are based on hg19.

an example of the starting coordinates from Illumina that this annotation is based on

cpg	chr	start
cg00000029	chr16	53468112
cg00000103	chr4	73470186
cg00000109	chr3	171916037
cg00000155	chr7	2590565
cg00000158	chr9	95010555
cg00000165	chr1	91194674
cg00000221	chr17	54534248
cg00000236	chr8	42263294
cg00000289	chr14	69341139
cg00000292	chr16	28890100

I also kept some probe-specific information that I thought some may find useful. The columns for these variables are all prefixed with "ilmn_".

cpg	chr	start
cg00000029	chr16	53468112
cg00000103	chr4	73470186
cg00000109	chr3	171916037
cg00000155	chr7	2590565
cg00000158	chr9	95010555
cg00000165	chr1	91194674
cg00000221	chr17	54534248
cg00000236	chr8	42263294
cg00000289	chr14	69341139
cg00000292	chr16	28890100

2. Annotate cpgs

Transcript-related features, enhancers, cpg islands

I used the R package annotatr to access UCSC annotations for cpg islands / transcripts, and FANTOM5 for enhancers.

UCSC transcript and cpg island -related elements:

cpg	chr	start	cpg_id	cpg_width	genes_id	genes_symbol	genes_tx_id	genes_width
cg00000029	chr16	53468112	shore	2000	promoter, 1to5kb	RBL2, RBL2	uc002ehi.4, uc010vgv.1	1000, 4000
cg00000103	chr4	73470186	sea	491623	intergenic	NA	NA	480899
cg00000109	chr3	171916037	sea	398648	intron, intron, intron	FNDC3B, FNDC3B, FNDC3B	uc003fhy.3, uc003fhz.4, uc003fia.3	93324, 93324, 93324
cg00000155	chr7	2590565	sea	3182	intron, intron	BRAT1, BRAT1	uc003smi.3, uc003smj.2	6826, 6826
cg00000158	chr9	95010555	sea	143935	intron, intron, intron, intron, intron	IARS, IARS, IARS, IARS, IARS	uc004ars.2, uc004art.2, uc004aru.4, uc010mqr.3, uc010mqt.2	2306, 2306, 2306, 2306, 2306
cg00000165	chr1	91194674	shore	2000	intergenic	NA	NA	107309
cg00000221	chr17	54534248	sea	656815	exon, intronexonboundary	ANKFN1, ANKFN1	uc002iun.1, uc002iun.1	100, 400
cg00000236	chr8	42263294	sea	10587	exon, exon, exon, 3UTR, 3UTR	VDAC3, VDAC3, VDAC3, VDAC3, VDAC3	uc003xpc.3, uc031tay.1, uc022aul.1, uc003xpc.3, uc022aul.1	567, 567, 567, 475, 475
cg00000289	chr14	69341139	shore	2000	exon, exon, exon, exon, exon, 3UTR, 3UTR, 3UTR, 3UTR, 3UTR	ACTN1, ACTN1, ACTN1, ACTN1, ACTN1, ACTN1, ACTN1, ACTN1, ACTN1, ACTN1	uc001xkk.3, uc010ttb.2, uc001xkl.3, uc001xkm.3, uc001xkn.3, uc001xkk.3, uc010ttb.2, uc001xkl.3, uc001xkm.3, uc001xkn.3	895, 895, 895, 895, 895, 736, 736, 736, 736, 736
cg00000292	chr16	28890100	shore	2000	1to5kb, exon, exon, intron	ATP2A1, ATP2A1, ATP2A1, no_associated_gene	uc002drp.1, uc002drn.1, uc002dro.1, uc010vct.2	4000, 302, 302, 931314

Enhancers

cpg	chr	start	enhancers_id	enhancers_width
cg00000776	chr4	156388205	enhancer	116
cg00003578	chr1	12600529	enhancer	328
cg00004667	chr1	16292746	enhancer	536
cg00004963	chr6	147124996	enhancer	324
cg00005325	chr1	201684967	enhancer	354
cg00005461	chr3	46131480	enhancer	363
cg00007021	chr8	101819246	enhancer	437
cg00007969	chr1	41633437	enhancer	488
cg00009088	chr11	60930188	enhancer	335
cg00009585	chr15	33111077	enhancer	345

Placental partially methylated domains (PMDs) from Schroeder et al. 2013:

Taken from the primary article.

cpg	chr	start	pmd_width	pmd_id
cg00000103	chr4	73470186	332252	chr4:73435322-73767574
cg00000165	chr1	91194674	81136	chr1:91192805-91273941
cg00000363	chr1	230560793	68156	chr1:230492946-230561102
cg00000596	chr8	133098502	77607	chr8:133063957-133141564
cg00000776	chr4	156388205	162183	chr4:156298095-156460278
cg00000884	chr4	154609857	74720	chr4:154606053-154680773
cg00000974	chr20	6750606	1147	chr20:6749547-6750694
cg00001099	chr8	87081553	201811	chr8:86879841-87081652
cg00001249	chr14	60389786	171588	chr14:60386751-60558339
cg00001520	chr14	37666489	24805	chr14:37641880-37666685

Imprinting regions

These placental imprinted regions were collected from several sources. The merging of these regions into a combined resource is documented at github.com/wvictor14/human_methylation_imprints.

cpg	chr	start	imprint_tissue_specificity	imprint_methylated_allele	imprint_sources	imprint_region
cg00000924	chr11	2720463	other	M	Court 2014, Hanna 2016	11:2719948-2722440
cg00050654	chr4	4576493	placental-specific	M	Sanchez-Delgado 2016	4:4576220-4577911
cg00059930	chr13	48894382	other	M	Court 2014	13:48892341-48895763
cg00082664	chr4	154710796	placental-specific	M	Sanchez-Delgado 2016, Hamada 2016	4:154709200-154715220
cg00082664	chr4	154710796	placental-specific	M	Sanchez-Delgado 2016, Hamada 2016	4:154709200-154715220
cg00083059	chr6	39902348	placental-specific	M	Hanna 2016	6:39901897-39902693
cg00096536	chr4	154711906	placental-specific	M	Sanchez-Delgado 2016, Hamada 2016	4:154709200-154715220
cg00096536	chr4	154711906	placental-specific	M	Sanchez-Delgado 2016, Hamada 2016	4:154709200-154715220
cg00098799	chr15	99409360	other	M	Court 2014	15:99408496-99409650
cg00155882	chr8	141110747	other	M	Court 2014, Hanna 2016	8:141107717-141111081

3. Map to hg38

Lastly I mapped the annotation to the genome assembly hg38 using UCSC liftover's tool implemented in R. This results in a loss of 237 cpgs.