Jun 19, 2019: Version 1.1 released.
- Reannotated probe by
InfiniumAnnotation
(Ref 2). - Improved filenaming: Adjacent normal were named as
.A
. Blood normal were named as.N
. - Improved filenaming: Unique sample ID were added.
May 30, 2018: Version 1.0 released.
- Probe annotation were liftover to hg38
## Get file description from github and set up environment
wget https://raw.githubusercontent.com/ding-lab/CPTAC3.catalog/master/CPTAC3.Catalog.dat
wget https://raw.githubusercontent.com/ding-lab/CPTAC3.catalog/master/BamMap/katmai.BamMap.dat
conda env creat -f methyl-pipeline.yml
conda activate methyl-pipeline
## Make input
python make_pipeline_input.py ${Path to processing folder} ${Batch_name} ${TXT with file names and UUID}
## Softlink IDAT files
for i in `cat ../katmai.BamMap.dat | grep Methylation | awk -F "\t" '{OFS="\t"; print $6}'`; do ln -s $i ${Path to processing folder}; done
## Pipeline
tmux new-session -d -s methylation "source activate methyl-pipeline; Rscript cptac_methylation_v1.1.R ${Path to processing folder} |& tee tmux.methylation.log"
tmux new-session -d -s remapping "source activate methyl-pipeline; Rscript cptac_methylation_liftover.R ${Path to processing folder}/Processed |& tee tmux.mapping.log"
The raw data from Illumina's EPIC methylation arrays were available as
IDAT files from the CPTAC consortium. The methylation analysis was
performed using the cross-package workflow methylationArrayAnalysis
available on Bioconductor. In brief, the raw data files IDAT files
were processed to obtain the methylated (M) and unmethylated (U) signal
intensities for each locus. The processing step included an unsupervised
normalization step called functional normalization that has been
previously implemented for Illumina 450K methylation arrays (Ref 1). A detection p
value was also calculated for each locus, and this p value captured the
quality of detection at the locus with respect to negative control
background probes included in the array. Loci having common SNPs (with
MAF > 0.01), as per dbSNP build 132 through 147 via the UCSC
snp132common track through snp147common track, were removed from further
analysis. Beta values were calculated as M/(M+U)
, that is equal to the
fraction methylated for each locus. Beta values of loci whose detection
p values were > 0.01 were assigned NA in the output file. All loci
are annotated with the EPIC Manifest from
MethylationEPIC_v-1-0_B2.csv
from the zip archive
infinium-methylationepic-v1-0-b2-manifest-file-csv.zip
from Illumina through the IlluminaHumanMethylationEPICanno.ilm10b2.hg19
package on Bioconductor.
To map EPIC arrays to GRCh38 assembly, all probe are reannotated by annotation information from InfiniumAnnotation (Ref 2) through the following steps:
- Getting the annotation information
anno/annotation_liftedOver_hg38.txt
fromIlluminaHumanMethylationEPICanno.ilm10b2.hg19
package using scriptannotation_liftOver_hg19tohg38.R
. - Remapping the probes with GRCh38 information from
InfiniumAnnotation
using scriptremapping.py
. - Replacing the annotation in output files with the new annotation
anno/annotation_remap_hg38.txt
generated in step 2. Refer to script cptac_methylation_liftover.R.
- Fortin, JP, Aurélie Labbe, Mathieu Lemire, and BW Zanke. 2014. “Functional normalization of 450k methylation array data improves replication in large cancer studies.” Genome Biology 15 (503):1–17
- Zhou W, Laird PW and Shen H. 2017. "Comprehensive characterization, annotation and innovative use of Infinium DNA Methylation BeadChip probes." Nucleic Acids Research 45 (4):e22
The samples follow the naming system [SubjectID].[T , N, or A].[SampleID].csv
,
where T, N, or A specifies whether it is a tumor, blood normal or a tissue normal sample (for
example: C3N-01375.T.CPTXXXXXXXXXX.csv). SampleID is a unique
identifier of the sample.
Column headding of output file is listed below.
Column | Content from Illumina EPIC Manifest | Status | Replacement | Content from InfiniumAnnotation |
---|---|---|---|---|
Locus | The IlmnID | Keep | - | - |
chr | Chromosome containing the CpG (Build 37) | Replace | CpG_chrm | Map to Build 38 |
pos | Chromosomal coordinates of the CpG (Build 37) | Replace | CpG_beg and CpG_end | Map to build 38 |
strand | The Forward (F) or Reverse (R) designation of the Design Strand | Keep | - | - |
Probe_rs | rsid(s) of SNP(s) located in the probe | Keep | - | - |
Probe_maf | Minor allele frequency of SNP(s) | Keep | - | - |
Random_Loci | CpG loci chosen randomly by consortium members during the design process are marked “True” | Keep | - | - |
Methyl27_Loci | CpG’s carried over from the HumanMethylation27 array (92% carryover) | Keep | - | - |
Methyl450_Loci | CpG’s carried over from the HumanMethylation450 array (94% carryover) | Keep | - | - |
UCSC_RefGene_Group_on37 | Gene region feature category describing the CpG position, from UCSC. | Keep | - | - |
Relation_to_Island_on37 | The location of the CpG relative to the CpG island | Keep | - | - |
Phantom5_Enhancers_on37 | Classifications from the FANTOM 5 enhancers as a low- or high-CpG density region | Keep | - | - |
DMR_on37 | Differentially methylated regions (experimentally determined) | Keep | - | - |
X450k_Enhancer_on37 | Predicted enhancer elements as annotated in the original 450K design | Keep | - | - |
HMM_Island_on37 | Hidden Markov Model Islands. Chromosomal map coordinates of computationally predicted CpG islands | Keep | - | - |
Regulatory_Feature_Name_on37 | Chromosomal map coordinates of the regulatory feature | Keep | - | - |
Regulatory_Feature_Group_on37 | Description of the regulatory feature referenced in “Regulatory_Feature_Name” | Keep | - | - |
GencodeBasicV12_NAME | Target gene name(s), from the basic GENECODE build | Replace | geneNames | Gene models follows GENCODE version 22 (hg38). |
GencodeBasicV12_Accession | The basic GENECODE accession number(s) of the target transcript(s) | Replace | transcriptIDs | Gene models follows GENCODE version 22 (hg38). |
GencodeBasicV12_Group | Gene region feature category describing the CpG position, from basic GENECODE | Remove | - | - |
GencodeCompV12_NAME | Target gene name(s), from the complete GENECODE build | Remove | - | - |
GencodeCompV12_Accession | The complete GENECODE accession number(s) of the target transcript(s) | Remove | - | - |
GencodeCompV12_Group | Gene region feature category describing the CpG position, from complete GENECODE | Remove | - | - |
DNase_Hypersensitivity_NAME_on37 | Chromosomal coordinates of the DNase hypersensitive region from ENCODE | Keep | - | - |
DNase_Hypersensitivity_Evidence_Count_on37 | Number of supporting experimental evidence for DNase hypersensitive region from ENCODE | Keep | - | - |
OpenChromatin_NAME_on37 | Chromosomal coordinates of open chromatin region from ENCODE | Keep | - | - |
OpenChromatin_Evidence_Count_on37 | Number of supporting experimental evidence for open chromatin region from ENCODE | Keep | - | - |
TFBS_NAME_on37 | Chromosomal coordinates of transcription factor binding site region from ENCODE | Keep | - | - |
TFBS_Evidence_Count_on37 | Number of supporting experimental evidence for transcription factor bind site region from ENCODE | Keep | - | - |
- | - | Add | transcriptTypes | gene annotation based on GENCODE v22 |
- | - | Add | genesUniq | gene annotation based on GENCODE v22 |
- | - | Add | distToTSS | gene annotation based on GENCODE v22 |
- | - | Add | CGI | CpG island definition based on UCSC genome browser |
- | - | Add | CGIposition | CpG island definition based on UCSC genome browser |
- | - | Add | mapQ_A | mapping quality score, 0-60, with 60 being the best. |
- | - | Add | mapQ_B | mapping quality score, 0-60, with 60 being the best. |
- | - | Add | MASK.mapping | whether the probe is masked for mapping reason. |
- | - | Add | MASK.typeINextBaseSwitch | whether the probe has a SNP in the extension base that causes a color channel switch from the official annotation |
- | - | Add | MASK.sub30.copy | whether the 30bp 3'-subsequence of the probe is non-unique. |
- | - | Add | MASK.extBase | probes masked for extension base inconsistent with specified color channel (type-I) or CpG (type-II) based on mapping. |
- | - | Add | MASK.snp5.GMAF1p | whether 5bp 3'-subsequence (including extension for typeII) overlap with any of the SNPs with global MAF >1%. |
- | - | Add | MASK.general | recommended general purpose masking merged from "MASK.sub30.copy", "MASK.mapping", "MASK.extBase", "MASK.typeINextBaseSwitch" and "MASK.snp5.GMAF1p" |