Leveraging methylation information to infer TF binding

Conclusions: Incorporating methylation data improves TF binding inference over standard motif binding profiles, i.e., PWM p-values, and does so in a cell-type specific manner. Although most TFs do not bind to methylated promoter regions, our analysis highlights several exceptions to this rule, indicating that the role of methylation in TF binding is likely cell-type and context specific.
Authors: Daniel Morgan; Dawn DeMeo; Kimberly Glass

Clone current version & run camb_motif_pipeline_gamma.sh

    git clone https://github.com/dcolinmorgan/mili_benchmark ~/src/mili_benchmark
    bash camb_motif_pipeline_gamma.sh -b0 -c'A549 K562 GM12878 SKNSH HepG2 HeLa' -o'outdirXX'

This benchmark can be summarized by three main steps (numbered) and takes four primary inputs, including the locations of:

potential transcription factor binding sites, which can be defined by position weight matrices mapped onto the DNA
850k Methyl array data
WGBS methylation data
ChIP-seq data

The output are TF-cell specific intersections of these four data types which are read into jupyternotebooks for analysis. Several other steps provide checks against bias for supplemental figures (bullet points)

IMPORTANT: bedtools must be installed in order to run. Please see https://bedtools.readthedocs.io/ for more information.

Separate hg38 MEME file into individual motif files (730 total PWM files)
Run FIMO with threshold=0.0001
Convert fimo-output to the BED format
For every motif BED file and for every cell line assign motif locations (~) 1 where ChIP for the corresponding TF assayed in that cell line is also observed (otherwise 0). This data to assess prediction within the following motif subsets:
- nonCG motif
  Input: motif locations devoid of CG
  Output: white in Figure S1
- CG motif
  Input:CG containing motif locations
  Output: Figure S1,S2,S7, Confirm baseline motif AUROC (Glass et al. 2013)
- CG motif ∩ WGBS
  Input: motif locations with WGBS methyl-data
  Parameters tested: sequence read depth
  Output: Figure S4
- CG motif ∩ WGBS ∩ array
  Input: motif locations with both WGBS and methyl array data
  Output: Figures 2, S3,S5,S6
  Overlay Yin 2017 data (Figure 3)
  Additional parameters tested: add +/- 0 : 10kb buffer sizes (Figure 4) \

Calculate & summarize AUROCs from intersection files

requirements : numpy==1.18.1 pandas==1.0.1 seaborn==0.10.0 scipy==1.4.1

   python mili_benchmark/src/python/run_predScore.py -i outdirXX -o outdirXX/test

Following this, the jupyter notebook processes AUROCs and figures

Among other things, these checks are performed herewithin:

Count multiple CpGs per motif region (varies per TF, ~15) with PWM hit
1. Compare pairwise distances
2. Compare mean/median/max/min CpG
Compare full WGBS and only subset within array to ChIP, calculate AUROC
1. threshold WGBS reads >10, calculate AUROC

Workflow figure from manuscript

Figure 1. Intersection schema between data modalities. Schematic workflow of bedtools2 intersection calls to calculate the prediction accuracy. The original motif is used as the template, onto which methylation information is supplemented/overwritten to predict ChIP-seq binding activity, where possible. Intersections: 1. Motif to WGBS, 2. Motif-WGBS to methyl array, 3. Motif+WGBS+methyl to ChIP. This narrow view is then expanded by adding buffers before and after motif sites (H0A) and methyl sites (H0B).