/mili_benchmark

mix of python, bash and R functions, scripts and notebooks to benchmark the use of methylation as proxy to infer TF binding

Primary LanguageJupyter Notebook

Leveraging methylation information to infer TF binding

Conclusions: Incorporating methylation data improves TF binding inference over standard motif binding profiles, i.e., PWM p-values, and does so in a cell-type specific manner. Although most TFs do not bind to methylated promoter regions, our analysis highlights several exceptions to this rule, indicating that the role of methylation in TF binding is likely cell-type and context specific.
Authors: Daniel Morgan; Dawn DeMeo; Kimberly Glass


Clone current version & run camb_motif_pipeline_gamma.sh

    git clone https://github.com/dcolinmorgan/mili_benchmark ~/src/mili_benchmark
    bash camb_motif_pipeline_gamma.sh -b0 -c'A549 K562 GM12878 SKNSH HepG2 HeLa' -o'outdirXX'

This benchmark can be summarized by three main steps (numbered) and takes four primary inputs, including the locations of:

  1. potential transcription factor binding sites, which can be defined by position weight matrices mapped onto the DNA
  2. 850k Methyl array data
  3. WGBS methylation data
  4. ChIP-seq data

The output are TF-cell specific intersections of these four data types which are read into jupyternotebooks for analysis. Several other steps provide checks against bias for supplemental figures (bullet points)

IMPORTANT: bedtools must be installed in order to run. Please see https://bedtools.readthedocs.io/ for more information.

  1. Separate hg38 MEME file into individual motif files (730 total PWM files)
  2. Run FIMO with threshold=0.0001
  3. Convert fimo-output to the BED format
  4. For every motif BED file and for every cell line assign motif locations (~) 1 where ChIP for the corresponding TF assayed in that cell line is also observed (otherwise 0). This data to assess prediction within the following motif subsets:
    • nonCG motif
      Input: motif locations devoid of CG
      Output: white in Figure S1
    • CG motif
      Input:CG containing motif locations
      Output: Figure S1,S2,S7, Confirm baseline motif AUROC (Glass et al. 2013)
    • CG motif ∩ WGBS
      Input: motif locations with WGBS methyl-data
      Parameters tested: sequence read depth
      Output: Figure S4
    • CG motif ∩ WGBS ∩ array
      Input: motif locations with both WGBS and methyl array data
      Output: Figures 2, S3,S5,S6
      Overlay Yin 2017 data (Figure 3)
      Additional parameters tested: add +/- 0 : 10kb buffer sizes (Figure 4) \

Calculate & summarize AUROCs from intersection files

requirements : numpy==1.18.1 pandas==1.0.1 seaborn==0.10.0 scipy==1.4.1

   python mili_benchmark/src/python/run_predScore.py -i outdirXX -o outdirXX/test

Following this, the jupyter notebook processes AUROCs and figures

Among other things, these checks are performed herewithin:

  1. Count multiple CpGs per motif region (varies per TF, ~15) with PWM hit
    1. Compare pairwise distances
    2. Compare mean/median/max/min CpG
  2. Compare full WGBS and only subset within array to ChIP, calculate AUROC
    1. threshold WGBS reads >10, calculate AUROC

Workflow figure from manuscript

Figure 1. Intersection schema between data modalities
Figure 1. Intersection schema between data modalities. Schematic workflow of bedtools2 intersection calls to calculate the prediction accuracy. The original motif is used as the template, onto which methylation information is supplemented/overwritten to predict ChIP-seq binding activity, where possible. Intersections: 1. Motif to WGBS, 2. Motif-WGBS to methyl array, 3. Motif+WGBS+methyl to ChIP. This narrow view is then expanded by adding buffers before and after motif sites (H0A) and methyl sites (H0B).