/Ocelot

Primary LanguagePythonMIT LicenseMIT

Ocelot: Improved Epigenome Imputation Reveals Asymmetric Predictive Relationships Across Histone Modifications

Ocelot is a machine learning approach to impute epigenomes across tissues and cell types. It ranked first in the ENCODE Imputation Challenge with high accuracy on held-out prospective data. Beyond high predictive performance, it offers a new way to investigate the cross-histone regulations based on large-scale epigenomics datasets. Please contact (hyangl@umich.edu or gyuanfan@umich.edu) if you have any questions or suggestions.

Figure1


Installation

Git clone a copy of code:

git clone https://github.com/GuanLab/Ocelot.git

Required dependencies

  • python (3.6.5)
  • numpy (1.13.3). It comes pre-packaged in Anaconda.
  • pyBigWig A package for quick access to and create of bigwig files.
conda install pybigwig -c bioconda
  • lightgbm(2.3.0) A gradient boosting tree-based algorithm with fast training speed and high efficienty.
pip install lightgbm
  • tensorflow (1.14.0) A popular deep learning package.
conda install tensorflow-gpu
  • keras (2.2.5) A popular deep learning package using tensorflow backend.
conda install keras

Dataset

Code of Ocelot and evaluation on the challenge data

  • data_challenge
  • code_challenge

Reproducing all these imputations and evaluations requires considerable time even with super computing resources, we therefore also provide the processed data, trained models and predictions together with the reproducible scripts.

For benchmarking, predictions from Avocado and ChromImpute are also provided:

Mapping between letter, id and histone mark in challenge

For simplicity, we map the epigeneic marks to captital letters as follows:

letter id mark
C M02 DNase-seq
D M18 H3K36me3
E M17 H3K27me3
F M16 H3K27ac
G M20 H3K4me1
H M22 H3K4me3
I M29 H3K9me3
J M01 ATAC-seq

For example, in the "CDEH_I" design, we used four marks (C, D, E, H) as cell type-specific features to predict mark I.

Data processing and model building scripts for ENCODE3 imputation

  • data_encode3
  • code_encode3

Data processing and model building scripts for Roadmap imputation

  • data_roadmap
  • code_roadmap

Code for SHAP analysis

  • code_shap

Code of Ocelot final submission to the ENCODE Imputation Challenge