DEcode: A Jupyter Notebook repository from Akmazad

Welcome to DEcode!

The goal of this project is to enable you to utilize genomic big data in identifying regulatory mechanisms for differential expression (DE).

DEcode predicts inter-tissue variations and inter-person variations in gene expression levels from TF-promoter interactions, RNABP-mRNA interactions, and miRNA-mRNA interactions.

You can read more about this method in this paper (full text is available at https://rdcu.be/b5r3p) where we conducted a series of evaluation and applications by predicting transcript usage, drivers of aging DE, gene coexpression relationships on a genome-wide scale, and frequent DE in diverse conditions.

Run DEcode on Code Ocean

You can run DEcode on Code Ocean platform without setting up a computational environment. Our Code Ocean capsule provides reproducible workflows, all processed data, and pre-trained models for tissue- and person-specific transcriptomes and DEprior, at gene- or transcript level.

Prepare input features

Prerequisites

Before running the scripts to create custom input data, you must have the following python libraries installed on your system:

pandas
scipy

Also you need to install the following R packages:

GenomicFeatures
tidyverse
data.table
rtracklayer
plyranges
optparse

Usage

First, download the example files by running the following commands:

#GTF file from GTEXv7 (hg19)
mkdir gtf
wget https://storage.googleapis.com/gtex_analysis_v7/reference/gencode.v19.genes.v7.patched_contigs.gtf -P ./gtf/

#eCLIP-seq peaks from Encode (hg19)
mkdir bed_rna
wget https://www.encodeproject.org/files/ENCFF039BKT/@@download/ENCFF039BKT.bed.gz -P ./bed_rna/
wget https://www.encodeproject.org/files/ENCFF379UQU/@@download/ENCFF379UQU.bed.gz -P ./bed_rna/

#ChIP-seq peaks from Encode (hg19)
mkdir bed_promoter
wget https://www.encodeproject.org/files/ENCFF553GPK/@@download/ENCFF553GPK.bed.gz -P ./bed_promoter/
wget https://www.encodeproject.org/files/ENCFF549TYR/@@download/ENCFF549TYR.bed.gz -P ./bed_promoter/

This will create directories gtf, bed_rna, and bed_promoter, and download the GTF file, two eCLIP-seq bed files, and two ChIP-seq bed files into them.

Our input data for the gene-level model was constructed based on the gencode.v19.transcripts.patched_contigs.gtf file from the GTEXv7 dataset. This file contains only one representative transcript for each gene for the human genome (hg19).

If you use a different gene model, please make sure to filter GTF file and select a representative transcript for each gene before running the pipeline.

The input data for the transcript-level model was created based on https://storage.googleapis.com/gtex_analysis_v7/reference/gencode.v19.transcripts.patched_contigs.gtf. It is not necessary to filter the GTF file for the transcript-level model.

Convert genome coordinates to RNA coordinates using the following command:

Rscript functions/bed_to_RNA_coord.R -b ./bed_rna/ -n 100 -g gtf/gencode.v19.genes.v7.patched_contigs.gtf -t rna -o custom_RNA

Arguments

bed_directory: Character string specifying the directory containing the bed files
bin: Numeric value specifying the size of bins for the genomic features
gtf_file: Character string specifying the path to the GTF file
input_type: Character string specifying the experiment type of the bed files, i.e. "promoter" or "rna"
output: Character string specifying the path and filename of the output file

This will convert bed files in the genome coordinates in the ./bed_rna/ directory to RNA coordinates using the gencode.v19.genes.v7.patched_contigs.gtf file and output as custom_RNA.txt.

If you want to map ChIP-seq peaks to promoters, use the -t option as promoter.

Rscript functions/bed_to_RNA_coord.R -b ./bed_promoter/ -n 100 -g gtf/gencode.v19.genes.v7.patched_contigs.gtf -t promoter -o custom_promoter

To convert RNA-coordinate peaks to Pandas format, use the following command:

python functions/to_sparse.py custom_RNA.txt
python functions/to_sparse.py custom_promoter.txt

# clean up
rm custom_RNA.txt 
rm custom_promoter.txt

This will convert the RNA-coordinate peaks in the custom_RNA.txt file and the custom_promoter.txt file to sparse Pandas DataFrames (custom_RNA.pkl and custom_promoter.pkl).

Place custom_RNA.pkl, custom_RNA_gene_name.txt.gz, and custom_RNA_feature_name.txt.gz in the directory where RNA features are located, for example: ./data/toy/RNA_features/. Also place custom_promoter.pkl, custom_promoter_gene_name.txt.gz, and custom_promoter_feature_name.txt.gz in the directory for promoter features, for example: ./data/toy/Promoter_features/.
Modify the code (Run_DEcode_toy.ipynb) as follows:

mRNA_data_loc = "./data/toy/RNA_features/"
mRNA_annotation_data = ["POSTAR","TargetScan","custom_RNA"]
promoter_data_loc = "./data/toy/Promoter_features/"
promoter_annotation_data = ["GTRD","custom_promoter"]

This modification to the code will instruct it to utilize the custom_RNA.pkl and custom_promoter.pkl files as part of the RNA annotation data and promoter annotation data, respectively.

If you find DEcode useful in your work, please cite our manuscript.

Tasaki, S., Gaiteri, C., Mostafavi, S. & Wang, Y. Deep learning decodes the principles of differential gene expression. Nature Machine Intelligence (2020) [link to paper] (full text is available at https://rdcu.be/b5r3p)

Source databases for traning data.

GTEx transcriptome data - GTEx portal
Transcription factor binding peaks - GTRD
RNA binding protein binding peaks - POSTAR2
miRNA binding locations - TargetScan

Akmazad/DEcode