/DeepSTARR

Deep learning model built to quantitatively predict the activities of developmental and housekeeping enhancers from DNA sequence in Drosophila melanogaster S2 cells

Primary LanguageJupyter NotebookMIT LicenseMIT

DeepSTARR

DeepSTARR is a deep learning model built to quantitatively predict the activities of developmental and housekeeping enhancers from DNA sequence in Drosophila melanogaster S2 cells.

For more information, see the DeepSTARR publication:
DeepSTARR predicts enhancer activity from DNA sequence and enables the de novo design of synthetic enhancers
Bernardo P. de Almeida, Franziska Reiter, Michaela Pagani, Alexander Stark. Nature Genetics, 2022.
Presentation at ISCB Webinar

This repository contains the code used to process genome-wide and oligo UMI-STARR-seq data and train DeepSTARR.

Genome-wide enhancer activity maps of developmental and housekeeping enhancers

We used UMI-STARR-seq (Arnold et al., 2013; Neumayr et al., 2019) to generate genome-wide high resolution, quantitative activity maps of developmental and housekeeping enhancers, representing the two main transcriptional programs in Drosophila S2 cells (Arnold et al., 2017; Haberle et al., 2019; Zabidi et al., 2015).

The raw sequencing data are available from GEO under accession number GSE183939.
You can find the code to process the data here.

DeepSTARR model

DeepSTARR is a multi-task convolutional neural network that maps 249 bp long DNA sequences to both their developmental and their housekeeping enhancer activities. We adapted the Basset convolutional neural network architecture (Kelley et al., 2016) and designed DeepSTARR with four convolution layers, each followed by a max-pooling layer, and two fully connected layers. The convolution layers identify local sequence features (e.g. TF motifs) and increasingly complex patterns (e.g. TF motif syntax), while the fully connected layers combine these features and patterns to predict enhancer activity separately for each enhancer type.

You can find the code used to train DeepSTARR and compute nucleotide contribution scores here.
Data used to train and evaluate the DeepSTARR model as well as the final trained model are available on zenodo at https://doi.org/10.5281/zenodo.5502060.
DeepSTARR is also deposited in Kipoi.

Tutorial

An end-to-end example to train DeepSTARR, compute the nucleotide contribution scores and modisco TF motifs is contained in the following colab notebook: https://colab.research.google.com/drive/1Xgak40TuxWWLh5P5ARf0-4Xo0BcRn0Gd. You can run this notebook yourself to experiment with DeepSTARR.

Predict developmental and housekeeping enhancer activity of new DNA sequences

To predict the developmental and housekeeping enhancer activity in Drosophila melanogaster S2 cells for new DNA sequences, please run:

# Clone this repository
git clone https://github.com/bernardo-de-almeida/DeepSTARR.git
cd DeepSTARR/DeepSTARR

# download the trained DeepSTARR model from zenodo (https://doi.org/10.5281/zenodo.5502060)

# create 'DeepSTARR' conda environment by running the following:
conda create --name DeepSTARR python=3.7 tensorflow=1.14.0 keras=2.2.4 # or tensorflow-gpu/keras-gpu if you are using a GPU
source activate DeepSTARR
pip install git+https://github.com/AvantiShri/shap.git@master
pip install 'h5py<3.0.0'
pip install deeplift==0.6.13.0

# Run prediction script
python DeepSTARR_pred_new_sequence.py -s Sequences_example.fa -m DeepSTARR.model

Where:

  • -s FASTA file with input DNA sequences

UMI-STARR-seq with designed oligo libraries to test more than 40,000 wildtype and mutant Drosophila and human enhancers

We designed and synthesised (in oligo pools by Twist Bioscience) wildtype and TF motif-mutant sequences of Drosophila and human enhancers. The activity of each sequence in the oligo libraries was assessed experimentally by UMI-STARR-seq in Drosophila melanogaster S2 (both developmental and housekeeping UMI-STARR-seq; see figure below) and human HCT116 cells, respectively.

The raw sequencing data are available from GEO under accession number GSE183939.
You can find the code to analyse Drosophila and human oligo UMI-STARR-seq screens here.

Code for Figures

Scripts to reproduce each main figure can be found here and the respective processed data here.

UCSC Genome Browser tracks

Genome browser tracks showing genome-wide UMI-STARR-seq and DeepSTARR predictions in Drosophila, including nucleotide contribution scores for all enhancer sequences, together with the enhancers used for mutagenesis, mutated motif instances and respective log2 fold-changes in enhancer activity, are available at https://genome.ucsc.edu/s/bernardo.almeida/DeepSTARR_manuscript.
Dynamic sequence tracks and contribution scores are also available as a Reservoir Genome Browser session.

Questions

If you have any questions/requests/comments please contact me at bernardo.almeida94@gmail.com.