/deepRAM

End-to-end deep learning toolkit for predicting protein binding sites and motifs.

Primary LanguagePython

deepRAM

deepRAM is an end-to-end deep learning toolkit for predicting protein binding sites and motifs. It helps users run experiments using many state-of-the-art deep learning methods and addresses the challenge of selecting model parameters in deep learning models using a fully automatic model selection strategy. This helps avoid hand-tuning and thus removes any bias in running experiments, making it user friendly without losing its flexibility. While it was designed with ChIP-seq and CLIP-seq data in mind, it can be used for any DNA/RNA sequence binary classification problem.

deepRAM allows users the flexibility to choose a deep learning model by selecting its different components: input sequence representation (one-hot or k-mer embedding), whether to use a CNN and how many layers, and whether to use an RNN, and the number of layers and their type. For CNNs, the user can choose to use dilated convolution as well.

Dependency

We recommend to use Anaconda 3 platform.

Usage

usage: deepRAM.py [-h] [--train_data TRAIN_DATA] [--test_data TEST_DATA]
                  [--data_type DATA_TYPE] [--train TRAIN]
                  [--predict_only PREDICT_ONLY]
                  [--evaluate_performance EVALUATE_PERFORMANCE]
                  [--models_dir MODELS_DIR] [--model_path MODEL_PATH]
                  [--motif MOTIF] [--motif_dir MOTIF_DIR]
                  [--tomtom_dir TOMTOM_DIR] [--out_file OUT_FILE]
                  [--Embedding EMBEDDING] [--Conv CONV] [--RNN RNN]
                  [--RNN_type RNN_TYPE] [--kmer_len KMER_LEN]
                  [--stride STRIDE] [--word2vec_train WORD2VEC_TRAIN]
                  [--word2vec_model WORD2VEC_MODEL]
                  [--conv_layers CONV_LAYERS] [--dilation DILATION]
                  [--RNN_layers RNN_LAYERS]

sequence specificities prediction using deep learning approach

optional arguments:
  -h, --help            show this help message and exit
  --train_data TRAIN_DATA
                        path for training data with format: sequence label
  --test_data TEST_DATA
                        path for test data containing test sequences with or
                        without label
  --data_type DATA_TYPE
                        type of data: DNA or RNA. default: DNA
  --train TRAIN         use this option for automatic calibration, training
                        model using train_data and predict labels for
                        test_data. default: True
  --predict_only PREDICT_ONLY
                        use this option to load pretrained model (found in
                        model_path) and use it to predict test sequences
                        (train will be set to False). default: False
  --evaluate_performance EVALUATE_PERFORMANCE
                        use this option to calculate AUC on test_data. If
                        True, test_data should be format: sequence label.
                        default: False
  --models_dir MODELS_DIR
                        The directory to save the trained models for future
                        prediction including best hyperparameters and
                        embedding model. default: models/
  --model_path MODEL_PATH
                        If train is set to True, This path will be used to
                        save your best model. If train is set to False, this
                        path should have the model that you want to use for
                        prediction. default: BestModel.pkl
  --motif MOTIF         use this option to generate motif logos. default:
                        False
  --motif_dir MOTIF_DIR
                        directory to save motifs logos. default: motifs
  --tomtom_dir TOMTOM_DIR
                        directory of TOMTOM, i.e:meme-5.0.3/src/tomtom
  --out_file OUT_FILE   The output file used to store the prediction
                        probability of testing data
  --Embedding EMBEDDING
                        Use embedding layer: True or False. default: False
  --Conv CONV           Use conv layer: True or False. default: True
  --RNN RNN             Use RNN layer: True or False. default: False
  --RNN_type RNN_TYPE   RNN type: LSTM or GRU or BiLSTM or BiGRU. default:
                        BiLSTM
  --kmer_len KMER_LEN   length of kmer used for embedding layer, default= 3
  --stride STRIDE       stride used for embedding layer, default= 1
  --word2vec_train WORD2VEC_TRAIN
                        set it to False if you have already trained word2vec
                        model. If you set it to False, you need to specify the
                        path for word2vec model in word2vec_model argument.
                        default: True
  --word2vec_model WORD2VEC_MODEL
                        If word2vec_train is set to True, This path will be
                        used to save your word2vec model. If word2vec_train is
                        set to False, this path should have the word2vec model
                        that you want to use for embedding layer. default:
                        word2vec
  --conv_layers CONV_LAYERS
                        number of convolutional modules. default= 1
  --dilation DILATION   the spacing between kernel elements for convolutional
                        modules (except the first convolutional module).
                        default= 1
  --RNN_layers RNN_LAYERS
                        number of RNN layers. default= 1

Motifs identification and visualization

You need to install WebLogo and TOMTOM in MEME Suite to match identifyed motifs with known motifs of Transcription Factors and RBPs. Read documentations about installation and usage.

Installation

  1. Download deepRAM
git clone https://github.com/MedChaabane/deepRAM.git

cd deepRAM
  1. Install required packages
pip3 install -r Prerequisites
  1. Install deepRAM
python setup.py install

Datasets

  1. ChIP-seq datasets can be downloaded from: http://tools.genes.toronto.edu/deepbind/nbtcode
  1. CLIP-seq datasets can be downloaded from: https://github.com/xypan1232/iDeepS/tree/master/datasets/clip

We have provided two preprocessing scripts to change the format of the used datasets to a format compatible with deepRAM input data format (deepRAM input data format: sequence label. See Example input data):

Example with CLIP-seq

preprocess CLIP-seq files (train and test) to match deepRAM data format: sequence label

python preprocess_2.py --CLIP_data datasets/CLIP-seq/1_PARCLIP_AGO1234_hg19/30000/training_sample_0/sequences.fa.gz --output CLIP_train.gz
python preprocess_2.py --CLIP_data datasets/CLIP-seq/1_PARCLIP_AGO1234_hg19/30000/test_sample_0/sequences.fa.gz --output CLIP_test.gz

train DeepBind architecture with CLIP_train.gz and evaluate performance on CLIP_test.gz

python deepRAM.py --train_data CLIP_train.gz --test_data CLIP_test.gz --data_type RNA --train True --evaluate_performance True --model_path DeepBind.pkl --out_file prediction.txt --Embedding False --Conv True --RNN False --conv_layers 1 

visualizating motifs and matching them with known motifs

python deepRAM.py --test_data CLIP_test.gz --data_type RNA --predict_only True --model_path DeepBind.pkl --motif True --motif_dir motifs --tomtom_dir meme-5.0.3/src/tomtom --out_file prediction.txt --Embedding False --Conv True --RNN False --conv_layers 1

make sure to specify the directory of TOMTOM in --tomtom_dir argument