DeepSplicer: An Improved Method of Splice Sites Prediction using Deep Learning

OluwadareLab, University of Colorado, Colorado Springs

Developers:
              Victor Akpokiro
              Department of Computer Science
              University of Colorado, Colorado Springs
              Email: vakpokir@uccs.edu

Contact:
              Oluwatosin Oluwadare, PhD
              Department of Computer Science
              University of Colorado, Colorado Springs
              Email: ooluwada@uccs.edu

1. Content of folders:

src: DeepSplicer source code. deepsplicer.py
src: Hyper-parameter tuning source code.
src: DeepSplicer cross-validation source code. deepsplicer_cross_val.py
models: Models file for deepsplicer models
log: Log file for utilization results logs
plots: Plots file for utilization results plots

2. Datasets:

In our research, we utilized five carefully selected datasets from organisms, namely: Homo sapiens, Oryza sativa japonica, Arabidopsis thaliana, Drosophila melanogaster, and Caenorhabditis elegans. We downloaded these reference genomic sequence datasets (FASTA file format) from Albaradei, S. et al and its corresponding annotation sequence (GTF file format) from Ensembl. Our data for constructed to permit a Sequence Length of 400

3. One-Hot encoding:

We used One-hot encoding to transforms our Genomic sequence data and labels into vectors of 0 and 1. In other words, each element in the vector will be 0, except the element that corresponds to the nucleotide base of the sequence data input is 1. Adenine (A) is [1 0 0 0], Cytosine (C) is [0 1 0 0], Guanine (G) is [0 0 1 0], Thymine (T) is [0 0 0 1].

4. Usage:

Usage: To use, type in the terminal python deepsplicer.py -n model_name -s sequence(acceptor or donor) -o organism_name -e encoded_sequnce_file -l encoded_label_file

Arguments:
- model_name: A string for the name of the model
- sequence: A string to specify acceptor or donor input dataset
- organism: A string to specify organism name i.e ["hs", "at", "oriza", "d_mel", "c_elegans"]
- encoded sequence file: A file containing the encoded sequence data
- encoded label file: A file containing the encoded label data

6. Output:

Deepsplicer outputs three files:

.h5: The deepslicer model and weight file.
.txt: A log file that contains the accuracy and evaluation metrics results.
png: contains the plotting of the prediction accuracy

7. Note:

Dataset sequence length is 400.
Deepsplice folders [log, models, plots] is essential for code functionality.
Genomic sequence input data should should transfomed using one-hot encoding.