OluwadareLab, University of Colorado, Colorado Springs
Developers:
Victor Akpokiro
Department of Computer Science
University of Colorado, Colorado Springs
Email: vakpokir@uccs.edu
Contact:
Oluwatosin Oluwadare, PhD
Department of Computer Science
University of Colorado, Colorado Springs
Email: ooluwada@uccs.edu
- src: DeepSplicer source code. deepsplicer.py
- src: Hyper-parameter tuning source code.
- src: DeepSplicer cross-validation source code. deepsplicer_cross_val.py
- models: Models file for deepsplicer models
- log: Log file for utilization results logs
- plots: Plots file for utilization results plots
In our research, we utilized five carefully selected datasets from organisms, namely: Homo sapiens, Oryza sativa japonica, Arabidopsis thaliana, Drosophila melanogaster, and Caenorhabditis elegans. We downloaded these reference genomic sequence datasets (FASTA file format) from Albaradei, S. et al and its corresponding annotation sequence (GTF file format) from Ensembl. Our data for constructed to permit a Sequence Length of 400
We used One-hot encoding to transforms our Genomic sequence data and labels into vectors of 0 and 1. In other words, each element in the vector will be 0, except the element that corresponds to the nucleotide base of the sequence data input is 1. Adenine (A) is [1 0 0 0], Cytosine (C) is [0 1 0 0], Guanine (G) is [0 0 1 0], Thymine (T) is [0 0 0 1].
Usage: To use, type in the terminal python deepsplicer.py -n model_name -s sequence(acceptor or donor) -o organism_name -e encoded_sequnce_file -l encoded_label_file
- Arguments:
- model_name: A string for the name of the model
- sequence: A string to specify acceptor or donor input dataset
- organism: A string to specify organism name i.e ["hs", "at", "oriza", "d_mel", "c_elegans"]
- encoded sequence file: A file containing the encoded sequence data
- encoded label file: A file containing the encoded label data
- model_name: A string for the name of the model
Deepsplicer outputs three files:
- .h5: The deepslicer model and weight file.
- .txt: A log file that contains the accuracy and evaluation metrics results.
- png: contains the plotting of the prediction accuracy
- Dataset sequence length is 400.
- Deepsplice folders [log, models, plots] is essential for code functionality.
- Genomic sequence input data should should transfomed using one-hot encoding.