/NCC

non-Coding RNA Deep Learning Classifier

Primary LanguageJupyter Notebook

NCC : non-Coding RNA Classifier

A new AI model trained and tested with fresh updated dataset of small Non-coding RNA (ncRNA or sncRNA) sequences to resolve efficiently the classification of small non-coding RNA. Biological experimental methods for identifying ncRNA families are not only time-consuming and labor-intensive but also expensive, making them impractical for the demands of high-throughput technology.



Performance comparison of several prediction methods

Method/Model Accuracy Sensitivity Precision F-score MCC
RNAcon 0.3737 0.3787 0.4500 0.3605 0.3341
GeaPPLE 0.6487 0.6684 0.7325 0.7050 0.6857
nRC 0.6960 0.6889 0.6878 0.6878 0.6627
ncRFP 0.7972 0.7878 0.7904 0.7883 0.7714
ncDLRES 0.8430 0.8344 0.8419 0.8407 0.8335
ncDENSE 0.8687 0.8677 0.8703 0.8667 0.8574
--> NCC 0.9897 0.9870 0.9892 0.9880 0.9889
MncR > 97% - - - -

The main modules of this Repo

Functions Files
Data collection functions rfam_query.py
Data Analysis Analysis.ipynb
Data transformation ncc_DataTransform.py
AI Models ncc_Model.py
Training and testing the model ncc_TrainTest.py

Data collection functions


To collect datasets from Rfam database and assemble the main used dataset you will find methods in rfam_query.py file

# Update if you need more or less RNA families to be downloaded form Rfam db
def get_RNA_Families_in_interest() -> []:
    return [
        'Cis-reg; IRES;',
        'Cis-reg; leader;',
        'Cis-reg; riboswitch;',
        'Cis-reg; riboswitch;',
        'Gene; ribozyme;',
        'Gene; rRNA;',
        'Gene; miRNA;',
        'Gene; snRNA; snoRNA; CD-box;',
        'Gene; snRNA; snoRNA; HACA-box;',
        'Gene; snRNA; snoRNA; scaRNA;',
        'Gene; tRNA;',
        'Intron;'
    ]

Data Analysis

If a Jupiter Notebook with some statictic analysis of the dataset that can help finalize the data input of the AI model. The final dataset has more than 50.000 labeld RNA sequences in fasta format as shown bellow:

>IRES
ATACCTTTCTCGGCCTTTTGGCTAAGATCAAGTGTAGTATCTGTTCTTATCAGTTTAATATCTGATACGTGGGCCA ...
>tRNA
GCACCACTCTGGCCTTTTGGCTTAGATCAAGTGTAGTATCTGTTCTTATTAGTTTAACCACTAATATGGTCGCACC ...
>tRNA
ATACCTTTCTCGGCCTTTTGGCTAAGATCAAGTGTAGTATCTGTTTTTATCAGTTTAATATCTGATATGTGGTCCA ...
>riboswitch
ATTACTTCTCAGCCTTTTGGCTAAGATCAAGTGTAATAAATCTCATTGTGCTTTATGCCTAATGTGTGCTTATATT ...
>HACA-box
CCAGCTCTCTTTGCCTTTTGGCTTAGATCAAGTGTAGTATCTGTTCTTTTCAGTTTAATCTCTGAAAGTGTTCTAA ...
>tRNA
ACAGCTGATGCCGCAGCTACACTATGTATTAATCGGATTTTTGAACTTGGAGTACGGTTCTGGAGCTTGCTCCACC ...

Data transformation

Padding, cutting and encoding the RNA sequences before loading them to AI model. If you and to change the encoding method edit this file. One-hot encoding is used.

# Ribisome encoding
# --------------------------------------
A_rep_8d = [1, 0, 0, 0, 0, 0, 1, 0]
U_rep_8d = [0, 1, 0, 0, 0, 0, 0, 1]
G_rep_8d = [0, 0, 1, 0, 1, 0, 0, 0]
C_rep_8d = [0, 0, 0, 1, 0, 1, 0, 0]
X_rep_8d = [0, 0, 0, 0, 0, 0, 0, 0]

AI Models

The keras model used for this task. Consists of an Biderectional RRN in the input and Densenet CNN.

Training and testing the model

A jupiter Notepad for training evaluating/tasting the selected model and some metrics along.

Requirements

  • python
    • docker - Docker SDK for Python
    • wget
    • fastaparser - A Python FASTA file Parser and Writer

NEED TO UPDATE

Recources

Rfam

Rfam database is a collection of RNA families, each represented by multiple sequence alignments, consensus secondary structures and covariance models