This repository contains the code and data for a project aimed at finding whether a given DNA sequence is coding RNA or non-coding RNA using machine learning models. The project uses various features of the DNA sequence to train the models
- ORF length
- protein coding potential
- G-C content
- K-mer periodicity
- Minimal Free Energy
- Transcript length
- CpG islands
- BLAST similarity
- Conservation Score
- Nucleotide Frequencies
- Hexamer score
- Fickett Score
The dataset used in this project contains both coding and non-coding RNA sequences. The sequences are in FASTA format and have been pre-processed to extract various features such as ORF length, protein coding potential, etc. The dataset is split into training and testing sets, with 80% of the data used for training and the remaining 20% used for testing.
The following machine learning models have been implemented for this project:
- Logistic Regression
- Support Vector Machine (SVM)
- Naive Bayes
- Random Forest
These models have been trained using the features extracted from the dataset and are used to predict whether a given DNA sequence is coding RNA or non-coding RNA.
To run the code in this repository, you will need:
- Python 3.x
- NumPy
- Pandas
- Scikit-learn
- Matplotlib
- Biopython
To use the code in this repository, follow these steps:
- Clone the repository to your local machine.
- Install the required dependencies using pip.
- Use the trained models to predict whether a given DNA sequence is coding RNA or non-coding RNA.
This project shows that machine learning models can be used to accurately classify DNA sequences as coding or non-coding RNA based on various features. The models implemented in this project can be further optimized and used in various biological applications, such as identifying potential drug targets and understanding gene expression. So, let's go and explore the world of RNA with Machine Learning!