This project aims to identify regions governing gene regulation and classify them as highly interacting or non-interacting using Explainable AI. By examining these sequences, we can classify genomic regions and determine influential genomic properties in the transformation of a cell into a diseased state.
- Model biological sequencing and DNA data to understand their underlying properties and patterns.
- Use deep learning algorithms to distinguish highly interacting regions and their boundaries.
- Classify genomic sequences as potentially interacting or non-interacting.
- Identify traits of highly interacting regions that make them more interactive.
- Understand how these specific regions influence gene regulation.
- Find genomic sequence properties that influence a cell's transformation into a diseased cell.
- Programming Languages: Python
- Libraries and Frameworks: TensorFlow, Keras, Sklearn, Matplotlib, Seaborn
- Tools: Jupyter Notebook, Git
- Data: Sub-kb Hi-C in D. melanogaster, ChIP-seq data from ENCODE project
-
Preprocessing Flow:
- Generate dummy data with a probability of 0.25 for each nucleotide.
- Create files embedding highly interacting regions using random functions.
-
Training and Testing:
- Train highly interacting region Markov models on respective files.
- Perform cross-validation and visualize results with AUC ROC curves, accuracy, and F1 scores.
- Model Architecture:
- Conv1D layers with Batch Normalization and LeakyReLU activation
- MaxPooling1D and Dropout layers to prevent overfitting
- Dense layers for final classification
- Perform cross-validation on simulated and real datasets.
- Evaluate the model using metrics such as accuracy, AUC, and ROC curves.
- Simulated Data:
- Generated dummy data with embedded highly interacting regions.
- Cross-validation results showing accuracy and AUC scores.
- Simulated Data:
- Achieved high testing accuracy with consistent results across folds.
- Drosophila Data:
- Moderate accuracy indicating room for improvement.
-
Data/: Contains all datasets and related files.
- fruitfly/: Data specific to fruit flies.
- Fruitfly Bed files/: BED files for different shifts.
- Fruitfly Datasets/: Dataset files for different shifts.
- Fruitfly Fasta/: FASTA files for different shifts.
- deepbind-exe-file/: Contains input files for DeepBind executions.
- dummy_markov_data/: Contains Markov model data.
- fruitfly/: Data specific to fruit flies.
-
colab_files/: Contains Jupyter notebooks for various data processing and analysis tasks.
-
__MACOSX/: Contains system files for macOS, which are not needed for the project execution.
-
Clone the repository:
git clone <repository-url>
-
Navigate to the project directory:
cd btechproj-main
-
Install required dependencies:
- Ensure you have Python installed.
- Install dependencies using
pip
:pip install -r requirements.txt
- Alternatively, if a
requirements.txt
file is not provided, manually install dependencies mentioned in the notebooks and scripts.
-
Fruitfly Data: This includes BED, FASTA, and dataset files for different shifts (e.g., shift_200, shift_500).
- BED files: Contains genomic regions data.
- FASTA files: Contains sequences of DNA.
- Dataset files: Contains various datasets used for analysis.
-
DeepBind Data: Input files for running DeepBind, a tool for predicting protein binding.
-
Markov Model Data: Files related to Markov models used in the analysis.
Located in the colab_files/
directory, these notebooks provide various analyses and processing steps, such as:
- Markov models with cross-validation
- TensorFlow and PyTorch implementations
- Binding site predictions
- Pipelines for converting BED to FASTA and other tasks
-
Running Jupyter Notebooks:
- Navigate to the
colab_files/
directory:cd colab_files
- Start Jupyter Notebook:
jupyter notebook
- Open the desired notebook and follow the instructions within.
- Navigate to the
-
Data Processing Scripts:
- Data processing scripts are located within the
Data/
directory. Run these scripts as needed for your analysis.
- Data processing scripts are located within the
- Fork the repository.
- Create a new branch (
git checkout -b feature-branch
). - Commit your changes (
git commit -am 'Add new feature'
). - Push to the branch (
git push origin feature-branch
). - Create a new Pull Request.
This project is licensed under the MIT License. See the LICENSE
file for details.
For any inquiries or issues, please contact [Your Name] at [your-email@example.com].