Cell-Penetrating Peptide (CPP) Predictor Project Summary

Project Overview

This project aimed to develop a machine learning model to predict whether a given peptide sequence is a cell-penetrating peptide (CPP) or not. CPPs are short peptides that can traverse cell membranes and potentially deliver molecular cargo into cells, making them valuable in drug delivery and biotechnology.

Data Preparation and Feature Extraction

We started with a small dataset of 30 peptide sequences (15 CPPs and 15 non-CPPs).
Features were extracted based on amino acid composition and physicochemical properties.
The dataset was split into training (24 samples) and test (6 samples) sets.

Initial Model Development

We implemented three types of models:
- Feedforward Neural Network (FFN)
- Convolutional Neural Network (CNN)
- Recurrent Neural Network (RNN)
These models showed varying performance, with potential overfitting due to the small dataset.

Feature Selection and Model Refinement

We performed feature selection using SelectKBest, reducing from 15 to 5 features.
The selected features were: 7, 8, 9, 10, and 14.
We experimented with simpler models like Logistic Regression and Decision Trees.

Hyperparameter Tuning

We used GridSearchCV to tune hyperparameters for our best-performing model.

Final Model and Analysis

A Random Forest classifier was chosen as the final model due to its good performance and interpretability.
Feature importance analysis revealed that Feature 14 was by far the most important, followed by Features 9, 8, and 7.

Simple Decision Tree Model

We created a simple decision tree using the top two features (14 and 9), which provided easily interpretable rules:

If Feature 14 ≤ 0.70, classify as non-CPP
If Feature 14 > 0.70:
- If Feature 9 ≤ 0.04 or > 0.09, classify as CPP
- If 0.04 < Feature 9 ≤ 0.09, classify as non-CPP

Visualizations and Interpretations

We created decision boundary plots to visualize how the model separates CPPs from non-CPPs.
Feature importance was visualized using both Mean Decrease in Impurity and Permutation Importance methods.

Key Findings

Feature 14 (likely representing a key physicochemical property) is crucial for CPP classification.
There's a non-linear relationship between Features 14 and 9 in determining CPP status.
A simple rule-based classifier using just two features can provide reasonable performance.

Limitations and Future Work

The small dataset size (30 samples) limits the model's generalizability.
Further data collection, focusing on the identified important features, would be beneficial.
Consultation with domain experts is needed to interpret the biological significance of the findings.

Next Steps

Identify the exact peptide properties that the important features represent.
Validate the model's findings with peptide science experts.
Collect more data, especially around the identified decision boundaries.
Investigate peptides that are near the decision boundaries for potential insights.
Consider developing a more robust model with a larger dataset while maintaining interpretability.

This project demonstrates the potential of machine learning in predicting cell-penetrating peptides, while also highlighting the importance of combining data-driven approaches with domain expertise in bioinformatics research.

cesco345/cell_penetrating_peptides_predictor