/Antimicrobial-Peptides

Collecting AMP MIC data from different sources, then running a GAN to output promising sequences

Primary LanguageJupyter Notebook

Antimicrobial-Peptides

Data and code for Deep learning regression model for antimicrobial peptide design. This repository contains code for training a model to predict antimicrobial activity of peptides against various bacteria including E. coli and P. aeruginosa.

Data

GRAMPA (link to csv file) is a database of peptides and their antimicrobial activity against various bacteria. The database contains the following key columns:

  • bacterium: the target bacterium.
  • sequence: the sequence of amino acids that make up the peptide. strain: the strain of bacterium, when available.
  • value: the MIC of the peptide on the bacterium.

The database also contains the following auxiliary columns:

  • database: the database from which the row's information was scraped.
  • url_source: a link to the database page from which the row's information was scraped.
  • modifications: modifications that have been applied to the sequence.
  • unit: the unit of measurement of MIC, always uM.
  • is_modified: a binary column stating whether or not the sequence was modified.
  • has_unusual_modification: a binary column stating whether or not the sequence was modified in any way other than by c-terminal amidation.
  • has_cterminal_amidation: a binary column stating whether or not the sequence was modified with c-terminal amidation.
  • datasource_has_modifications: a column stating whether the database for that row contained modification information. When this column is False, the sequence may have been modified irrespective of the value of is_modified.

Training a model

To train a model for E. coli that has a 1:1 ratio of random negative examples and runs for 60 epochs:

git clone git@github.com:zswitten/Antimicrobial-Peptides.git
cd Antimicrobial-Peptides
pip install -r requirements.txt
python src/train_model.py --negatives=1 --bacterium='E. coli' --epochs=60

This notebook contains code for reproducing the figures in the paper.