- Task1: AMPs Binary classification.
- Task2: AMPs regression.
- Other task: Such as AMPs ranking and Multilabel classification, will be published in subsequent papers, coming soon.
# clone project
git clone https://github.com/William-Zhanng/SenseXAMP.git
cd SenseXAMP
# create conda virtual environment
conda create -n torch1.7 python=3.8
conda activate torch1.7
# install all requirements
pip install -r requirements.txt
Before utilizing SenseXAMP, it's important to prepare the datasets appropriately. Our research utilized several dataset versions, and it's crucial to have all the following versions of datasets ready before running SenseXAMP:
- Format: .csv
- Description: This includes
train.csv
,val.csv
, andtest.csv
. These datasets exclusively contain sequences and corresponding labels. - Obtaining Method: Download the "ori_datasets" version of the datasets from here.
- Format: .h5
- Description: These datasets are in .h5 format and are generated using the esm-1b model. This version of the dataset is derived from the
ori_datasets
. - Obtaining Method: By running the script
tools/esm_emb_gen.py
(Since embeddings files are too large)
- Format: .h5
- Description: These datasets are in .h5 format and are obtained by calculating protein descriptors based on the sequences. This version of the dataset is derived from the
ori_datasets
. - Obtaining Method:
- Download from here.
- By running the script
tools/stc_gen.py
(Must getstc_csv
version first)
- Format: .csv
- Description: These datasets are in .h5 format and are obtained by calculating protein descriptors based on the sequences. Also includes
train.csv
,val.csv
, andtest.csv
. This version of the dataset is derived from theori_datasets
. - Obtaining Method: By running the script:
tools/generate_csv.py
Note: The "stc_csv" dataset version is primarily intended for comparative methods like SMEP and is not necessary for using SenseXAMP.
-
Download the "ori_datasets" version of the datasets from here.
-
Download the "stc_info" version of the datasets from here.
-
For the "esm_embeddings" version of the dataset, it needs to be generated by running
tools/esm_emb_gen.py
.
Download our model checkpoints from here.
It is recommended to refer to the project structure we provide at end of this README to organize all versions of datasets.
Here is an example for generating esm-1b embeddings for the ori_datasets
version of AMPlify dataset.
python tools/esm_emb_gen.py --dataset_dir ./datasets/ori_datasets/AMPlify --fname AMPlify.h5
After running this command, an AMPlify.h5
file will be generated in the datasets/esm_embeddings/all
directory.
Tips: Since our proposed balanced cls(classification) datasets is a subset of imbalanced cls datasets, if you generate embeddings for ori_datasets/cls_benchmark_imbalanced/
using this script,the generated cls_benchmark.h5
can be used for experiments of both balanced cls datasets and imbalanced cls datasets.
In this project, the model, datasets, and hyperparameters are all setted in config.py
. Therefore, before running run.py
, please ensure that the corresponding config.py
is correctly configured.
We have provided configs for the main experiments in the paper, which can be referenced.
The configs we provide are named according to the following rules:
DATASET_ MODEL.py
The following is an explanation of the specific words that appear in the name of our provided configs:
About datasets:
- ecoli: Our proposed E.coli regression dataset.
- saureus: Our proposed S.aureus regression dataset.
- benchmark_balanced: Our balanced classification dataset.
- benchmark_imbalanced: Our imbalanced classification dataset.
- AMPlify: The AMPlify dataset.
- AmPEP: The DeepAmpep-30 dataset.
About models:
- fcn: The model-PD branch of SenseXAMP, utilizing structured data exclusively. Refer to the paper for detailed information.
- onlyslfattn: The model-EB branch of SenseXAMP, exclusively using embeddings from esm-1b. Detailed information is available in the paper.
- SenseXAMP: The comprehensive SenseXAMP model.
For example, train SenseXAMP on our proposed balanced classification datasets.
CUDA_VISIBLE_DEVICES=0 python -m torch.distributed.launch --nproc_per_node 1 run.py \
--config ./configs/cls_task/benchmark_balanced_SenseXAMP.py --mode train
For example, evaluate with SenseXAMP on the test set of our proposed balanced classification datasets.
(Make sure you have modified ckpt_path
to checkpoint in the config file.)
CUDA_VISIBLE_DEVICES=0 python -m torch.distributed.launch --nproc_per_node 1 run.py \
--config ./configs/cls_task/benchmark_balanced_SenseXAMP.py --mode test
The core project framework of SenseXAMP is outlined below. It is recommended to refer to this project structure when organizing all files.
SenseXAMP/
├── Ampmm_base/ : Core implementaion of trainer, including models, loss, dataloader, etc.
│ ├── data/
│ ├── models/
│ ├── runner/
│ └── utils/
├── configs/ : This folder contains various configs for different experiments, you can also write your own configs.
│ ├── cls_task/
│ │ ├── benchmark_imblanced_SenseXAMP.py
│ │ ├── benchmark_blanced_SenseXAMP.py
│ │ ├── ...
│ │ └── Your_own_config.py
│ └── reg_task/
│ │ ├── ecoli_SenseXAMP.py
│ │ ├── saureus_SenseXAMP.py
│ │ ├── ...
│ │ └── Your_own_config.py
├── datasets/ : This folder contains different version of datasets
│ ├── ori_datasets/ : 'ori_datasets' version
│ │ ├── cls_benchmark_imbalanced/
│ │ ├── cls_benchmark_balanced/
│ │ ├── ...
│ │ └── AMPlify/
│ ├── stc_datasets/ : 'stc_datasets' version, same structure as 'ori_datasets'
│ ├── esm_embeddings/ : 'esm_embeddings' version
│ ├── stc_info/ : 'stc_info' version
├── experiments/ : This folder contains experiments results. (including model checkpoints auto saved)
├── tools/ : This folder contains useful scripts such as generation of different version of datasets.
├── utils/ : This folder contains necessary codes for the implementation of Ampmm_base
├── requirements.txt
└── run.py