This the the official repository for the NeurIPS 2024 paper MutaPLM: Protein Language Modeling for Mutation Explanation and Engineering.
pytorch==1.13.1+cu117
transformers==4.36.1
peft==0.9.0
pandas
numpy
scipy
evoprotgrad
nltk
rouge_score
sequence_models
scikit-learn
The pre-training dataset and the MutaDescribe dataset are available at HuggingFace. Download the data and place them under the data
folder.
Before running the scripts, you should:
- Download the PLM checkpoint esm2_t33_650M_UR50D and put it in
ckpts/esm2-650m
. - Download the LLM checkpoint BioMedGPT-LM and put it in
ckpts/biomedgpt-lm
. If you intend to perform evaluation only, you can just download the configuration files. - Download the fine-tuned checkpoint MutaPLM and put it in
ckpts/mutaplm
.
For pre-training on protein literature, run the following script:
bash scripts/train/pretrain.sh
For fine-tuning on the MutaDescribe dataset, run the following script:
bash scripts/train/finetune.sh
For evaluating MutaPLM on mutation explanation, run the following script:
bash scripts/test/mutaplm_explain.sh
For evaluating MutaPLM on mutation engineering, run the following script:
bash scripts/test/mutaplm_engineer.sh
@misc{luo2024mutaplm,
title={MutaPLM: Protein Language Modeling for Mutation Explanation and Engineering},
author={Yizhen Luo and Zikun Nie and Massimo Hong and Suyuan Zhao and Hao Zhou and Zaiqing Nie},
year={2024},
eprint={2410.22949},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2410.22949},
}