Code for ProSST: A Pre-trained Protein Sequence and Structure Transformer with Disentangled Attention. (NeurIPS 2024)
- Our MSA-Enhanced model ProtREM has achieved 0.518 Spearman's rho in the ProteinGym benchmark.
git clone https://github.com/ginnm/ProSST.git
cd ProSST
pip install -r requirements.txt
export PYTHONPATH=$PYTHONPATH:$(pwd)
from prosst.structure.quantizer import PdbQuantizer
processor = PdbQuantizer(structure_vocab_size=2048) # can be 20, 128, 512, 1024, 2048, 4096
result = processor("example_data/p1.pdb", return_residue_seq=False)
Output:
[407, 998, 1841, 1421, 653, 450, 117, 822, ...]
from transformers import AutoModelForMaskedLM, AutoTokenizer
model = AutoModelForMaskedLM.from_pretrained("AI4Protein/ProSST-2048", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("AI4Protein/ProSST-2048", trust_remote_code=True)
See AI4Protein/ProSST-* for more models.
Zero-shot mutant effect prediction
Download dataset from Google Driver. (This file contains quantized structures within ProteinGYM).
cd example_data
unzip proteingym_benchmark.zip
python zero_shot/proteingym_benchmark.py --model_path AI4Protein/ProSST-2048 \
--structure_dir example_data/structure_sequence/2048
If you use ProSST in your research, please cite the following paper:
@inproceedings{
li2024prosst,
title={ProSST: Protein Language Modeling with Quantized Structure and Disentangled Attention},
author={Mingchen Li and Yang Tan and Xinzhu Ma and Bozitao Zhong and Huiqun Yu and Ziyi Zhou and Wanli Ouyang and Bingxin Zhou and Pan Tan and Liang Hong},
booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems},
year={2024}
}