ProteinChat: Towards Enabling ChatGPT-Like Capabilities on Protein 3D Structures

This repository holds the code and data of ProteinChat: Towards Enabling ChatGPT-Like Capabilities on Protein 3D Structures.

Technical report is available here

Examples

Introduction

In this work, we make an initial attempt towards enabling ChatGPT-like capabilities on protein 3D structures, by developing a prototype system ProteinChat.
ProteinChat works in a similar way as ChatGPT. Users upload a protein 3D structure and ask various questions about this protein. ProteinChat will answer these questions in a multi-turn, interactive manner.
The ProteinChat system consists of a protein 3D structure encoder (based on ESM inverse folding), a large language model (LLM), and an adaptor. The protein encoder takes a protein 3D structure as input and learns a representation for this protein. The adaptor transforms the protein representation produced by the protein encoder into another representation that is acceptable to the LLM. The LLM takes the representation transformed by the adaptor and users' questions about this protein as inputs and generates answers. All these components are trained end-to-end.
To train ProteinChat, we collected instruction tuning datasets which contain 143508 proteins and 143508 instructions.

Datasets

The dataset contains 143508 proteins (represented using 3D structures) with 143508 instructions. The instruction set are available at this link. The processed protein files (83G in total) are available at this link. The data is curated from the Protein Data Bank. More details can be found here.

Getting Started

Installation

These instructions largely follow those in MiniGPT-4.

1. Prepare the code and the environment

Git clone our repository, creating a python environment and ativate it via the following command

git clone https://github.com/UCSD-AI4H/proteinchat
cd proteinchat
conda env create -f environment.yml
conda activate proteinchat
pip install einops

Verify the installation of torch and torchvision is successful by running python -c "import torchvision; print(torchvision.__version__)". If it outputs the version number without any warnings or errors, then you are good to go. If it outputs any warnings or errors, try to uninstall torch by conda uninstall pytorch torchvision torchaudio cudatoolkit and then reinstall them following here. You need to find the correct command according to the CUDA version your GPU driver supports (check nvidia-smi).

2. Prepare the pretrained Vicuna weights

The current version of ProteinChat is built on the v0 versoin of Vicuna-13B. Please refer to our instruction here to prepare the Vicuna weights. The final weights would be in a single folder in a structure similar to the following:

vicuna_weights
├── config.json
├── generation_config.json
├── pytorch_model.bin.index.json
├── pytorch_model-00001-of-00003.bin
...

Then, set the path to the vicuna weight in the model config file here at Line 16.

Training

You need roughly 45 GB GPU memory for the training.

The training configuration file is train_configs/minigpt4_stage2_esm.yaml. In addition, you may want to change the number of epochs and other hyper-parameters there, such as max_epoch, init_lr, min_lr,warmup_steps, batch_size_train. Please adjust iters_per_epoch so that iters_per_epoch * batch_size_train = your training set size. Due to the GPU consumption, we set batch_size_train=1.

Start training on LLaMA model with protein dataset by running finetune.sh bash finetune.sh.

It takes around 24 GB GPU memory for the demo.

Find the checkpoint you save in the training process above, which is located under the folder minigpt4/output/minigpt4_stage2_esm/ by default. Copy it to the folder ckpt by running cp minigpt4/output/minigpt4_stage2_esm/.../checkpoint_xxx.pth, and modify the ckpt entry in eval_configs/proteinchat_eval.yaml to the location of your checkpoint.

Now we launch the demo.py in our original environment. Then, start the demo demo.sh on your local machine by running bash demo.sh. Then, open the URL created by the demo and try it out!

Acknowledgement

License

This repository is under BSD 3-Clause License. Many codes are based on MiniGPT-4 with BSD 3-Clause License here, which is based on Lavis with BSD 3-Clause License here.

Disclaimer

This is a prototype system that has not been systematically and comprehensively validated by biologists yet. Please use with caution.

Trained models and demo websites will be released after we thoroughly validate the system with biologists.

Citation

If you're using ProteinChat in your research or applications, please cite using this BibTeX:

@article{guo2023proteinchat,
  title={ProteinChat: Towards Enabling ChatGPT-Like Capabilities on Protein 3D Structures},
  author={Guo, Han and Huo, Mingjia and Xie, Pengtao},
  year={2023}
}