/metl

Mutational Effect Transfer Learning (METL) framework for pretraining and finetuning biophysics-informed protein language models

Primary LanguagePythonMIT LicenseMIT

Mutational Effect Transfer Learning

GitHub Actions DOI

This repository contains the Mutational Effect Transfer Learning (METL) framework for pretraining and finetuning biophysics-informed protein language models. You can use it to train models on your own data or recreate the results from our manuscript. This framework uses PyTorch Lightning.

  • To access pretrained METL models in pure PyTorch with minimal software dependencies, see our metl-pretrained repository.
  • To recreate the results from our preprint, see our metl-pub repository.
  • To run your own molecular simulations, see our metl-sim repository.

For more information, please see our manuscript:

Biophysics-based protein language models for protein engineering.
Sam Gelman, Bryce Johnson, Chase Freschlin, Sameer D'Costa, Anthony Gitter+, Philip A Romero+.
bioRxiv, 2024. doi:10.1101/2024.03.15.585128
+ denotes equal contribution.

Installation

Clone this repository and install the required packages using conda or mamba:

conda env create -f environment.yml
conda activate metl

Installation typically takes approximately 5 minutes.

For GPU support, make sure you have the appropriate CUDA version installed. Add cudatoolkit to the environment.yml file before creating the conda environment.

Pretraining on Rosetta data

Rosetta pretraining data is stored in the rosetta_data directory. This repository contains a sample Rosetta dataset for avGFP with 10,000 variants, which can be used to pretrain a toy avGFP METL-Local model. For more information on how to acquire or create a Rosetta dataset, see the README in the rosetta_data directory.

Once you've downloaded or created a Rosetta pretraining dataset, you can pretrain a METL model using train_source_model.py. The notebook pretraining.ipynb shows a complete example of how to pretrain a METL model using the sample avGFP dataset.

You can run the pretraining script on the sample dataset using the following command:

python code/train_source_model.py @args/pretrain_avgfp_local.txt

Note this might take a while to train, so for demonstration purposes, you may want to limit the number of epochs and amount of data using the following:

python code/train_source_model.py @args/pretrain_avgfp_local.txt --max_epochs 5 --limit_train_batches 5 --limit_val_batches 5 --limit_test_batches 5

The test metrics are expected to show poor performance after such a short training run. For instance, pearson_total_score may be around 0.24.

Running the limited pretraining demo takes approximately 5 minutes on CPU.

See the help message for an explanation of all the arguments

python code/train_source_model.py --help

Finetuning on experimental data

Experimental data is stored in dms_data directory. For demonstration purposes, this repository contains the avGFP experimental dataset from Sarkisyan et al. (2016). See the metl-pub repository to access the other experimental datasets we used in our manuscript. See the README in the dms_data directory for information about how to use your own experimental dataset.

In addition to experimental data, you will need a pretrained METL model to finetune. You can pretrain METL models yourself using this repository, or you can use our pretrained METL models from the metl-pretrained repository.

Once you have a pretrained METL model and an experimental dataset, you can finetune the model using train_target_model.py. The notebook finetuning.ipynb shows a complete example of how to finetune a METL model using the sample avGFP dataset. For demonstration purposes, it uses the command:

python code/train_target_model.py @args/finetune_avgfp_local.txt --enable_progress_bar false --enable_simple_progress_messages --max_epochs 50 --unfreeze_backbone_at_epoch 25

Following the short demonstration pretraining and finetuning process is expected to give test set Spearman correlation around 0.6.

Running the finetuning demo takes approximately 7 minutes on CPU.

See the help message for an explanation of all the arguments

python code/code/train_target_model.py --help