This repository contains pretrained METL models with minimal dependencies. For more information, please see the metl repository and our manuscript:
Biophysics-based protein language models for protein engineering.
Sam Gelman, Bryce Johnson, Chase Freschlin, Sameer D'Costa, Anthony Gitter+, Philip A Romero+.
bioRxiv, 2024. doi:10.1101/2024.03.15.585128
+ denotes equal contribution.
- Create a conda environment (or use existing one):
conda create --name myenv python=3.9
- Activate conda environment
conda activate myenv
- Clone this repository
- Navigate to the cloned repository
cd metl-pretrained
- Install the package with
pip install .
- Import the package in your script with
import metl
- Load a pretrained model using
model, data_encoder = metl.get_from_uuid(uuid)
or one of the other loading functions (see examples below)model
is a PyTorch model loaded with the pre-trained weightsdata_encoder
is a helper object that can be used to encode sequences and variants to be fed into the model
Model checkpoints are available to download from Zenodo.
Once you have a checkpoint downloaded, you can load it into a PyTorch model using metl.get_from_checkpoint()
.
Alternatively, you can use metl.get_from_uuid()
or metl.get_from_ident()
to automatically download, cache, and load the model based on the model identifier or UUID.
See the examples below.
Source models predict Rosetta energy terms.
Identifier | UUID | Params | RPE | Output | Description | Download |
---|---|---|---|---|---|---|
METL-G-20M-1D |
D72M9aEp |
20M | 1D | Rosetta energies | METL-G | Download |
METL-G-20M-3D |
Nr9zCKpR |
20M | 3D | Rosetta energies | METL-G | Download |
METL-G-50M-1D |
auKdzzwX |
50M | 1D | Rosetta energies | METL-G | Download |
METL-G-50M-3D |
6PSAzdfv |
50M | 3D | Rosetta energies | METL-G | Download |
Identifier | UUID | Protein | Params | RPE | Output | Description | Download |
---|---|---|---|---|---|---|---|
METL-L-2M-1D-GFP |
8gMPQJy4 |
avGFP | 2M | 1D | Rosetta energies | METL-L | Download |
METL-L-2M-3D-GFP |
Hr4GNHws |
avGFP | 2M | 3D | Rosetta energies | METL-L | Download |
METL-L-2M-1D-DLG4_2022 |
8iFoiYw2 |
DLG4 | 2M | 1D | Rosetta energies | METL-L | Download |
METL-L-2M-3D-DLG4_2022 |
kt5DdWTa |
DLG4 | 2M | 1D | Rosetta energies | METL-L | Download |
METL-L-2M-1D-GB1 |
DMfkjVzT |
GB1 | 2M | 1D | Rosetta energies | METL-L | Download |
METL-L-2M-3D-GB1 |
epegcFiH |
GB1 | 2M | 3D | Rosetta energies | METL-L | Download |
METL-L-2M-1D-GRB2 |
kS3rUS7h |
GRB2 | 2M | 1D | Rosetta energies | METL-L | Download |
METL-L-2M-3D-GRB2 |
X7w83g6S |
GRB2 | 2M | 3D | Rosetta energies | METL-L | Download |
METL-L-2M-1D-Pab1 |
UKebCQGz |
Pab1 | 2M | 1D | Rosetta energies | METL-L | Download |
METL-L-2M-3D-Pab1 |
2rr8V4th |
Pab1 | 2M | 3D | Rosetta energies | METL-L | Download |
METL-L-2M-1D-TEM-1 |
PREhfC22 |
TEM-1 | 2M | 1D | Rosetta energies | METL-L | Download |
METL-L-2M-3D-TEM-1 |
9ASvszux |
TEM-1 | 2M | 3D | Rosetta energies | METL-L | Download |
METL-L-2M-1D-Ube4b |
HscFFkAb |
Ube4b | 2M | 1D | Rosetta energies | METL-L | Download |
METL-L-2M-3D-Ube4b |
H48oiNZN |
Ube4b | 2M | 3D | Rosetta energies | METL-L | Download |
These models will output a length 55 vector corresponding to the following energy terms (in order):
Expand to see energy terms
total_score
fa_atr
fa_dun
fa_elec
fa_intra_rep
fa_intra_sol_xover4
fa_rep
fa_sol
hbond_bb_sc
hbond_lr_bb
hbond_sc
hbond_sr_bb
lk_ball_wtd
omega
p_aa_pp
pro_close
rama_prepro
ref
yhh_planarity
buried_all
buried_np
contact_all
contact_buried_core
contact_buried_core_boundary
degree
degree_core
degree_core_boundary
exposed_hydrophobics
exposed_np_AFIMLWVY
exposed_polars
exposed_total
one_core_each
pack
res_count_buried_core
res_count_buried_core_boundary
res_count_buried_np_core
res_count_buried_np_core_boundary
ss_contributes_core
ss_mis
total_hydrophobic
total_hydrophobic_AFILMVWY
total_sasa
two_core_each
unsat_hbond
centroid_total_score
cbeta
cenpack
env
hs_pair
pair
rg
rsigma
sheet
ss_pair
vdw
The GB1 experimental data measured the binding interaction between GB1 variants and Immunoglobulin G (IgG). To match this experimentally characterized function, we implemented a Rosetta pipeline to model the GB1-IgG complex and compute 17 attributes related to energy changes upon binding. We pretrained a standard METL-Local model and a modified METL-Bind model, which additionally incorporates the IgG binding attributes into its pretraining tasks.
Identifier | UUID | Protein | Params | RPE | Output | Description | Download |
---|---|---|---|---|---|---|---|
METL-BIND-2M-3D-GB1-STANDARD |
K6mw24Rg |
GB1 | 2M | 3D | Standard Rosetta energies | Trained for the function-specific synthetic data experiment, but only trained on the standard energy terms, to use as a baseline. Should perform similarly to METL-L-2M-3D-GB1 . |
Download |
METL-BIND-2M-3D-GB1-BINDING |
Bo5wn2SG |
GB1 | 2M | 3D | Standard + binding Rosetta energies | Trained on both the standard energy terms and the binding-specific energy terms. | Download |
METL-BIND-2M-3D-GB1-BINDING
predicts the standard energy terms listed above as well as the following binding energy terms (in order):
Expand to see binding energy terms
complex_normalized
dG_cross
dG_cross/dSASAx100
dG_separated
dG_separated/dSASAx100
dSASA_hphobic
dSASA_int
dSASA_polar
delta_unsatHbonds
hbond_E_fraction
hbonds_int
nres_int
per_residue_energy_int
side1_normalized
side1_score
side2_normalized
side2_score
Target models are fine-tuned source models that predict functional scores from experimental sequence-function data.
DMS Dataset | Identifier | UUID | RPE | Output | Description | Download |
---|---|---|---|---|---|---|
avGFP | None |
YoQkzoLD |
1D | Functional score | The METL-L-2M-1D-GFP model, fine-tuned on 64 examples from the avGFP DMS dataset. This model was used for the GFP design experiment described in the manuscript. |
Download |
avGFP | None |
PEkeRuxb |
3D | Functional score | The METL-L-2M-3D-GFP model, fine-tuned on 64 examples from the avGFP DMS dataset. This model was used for the GFP design experiment described in the manuscript. |
Download |
METL uses relative position embeddings (RPEs) based on 3D protein structure. The implementation of relative position embeddings is similar to the original paper by Shaw et al. However, instead of using the default 1D sequence-based distances, we calculate relative distances based on a graph of the 3D protein structure. These 3D RPEs enable the transformer to use 3D distances between amino acid residues as the positional signal when calculating attention. When using 3D RPEs, the model requires a protein structure in the form of a PDB file, corresponding to the wild-type protein or base protein of the input variant sequence.
Our testing showed that 3D RPEs improve performance for METL-Global models but do not make a difference for METL-Local models. We provide both 1D and 3D models in this repository. The 1D models do not require the PDB structure as an additional input.
METL source models are assigned identifiers that can be used to load the model with metl.get_from_ident()
.
This example:
- Automatically downloads and caches
METL-G-20M-1D
usingmetl.get_from_ident("metl-g-20m-1d")
. - Encodes a pair of dummy amino acid sequences using
data_encoder.encode_sequences()
. - Runs the sequences through the model and prints the predicted Rosetta energies.
Todo: show how to extract the METL representation at different layers of the network
import metl
import torch
model, data_encoder = metl.get_from_ident("metl-g-20m-1d")
# these are amino acid sequences
# make sure all the sequences are the same length
dummy_sequences = ["SMART", "MAGIC"]
encoded_seqs = data_encoder.encode_sequences(dummy_sequences)
# set model to eval mode
model.eval()
# no need to compute gradients for inference
with torch.no_grad():
predictions = model(torch.tensor(encoded_seqs))
print(predictions)
If you are using a model with 3D relative position embeddings, you will need to provide the PDB structure of the wild-type or base protein.
predictions = model(torch.tensor(encoded_seqs), pdb_fn="../path/to/file.pdb")
METL target models can be loaded using the model's UUID and metl.get_from_uuid()
.
This example:
- Automatically downloads and caches
YoQkzoLD
usingmetl.get_from_uuid(uuid="YoQkzoLD")
. - Encodes several variants specified in variant notation. A wild-type sequence is needed to encode variants.
- Runs the sequences through the model and prints the predicted DMS scores.
import metl
import torch
model, data_encoder = metl.get_from_uuid(uuid="YoQkzoLD")
# the GFP wild-type sequence
wt = "SKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFICTTGKLPVPWPTLVTTLSYGVQCFSRYPDHMKQ" \
"HDFFKSAMPEGYVQERTIFFKDDGNYKTRAEVKFEGDTLVNRIELKGIDFKEDGNILGHKLEYNYNSHNVYIMADKQKN" \
"GIKVNFKIRHNIEDGSVQLADHYQQNTPIGDGPVLLPDNHYLSTQSALSKDPNEKRDHMVLLEFVTAAGITHGMDELYK"
# some example GFP variants to compute the scores for
variants = ["E3K,G102S",
"T36P,S203T,K207R",
"V10A,D19G,F25S,E113V"]
encoded_variants = data_encoder.encode_variants(wt, variants)
# set model to eval mode
model.eval()
# no need to compute gradients for inference
with torch.no_grad():
predictions = model(torch.tensor(encoded_variants))
print(predictions)