This package provides inference code for LigandMPNN & ProteinMPNN models. The code and model parameters are available under the MIT license.
git clone https://github.com/dauparas/LigandMPNN.git
cd LigandMPNN
bash get_model_params.sh "./model_params"
#setup your conda/or other environment
#conda create -n ligandmpnn_env python=3.11
#pip3 install torch
#pip install prody
python run.py \
--seed 111 \
--pdb_path "./inputs/1BC8.pdb" \
--out_folder "./outputs/default"
To run the model you will need to have Python>=3.0, PyTorch, Numpy installed, and to read/write PDB files you will need Prody.
For example to make a new conda environment for LigandMPNN run:
conda create -n ligandmpnn_env python=3.11
pip3 install torch
pip install prody
Main differences compared with ProteinMPNN code
- Input PDBs are parsed using Prody preserving protein residue indices, chain letters, and insertion codes. If there are missing residues in the input structure the output fasta file won't have added
X
to fill the gaps. The script outputs .fasta and .pdb files. It's recommended to use .pdb files since they will hold information about chain letters and residue indices. - Adding bias, fixing residues, and selecting residues to be redesigned now can be done using residue indices directly, e.g. A23 (means chain A residue with index 23), B42D (chain B, residue 42, insertion code D).
- Model writes to fasta files:
overall_confidence
,ligand_confidence
which reflect the average confidence/probability (with T=1.0) over the redesigned residuesoverall_confidence=exp[-mean_over_residues(log_probs)]
. Higher numbers mean the model is more confident about that sequence. min_value=0.0; max_value=1.0. Sequence recovery with respect to the input sequence is calculated only over the redesigned residues.
To download model parameters run:
bash get_model_params.sh "./model_params"
To run the model of your choice specify --model_type
and optionally the model checkpoint path. Available models:
- ProteinMPNN
--model_type "protein_mpnn"
--checkpoint_protein_mpnn "./model_params/proteinmpnn_v_48_002.pt" #noised with 0.02A Gaussian noise
--checkpoint_protein_mpnn "./model_params/proteinmpnn_v_48_010.pt" #noised with 0.10A Gaussian noise
--checkpoint_protein_mpnn "./model_params/proteinmpnn_v_48_020.pt" #noised with 0.20A Gaussian noise
--checkpoint_protein_mpnn "./model_params/proteinmpnn_v_48_030.pt" #noised with 0.30A Gaussian noise
- LigandMPNN
--model_type "ligand_mpnn"
--checkpoint_ligand_mpnn "./model_params/ligandmpnn_v_32_005_25.pt" #noised with 0.05A Gaussian noise
--checkpoint_ligand_mpnn "./model_params/ligandmpnn_v_32_010_25.pt" #noised with 0.10A Gaussian noise
--checkpoint_ligand_mpnn "./model_params/ligandmpnn_v_32_020_25.pt" #noised with 0.20A Gaussian noise
--checkpoint_ligand_mpnn "./model_params/ligandmpnn_v_32_030_25.pt" #noised with 0.30A Gaussian noise
- SolubleMPNN
--model_type "soluble_mpnn"
--checkpoint_soluble_mpnn "./model_params/solublempnn_v_48_002.pt" #noised with 0.02A Gaussian noise
--checkpoint_soluble_mpnn "./model_params/solublempnn_v_48_010.pt" #noised with 0.10A Gaussian noise
--checkpoint_soluble_mpnn "./model_params/solublempnn_v_48_020.pt" #noised with 0.20A Gaussian noise
--checkpoint_soluble_mpnn "./model_params/solublempnn_v_48_030.pt" #noised with 0.30A Gaussian noise
- ProteinMPNN with global membrane label
--model_type "global_label_membrane_mpnn"
--checkpoint_global_label_membrane_mpnn "./model_params/global_label_membrane_mpnn_v_48_020.pt" #noised with 0.20A Gaussian noise
- ProteinMPNN with per residue membrane label
--model_type "per_residue_label_membrane_mpnn"
--checkpoint_per_residue_label_membrane_mpnn "./model_params/per_residue_label_membrane_mpnn_v_48_020.pt" #noised with 0.20A Gaussian noise
Default settings will run ProteinMPNN.
python run.py \
--seed 111 \
--pdb_path "./inputs/1BC8.pdb" \
--out_folder "./outputs/default"
--temperature 0.05
Change sampling temperature (higher temperature gives more sequence diversity).
python run.py \
--seed 111 \
--pdb_path "./inputs/1BC8.pdb" \
--temperature 0.05 \
--out_folder "./outputs/temperature"
--seed
Not selecting a seed will run with a random seed. Running this multiple times will give different results.
python run.py \
--pdb_path "./inputs/1BC8.pdb" \
--out_folder "./outputs/random_seed"
--verbose 0
Do not print any statements.
python run.py \
--seed 111 \
--verbose 0 \
--pdb_path "./inputs/1BC8.pdb" \
--out_folder "./outputs/verbose"
--save_stats 1
Save sequence design statistics.
#['generated_sequences', 'sampling_probs', 'log_probs', 'decoding_order', 'native_sequence', 'mask', 'chain_mask', 'seed', 'temperature']
python run.py \
--seed 111 \
--pdb_path "./inputs/1BC8.pdb" \
--out_folder "./outputs/save_stats" \
--save_stats 1
--fixed_residues
Fixing specific amino acids. This example fixes the first 10 residues in chain C and adds global bias towards A (alanine). The output should have all alanines except the first 10 residues should be the same as in the input sequence since those are fixed.
python run.py \
--seed 111 \
--pdb_path "./inputs/1BC8.pdb" \
--out_folder "./outputs/fix_residues" \
--fixed_residues "C1 C2 C3 C4 C5 C6 C7 C8 C9 C10" \
--bias_AA "A:10.0"
--redesigned_residues
Specifying which residues need to be designed. This example redesigns the first 10 residues while fixing everything else.
python run.py \
--seed 111 \
--pdb_path "./inputs/1BC8.pdb" \
--out_folder "./outputs/redesign_residues" \
--redesigned_residues "C1 C2 C3 C4 C5 C6 C7 C8 C9 C10" \
--bias_AA "A:10.0"
Design 15 sequences; with batch size 3 (can be 1 when using CPUs) and the number of batches 5.
python run.py \
--seed 111 \
--pdb_path "./inputs/1BC8.pdb" \
--out_folder "./outputs/batch_size" \
--batch_size 3 \
--number_of_batches 5
Global amino acid bias. In this example, output sequences are biased towards W, P, C and away from A.
python run.py \
--seed 111 \
--pdb_path "./inputs/1BC8.pdb" \
--bias_AA "W:3.0,P:3.0,C:3.0,A:-3.0" \
--out_folder "./outputs/global_bias"
Specify per residue amino acid bias, e.g. make residues C1, C3, C5, and C7 to be prolines.
# {
# "C1": {"G": -0.3, "C": -2.0, "P": 10.8},
# "C3": {"P": 10.0},
# "C5": {"G": -1.3, "P": 10.0},
# "C7": {"G": -1.3, "P": 10.0}
# }
python run.py \
--seed 111 \
--pdb_path "./inputs/1BC8.pdb" \
--bias_AA_per_residue "./inputs/bias_AA_per_residue.json" \
--out_folder "./outputs/per_residue_bias"
Global amino acid restrictions. This is equivalent to using --bias_AA
and setting bias to be a large negative number. The output should be just made of E, K, A.
python run.py \
--seed 111 \
--pdb_path "./inputs/1BC8.pdb" \
--omit_AA "CDFGHILMNPQRSTVWY" \
--out_folder "./outputs/global_omit"
Per residue amino acid restrictions.
# {
# "C1": "ACDEFGHIKLMNPQRSTVW",
# "C3": "ACDEFGHIKLMNPQRSTVW",
# "C5": "ACDEFGHIKLMNPQRSTVW",
# "C7": "ACDEFGHIKLMNPQRSTVW"
# }
python run.py \
--seed 111 \
--pdb_path "./inputs/1BC8.pdb" \
--omit_AA_per_residue "./inputs/omit_AA_per_residue.json" \
--out_folder "./outputs/per_residue_omit"
Designing sequences with symmetry, e.g. homooligomer/2-state proteins, etc. In this example make C1=C2=C3, also C4=C5, and C6=C7.
#total_logits += symmetry_weights[t]*logits
#probs = torch.nn.functional.softmax((total_logits+bias_t) / temperature, dim=-1)
#total_logits_123 = 0.33*logits_1+0.33*logits_2+0.33*logits_3
#output should be ***ooxx
python run.py \
--seed 111 \
--pdb_path "./inputs/1BC8.pdb" \
--out_folder "./outputs/symmetry" \
--symmetry_residues "C1,C2,C3|C4,C5|C6,C7" \
--symmetry_weights "0.33,0.33,0.33|0.5,0.5|0.5,0.5"
Design homooligomer sequences. This automatically sets --symmetry_residues
and --symmetry_weights
assuming equal weighting from all chains.
python run.py \
--model_type "ligand_mpnn" \
--seed 111 \
--pdb_path "./inputs/4GYT.pdb" \
--out_folder "./outputs/homooligomer" \
--homo_oligomer 1 \
--number_of_batches 2
Outputs will have a specified ending; e.g. 1BC8_xyz.fa
instead of 1BC8.fa
python run.py \
--seed 111 \
--pdb_path "./inputs/1BC8.pdb" \
--out_folder "./outputs/file_ending" \
--file_ending "_xyz"
Zero indexed names in /backbones/1BC8_0.pdb, 1BC8_1.pdb, 1BC8_2.pdb etc
python run.py \
--seed 111 \
--pdb_path "./inputs/1BC8.pdb" \
--out_folder "./outputs/zero_indexed" \
--zero_indexed 1 \
--number_of_batches 2
Specify which chains (e.g. "ABC") need to be redesigned, other chains will be kept fixed. Outputs in seqs/backbones will still have atoms/sequences for the whole input PDB.
python run.py \
--model_type "ligand_mpnn" \
--seed 111 \
--pdb_path "./inputs/4GYT.pdb" \
--out_folder "./outputs/chains_to_design" \
--chains_to_design "B"
Parse and design only specified chains (e.g. "ABC"). Outputs will have only specified chains.
python run.py \
--model_type "ligand_mpnn" \
--seed 111 \
--pdb_path "./inputs/4GYT.pdb" \
--out_folder "./outputs/parse_these_chains_only" \
--parse_these_chains_only "B"
Run LigandMPNN with default settings.
python run.py \
--model_type "ligand_mpnn" \
--seed 111 \
--pdb_path "./inputs/1BC8.pdb" \
--out_folder "./outputs/ligandmpnn_default"
Run LigandMPNN using 0.05A model by specifying --checkpoint_ligand_mpnn
flag.
python run.py \
--checkpoint_ligand_mpnn "./model_params/ligandmpnn_v_32_005_25.pt" \
--model_type "ligand_mpnn" \
--seed 111 \
--pdb_path "./inputs/1BC8.pdb" \
--out_folder "./outputs/ligandmpnn_v_32_005_25"
Setting --ligand_mpnn_use_atom_context 0
will mask all ligand atoms. This can be used to assess how much ligand atoms affect AA probabilities.
python run.py \
--model_type "ligand_mpnn" \
--seed 111 \
--pdb_path "./inputs/1BC8.pdb" \
--out_folder "./outputs/ligandmpnn_no_context" \
--ligand_mpnn_use_atom_context 0
Use fixed residue side chain atoms as extra ligand atoms.
python run.py \
--model_type "ligand_mpnn" \
--seed 111 \
--pdb_path "./inputs/1BC8.pdb" \
--out_folder "./outputs/ligandmpnn_use_side_chain_atoms" \
--ligand_mpnn_use_side_chain_context 1 \
--fixed_residues "C1 C2 C3 C4 C5 C6 C7 C8 C9 C10"
Run SolubleMPNN (ProteinMPNN-like model with only soluble proteins in the training dataset).
python run.py \
--model_type "soluble_mpnn" \
--seed 111 \
--pdb_path "./inputs/1BC8.pdb" \
--out_folder "./outputs/soluble_mpnn_default"
Run global label membrane MPNN (trained with extra input - binary label soluble vs not) --global_transmembrane_label #1 - membrane, 0 - soluble
.
python run.py \
--model_type "global_label_membrane_mpnn" \
--seed 111 \
--pdb_path "./inputs/1BC8.pdb" \
--out_folder "./outputs/global_label_membrane_mpnn_0" \
--global_transmembrane_label 0
Run per residue label membrane MPNN (trained with extra input per residue specifying buried (hydrophobic), interface (polar), or other type residues; 3 classes).
python run.py \
--model_type "per_residue_label_membrane_mpnn" \
--seed 111 \
--pdb_path "./inputs/1BC8.pdb" \
--out_folder "./outputs/per_residue_label_membrane_mpnn_default" \
--transmembrane_buried "C1 C2 C3 C11" \
--transmembrane_interface "C4 C5 C6 C22"
Choose a symbol to put between different chains in fasta output format. It's recommended to PDB output format to deal with residue jumps and multiple chain parsing.
python run.py \
--pdb_path "./inputs/1BC8.pdb" \
--out_folder "./outputs/fasta_seq_separation" \
--fasta_seq_separation ":"
Specify multiple PDB input paths. This is more efficient since the model needs to be loaded from the checkpoint once.
#{
#"./inputs/1BC8.pdb": "",
#"./inputs/4GYT.pdb": ""
#}
python run.py \
--pdb_path_multi "./inputs/pdb_ids.json" \
--out_folder "./outputs/pdb_path_multi" \
--seed 111
Specify fixed residues when using --pdb_path_multi
flag.
#{
#"./inputs/1BC8.pdb": "C1 C2 C3 C4 C5 C10 C22",
#"./inputs/4GYT.pdb": "A7 A8 A9 A10 A11 A12 A13 B38"
#}
python run.py \
--pdb_path_multi "./inputs/pdb_ids.json" \
--fixed_residues_multi "./inputs/fix_residues_multi.json" \
--out_folder "./outputs/fixed_residues_multi" \
--seed 111
Specify which residues need to be redesigned when using --pdb_path_multi
flag.
#{
#"./inputs/1BC8.pdb": "C1 C2 C3 C4 C5 C10",
#"./inputs/4GYT.pdb": "A7 A8 A9 A10 A12 A13 B38"
#}
python run.py \
--pdb_path_multi "./inputs/pdb_ids.json" \
--redesigned_residues_multi "./inputs/redesigned_residues_multi.json" \
--out_folder "./outputs/redesigned_residues_multi" \
--seed 111
Specify which residues need to be omitted when using --pdb_path_multi
flag.
#{
#"./inputs/1BC8.pdb": {"C1":"ACDEFGHILMNPQRSTVWY", "C2":"ACDEFGHILMNPQRSTVWY", "C3":"ACDEFGHILMNPQRSTVWY"},
#"./inputs/4GYT.pdb": {"A7":"ACDEFGHILMNPQRSTVWY", "A8":"ACDEFGHILMNPQRSTVWY"}
#}
python run.py \
--pdb_path_multi "./inputs/pdb_ids.json" \
--omit_AA_per_residue_multi "./inputs/omit_AA_per_residue_multi.json" \
--out_folder "./outputs/omit_AA_per_residue_multi" \
--seed 111
Specify amino acid biases per residue when using --pdb_path_multi
flag.
#{
#"./inputs/1BC8.pdb": {"C1":{"A":3.0, "P":-2.0}, "C2":{"W":10.0, "G":-0.43}},
#"./inputs/4GYT.pdb": {"A7":{"Y":5.0, "S":-2.0}, "A8":{"M":3.9, "G":-0.43}}
#}
python run.py \
--pdb_path_multi "./inputs/pdb_ids.json" \
--bias_AA_per_residue_multi "./inputs/bias_AA_per_residue_multi.json" \
--out_folder "./outputs/bias_AA_per_residue_multi" \
--seed 111
This sets the cutoff distance in angstroms to select residues that are considered to be close to ligand atoms. This flag only affects the num_ligand_res
and ligand_confidence
in the output fasta files.
python run.py \
--model_type "ligand_mpnn" \
--seed 111 \
--pdb_path "./inputs/1BC8.pdb" \
--ligand_mpnn_cutoff_for_score "6.0" \
--out_folder "./outputs/ligand_mpnn_cutoff_for_score"
You can specify residue using chain_id + residue_number + insersion_code; e.g. redesign only residue B82, B82A, B82B, B82C.
python run.py \
--seed 111 \
--pdb_path "./inputs/2GFB.pdb" \
--out_folder "./outputs/insertion_code" \
--redesigned_residues "B82 B82A B82B B82C" \
--parse_these_chains_only "B"
Parse atoms in the PDB files with zero occupancy too.
python run.py \
--model_type "ligand_mpnn" \
--seed 111 \
--pdb_path "./inputs/1BC8.pdb" \
--out_folder "./outputs/parse_atoms_with_zero_occupancy" \
--parse_atoms_with_zero_occupancy 1
out_dict = {}
out_dict["logits"] - raw logits from the model
out_dict["probs"] - softmax(logits)
out_dict["log_probs"] - log_softmax(logits)
out_dict["decoding_order"] - decoding order used (logits will depend on the decoding order)
out_dict["native_sequence"] - parsed input sequence in integers
out_dict["mask"] - mask for missing residues (usually all ones)
out_dict["chain_mask"] - controls which residues are decoded first
out_dict["alphabet"] - amino acid alphabet used
out_dict["residue_names"] - dictionary to map integers to residue_names, e.g. {0: "C10", 1: "C11"}
out_dict["sequence"] - parsed input sequence in alphabet
out_dict["mean_of_probs"] - averaged over batch_size*number_of_batches probabilities, [protein_length, 21]
out_dict["std_of_probs"] - same as above, but std
Get probabilities/scores for backbone-sequence pairs using autoregressive probabilities: p(AA_1|backbone), p(AA_2|backbone, AA_1) etc. These probabilities will depend on the decoding order, so it's recomended to set number_of_batches to at least 10.
python score.py \
--model_type "ligand_mpnn" \
--seed 111 \
--autoregressive_score 1\
--pdb_path "./outputs/ligandmpnn_default/backbones/1BC8_1.pdb" \
--out_folder "./outputs/autoregressive_score_w_seq" \
--use_sequence 1\
--batch_size 1 \
--number_of_batches 10
Get probabilities/scores for backbone using probabilities: p(AA_1|backbone), p(AA_2|backbone) etc. These probabilities will depend on the decoding order, so it's recomended to set number_of_batches to at least 10.
python score.py \
--model_type "ligand_mpnn" \
--seed 111 \
--autoregressive_score 1\
--pdb_path "./outputs/ligandmpnn_default/backbones/1BC8_1.pdb" \
--out_folder "./outputs/autoregressive_score_wo_seq" \
--use_sequence 0\
--batch_size 1 \
--number_of_batches 10
Get probabilities/scores for backbone-sequence pairs using single aa probabilities: p(AA_1|backbone, AA_{all except AA_1}), p(AA_2|backbone, AA_{all except AA_2}) etc. These probabilities will depend on the decoding order, so it's recomended to set number_of_batches to at least 10.
python score.py \
--model_type "ligand_mpnn" \
--seed 111 \
--single_aa_score 1\
--pdb_path "./outputs/ligandmpnn_default/backbones/1BC8_1.pdb" \
--out_folder "./outputs/single_aa_score_w_seq" \
--use_sequence 1\
--batch_size 1 \
--number_of_batches 10
Get probabilities/scores for backbone-sequence pairs using single aa probabilities: p(AA_1|backbone), p(AA_2|backbone) etc. These probabilities will depend on the decoding order, so it's recomended to set number_of_batches to at least 10.
python score.py \
--model_type "ligand_mpnn" \
--seed 111 \
--single_aa_score 1\
--pdb_path "./outputs/ligandmpnn_default/backbones/1BC8_1.pdb" \
--out_folder "./outputs/single_aa_score_wo_seq" \
--use_sequence 0\
--batch_size 1 \
--number_of_batches 10
- Support for ProteinMPNN CA-only model.
- Examples for scoring sequences only.
- Side-chain packing scripts.
- TER
If you use the code, please cite:
@article{dauparas2023atomic,
title={Atomic context-conditioned protein sequence design using LigandMPNN},
author={Dauparas, Justas and Lee, Gyu Rie and Pecoraro, Robert and An, Linna and Anishchenko, Ivan and Glasscock, Cameron and Baker, David},
journal={Biorxiv},
pages={2023--12},
year={2023},
publisher={Cold Spring Harbor Laboratory}
}
@article{dauparas2022robust,
title={Robust deep learning--based protein sequence design using ProteinMPNN},
author={Dauparas, Justas and Anishchenko, Ivan and Bennett, Nathaniel and Bai, Hua and Ragotte, Robert J and Milles, Lukas F and Wicky, Basile IM and Courbet, Alexis and de Haas, Rob J and Bethel, Neville and others},
journal={Science},
volume={378},
number={6615},
pages={49--56},
year={2022},
publisher={American Association for the Advancement of Science}
}