Code to sample sequences with a contextual Masked EnTransformer as described in "Contextual protein and antibody encodings from equivariant graph transformers".
In your virtual environment, pip install as follows:
pip install torch torchvision torchaudio -f https://download.pytorch.org/whl/torch_stable.html
pip install -r requirements.txt
Sampling works well on CPUs and GPUs. Sampling is just as fast on cpus: <2min for 10000 sequences
Download and extract trained models from Zenodo.
tar -xvzf model.tar.gz
To design/generate all positions on the protein, run:
MODEL=trained_models/ProtEnT_backup.ckpt
OUTDIR=./sampled_sequences
PDB_DIR=data/proteins
python3 ProteinSequenceSampler.py \
--output_dir ${OUTDIR} \
--model $MODEL \
--from_pdb $PDB_DIR \
--sample_temperatures 0.2,0.5 \
--num_samples 100
The above command samples all sequences at 100% masking (i.e. only coord information is used by the model). You may sample at any other masking rate between 0-100% and the model will randomly select the positions to mask. For more options, run:
python3 ProteinSequenceSampler.py --help
To design/generate all positions on the protein, run:
MODEL=trained_models/ProtEnT_backup.ckpt
OUTDIR=./sampled_sequences
PDB_DIR=data/proteins
python3 ProteinSequenceSampler.py \
--output_dir ${OUTDIR} \
--model $MODEL \
--from_pdb $PDB_DIR \
--sample_temperatures 0.2,0.5 \
--num_samples 100 \
--antibody \
--mask_ab_indices 10,11,12
# To sample for a specific region
# --mask_ab_region h3
The above command samples all sequences at 100% masking (i.e. only coord information is used by the model). You may sample at any other masking rate between 0-100% and the model will randomly select the positions to mask. For more options, run:
python3 AntibodySequenceSampler.py --help
To generate/design the interface residues for the first partner (order determined by partners.json), run:
MODEL=trained_models/ProtPPIEnT_backup.ckpt
OUTDIR=./sampled_ppi_sequences
PDB_DIR=data/ppis
PPI_PARTNERS_DICT=data/ppis/heteromers_partners_example.json
python3 PPIAbAgSequenceSampler.py \
--output_dir ${OUTDIR} \
--model $MODEL \
--from_pdb $PDB_DIR \
--sample_temperatures 0.2,0.5 \
--num_samples 100 \
--partners_json ${PPI_PARTNERS_DICT} \
--partner_name p0
# to design interface residues on second partner use
# --partner_name p0
# to design interface residues on both partners use
# --partner_name both
MODEL=trained_models/ProtAbAgEnT_backup.ckpt
OUTDIR=./sampled_abag_sequences
PDB_DIR=data/abag/
PPI_PARTNERS_DICT=data/abag/1n8z_partners.json
python3 PPIAbAgSequenceSampler.py \
--output_dir ${OUTDIR} \
--model $MODEL \
--from_pdb $PDB_DIR \
--sample_temperatures 0.2,0.5 \
--num_samples 100 \
--partners_json ${PPI_PARTNERS_DICT} \
--partner_name Ab \
--antibody
# To specify sampling at a specific CDR loop:
# --mask_ab_region h3
# To specify sampling at a specific indices:
# --mask_ab_indices 10,11,12
Dockerfile
is provided as example/demo of package use. Please see example command lines to use below. For production use you might need to mount host data dir as a subdir to /code
dir where package code is located.
docker build -t masked-protein-ent .
docker run -it masked-protein-ent
Example Jupyter notebook for Colab is provided in MaskedProteinEnT-colab-example.ipynb. Please note that due to volatile nature of Colab platform it is difficult to ensure that in long term such notebook will be functionining so some edits might be required.
EnTransformer code is based on Phil Wang's implementation of EGNN (Satorras et al. 2021) with equivariant transformer layers. Models and sequence recovery reported for Antibody CDRs with different models reported in Figure 2 available at https://zenodo.org/record/8313466. If you use this repository to generate or score sequences, please cite:
Mahajan, S. P., Ruffolo, J. A., Gray, J. J., "Contextual protein and antibody encodings from equivariant graph transformers", 2021.