/dyMEAN

This repo contains the codes for our paper "End-to-End Full-Atom Antibody Design"

Primary LanguagePythonMIT LicenseMIT

dyMEAN: End-to-End Full-Atom Antibody Design

This repo contains the codes for our paper End-to-End Full-Atom Antibody Design.

Quick Links

Setup

There are 3 necessary and 1 optional prerequisites: setting up conda environment (necessary), obtaining scorers (necessary), preparing antibody pdb data (necessary), and downloading baselines (optional).

1. Environment

We have provided the env.yml for creating the runtime conda environment just by running:

conda env create -f env.yml

2. Scorers

Please first prepare the scorers for TMscore and DockQ as follows:

The source code for assessing TMscore is at evaluation/TMscore.cpp. Please compile it by:

g++ -static -O3 -ffast-math -lm -o evaluation/TMscore evaluation/TMscore.cpp

To prepare the DockQ scorer, please clone its official github and compile the prerequisites according to its instructions. After that, please revise the DOCKQ_DIR variable in the configs.py to point to the directory containing the DockQ project (e.g. ./DockQ).

The lDDT scorer is in the conda environment, and the $\Delta\Delta G$ scorer is integrated into our codes, therefore they don't need additional preparations.

3. PDB data

Please download all the structure data of antibodies from the download page of SAbDab. Please enter the Downloads tab on the left of the web page and download the archived zip file for the structures, then decompress it:

wget https://opig.stats.ox.ac.uk/webapps/newsabdab/sabdab/archive/all/ -O all_structures.zip
unzip all_structures.zip

You should get a folder named all_structures with the following hierarchy:

├── all_structures
│   ├── chothia
│   ├── imgt
│   ├── raw

Each subfolder contains the pdb files renumbered with the corresponding scheme. We use IMGT in the paper, so the imgt subfolder is what we care about.

Since pdb files are heavy to process, usually people will generate a summary file for the structural database which records the basic information about each structure for fast access. We have provided the summary of the dataset retrieved at November 12, 2022 (summaries/sabdab_summary.tsv). Since the dataset is updated on a weekly basis, if you want to use the latest version, please download it from the official website.

(Optional) 4. Baselines

If you are interested in the pipeline baselines, including the following projects and integrate their dependencies according to your needs:

After adding these projects, please also remember to revise the corresponding paths in ./configs.py. We have also provided the scripts for cascading the modules in ./scripts/pipeline_inference.sh.

Experiments

The trained checkpoints for each task are provided at the github release page. To use them, please download the ones you are interested in and save them into the folder ./checkpoints. We provide the names, training configurations (under ./scripts/train/configs), and descriptions of the checkpoints as follows:

checkpoint(s) configure description
cdrh3_design.ckpt single_cdr_design.json Epitope-binding CDR-H3 design
struct_prediction.ckpt struct_prediction.json Complex structure prediction
affinity_opt.ckpt & ddg_predictor.ckp single_cdr_opt.json Affinity optimization on CDR-H3
multi_cdr_design.ckpt multi_cdr_design.json Design all 6 CDRs simultaneously
multi_cdr_opt.ckpt & multi_cdr_ddg_predictor multi_cdr_opt.json Optimize affinity on all 6 CDRs simultaneously
full_design.ckpt full_design.json Design the entire variable domain, including the framework region

Data Preprocessing

Data

To preprocess the raw data, we need to first generate summaries for each benchmark in json format, then split the datasets into train/validation/test sets, and finally transform the pdb data to python objects. We have provided the script for all these procedures in scripts/data_preprocess.sh. Suppose the IMGT-renumbered pdb data are located at all_structures/imgt/, and that you want to store the processed data (~5G) at all_data, you can simply run:

bash scripts/data_preprocess.sh all_structures/imgt all_data

which takes about 1 hour to process SAbDab, RAbD, Igfold test set, and SKEMPI V2.0. It is normal to see reported errors in this process because some antibody structures are wrongly annotated or have wrong format, which will be dropped out in the data cleaning phase.

(Optional) Conserved Template

We have provided the conserved template from SAbDab in ./data/template.json. If you are interested in the extracting process, it is also possible to extract a conserved template from a specified dataset (e.g. the training set for the CDR-H3 design task) by running the following command:

python -m data.framework_templates \
    --dataset ./all_data/RAbD/train.json \
    --out ./data/new_template.json

CDR-H3 Design

We use SAbDab for training and RAbD for testing. Please first revise the settings in scripts/train/configs/cdr_design.json (path to datasets and other hyperparameters) and then run the below command for training:

GPU=0,1 bash scripts/train/train.sh scripts/train/configs/single_cdr_design.json

Normally the training procedure takes about 7 hours on 2 GeForce RTX 2080 Ti GPUs. We have also provided the trained checkpoint at checkpoints/cdrh3_design.ckpt. Then please revise the path to the test set in scripts/test/test.sh and run the following command for testing:

GPU=0 bash scripts/test/test.sh ./checkpoints/cdrh3_design.ckpt ./all_data/RAbD/test.json ./results

which will save the generated results to ./results.

Complex Structure Prediction

We use SAbDab for training and IgFold for testing. The training and testing procedure are similar to those of CDR-H3 design. After revising the settings in scripts/train/configs/cdr_design.json and scripts/test/test.sh as mentioned before, please run the following command for training:

GPU=0,1 bash scripts/train/train.sh scripts/train/configs/struct_prediction.json

Normally the training procedure takes about 8 hours on 2 GeForce RTX 2080 Ti GPUs. We have also provided the trained checkpoint at checkpoints/struct_prediction.ckpt. Then please run the following command for testing:

GPU=0 bash scripts/test/test.sh ./checkpoints/struct_prediction.ckpt ./all_data/IgFold/test.json ./results

Affinity Optimization

We use SAbDab for training and the antibodies in SKEMPI V2.0 for testing. Similarly, please first revise the settings in scripts/train/configs/affinity_opt.json, scripts/test/optimize_test.sh, and additionally scripts/train/train_predictor.sh. Then please conduct training of dyMEANOpt (~ 5h):

GPU=0,1 bash scripts/train/train.sh scripts/train/configs/single_cdr_opt.json

Then we need to train a predictor of ddg on the representations of generated complex (~ 40min):

GPU=0 bash scripts/train/train_predictor.sh checkpoints/cdrh3_opt.ckpt

We have provided the trained checkpoints at checkpoints/cdrh3_opt.ckpt and checkpoints/cdrh3_ddg_predictor.ckpt. The optimization test can be conducted through:

GPU=0 bash scripts/test/optimize_test.sh checkpoints/cdrh3_opt.ckpt checkpoints/cdrh3_ddg_predictor.ckpt ./all_data/SKEMPI/test.json 0 50

which will do 50 steps of gradient search without restrictions on the maximum number of changed residues (change 0 to any number to restrict the upperbound of $\Delta L$).

Proof-of-Concept Applications

We also provide inference API and in silico demos for common applications in the real world problems, which are located in the ./api and ./demos.

Inference API

We provide the design API and the optimize API in ./api, which can be easily integrated into python codes.

Design

The design API (./api/design.py) can be used to generate CDRs given the sequences of the framework region, the PDB file of the antigen as well as the epitope definitions. We will use the an interesting scenario to illustrate the usage of the design API.

We want to design an antibody combining to the open state of the transient receptor potential cation channel subfamily V member 1 (TRPV1), which plays a critical role in acute and persistent pain. Instead of handcraft the epitope on TRPV1, we try to mimic an existing binder which is a double-knot toxin (DkTx). Therefore, we need to first extract the epitope definition by analyzing the binding pattern of the toxin, then design an antibody with given sequences of the framework regions.

1. Extract the Epitope Definition

We provide the PDB file of the complex of the transient receptor potential cation channel subfamily V member 1 (TRPV1, chain ABCD) and the double-knot toxin (DkTx, chain EF) in ./demos/data/7l2m.pdb. The original PDB has 4 symmetric units, so we manually split the two toxins (chain EF) in the middle to form 4 symmetric chains e,f,E,F. Each antibody only need to focus on one unit. Here we choose the chain E as an example.

We generate the epitope definition by analyzing the binding interface of chain E to the TRPV1:

python -m api.binding_interface \
    --pdb ./demos/data/7l2m.pdb \
    --receptor A B C D \
    --ligand E \
    --out ./demos/data/E_epitope.json

Now the epitope definition (i.e. the residues of TRPV1 on the binding interface) is saved to ./demos/data/E_epitope.json. By changing the value of the argument "ligand" to e, f, and F, we can obtain the epitope definitions for other units (don't forget to revise the output path as well).

2. Obtain the Sequences of the Framework Regions

Depending on the final purposes of designing the antibody, framework regions with different physiochemical properties may be desired. Since here we are only providing a proof-of-concept case, we randomly pick up one from the existing dataset:

heavy chain (H): 'QVQLKESGPGLLQPSQTLSLTCTVSGISLSDYGVHWVRQAPGKGLEWMGIIGHAGGTDYNSNLKSRVSISRDTSKSQVFLKLNSLQQEDTAMYFC----------WGQGIQVTVSSA'
light chain (L): 'YTLTQPPLVSVALGQKATITCSGDKLSDVYVHWYQQKAGQAPVLVIYEDNRRPSGIPDHFSGSNSGNMATLTISKAQAGDEADYYCQSWDGTNSAWVFGSGTKVTVLGQ'

The original CDR-H3 is masked by "-". Designing multiple CDRs are also supported, which will be illustrated later.

3. Design the CDRs

The last step is to design the CDRs with the design API:

from api.design import design

ckpt = './checkpoints/cdrh3_design.ckpt'
root_dir = './demos/data'
pdbs = [os.path.join(root_dir, '7l2m.pdb') for _ in range(4)]
toxin_chains = ['E', 'e', 'F', 'f']
remove_chains = [toxin_chains for _ in range(4)]
epitope_defs = [os.path.join(root_dir, c + '_epitope.json') for c in toxin_chains]
identifiers = [f'{c}_antibody' for c in toxin_chains]

# use '-' for masking amino acids
frameworks = [
    (
        ('H', 'QVQLKESGPGLLQPSQTLSLTCTVSGISLSDYGVHWVRQAPGKGLEWMGIIGHAGGTDYNSNLKSRVSISRDTSKSQVFLKLNSLQQEDTAMYFC----------WGQGIQVTVSSA'),
        ('L', 'YTLTQPPLVSVALGQKATITCSGDKLSDVYVHWYQQKAGQAPVLVIYEDNRRPSGIPDHFSGSNSGNMATLTISKAQAGDEADYYCQSWDGTNSAWVFGSGTKVTVLGQ')
    ) \
    for _ in pdbs
]  # the first item of each tuple is heavy chain, the second is light chain

design(ckpt=ckpt,  # path to the checkpoint of the trained model
       gpu=0,      # the ID of the GPU to use
       pdbs=pdbs,  # paths to the PDB file of each antigen (here antigen is all TRPV1)
       epitope_defs=epitope_defs,  # paths to the epitope definitions
       frameworks=frameworks,      # the given sequences of the framework regions
       out_dir=root_dir,           # output directory
       identifiers=identifiers,    # name of each output antibody
       remove_chains=remove_chains,# remove the original ligand
       enable_openmm_relax=True,   # use openmm to relax the generated structure
       auto_detect_cdrs=False)  # manually use '-'  to represent CDR residues

These codes are also added as an example in ./api/design.py, so you can directly run it by:

python -m api.design

Here we use "-" to mark the CDR-H3 manually, but you can also set auto_detect_cdrs=True to let the CDR be automatically decided by the IMGT numbering system. The types of the CDRs to design will be automatically derived from the given checkpoint. Currently the API support re-designing single or multiple CDRs, as well as designing the full antibody (by passing "-" * length as the input).

Enabling Openmm relax will slow down the generation process a lot, but will rectify the bond lengths and angles to conform to the physical constraints.

Optimize

The optimize API (./api/optimize.py) is straight-forward. We optimize ./demos/data/1nca.pdb as an example:

from api.optimize import optimize, ComplexSummary

ckpt = './checkpoints/cdrh3_opt.ckpt'
predictor_ckpt = './checkpoints/cdrh3_ddg_predictor.ckpt'
root_dir = './demos/data/1nca_opt'
summary = ComplexSummary(
    pdb='./demos/data/1nca.pdb',
    heavy_chain='H',
    light_chain='L',
    antigen_chains=['N']
)
optimize(
    ckpt=ckpt,  # path to the checkpoint of the trained model
    predictor_ckpt=predictor_ckpt,  # path to the checkpoint of the trained ddG predictor
    gpu=0,      # the ID of the GPU to use
    cplx_summary=summary,   # summary of the complex as well as its PDB file
    num_residue_changes=[1, 2, 3, 4, 5],  # generate 5 samples, changing at most 1, 2, 3, 4, and 5 residues, respectively
    out_dir=root_dir,  # output directory
    batch_size=16,     # batch size
    num_workers=4,     # number of workers to use
    optimize_steps=50  # number of steps for gradient desend
)

Codes for this example is also added to ./api/optimize.py, so you can directly run them by:

python -m api.optimize

Then you will get the following results:

├── demos/data/1nca_opt
│   ├── 1nca_0_1.pdb
│   ├── 1nca_1_2.pdb
│   ├── 1nca_2_3.pdb
│   ├── 1nca_3_4.pdb
│   ├── 1nca_4_5.pdb
│   ├── 1nca_original.pdb

where the 1nca_original.pdb is the original complex, and 1nca_a_b.pdb means the $a$-th candiates with constraints of changing up to $b$ residues.

Complex Structure Prediction

The complex structure prediction API (./api/structure_prediction.py) predicts the complex structure given the antigen, the sequences of the heavy chain and the light chain, and the definition of the epitope. Global docking is still very challenging so we narrow the scope to the epitope of interest. We predict./demos/data/1nca.pdb as an example:

from api.structure_prediction import structure_prediction

ckpt = './checkpoints/struct_prediction.ckpt'
root_dir = './demos/data'
n_sample = 10  # sample 10 conformations
pdbs = [os.path.join(root_dir, '1nca_antigen.pdb') for _ in range(n_sample)]
epitope_defs = [os.path.join(root_dir, '1nca_epitope.json') for _ in range(n_sample)]
identifiers = [f'1nca_model_{i}' for i in range(n_sample)]

seqs = [
    (
        ('H', 'QIQLVQSGPELKKPGETVKISCKASGYTFTNYGMNWVKQAPGKGLKWMGWINTNTGEPTYGEEFKGRFAFSLETSASTANLQINNLKNEDTATFFCARGEDNFGSLSDYWGQGTTVTVSS'),
        ('L', 'DIVMTQSPKFMSTSVGDRVTITCKASQDVSTAVVWYQQKPGQSPKLLIYWASTRHIGVPDRFAGSGSGTDYTLTISSVQAEDLALYYCQQHYSPPWTFGGGTKLEIK')
    ) \
    for _ in pdbs
]  # the first item of each tuple is heavy chain, the second is light chain

structure_prediction(
    ckpt=ckpt,  # path to the checkpoint of the trained model
    gpu=0,      # the ID of the GPU to use
    pdbs=pdbs,  # paths to the PDB file of each antigen (here antigen is all TRPV1)
    epitope_defs=epitope_defs,  # paths to the epitope definitions
    seqs=seqs,      # the given sequences of the framework regions
    out_dir=root_dir,           # output directory
    identifiers=identifiers,    # name of each output antibody
    enable_openmm_relax=True)   # use openmm to relax the generated structure

Codes for this example is also added to ./api/structure_prediction.py, so you can directly run them by:

python -m api.structure_prediction

Then you will get the following results:

├── demos/data
│   ├── 1nca_model_0.pdb
│   ├── 1nca_model_1.pdb
│   ├── 1nca_model_2.pdb
│   ├── ...

where there should be a total of 10 sampled conformations. Note that the first or last few residues might be discarded in the results if they are out of the variable domain according to the IMGT numbering system.

In Silico "Display"

In vitro display are commonly used for selecting binding mutants from antibody libraries. Here we implement an in silico version with the design API by generating and filtering candidates from existing dataset against the antigen with an epitope definition. Further, we need an metric to evaluate how well the generated antibody binds to the target. Here we use FoldX as the affinity predictor, so to run this demo, you may need to first download the it from the official website and revise the path in ./configs.py correspondingly. We still use the TRPV1 example in the previous section, and use the RAbD benchmark as the antibody library providing the framework regions:

python -m demos.display \
    --ckpt checkpoints/multi_cdr_design.ckpt \
    --pdb demos/data/7l2m.pdb \
    --epitope_def demos/data/E_epitope.json \
    --library ./all_data/rabd_all.json \
    --n_sample 30 \
    --save_dir demos/display \
    --gpu 0

which will results in 30 candidates with their affinity predicted by FoldX.

Contact

Thank you for your interest in our work!

Please feel free to ask about any questions about the algorithms, codes, as well as problems encountered in running them so that we can make it clearer and better. You can either create an issue in the github repo or contact us at jackie_kxz@outlook.com.

Others

The files below are borrowed from existing repositories: