/Protenix

A trainable PyTorch reproduction of AlphaFold 3.

Primary LanguagePythonOtherNOASSERTION

Protenix: Protein + X

A trainable PyTorch reproduction of AlphaFold 3.

For more information on the model's performance and capabilities, see our technical report.

Protenix predictions

⚡ Try it online

Installation and Preparations

Installing Protenix

Follow these steps to set up and run Protenix:

  1. Install Docker (with GPU Support) Ensure that Docker is installed and configured with GPU support. Follow these steps:

    • Install Docker if not already installed.
    • Install the NVIDIA Container Toolkit to enable GPU support.
    • Verify the setup with:
      docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi
  2. Pull the Docker image, which was built based on this Dockerfile

    docker pull ai4s-cn-beijing.cr.volces.com/infra/protenix:v0.0.1
  3. Clone this repository and cd into it

    git clone https://github.com/bytedance/protenix.git 
    cd ./protenix
    pip install -e .
  4. Run Docker with an interactive shell

    docker run --gpus all -it -v $(pwd):/workspace -v /dev/shm:/dev/shm ai4s-cn-beijing.cr.volces.com/infra/protenix:v0.0.1 /bin/bash

After running above commands, you’ll be inside the container’s environment and can execute commands as you would on a normal Linux terminal.

Setting up kernels

  • Custom CUDA layernorm kernels modified from FastFold and Oneflow accelerate about 30%-50% during different training stages. To use this feature, run the following command:
    export LAYERNORM_TYPE=fast_layernorm
    If the environment variable LAYERNORM_TYPE is set to fast_layernorm, the model will employ the layernorm we have developed; otherwise, the naive PyTorch layernorm will be adopted. The kernels will be compiled when fast_layernorm is called for the first time.
  • DeepSpeed DS4Sci_EvoformerAttention kernel is a memory-efficient attention kernel developed as part of a collaboration between OpenFold and the DeepSpeed4Science initiative. To use this feature, simply pass:
    --use_deepspeed_evo_attention true
    into the command line. DS4Sci_EvoformerAttention is implemented based on CUTLASS. You need to clone the CUTLASS repository and specify the path to it in the environment variable CUTLASS_PATH. The Dockerfile has already include this setting:
    RUN git clone -b v3.5.1 https://github.com/NVIDIA/cutlass.git  /opt/cutlass
    ENV CUTLASS_PATH=/opt/cutlass
    The kernels will be compiled when DS4Sci_EvoformerAttention is called for the first time.

Preparing the datasets

To download the wwPDB dataset and proprecessed training data, you need at least 1T disk space.

Use the following command to download the preprocessed wwpdb training databases:

wget -P /af3-dev/release_data/ https://af3-dev.tos-cn-beijing.volces.com/release_data.tar.gz
tar -xzvf /af3-dev/release_data/release_data.tar.gz -C /af3-dev/release_data/
rm /af3-dev/release_data/release_data.tar.gz

The data should be placed in the /af3-dev/release_data/ directory. You can also download it to a different directory, but remember to modify the DATA_ROOT_DIR in configs/configs_data.py correspondingly. Data hierarchy after extraction is as follows:

├── components.v20240608.cif [408M] # ccd source file
├── components.v20240608.cif.rdkit_mol.pkl [121M] # rdkit Mol object generated by ccd source file
├── indices [33M] # chain or interface entries
├── mmcif [283G]  # raw mmcif data
├── mmcif_bioassembly [36G] # preprocessed wwPDB structural data
├── mmcif_msa [450G] # msa files
├── posebusters_bioassembly [42M] # preprocessed posebusters structural data
├── posebusters_mmcif [361M] # raw mmcif data
├── recentPDB_bioassembly [1.5G] # preprocessed recentPDB structural data
└── seq_to_pdb_index.json [45M] # sequence to pdb id mapping file

With the above data, you can run the training demo from scratch. components.v20240608.cif and components.v20240608.cif.rdkit_mol.pkl is also used in inference pipeline for generating ccd reference feature. If you only want to run inference, the full released data is not necessary, you can download these two files separately.

wget -P /af3-dev/release_data/ https://af3-dev.tos-cn-beijing.volces.com/release_data/components.v20240608.cif
wget -P /af3-dev/release_data/ https://af3-dev.tos-cn-beijing.volces.com/release_data/components.v20240608.cif.rdkit_mol.pkl

Data processing scripts are still being organized and prepared, and distillation data will be released in the future.

Running your first prediction

Model checkpoints

Use the following command to download pretrained checkpoint [1.4G]:

wget -P /af3-dev/release_model/ https://af3-dev.tos-cn-beijing.volces.com/release_model/model_v1.pt 

the checkpoint should be placed in the /af3-dev/release_model/ directory.

Notebook demo

You can use notebooks/protenix_inference.ipynb to run the model inference.

Inference demo

You can run the script inference_demo.sh to do model inference:

bash inference_demo.sh

Arguments in this scripts are explained as follows:

  • load_checkpoint_path: path to the model checkpoints.
  • input_json_path: path to a JSON file that fully describes the input.
  • dump_dir: path to a directory where the results of the inference will be saved.
  • dtype: data type used in inference. Valid options include "bf16" and "fp32".
  • use_deepspeed_evo_attention: whether use the EvoformerAttention provided by DeepSpeed.
  • use_msa: whether to use the MSA feature, the default is true. If you want to disable the MSA feature, add --use_msa false to the inference_demo.sh script.

Detailed information on the format of the input JSON file and the output files can be found here.

Training and Finetuning

Training demo

After the installation and data preparations, you can run the following command to train the model from scratch:

bash train_demo.sh 

Key arguments in this scripts are explained as follows:

  • dtype: data type used in training. Valid options include "bf16" and "fp32".

    • --dtype fp32: the model will be trained in full FP32 precision.
    • --dtype bf16: the model will be trained in BF16 Mixed precision, by default, the SampleDiffusion,ConfidenceHead, Mini-rollout and Loss part will still be training in FP32 precision. if you want to train and infer the model in full BF16 Mixed precision, pass the following arguments to the train_demo.sh:
      --skip_amp.sample_diffusion_training false \
      --skip_amp.confidence_head false \
      --skip_amp.sample_diffusion false \
      --skip_amp.loss false \
  • use_deepspeed_evo_attention: whether use the EvoformerAttention provided by DeepSpeed as mentioned above.

  • ema_decay: the decay rate of the EMA, default is 0.999.

  • sample_diffusion.N_step: during evalutaion, the number of steps for the diffusion process is reduced to 20 to improve efficiency.

  • data.train_sets/data.test_sets: the datasets used for training and evaluation. If there are multiple datasets, separate them with commas.

  • Some settings follow those in the AlphaFold 3 paper, The table below shows the training settings for different fine-tuning stages:

    Arguments Initial training Fine tuning 1 Fine tuning 2 Fine tuning 3
    train_crop_size 384 640 768 768
    diffusion_batch_size 48 32 32 32
    loss.weight.alpha_pae 0 0 0 1.0
    loss.weight.alpha_bond 0 1.0 1.0 0
    loss.weight.smooth_lddt 1.0 0 0 0
    loss.weight.alpha_confidence 1e-4 1e-4 1e-4 1e-4
    loss.weight.alpha_diffusion 4.0 4.0 4.0 0
    loss.weight.alpha_distogram 0.03 0.03 0.03 0
    train_confidence_only False False False True
    full BF16-mixed speed(A100, s/step) ~12 ~30 ~44 ~13
    full BF16-mixed peak memory (G) ~34 ~35 ~48 ~24

    We recommend carrying out the training on A100-80G or H20/H100 GPUs. If utilizing full BF16-Mixed precision training, the initial training stage can also be performed on A800-40G GPUs. GPUs with smaller memory, such as A30, you'll need to reduce the model size, such as decreasing model.pairformer.nblocks and diffusion_batch_size.

  • In this version, we do not use the template and RNA MSA feature for training. As the default settings in configs/configs_base.py and configs/configs_data.py:

    --model.template_embedder.n_blocks 0 \
    --data.msa.enable_rna_msa false \

    This will be considered in our future work.

  • The model also supports distributed training with PyTorch’s torchrun. For example, if you’re running distributed training on a single node with 4 GPUs, you can use:

    torchrun --nproc_per_node=4 runner/train.py

    You can also pass other arguments with --<ARGS_KEY> <ARGS_VALUE> as you want.

Finetune demo

If you want to fine-tune the model on a specific subset, such as an antibody dataset, you only need to provide a PDB list file and load the pretrained weights as finetune_demo.sh shows:

checkpoint_path="/af3-dev/release_model/model_v1.pt"
...

--load_checkpoint_path ${checkpoint_path} \
--load_checkpoint_ema_path ${checkpoint_path} \
--data.weightedPDB_before2109_wopb_nometalc_0925.base_info.pdb_list examples/subset.txt \

, where the subset.txt is a file containing the PDB IDs like:

6hvq
5mqc
5zin
3ew0
5akv

Acknowledgements

Implementation of the layernorm operators referred to OneFlow and FastFold. We used OpenFold for some module implementations, except the LayerNorm.

Contribution

Please check Contributing for more details.

Code of Conduct

Please check Code of Conduct for more details.

Security

If you discover a potential security issue in this project, or think you may have discovered a security issue, we ask that you notify Bytedance Security via our security center or vulnerability reporting email.

Please do not create a public GitHub issue.

License

This project, including code and model parameters are made available under the terms of the Creative Commons Attribution-NonCommercial 4.0 International License. You can find details at: https://creativecommons.org/licenses/by-nc/4.0/

For commercial use, please reach out to us at ai4s-bio@bytedance.com for the commercial license. We welcome all types of collaborations.