Basic example of running AlphaFold2 on Summit, utilizing the pre-built Singularity container.
NOTE: This assumes you have access to the OLCF's Summit Supercomputer.
The primary thing being provided here is the container. A common issue for users is the the inability to build containers targeting Summit, themselves, due to the ppcle64 architecture of Summit.
Most users do not have proper access to a system in which they can build for this architecture. We also do not currently enable users to build containers on Summit, directly, BUT we do provide the Singularity runtime. Therefore, we have decided to provide this AlphaFold container to users, as a pre-built container.
It is worth noting that the container we provide ONLY has the ML/DL portions of AlphaFold. The other components are the generation of the features input into the inference procedure. In other words, not all packages are in the container for a full AlphaFold application run. These are generated by running the accessory tools: HMMER, HHSuite, and kalign, to generate MSAs for AlphaFold. Some of these (HMMER) are reliant on x86 architecture, and all of these use only CPUs. Therefore, it is better (and less expensive, in terms of your Summit allocation), to pre-generate the input features elsewhere. After creating your input MSAs, you will pickle these. Our AlphaFold workflow will look for the pickled input features. These examples assume the pre-processing phase is done already. You should bring your own pickled data that is ready to run against the AlphaFold model. We find that using OLCF Andes is a great resource for generating these pre-processed input features.
Here we provide an example with some CASP14 data.
Outline of what you can find here:
alphafold1103.sif
: Singularity container with ML/DL portions of AlphaFold.alphafold/run_alphafold_summit_dl.py
: Adjusted run_alphafold.py. Essentially comments out alphafold.data and alphafold.relax portions, leaving the model portion.run_af_summit_dl.sh
: Simple wrapper to provide inputs and launch alphafold.batch_submit.sh
: Job submission script example
Thank you to Dr. Mu Gao for his outstanding assistance, enabling us to share these basic examples.
/gpfs/alpine/stf007/world-shared/AlphaFold/alphafold1103.sif
- You may copy it to your directory or use it from the above location
NOTE: It was built on top of the cuda-ppc64le:11.0.3-cudnn8-devel-ubuntu18.04
base container.
We generated all the input features on the Andes cluster. That is because all the codes required to generate these features only use CPUs, no GPUs, and Summit's nodes are too expensive a resource to use to run this type of CPU-based preprocessing. To do this, you will need to install HHSuite, Kalign, and HMMER on Andes. All of these are pretty easy to build. We then ran all these HMM-generating codes on several nodes of Andes. We also had to replicate the reduced BFD dataset and the other sequence datasets in order to make runnning several instances of HHSuite efficient.
Our provided batch_submit.sh
can be adjusted as needed. Within that, we provide a target list of sequences to run against (casp14_fm.lst
) and an output directory. Those, in turn, get fed to the run_af_summit_dl.sh
AlphaFold wrapper script.
The data for the example can be found here: /gpfs/alpine/stf007/world-shared/AlphaFold/
.
You can see we set these variables in the run_af_summit_dl.sh
script:
fea_dir=/gpfs/alpine/stf007/world-shared/AlphaFold/casp14
af_dir=/gpfs/alpine/stf007/world-shared/AlphaFold/alphafold
data_dir=/gpfs/alpine/stf007/world-shared/AlphaFold/alphafold_databases
You too can run against these as a test, as they are enabled for world-read. You will need to change your project allocation and the output directory in the batch_submit.sh
script.
Creating other datasets to run against is left up to you at this point. But, this should help you run AF against data that is ready for the ML/DL phase.
References:
Jumper et al. Highly accurate protein structure prediction with AlphaFold. Nature 2021. Paper
Original github
License:
While the code for AlphaFold is licensed under Apache 2.0 license, the model parameters themselves are licensed under a Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) license, and are for non-commercial use only: