This repository contains a HyperQueue workflow that implements a LiGen virtual screening + docking pipeline, and also a CADD pipeline that adds integration with GROMACS.
Both pipelines use HyperQueue to build and execute a task graph.
There are several external dependencies required to run the workflows. They can be installed in two ways, with a Dockerfile or natively on the target system.
Before installing the dependencies, you have to download AmbertTools23.tar.bz
from https://ambermd.org/GetAmber.php (registration is required to download it), and put it into the deps
directory.
To use the Python packages, you should have Python 3.10 or 3.11.
Install Docker and run:
$ docker build -t cadd .
You can convert the resulting image to Singularity/Apptainer if you need to execute it on an HPC cluster.
Before starting to set up anything, you should have at least the following packages available:
- C/C++ compiler
- CMake
- CPython development headers (
python-dev
) - MPI implementation (for compiling
mpi4py
)- Preferably OpenMPI, AmberTools seems to have some issue with MPICH
- For example
libopenmpi-dev
You will then need to install several dependencies. You can examine the Dockerfile to see how it installs these dependencies on Ubuntu 22.04.
Steps 3 - 5 are only needed for the CADD pipeline.
- Create a virtual environment
$ python3 -m venv venv $ source venv/bin/activate (venv) $ python3 -m pip install -U setuptools wheel pip (venv) $ python3 -m pip install uv
- Install Python dependencies
(venv) $ uv pip sync requirements.txt
- Install native dependencies
(venv) $ python3 env.py install
- The installation step will generate an
env.sh
file, which you should load before using this package (and before executing thecheck-env
command):
(venv) $ source env.sh
- You can also run the scripts in the
deps
directory manualy, in the same order as in the Dockerfile.
- The installation step will generate an
- Check if everything has been installed correctly
(venv) $ python3 main.py check-env
- Prepare the input dataset.
It needs to consists of a
.pdb
protein file, a.mol2
probe file, and a.smi
file containing ligands. - Get access to an Apptainer file containing LiGen tools (it is proprietary)
- Prepare a YAML file that configures the workflow parameters. Here is an example:
The
data: protein_pdb: <path> probe_mol2: <path> smi: <path> max_molecules_per_smi: <number>
max_moleculer_per_smi
parameter specifies the number of ligands per HyperQueue task. A number such as10
is a reasonable default. The paths are resolved relative to the directory from which the script is executed (step 4.). - Execute the workflow.
(venv) $ python3 cadd.py ligen <workdir> <params-file> <ligen-container> [--dock] [--local-cluster]
workdir
will store intermediate files and outputs of the workflow.params-file
is a path to a YAML with workflow parameters (step 3).ligen-container
is a path to the LiGen apptainer image (step 2).--dock
specifies whether docking should also be performed. Without it, only virtual screening is performed.--local-cluster
specifies whether a new HyperQueue cluster should be created. If unset, the code will try to connect to an existing HyperQueue instance on the local node.