This application establishes a distributed system for drug discovery by utilizing load balancing and fault tolerance to identify promising molecules that interact with a specified protein of interest.
The core workflow allows users to input a SMILES text file, the receptor of interest, and the docking parameters to facilitate large-scale docking simulations. Out-of-the-box, the system employs QuickVina 2.0 for docking.
Execution within SLURM environments is highly optimized, with computations distributed in parallel across multiple CPUs and nodes. This architecture ensures efficient linear scaling relative to the number of molecules processed. Moreover, the application supports robust load balancing and enhanced fault tolerance mechanisms.
Please clone the repository using:
git clone https://github.com/akshat998/cs224b.git
Please ensure that the following packages are installed:
- RDKit version 2021.09.5
- Open Babel 3.1.0
- Python 3.7.13 (or higher)
DATA
: This directory is where users can place the receptor file and the corresponding executables for running the docking process.OUTPUTS
: This directory is designated for storing the results of the docking simulations (created on the fly).all.ctrl
: Contains all user-specifiable parameters required for the screening process, including the docking parameterization.dataset_calc.py
: A Python script for running the docking on specified ligands.submit.sh
: A Slurm submission script for submitting an array of jobs for processing.load_balancer.py
: Script that evenly distributes the workload across different nodes based on the number of molecules in a user-provided file and the number of nodes specified by the user.monitor_and_resubmit.py
: A script that can monitor ongoing jobs and resubmit tasks for nodes that have unexpectedly crashed.
To get started with the docking simulations, follow the steps outlined below. These steps ensure that your configuration is correctly set up for your specific docking scenario:
-
Configure Receptor Location:
- Open
all.ctrl
and specify the exact location of your receptor in the designated section.
- Open
-
Set Docking Parameters:
- Within
all.ctrl
, enter the appropriateCENTER-X/Y/Z
andSIZE-X/Y/Z
coordinates to define your docking area.
- Within
-
Specify SMILES List Path:
- In
all.ctrl
, input the path to your SMILES list file. This file is crucial for defining the molecular inputs for the simulation. - Ensure the file adheres to the format: each line contains a SMILES followed by a newline character (e.g.,
C[C@@H](N)C(=O)O \n
).
- In
-
Slurm Cluster Account:
- In
submit.sh
, replaceTODO
in#SBATCH --account=TODO
with your actual Slurm cluster account name to ensure proper job submission.
- In
-
Job Submission Configuration:
- Adjust the number of jobs to submit for your docking calculation in
submit.sh
by modifying#SBATCH --array=1-999
accordingly. Ensure this number matches theMAX_NUM_JOBS
parameter set inall.ctrl
.
- Adjust the number of jobs to submit for your docking calculation in
-
Executable Permissions:
- Make sure the docking executables have the correct executable permissions by running
chmod 777 ./DATA/qvina
.
- Make sure the docking executables have the correct executable permissions by running
-
Assign Subtasks to Each Node with Load Balancing:
- Within
all.ctrl
, please ensure that the appropriate flag is set for the parameterUSE_LOAD_BALANCER
. If set to False, molecules are distributed randomly across the nodes. If set to True, molecules are divided based on the number of atoms, which leads to a more balanced compute load per node, enhancing efficiency. - Following this setup, run
python3 load_balancer.py
. This will generate variouspartition_i.smi
files in the DATA directory, where each i-th node will process all the molecules listed in its respective file.
- Within
-
Submit Your Job:
- Finally, submit your job to the Slurm cluster with the command:
sbatch submit.sh
.
- Finally, submit your job to the Slurm cluster with the command:
By following these steps, you'll be properly set up to conduct your docking simulations. Ensure all paths and parameters are double-checked for accuracy before submitting your job.
The monitor_and_resubmit.py
script provides robust monitoring and management functionalities for the distributed molecular docking processes. This script helps in overseeing the progress of ongoing jobs and managing the resubmission of jobs for nodes that may have unexpectedly crashed, ensuring your simulations run efficiently and effectively.
- Job Monitoring: Check the progress of running jobs to ensure they are proceeding as expected.
- Automatic Resubmission: Automatically detects and resubmits failed or crashed jobs to maintain continuous operation without manual intervention.
To utilize the monitor_and_resubmit.py
script, execute it with specific arguments based on your monitoring and resubmission needs:
-
Check Progress:
- Use this mode to check the current status of submitted jobs. If a crash is detected in any node, the affected node's job will be resubmitted. If all jobs have completed, you will be advised to consider using the finish-and-resubmit mode. If all jobs are in progress without any crashes, no action will be taken.
- Command:
python3 monitor_and_resubmit.py check_progress [job_id]
-
Finish and Resubmit:
- This mode should be used once all jobs have finished. It performs output collection, cleanup, and preparation for resubmission of jobs for any missing molecules.
- It consolidates output files, cleans up intermediate files, and prepares a new batch of submissions if there are incomplete calculations.
- Command:
python3 monitor_and_resubmit.py finish_and_resubmit [job_id]
- The final output file, which contains the SMILES string and the completed docking score, will be stored in
DATA/combined_output.txt
in the formatSMILES,Docking Score
.
- The script first checks job status using SLURM's
squeue
command to identify if jobs are still running or have crashed. - For crashed jobs, the script generates a new SLURM script for each crashed job part, resubmits it, and ensures that the workspace remains clean by deleting temporary files.
- In the finish-and-resubmit mode, the script combines output files for analysis, deletes intermediate files to free up space, and identifies any incomplete molecule calculations. It then updates the configuration file to reflect the new number of molecules and their file paths, and prepares the system for a new round of submissions.
-
Time Comparison: Analyze the scaling of docking relative to the number of atoms. Access the raw data timings for docking on a single CPU at
Experiments/EXP1_time_vs_num_atoms/TIME.csv
. -
Scaling Behavior: Examine the total runtime for docking the NCI Open Compound Collection. The data is available at
Experiments/EXP2_linear_scaling/cpu_runtime_data.csv
.- To conduct this experiment, modify the
submit.sh
file. Use the--array
option to specify the number of nodes (e.g.,#SBATCH --array=1-10
for utilizing 10 nodes). - Set
#SBATCH --ntasks-per-node=40
to define the number of CPUs per node. In this scenario, a total of 400 CPUs will be deployed across 10 nodes.
- To conduct this experiment, modify the
-
Understanding Load Balancing Effects: Investigate the impact of using load balancing on processing times. Find the comparison data in
Experiments/EXP3_load_balancing/timings_data.csv
.- To execute this experiment, adjust the
USE_LOAD_BALANCER
parameter within theall.ctrl
file toTrue
orFalse
to toggle load balancing.
- To execute this experiment, adjust the
-
Fault Tolerance Experiments: Explore how different levels of fault tolerance affect the number of molecular failures across three replicates, detailed in
Experiments/EXP4_fault_taul/fault_taul_timings.csv
.- Setup: Conduct these experiments using 10 nodes, each equipped with 40 CPUs.
- Simulating Early Node Crashes: To induce an early crash in a specific node, execute
scancel job_id
while the correspondingjob_id
is in a pending (PD) status. - Simulating Crashes During Calculations: To simulate a crash while calculations are ongoing, use
scancel job_id
when thejob_id
is in running (R) status. - Handling Early Crashes: If an early crash occurs, you can manage it by running
python3 monitor_and_resubmit.py check_progress [job_id]
. This script automatically resubmits the crashed node. - Resuming Partial Calculations: To continue calculations after a partial set of molecules has been processed, execute
python3 monitor_and_resubmit.py finish_and_resubmit [job_id]
.
Make a github issue 😄. Please be as clear and descriptive as possible. Please feel free to reach out in person: (akshat98m[AT]stanford[DOT]edu)