/FragFold

Primary LanguagePythonMIT LicenseMIT

Fragment binding prediction with ColabFold

Scripts for predicting how short fragments of natural proteins bind to full-length proteins, as described in the manuscript. This program is built on top of MMseqs2 and ColabFold, extending them to efficiently predict interactions between a full-length protein and fragments derived from a protein.

This code is associated with the following article: A. Savinov, S. Swanson, A. E. Keating, G.-W. Li. High-throughput computational discovery of inhibitory protein fragments with AlphaFold. bioRxiv (2023). doi: 10.1101/2023.12.19.572389. https://www.biorxiv.org/content/10.1101/2023.12.19.572389v1

Please cite this article if you make use of FragFold.

Associated Source Data can also be found here: https://figshare.com/articles/dataset/Source_Data_for_Savinov_and_Swanson_et_al_2023/24841269
doi: 10.6084/m9.figshare.24841269

Installing FragFold

1. Install local colabfold

The locally installable version of ColabFold is used to predict structures. Go to the repo and follow the instructions to install ColabFold on your system (this can take >1 hr). You can ignore the recommendation to edit your ~/.bashrc to add local ColabFold to your PATH, this is handled by FragFold.

We developed and tested FragFold on this specific commit. Note that CUDA 11.8 is required to use a GPU with ColabFold.

Log the output while installing to verify that each step of the process completed successfully, this can be particularly helpful for debugging issues later on.

bash install_colabbatch_linux.sh | tee install_colabbatch_linux.log

Once installation has completed successfully, run a small test job to verify that it's working. The following example is taken from the localcolabfold README.

/path/to/installation/colabfold_batch example.fasta colabfold_batch_test

Where example.fasta is a short amino acid sequence

>sp|P61823
MALKSLVLLSLLVLVLLLVRVQPSLGKETAAAKFERQHMDSSTSAASSSNYCNQMMKSRN
LTKDRCKPVNTFVHESLADVQAVCSQKNVACKNGQTNCYQSYSTMSITDCRETGSSKYPN
CAYKTTQANKHIIVACEGNPYVPVHFDASV

If you're running on a system with a NVIDIA gpu, look for Running on GPU in the output. If you see that not GPU is detected, check if one is currently available with nvidia-smi. If you're still having issues, look back at the logs to check if the version of tensorflow that was installed is compatible with CUDA 11.8.

2. Install FragFold

Use the following commands to clone the repo and install FragFold and its dependencies on your computer.

git clone https://github.com/swanss/FragFold.git
cd FragFold
bash install_fragfold.sh

install_fragfold.sh uses conda to set up an environment containing FragFold as well as nextflow, the workflow system used to run jobs. The script should print installation complete, if not, there was an issue during one of steps.

Running FragFold

FragFold uses nextflow to manage the execution of the pipeline. MSA building with MMseq2s, MSA processing/concatenation, ColabFold, and the analysis and aggregation are each defined as individual processes in FragFold/nextflow/modules.nf and then used to construct workflows for modeling homomeric or heteromeric interactions between fragments and full-length proteins, e.g. FragFold/nextflow/heteromeric_fragments.nf. The details of execution, such as whether a process should be run locally or submitted to a job queue in an HPC environment is controlled by the config file FragFold/nextflow/nextflow.config.

FtsZ example

The following example demonstrates how to model homomeric interactions between full-length FtsZ (with slight truncations of the terminal residues) with 30aa fragments derived from the protein.

Setting system-specific variables

Before running the pipeline, edit FragFold/nextflow/nextflow.config. First, set general parameters according to your install.

// Define system-specific variables for all processes
executor.queueSize = 50
executor.conda = '/home/user/mambaforge/envs/fragfold'
env.repo_dir = '/home/gridsan/user/FragFold'
env.colabfold_dir = '/home/gridsan/user/localcolabfold/colabfold-conda'
env.alphafold_params_dir = '/home/gridsan/user/localcolabfold/colabfold'

Tips

  • You can get the location of your FragFold conda install with conda info --envs
  • colabfold_dir will be determined based on where you ran the install
  • alphafold_params_dir can be found with the colabfold directory

Creating process profiles

Next you will need to edit the process profiles (in FragFold/nextflow/nextflow.config) according to your HPC environment.

withLabel: cpu_network {
    executor = 'slurm'
    time = '1h'
    cpus = 1
    memory = '4 GB'
    queue = 'download' // edit this value
    clusterOptions = '--ntasks-per-node=1'
}

In this example, we define a process profile with the label cpu_network. If you're using a HPC with SLURM, you will only need to change queue to a partition on your HPC (see all partitions with sinfo). Note that the nodes in the partition set for cpu_network must have network access, as this is required for contacting the mmseqs server.

Repeat this for all of the profiles that are defined in the config file. Note that if you have nodes with different memory/cpu/time limits, you will need to adjust these values to match those.

Tips

  • If you'd like to run all steps locally, go to FragFold/nextflow/modules.nf and replace the labels for each process with 'standard'
  • If your cluster uses a different resource manager, check the nextflow executor docs to see if it is supported and edit the profiles accordingly.

Setting job-specific variables

Now we set the job-specific parameters that control FragFold execution. In order to run the example FragFold/nextflow/ftsZ_homomeric_example.nf, you will only need to edit the path of the query sequence to match your system.

// Define parameters
params.job_name = "ftsZ_test"
params.query_seq = "path/to/repo/FragFold/example/ftsZ.fasta" // edit this path according to your repo
params.fragment_ntermres_start = 160
params.fragment_ntermres_final = 170
params.fragment_length = 30
params.protein_nterm_res = 150
params.protein_cterm_res = 200
params.protein_copies = 1

Running nextflow.

Run nextflow with the following command:

NEXTFLOWDIR=/home/gridsan/user/FragFold/nextflow #directory containing nextflow scripts
WORKDIR=/home/gridsan/user/FragFold/example #directory where results will be stored
nextflow run ${NEXTFLOWDIR}/ftsZ_homomeric_example.nf -w $WORKDIR -resume

After the job is completed you should see output like this:

(fragfold) user@d-19-1-1:/state/partition1/user/user$ nextflow run ${NEXTFLOWDIR}/ftsZ_homomeric_example.nf -w $WORKDIR -resume
N E X T F L O W  ~  version 23.10.1
Launching `/home/gridsan/user/keatinglab_shared/user/savinovCollaboration/FragFold/nextflow/ftsZ_homomeric_example.nf` [tiny_agnesi] DSL2 - revision: f2c61140f9
executor >  slurm (1)
[b5/8446c6] process > build_msa          [100%] 1 of 1 ✔
[51/8c5e4b] process > process_msa        [100%] 1 of 1 ✔
[66/32bf2a] process > colabfold (5)      [100%] 11 of 11 ✔
[89/a44267] process > create_summary_csv [100%] 1 of 1 ✔

To inspect the output of intermediate steps, use the names of the processes to find the directory. For example to find output from colabfold, go to /home/gridsan/user/FragFold/nextflow/practice/66/32bf2a*.

The final output csv (*_colabfold_predictions.csv) is copied to the working directory where the nextflow job was submitted.

A note on file systems and NextFlow compatibility

Nextflow uses file locking to store task metadata and can only be run in a directory that has locking. This is not compatible with certain file system types, such as Lustre. To check what file systems are available on your system, use findmnt, the output will report the file system type for available directories and also describe whether it is available in the OPTIONS column (If you see noflock, then you know that locking is not supported).

One workaround for this is to run nextflow on a file system that supports locking (e.g. a local filesystem) for task metadata storage while running the processes/and storing the output on a shared lustre directory. This works because the directory where results are stored does not need file locking (as outputs are always stored in separate directories to avoid collisions). To do this, simply run the nextflow command in a directory with locking and add the -w argument to specify the working directory.

As an example, this is how we start nextflow on our HPC:

LLsub -i --time=1-00:00 --partition=xeon-p8
USER=$(whoami) && mkdir -p /state/partition1/user/$USER && cd /state/partition1/user/$USER && conda activate fragfold
NEXTFLOWDIR=/home/gridsan/user/FragFold/nextflow #directory containing nextflow scripts
WORKDIR=/home/gridsan/user/FragFold/example #directory where results will be stored
nextflow run ${NEXTFLOWDIR}/ftsZ_homomeric_example.nf -w $WORKDIR -resume

Submitting nextflow as a job for large colabfold jobs

For most users, it will take days for all the submitted colabfold jobs to complete. In order to keep nextflow running until all the processes are complete, we submit it as a job with a long time limit. See the example script: TODO

Downstream analysis

Given the colabfold predictions, you can predict peaks by running predict_alphafold_peaks.py.

cd example/ftsZ_predictpeaks
sbatch run_predict_alphafold_peaks.sh