/AHRD_Snakemake

Snakemake wrapper for AHRD

Primary LanguagePythonOtherNOASSERTION

AHRD Snakemake

AHRD (Automated Assignment of Human Readable Descriptions) annotates proteins with descriptions and GO terms.
The preparations to run AHRD include making protein databases locally available, performing sequence similarity searches on said databases and creating configuration files.

Snakemake is a python based worflow utility that is used here to perform all neccesarry steps to run AHRD.
The user only has to provide the query fasta-files as well as the required bandwith, storage space and computing power.

Table of contents

  1. Getting started
    1. Run the workflow unsing conda for runtime installation of dependencies
    2. Use Singularity to run the workflow in a Docker container
    3. Run the workflow in an HPC environment
  2. Workflow visualization
  3. Hardware requirements
    1. Storage Space
    2. AHRD’s Memory Usage
    3. Runtime Examples
  4. License
  5. Authors

1 Getting started

  • Install mamba and git:
    conda install -c conda-forge -c bioconda mamba git
  • Clone the AHRD_Snakemake pipeline:
    git clone https://github.com/groupschoof/AHRD_Snakemake.git
  • Create an empty conda environment:
    conda create --name ahrd_snakemake
  • … and use mamba to install packages in it faster than with conda:
    mamba env update --name ahrd_snakemake --file AHRD_Snakemake/workflow/environment.yaml
  • Activate the conda environment:
    conda activate ahrd_snakemake
  • Go to the resources subfolder:
    cd AHRD_Snakemake/resources
  • Provide the protein sequences that are to be annotated, eg:
    wget ftp://ftp.ensemblgenomes.org/pub/release-51/plants/fasta/oryza_sativa/pep/Oryza_sativa.IRGSP-1.0.pep.all.fa.gz
    gunzip Oryza_sativa.IRGSP-1.0.pep.all.fa.gz
  • Edit the species.tsv file to include the desired protein sequence files:
    nano species.tsv
    (Put the name of the species in the first collumn and the path of the fasta file in the second
    eg.: Oryza_sativa resources/Oryza_sativa.IRGSP-1.0.pep.all.fa
    Use tabs as column separtor!)
  • Go back to the main workflow folder:
    cd ..
  • Edit the workflow configuration file to adjust the CPU usage to your system:
    nano config/config.yaml
    (Parts of the workflow can make use of more resources than others. Small jobs don’t scale much after 8 cores. However, large jobs should be given as many cores as you can reasonably spare.)

1.1 Run the workflow unsing conda for runtime installation of dependencies (faster and should be tried first)

This will use all cores on the local system (change all to a number to restrict the cpu usage):
snakemake --use-conda --cores all

  • Once the workflow is finished you will find your annotations (species.ahrd_output.tsv) in the results folder

1.2 Use Singularity to run the workflow in a Docker container (more stable but slower)

  • First Singularity needs to be installed. eg.:
    mamba install -c conda-forge singularity
  • Then AHRD_Snakemake can be started using our Docker image:
    snakemake --use-conda --use-singularity --cores all

1.3 Run the workflow in an HPC environment

You need a configuration profile to enable snakemake to use your cluster computing environment (slurm, lsf, etc.). See https://github.com/snakemake-profiles/doc for a list of available profiles! In this example we use slurm:

  • Create and enter a folder for cookiecutters
    mkdir $HOME/.cookiecutters
    cd $HOME/.cookiecutters
  • Get the slurm cookiecutter and check out a commit know to work with snakemake < 7.0
    git clone https://github.com/Snakemake-Profiles/slurm.git
    cd slurm
    git checkout e725a99
  • Create and enter the snakemake configuration folder
    mkdir $HOME/.config/snakemake
    cd $HOME/.config/snakemake
  • Install cookiecutter in a new conda environment:
    conda create --name cookiecutter
    conda activate cookiecutter
    conda install -c conda-forge cookiecutter
  • Use cookiecutter
    cookiecutter slurm
    Follow the questions and answer them as best you can. Often using the default works just fine.
  • Start the workflow while specifying the number of parallel jobs and the name of your cluster configuration profile (don’t forget to first cd back to the workflow folder):
    conda activate ahrd_snakemake
    snakemake --use-conda --jobs 16 --profile NAME_OF_YOUR_CLUSTER_CONFIGURATION_PROFILE
  • If you want to run this administrative process non interactively, the subshell needs to have conda initialized and the environment activated:
    bash -c "source ~/.bashrc; conda activate ahrd_snakemake; snakemake --use-conda --jobs 16 --profile NAME_OF_YOUR_CLUSTER_CONFIGURATION_PROFILE;" &> log.txt &
    Don’t forget to remove the job from your shell’s control before you log out of it (disown %1).
  • On HPC you can also use --use-conda --use-singularity if just using --use-conda fails.

2 Workflow visualization

snakemake --rulegraph | dot -Tsvg > rulegraph.svg

3 Hardware requirements

3.1 Storage Space

resources Reference GO annotations 13 GB
SwissProt 0.09 GB
Uniref90 30 GB
Query Fasta Files each 0.02-0.13 GB
results SwissProt DiamondDB 2.74 GB
Uniref90 DiamondDB 60 GB
SwissProt Search Results each 0.2-1.5 GB
Uniref 90 Search Results each 0.6-5 GB
workflow snakemake folder 1.6 GB
Overall ca. 130 GB

The resources are automatically downloaded (except the query fasta files).
So this also indicates the bandwidth / data usage requirements.

3.2 AHRD’s Memory Usage

Formula:
11000+fastaSizeInMB*540 = projectedMemUsageInMB

Example 1 (Oryza sativa):
11000 + 23 MB * 540 = 23420 MB = 23GB

Example 2 (Hordeum vulgare):
11000 + 131 MB * 540 = 81740 MB = 82GB

3.3 Runtime Examples

Download Reference GO annotations 4min
Uniref90 3h
Diamond Create Uniref90 48 Cores 21min
Search Rize in Uniref90 32 Cores 1.5h
Search Barley in Uniref90 32 Cores 4h
AHRD Annotate Rize 8 Cores 3.3h
Annotate Barley 8 Cores 3.5h

Download times depend on your location relative to the uniprot servers, how busy the servers are and of course your connection.

Diamond scales very well. Give it more cores and memory and it will put them to good use.

AHRD’s bootleneck is parsing the data. The actual annotation step is quick by comparison. Nonetheless, it’s parallelized but doesn’t scale very well.

4 License

See attached file LICENSE.txt for details.

5 Authors

Florian Boecker and Prof. Dr. Heiko Schoof

INRES Crop Bioinformatics
University of Bonn
Katzenburgweg 2
53115 Bonn
Germany