AHRD Snakemake

AHRD (Automated Assignment of Human Readable Descriptions) annotates proteins with descriptions and GO terms.
The preparations to run AHRD include making protein databases locally available, performing sequence similarity searches on said databases and creating configuration files.

Snakemake is a python based worflow utility that is used here to perform all neccesarry steps to run AHRD.
The user only has to provide the query fasta-files as well as the required bandwith, storage space and computing power.

Getting started
Workflow visualization
Hardware requirements
License
Authors

1 Getting started

Install the python 3 version of miniconda from here: https://docs.conda.io/en/latest/miniconda.html

Install mamba and git:
conda install -c conda-forge -c bioconda mamba git

Clone the AHRD_Snakemake pipeline:
git clone https://github.com/groupschoof/AHRD_Snakemake.git

Create an empty conda environment:
conda create --name ahrd_snakemake

… and use mamba to install packages in it faster than with conda:
mamba env update --name ahrd_snakemake --file AHRD_Snakemake/workflow/environment.yaml

Activate the conda environment:
conda activate ahrd_snakemake

Go to the resources subfolder:
cd AHRD_Snakemake/resources

Provide the protein sequences that are to be annotated, eg:
wget ftp://ftp.ensemblgenomes.org/pub/release-51/plants/fasta/oryza_sativa/pep/Oryza_sativa.IRGSP-1.0.pep.all.fa.gz
gunzip Oryza_sativa.IRGSP-1.0.pep.all.fa.gz

Edit the species.tsv file to include the desired protein sequence files:
nano species.tsv
(Put the name of the species in the first collumn and the path of the fasta file in the second
eg.: Oryza_sativa resources/Oryza_sativa.IRGSP-1.0.pep.all.fa
Use tabs as column separtor!)

Go back to the main workflow folder:
cd ..

Edit the workflow configuration file to adjust the CPU usage to your system:
nano config/config.yaml
(Parts of the workflow can make use of more resources than others. Small jobs don’t scale much after 8 cores. However, large jobs should be given as many cores as you can reasonably spare.)

1.1 Run the workflow unsing conda for runtime installation of dependencies (faster and should be tried first)

This will use all cores on the local system (change all to a number to restrict the cpu usage):
snakemake --use-conda --cores all

Once the workflow is finished you will find your annotations (species.ahrd_output.tsv) in the results folder

1.2 Use Singularity to run the workflow in a Docker container (more stable but slower)

First Singularity needs to be installed. eg.:
mamba install -c conda-forge singularity

Then AHRD_Snakemake can be started using our Docker image:
snakemake --use-conda --use-singularity --cores all

1.3 Run the workflow in an HPC environment

You need a configuration profile to enable snakemake to use your cluster computing environment (slurm, lsf, etc.). See https://github.com/snakemake-profiles/doc for a list of available profiles! In this example we use slurm:

Create and enter a folder for cookiecutters
mkdir $HOME/.cookiecutters
cd $HOME/.cookiecutters

Get the slurm cookiecutter and check out a commit know to work with snakemake < 7.0
git clone https://github.com/Snakemake-Profiles/slurm.git
cd slurm
git checkout e725a99

Create and enter the snakemake configuration folder
mkdir $HOME/.config/snakemake
cd $HOME/.config/snakemake

Install cookiecutter in a new conda environment:
conda create --name cookiecutter
conda activate cookiecutter
conda install -c conda-forge cookiecutter

Use cookiecutter
cookiecutter slurm
Follow the questions and answer them as best you can. Often using the default works just fine.

Start the workflow while specifying the number of parallel jobs and the name of your cluster configuration profile (don’t forget to first cd back to the workflow folder):
conda activate ahrd_snakemake
snakemake --use-conda --jobs 16 --profile NAME_OF_YOUR_CLUSTER_CONFIGURATION_PROFILE

If you want to run this administrative process non interactively, the subshell needs to have conda initialized and the environment activated:
bash -c "source ~/.bashrc; conda activate ahrd_snakemake; snakemake --use-conda --jobs 16 --profile NAME_OF_YOUR_CLUSTER_CONFIGURATION_PROFILE;" &> log.txt &
Don’t forget to remove the job from your shell’s control before you log out of it (disown %1).

On HPC you can also use --use-conda --use-singularity if just using --use-conda fails.

2 Workflow visualization

snakemake --rulegraph | dot -Tsvg > rulegraph.svg

3 Hardware requirements

3.1 Storage Space

resources	Reference GO annotations	13 GB
	SwissProt	0.09 GB
	Uniref90	30 GB
	Query Fasta Files	each 0.02-0.13 GB
results	SwissProt DiamondDB	2.74 GB
	Uniref90 DiamondDB	60 GB
	SwissProt Search Results	each 0.2-1.5 GB
	Uniref 90 Search Results	each 0.6-5 GB
workflow	snakemake folder	1.6 GB
Overall		ca. 130 GB

The resources are automatically downloaded (except the query fasta files).
So this also indicates the bandwidth / data usage requirements.

3.2 AHRD’s Memory Usage

Formula:
11000+fastaSizeInMB*540 = projectedMemUsageInMB

Example 1 (Oryza sativa):
11000 + 23 MB * 540 = 23420 MB = 23GB

Example 2 (Hordeum vulgare):
11000 + 131 MB * 540 = 81740 MB = 82GB

3.3 Runtime Examples

Download	Reference GO annotations		4min
	Uniref90		3h
Diamond	Create Uniref90	48 Cores	21min
	Search Rize in Uniref90	32 Cores	1.5h
	Search Barley in Uniref90	32 Cores	4h
AHRD	Annotate Rize	8 Cores	3.3h
	Annotate Barley	8 Cores	3.5h

Download times depend on your location relative to the uniprot servers, how busy the servers are and of course your connection.

Diamond scales very well. Give it more cores and memory and it will put them to good use.

AHRD’s bootleneck is parsing the data. The actual annotation step is quick by comparison. Nonetheless, it’s parallelized but doesn’t scale very well.

4 License

See attached file LICENSE.txt for details.

5 Authors

Florian Boecker and Prof. Dr. Heiko Schoof

INRES Crop Bioinformatics
University of Bonn
Katzenburgweg 2
53115 Bonn
Germany

tgstoecker/AHRD_Snakemake