AHRD (Automated Assignment of Human Readable Descriptions) annotates proteins with descriptions and GO terms.
The preparations to run AHRD include making protein databases locally available, performing sequence similarity searches on said databases and creating configuration files.
Snakemake is a python based worflow utility that is used here to perform all neccesarry steps to run AHRD.
The user only has to provide the query fasta-files as well as the required bandwith, storage space and computing power.
- Install the python 3 version of miniconda from here: https://docs.conda.io/en/latest/miniconda.html
- Install mamba and git:
conda install -c conda-forge -c bioconda mamba git
- Clone the AHRD_Snakemake pipeline:
git clone https://github.com/groupschoof/AHRD_Snakemake.git
- Create an empty conda environment:
conda create --name ahrd_snakemake
- … and use mamba to install packages in it faster than with conda:
mamba env update --name ahrd_snakemake --file AHRD_Snakemake/workflow/environment.yaml
- Activate the conda environment:
conda activate ahrd_snakemake
- Go to the resources subfolder:
cd AHRD_Snakemake/resources
- Provide the protein sequences that are to be annotated, eg:
wget ftp://ftp.ensemblgenomes.org/pub/release-51/plants/fasta/oryza_sativa/pep/Oryza_sativa.IRGSP-1.0.pep.all.fa.gz
gunzip Oryza_sativa.IRGSP-1.0.pep.all.fa.gz
- Edit the species.tsv file to include the desired protein sequence files:
nano species.tsv
(Put the name of the species in the first collumn and the path of the fasta file in the second
eg.: Oryza_sativa resources/Oryza_sativa.IRGSP-1.0.pep.all.fa
Use tabs as column separtor!)
- Go back to the main workflow folder:
cd ..
- Edit the workflow configuration file to adjust the CPU usage to your system:
nano config/config.yaml
(Parts of the workflow can make use of more resources than others. Small jobs don’t scale much after 8 cores. However, large jobs should be given as many cores as you can reasonably spare.)
1.1 Run the workflow unsing conda for runtime installation of dependencies (faster and should be tried first)
This will use all cores on the local system (change all
to a number to restrict the cpu usage):
snakemake --use-conda --cores all
- Once the workflow is finished you will find your annotations (species.ahrd_output.tsv) in the results folder
- First Singularity needs to be installed. eg.:
mamba install -c conda-forge singularity
- Then AHRD_Snakemake can be started using our Docker image:
snakemake --use-conda --use-singularity --cores all
You need a configuration profile to enable snakemake to use your cluster computing environment (slurm, lsf, etc.). See https://github.com/snakemake-profiles/doc for a list of available profiles! In this example we use slurm:
- Create and enter a folder for cookiecutters
mkdir $HOME/.cookiecutters
cd $HOME/.cookiecutters
- Get the slurm cookiecutter and check out a commit know to work with snakemake < 7.0
git clone https://github.com/Snakemake-Profiles/slurm.git
cd slurm
git checkout e725a99
- Create and enter the snakemake configuration folder
mkdir $HOME/.config/snakemake
cd $HOME/.config/snakemake
- Install cookiecutter in a new conda environment:
conda create --name cookiecutter
conda activate cookiecutter
conda install -c conda-forge cookiecutter
- Use cookiecutter
cookiecutter slurm
Follow the questions and answer them as best you can. Often using the default works just fine.
- Start the workflow while specifying the number of parallel jobs and the name of your cluster configuration profile (don’t forget to first cd back to the workflow folder):
conda activate ahrd_snakemake
snakemake --use-conda --jobs 16 --profile NAME_OF_YOUR_CLUSTER_CONFIGURATION_PROFILE
- If you want to run this administrative process non interactively, the subshell needs to have conda initialized and the environment activated:
bash -c "source ~/.bashrc; conda activate ahrd_snakemake; snakemake --use-conda --jobs 16 --profile NAME_OF_YOUR_CLUSTER_CONFIGURATION_PROFILE;" &> log.txt &
Don’t forget to remove the job from your shell’s control before you log out of it (disown %1
).
- On HPC you can also use
--use-conda --use-singularity
if just using--use-conda
fails.
snakemake --rulegraph | dot -Tsvg > rulegraph.svg
resources | Reference GO annotations | 13 GB |
SwissProt | 0.09 GB | |
Uniref90 | 30 GB | |
Query Fasta Files | each 0.02-0.13 GB | |
results | SwissProt DiamondDB | 2.74 GB |
Uniref90 DiamondDB | 60 GB | |
SwissProt Search Results | each 0.2-1.5 GB | |
Uniref 90 Search Results | each 0.6-5 GB | |
workflow | snakemake folder | 1.6 GB |
Overall | ca. 130 GB |
The resources are automatically downloaded (except the query fasta files).
So this also indicates the bandwidth / data usage requirements.
Formula:
11000+fastaSizeInMB*540 = projectedMemUsageInMB
Example 1 (Oryza sativa):
11000 + 23 MB * 540 = 23420 MB = 23GB
Example 2 (Hordeum vulgare):
11000 + 131 MB * 540 = 81740 MB = 82GB
Download | Reference GO annotations | 4min | |
Uniref90 | 3h | ||
Diamond | Create Uniref90 | 48 Cores | 21min |
Search Rize in Uniref90 | 32 Cores | 1.5h | |
Search Barley in Uniref90 | 32 Cores | 4h | |
AHRD | Annotate Rize | 8 Cores | 3.3h |
Annotate Barley | 8 Cores | 3.5h |
Download times depend on your location relative to the uniprot servers, how busy the servers are and of course your connection.
Diamond scales very well. Give it more cores and memory and it will put them to good use.
AHRD’s bootleneck is parsing the data. The actual annotation step is quick by comparison. Nonetheless, it’s parallelized but doesn’t scale very well.
See attached file LICENSE.txt for details.
Florian Boecker and Prof. Dr. Heiko Schoof
INRES Crop Bioinformatics
University of Bonn
Katzenburgweg 2
53115 Bonn
Germany