MakeGNU

Snakemake pipeline for implementing WhatsGNU. This Snakemake workflow allows for downloading of microbial genome sequences, annotation with prokka, pangenome analysis with Roary and investigation of proteomic novelty with WhatsGNU.

Set Up

Installing the pipeline

git clone https://github.com/ArwaAbbas/MakeGNU
cd MakeGNU

Creating the environment

For most of the tools used in the pipeline, a separate conda environment is created when the rule runs. These dependencies are listed in Envs/. However, because of this issue in prokka, a little bit of finagling is necessary at the moment. First, we'll create the base snakemake environment:

conda create -c bioconda -c conda-forge -n MakeGNU snakemake
conda activate MakeGNU

Then we'll add prokka to the base environment and manually replace the outdated script.

conda install -c conda-forge -c bioconda -c defaults prokka=1.14.5
wget ftp://ftp.ncbi.nih.gov/toolbox/ncbi_tools/converters/by_program/tbl2asn/linux64.tbl2asn.gz -O linux64.tbl2asn.gz 
gunzip linux64.tbl2asn.gz
mv linux64.tbl2asn ~/anaconda3/envs/MakeGNU/bin/tbl2asn
chmod +x ~/anaconda3/envs/MakeGNU/bin/tbl2asn

The path to the location of the script to replace may be slightly different depending on whether you're using anaconda, miniconda, conda, etc.

Preparing inputs and configuration files

The pipeline currently needs these inputs from the user:

A config.yaml that contains some fields the user will modify to run.
Query proteome ".faa" files. The names of the queries will be specified in the config folder. If the user is beginning with the nucleotide sequence of a whole genome assembly (see below), they can optionally use prokka to annotate the genome and create the ".faa" files.
Two CSV files that map names of .faa and .gff files (usually something like "GCA_#########.#.faa/gff" to a biologist-friendly strain name). See documentation in WhatsGNU for more details.
A reference proteome for the organism of interest. Currently this is REQUIRED for MakeGNU to run.

A Working Example Using the Test Data

This is how the directory looks like prior to running any rules:

Data
- Query_fna (contains microbial genomes)
- ReferenceProteome (contains the reference proteome from a bacterial strain)
- Dummy_query (contains a small faa file used to help create the WhatsGNU database)
- strain_name_list_faa.csv
- strain_name_list_gff.csv

Annotating bacterial genomes to be queried

If you are starting with nucleotide sequences, this wil use prokka to annotate the genomes and pull out the ".faa" files to be used by WhatsGNU.

Execute the following in the MakeGNU root directory. This README won't/can't go over every single Snakemake parameter or error you may encounter, but here are some helpful tips: The -p flag will print out the shell commands that will be executed. To do a dry run (see the commands without running them), pass -np and if you want to see the reason for each rule use -r.

snakemake all_query --cores 2 --use-conda --configfile test_config.yaml

The directory structure should now look like this. New output is bolded

Data
- Query_faa (contains your proteomes to be queried)
- Annotations
  - prokka_QUERY (contains all the outputs from prokka)
- Query_fna (contains microbial genomes)
- ReferenceProteome
- Dummy_query (contains a small faa file used to help create the WhatsGNU database)
- strain_name_list_faa.csv
- strain_name_list_gff.csv

Downloading and annotating reference genomes

snakemake download_genomes --cores 2 --use-conda --configfile test_config.yaml 
snakemake unzip_genome_files --cores 2 --configfile test_config.yaml
snakemake rename_genome_files --cores 2 --configfile test_config.yaml
snakemake all_database_processing --cores 2 --use-conda --configfile test_config.yaml

The directory structure should now look similar to this.

Data
- Genomes
- Query_faa
- Query_fna
- Annotations
- ReferenceProteome
- Dummy_query
- strain_name_list_faa.csv
- strain_name_list_gff.csv
- genome_list.txt
Results
- Annotations
  - prokka_GENOMEID (contains all prokka output files)
  - all_modified_faa
  - all_modified_gff

Once the reference database has been built, and you have additional genomes to analyze, these database processing steps do not need to be rerun.

Creating a basic report

snakemake all_basic --cores 2 --use-conda --configfile test_config.yaml

Creating an ortholog report

snakemake analyze_pangenome --cores 2 --use-conda --configfile test_config.yaml 
snakemake roary_cleanup --cores 2 --configfile test_config.yaml

Once the pangenome analysis has been done on the reference genomes, and you have additional query genomes to analyze, the above steps do not need to be rerun.

snakemake all_ortholog --cores 2 --use-conda --configfile test_config.yaml

Final directory structure should look like this:

Data
- Genomes
- Query_faa
- Query_fna
- Annotations
- ReferenceProteome
- Dummy_query
Results
- Annotations
  - prokka_GENOMEID
  - all_modified_faa
  - all_modified_gff
- Roary
- WhatsGNU_db
- WhatsGNU_basic_results
- WhatsGNU_ortholog_results

Visualization of WhatsGNU results

Read the full description of the types of plots created here on the WhatsGNU GitHub.

    snakemake all_histogram --cores 2 --use-conda --configfile test_config.yaml

New directory structure:

Data
- Genomes
- Query_faa
- Query_fna
- Annotations
- ReferenceProteome
- Dummy_query
Results
- Annotations
  - prokka_GENOMEID
  - all_modified_faa
  - all_modified_gff
- Roary
- WhatsGNU_db
- WhatsGNU_basic_results
  - Plots
- WhatsGNU_ortholog_results
  - Plots