Snakemake pipeline for implementing WhatsGNU. This Snakemake workflow allows for downloading of microbial genome sequences, annotation with prokka, pangenome analysis with Roary and investigation of proteomic novelty with WhatsGNU.
git clone https://github.com/ArwaAbbas/MakeGNU
cd MakeGNU
For most of the tools used in the pipeline, a separate conda environment is created when the rule runs. These dependencies are listed in Envs/
. However, because of this issue in prokka, a little bit of finagling is necessary at the moment. First, we'll create the base snakemake environment:
conda create -c bioconda -c conda-forge -n MakeGNU snakemake
conda activate MakeGNU
Then we'll add prokka to the base environment and manually replace the outdated script.
conda install -c conda-forge -c bioconda -c defaults prokka=1.14.5
wget ftp://ftp.ncbi.nih.gov/toolbox/ncbi_tools/converters/by_program/tbl2asn/linux64.tbl2asn.gz -O linux64.tbl2asn.gz
gunzip linux64.tbl2asn.gz
mv linux64.tbl2asn ~/anaconda3/envs/MakeGNU/bin/tbl2asn
chmod +x ~/anaconda3/envs/MakeGNU/bin/tbl2asn
The path to the location of the script to replace may be slightly different depending on whether you're using anaconda, miniconda, conda, etc.
The pipeline currently needs these inputs from the user:
- A
config.yaml
that contains some fields the user will modify to run. - Query proteome ".faa" files. The names of the queries will be specified in the config folder. If the user is beginning with the nucleotide sequence of a whole genome assembly (see below), they can optionally use prokka to annotate the genome and create the ".faa" files.
- Two CSV files that map names of .faa and .gff files (usually something like "GCA_#########.#.faa/gff" to a biologist-friendly strain name). See documentation in WhatsGNU for more details.
- A reference proteome for the organism of interest. Currently this is REQUIRED for MakeGNU to run.
This is how the directory looks like prior to running any rules:
- Data
- Query_fna (contains microbial genomes)
- ReferenceProteome (contains the reference proteome from a bacterial strain)
- Dummy_query (contains a small faa file used to help create the WhatsGNU database)
- strain_name_list_faa.csv
- strain_name_list_gff.csv
If you are starting with nucleotide sequences, this wil use prokka to annotate the genomes and pull out the ".faa" files to be used by WhatsGNU.
Execute the following in the MakeGNU root directory. This README won't/can't go over every single Snakemake parameter or error you may encounter, but here are some helpful tips: The -p
flag will print out the shell commands that will be executed. To do a dry run (see the commands without running them), pass -np
and if you want to see the reason for each rule use -r
.
snakemake all_query --cores 2 --use-conda --configfile test_config.yaml
The directory structure should now look like this. New output is bolded
- Data
- Query_faa (contains your proteomes to be queried)
- Annotations
- prokka_QUERY (contains all the outputs from prokka)
- Query_fna (contains microbial genomes)
- ReferenceProteome
- Dummy_query (contains a small faa file used to help create the WhatsGNU database)
- strain_name_list_faa.csv
- strain_name_list_gff.csv
snakemake download_genomes --cores 2 --use-conda --configfile test_config.yaml
snakemake unzip_genome_files --cores 2 --configfile test_config.yaml
snakemake rename_genome_files --cores 2 --configfile test_config.yaml
snakemake all_database_processing --cores 2 --use-conda --configfile test_config.yaml
The directory structure should now look similar to this.
- Data
- Genomes
- Query_faa
- Query_fna
- Annotations
- ReferenceProteome
- Dummy_query
- strain_name_list_faa.csv
- strain_name_list_gff.csv
- genome_list.txt
- Results
- Annotations
- prokka_GENOMEID (contains all prokka output files)
- all_modified_faa
- all_modified_gff
- Annotations
Once the reference database has been built, and you have additional genomes to analyze, these database processing steps do not need to be rerun.
snakemake all_basic --cores 2 --use-conda --configfile test_config.yaml
snakemake analyze_pangenome --cores 2 --use-conda --configfile test_config.yaml
snakemake roary_cleanup --cores 2 --configfile test_config.yaml
Once the pangenome analysis has been done on the reference genomes, and you have additional query genomes to analyze, the above steps do not need to be rerun.
snakemake all_ortholog --cores 2 --use-conda --configfile test_config.yaml
Final directory structure should look like this:
- Data
- Genomes
- Query_faa
- Query_fna
- Annotations
- ReferenceProteome
- Dummy_query
- Results
- Annotations
- prokka_GENOMEID
- all_modified_faa
- all_modified_gff
- Roary
- WhatsGNU_db
- WhatsGNU_basic_results
- WhatsGNU_ortholog_results
- Annotations
Read the full description of the types of plots created here on the WhatsGNU GitHub.
snakemake all_histogram --cores 2 --use-conda --configfile test_config.yaml
New directory structure:
- Data
- Genomes
- Query_faa
- Query_fna
- Annotations
- ReferenceProteome
- Dummy_query
- Results
- Annotations
- prokka_GENOMEID
- all_modified_faa
- all_modified_gff
- Roary
- WhatsGNU_db
- WhatsGNU_basic_results
- Plots
- WhatsGNU_ortholog_results
- Plots
- Annotations