kb_IMAG-viz

Under active development as of July 2019 [use at your own risk!]

Workflow Description

This script package uses genome annotation data as input to perform a taxa-specific assessment of genome quality.

The core steps are as follows:

Annotate your query genomes and reference genomes using the SAME tool for all genomes (e.g., RAST, Prokka, etc.).
Munge the genome annotation data such that files are organized to have one genome per line, with the text string of each annotation separated by tab.
Generate genome completeness and contamination estimates for all genomes using CheckM.
Generate taxonomic identifications for all genomes using GTDB-tk.
Combine annotation and genome quality information into a single large table.
Identify a taxonomic level (e.g., Phylum) and divide the single large table into multiple tables where each table corresponds to a taxa at that level (e.g., p__Crenarchaeaota, p__Altiarchaeota, p__Euryarchaeota, etc.).

For each taxa, generate a presence/absence count table for the annotations represented.
Perform dimensional reduction of annotation count tables.
Generate plots of dimensional reduction results and color by taxonomy and shape by genome type (i.e., isolate, SAG, MAG).

For the test data and instructions to run listed below, steps 1-4 were run previously, so effectively you are starting at step 5.

Installation

Requirements (eventually the only requirement will be Docker, but for now one must install these Python and R packages manually):

Python3
Python packages: (pandas, numpy)
R
R packages (ggplot2, ggpubr)

Running Instructions

Clone this repo

git clone https://github.com/jungbluth/kb_iMAG-viz

Set application location as a variable

PATH_TO_KB_IMAG_VIZ="/Applications/ResearchSoftware/kb_iMAG-viz"

Change permissions to executable

chmod +x ${PATH_TO_KB_IMAG_VIZ}/kb_iMAG-viz-workflow.py

Optional: if running on the test data, forgo the time-consuming count-table generation step by copying the count-tables to your local directory. If doing this, then in Step 5 set the --generate_count_tables flag to 'n'.

cp ${PATH_TO_KB_IMAG_VIZ}/test/output/*count-data* ./

Run application

/Applications/ResearchSoftware/kb_iMAG-viz/kb_iMAG-viz-workflow.py \
-i ${PATH_TO_KB_IMAG_VIZ}/test/query-genomes/TARA-MAGs_Delmont-Archaea-only-2017.RAST.txt \
--taxa_level Phylum \
--save_master_table Yes \
--path_to_kb_imagviz ${PATH_TO_KB_IMAG_VIZ} \
--generate_count_tables n \
--dimensional_reduction_method pca \
--plotting_method ggplot

Sweet, it worked! Grab a beer and review the newly-produced pdf files to learn something about your genomez. :)