Make sure you have some kinda of conda installation completed. I like
miniconda.
https://www.docs.conda.io/en/latest/miniconda.html
Once you have miniconda install gifrop.
From the command line:
conda create -n gifrop
conda activate gifrop
conda install -c conda-forge -c bioconda -c defaults gifrop
# Test that gifrop was installed correctly
gifrop --help
pan_pipe --help
If you want to follow along you can clone the repo I’ll be working from.
git clone https://www.github.com/jtrachsel/gifrop_examples.git
cd gifrop_examples
- Two strains of Salmonella enterica
- LT2: famous reference strain
- USDA15WA-1: Isolate under investigation in the Bearson Lab
Two genomic islands of interest here. 80kb metal tolerance island (SGI4), 28kb multidrug resist module.
For this example I have already annotated the genomes with prokka and generated the pangenome with Roary.
Show gene_presence_absence.csv
cd test1
cd pan
gifrop --get_islands --threads 8
cd gifrop_out
ls
cd ../../..
Show clustered island info, explain columns. Show gpa_clust.
Show secondary cluster heatmap.
Identified 80kb metal island, 28kb MDR module, phage etc.
- show extracted fasta and gff.
Campylobacter draft genomes from an AMR transfer experiment.
1. C. coli donor Tet resistant
2. C. jejuni recipient Tet sensitive
3. C. jejuni result Tet resistant
cd test2
ls
# takes ~7 minutes
pan_pipe --prokka_args '--cpus 2' --roary_args '-p 8' --gifrop_args '--threads 8'
# pan_pipe --prokka_args '--cpus 24' --roary_args '-p 60' --gifrop_args '--threads 60'
cd pan
cd gifrop_out
ls
cd ../../../
Show secondary cluster by genome. Show transfered tet resistance island.
A messy dataset. Should show some of the main problems with this approach.
7 Salmonella strains.
I precomputed these results for the sake of time.
pan_pipe --prokka_args '--cpus 2 --proteins LT2.gbk' --roary_args '-p 8' --gifrop_args '--threads 8'
# takes ~ 7 minutes
Look at clustered island info, try and find MDR and SGI4.
Look at secondary cluster by genome heatmap.
Look at secondary cluster 132? heatmap, show problems.
Look at clustering graph.
Potential solutions:
- Do some pruning on the clustering graph before community
identification.
- remove low weight edges?
- remove edges of weight 1
- calculate a median edge weight for each primary cluster
- remove edges weighted less than 1/4 of median edge weight?
- remove low weight edges?