This repository contains a snakemake pipeline to analyze the structural diversity of microbial genomes using PanGraph. The pipeline is a simplified version of the one used for the analysis of E. coli ST131 genomes in this preprint.
The input consists of a set of genbank files, containing each a single record (the bacterial chromosome) forming a collection of elements that can be meaningfully structurally compared.
The pipeline requires a working installation of conda or mamba. To run, it requires an environment with snakemake (v.7+). This can be created with:
conda create -n snakemake -c conda-forge -c bioconda snakemake=7
A new dataset can be added with the following three steps:
- all input data files can be placed in the
data/gbk/{acc}.gbk
directory, where{acc}
is an id of the record (e.g. accession number). - in addition, a file
datasets/{dataset_name}/acc_nums.txt
should contain a list of ids for all entries of the dataset. Here{dataset_name}
is the name of the dataset. - finally, update the
config.yaml
file by adding the new dataset name under thedatasets
key.In addition you must specify the id of adatasets: {dataset_name}: guide-strain: "{acc}"
guide-strain
from the dataset. This will be used as a reference for the structural comparison.
After activating the environment with:
conda activate snakemake
You can run the pipeline on a cluster with SLURM workload manager with:
snakemake all --profile cluster
Or alternatively on your local machine with:
snakemake -c <n. cores> --use-conda all