/FlexCore

Extract flexible bacterial core genome alignments and compute SNP distances. Additionally, perform rarefaction-sampling analysis of alignments

Primary LanguagePython

FlexCore

Python conda

Extract per-site flexible (default ≥95%) bacterial core genomes from read-mapped whole genome alignments. Optionally calculate core genome SNP counts and distances.

For flexible core extraction alone, this can be done rapidly with BactCore

For overall speed and performance obtaining pairwise SNPs from strict core genomes, this can be done with snp-dists.

Dependencies

FlexCore.py is written in python3 and requires the following python packages:

  • numpy
  • screed

Usage

First set up a conda environment with the appropriate dependencies:

conda env create -f environment.yml
conda activate env-fc

Then run the main script FlexCore.py referencing your alignment file:

python FlexCore.py --alignment <alignment file>

Input

Multi-fasta whole genome alignment derived from mapping to a reference and variant calling such as from snippy, snippy-core and snippy-clean_full_aln:

snippy-core --ref ref.fa snippyoutfiles 
snippy-clean_full_aln core.full.aln > clean.full.aln

Output

File Description
Coresites.fasta the core genome alignment
rSNPs95.csv the comma separated non-redundant (exclusive) pairwise distances (optional)

SNP distances are calculated by the number of comparable sites per pair in the core genome alignment. Though a core genome cutoff for the total core alignment may be ≥95%, in principle any number of sites may be missing between a pair of sequences (they may have far fewer sites without gaps or ambiguous bases than the overall alignment size). As such, calculating SNP distance by dividing by the total alignment length may be innapropriate. The SNP counts, SNP distance (SNP count/length of pairwise alignment) and adjusted SNP counts are output (SNP distance times by overall alignment length)

Options

Flag                 Short flag Description Required Default val
--alignment -a Provide path and filename of alignment
--cutoff -c Per-site percent core (integer). Default=95(%) 95
--nproc -p Number of processes 1
--no-dists Do not calculate SNP distances, output core alignment only False
--keepref Retain the reference sequence in the core calculation and SNP distances False

The --keepref option expects the reference file to be named >Reference in line with snippy-core output

Similar software

Other software (that I'm aware of) for extracting flexible core genomes following reference mapping and variant calling:




AlignRarefaction

Generate rarefaction data from read-mapped (reference-based) bacterial whole-genome alignments. A population of bacterial genomes is input and varying population sizes sampled to assess the impact of increased number and diversity of genomes

Usage

  1. Run the main script AlignRarefaction.py referencing the path to your sample of (fasta format) genomes:

    python AlignRarefaction.py --alignment </path/to/alignment/alignmentfile> 

Input

Multi-fasta whole genome alignment derived from mapping to a reference and variant calling such as from snippy, snippy-core and snippy-clean_full_aln:

snippy-core --ref ref.fa snippyoutfiles 
snippy-clean_full_aln core.full.aln > clean.full.aln

Output

File Description
rarefaction.csv Core alignment lengths for each iteration, for each subsampled population size

...

Options

Flag           Short flag Description Required Default val
--alignment -a Provide path and name of alignment file
--cutoff -c Per-site percent core (integer) 95
--step -s Step for random population size sampling 10
--minpop -mp Minimum (starting) random population size 20
--iterations -itr Number of iterations per population size 100
--keepref Remove reference sequence from the alignment False