This pipeline detects and visualises phenotype-associated genome rearrangement events in bacterial genomes that are mediated by homologous recombination between repetitive elements (such as IS elements).
Currently, this package only supports binary phenotypes.
blastn: 2.2.31+, R 4.2.2, Python 3.9.16
python modules: argparse, SeqIO, csv, pandas, gzip, os, Bio.Seq
R modules: optparse, plyr, ggplot2, ggforce, ggpubr
genome_rearrangement can be installed using the git clone
command
git clone https://github.com/DorothyTamYiLing/GWarrange.git
For identifing candidate repeat sequence categories and estimating size of repeat sequence clusters in selected reference genome
#example:
bash scripts/homo_main.sh -gff ./example_data/ref.gff -fna ./example_data/ref.fasta
Arguments:
gff : gff file for selected reference genome, with fasta sequences lines removed
fna : complete assembly for selected reference genome
thread_blast : number of threads for BLAST (default: 8)
freq : number of occurrence/BLAST hit of a sequence in the .gff file to be defined as repeat sequence (default: 2)
idcov : values for -perc_identity and -qcov_hsp_perc in BLAST, in string format (default: "80_80")
dist : defines repeat sequences to belong to the same repeat sequence cluster when they are less than this number of base pairs apart (default: 1000)
#example:
bash scripts/GWarrange.sh -gen genomes.fna.gz -pheno /full/path/to/pheno.txt \
-gen_size 4300 -startgene startgene.fasta -replist replist.fasta \
-thread 8 \
-fsmlite_arg "-v -s 3 -S 44 -t tmp -m 200 -M 200" \
-pyseer_arg "--min-af 0.05 --max-af 0.95 --no-distances" \
-ext_mrg_min "100_3" -ext_mrg_max "7000_3"
Arguments:
gen : gzipped/gunzipped multifasta file of genomes set (original sequence without IS replacement). Must be in *.fna.gz suffix.
pheno : phenotype file (file format: sample names in 1st column, binary phenotype in 2nd column; no header, tab-delimited) (must be in full directory path)
gen_size : genome size
startgene : chosen gene for reorientating genomes
replist : representatives of repeat loci categories are aligned with reference genome
ext_mrg_min : minimum extending and merging neighbouring repeat sequences into blocks (default: "100_3")
ext_mrg_max : maximum extending and merging neighbouring repeat sequences into blocks (default: "7000_3")
flk_len : minimum length (bp) of flanking sequences (both side) for the kmer to be blasted with the genomes (default: 30bp)
pyseer_arg : additional arguments for running pyseer, apart from --phenotypes, --kmers, --output-patterns and --cpu. Full path should be provided for any additional input file. (default: "--min-af 0.05 --max-af 0.95")
fsmlite_arg : additional arguments for running fsm-lite, apart from -l (default: "-v -t tmp -m 200 -M 200")
unitigcaller_arg : additional arguments for running unitigcaller, apart from --call, --pyseer, --refs, --out and --threads. (default: "")
string_type : "kmer" or "kmers_and_unitigs" ("kmer" for performing k-mer-based GWAS only; "kmers_and_unitigs" for performing both k-mers and unitigs based GWAS, then instead of significant k-mers without placeholder sequences, significant unitigs will be analysed together with significant k-mer containing placeholder sequences for the purpose of efficient run time (default: "kmer")
thread : number of thread for BLAST, unitig-caller and pyseer (default: "8")
Split k-mer plots parameters:
dedupk : Number of significant digits (e.g. 2,3,4) for rounding off mean upstream flank start coordinate, for selecting split kmers with unique proportion and genome position information for plotting (default: 2)
exp_fac : how much the arrow expand horizontally for visibility in relative to the genome size. Smaller number leads to larger arrow expansion. In split kmers' plots. (default: 86)
yaxis : height of split kmers' plots (default: 360)
arr_dist : vertical distance between arrows, in split kmers' plots (default: 70)
split_h : height of the device in pdf function in R, in split kmers' plots (default: 7)
split_w : width of the device in pdf function in R, in split kmers' plots (default: 10)
merge : merge arrows into one when they are less than this number of base pair apart (default: 40000)
Intact k-mer plots parameters:
intkrd : The closest multiplier of selected value (e.g. 100, 1000, 10000) used for rounding off median genome position of intact kmer, for selecting intact k-mers with unique genome position information for plotting (default: 1000)
intact_h : height of the device in png function in R, in intact kmers' plots (default: 100)
intact_w : height of the device in png function in R, in intact kmers' plots (default: 180)
intact_res : Resolution for intact kmers' plots (default: 150)
For tutorials, please go to here
For pipeline and output files description, go to here
Range of IS elements can be found in https://github.com/thanhleviet/ISfinder-sequences for multiple bacterial species.