A pipeline to atuomatically detect the allele of espW in E. coli genomes.
Recent investigations (publication forthcoming) identified a homopolymer of varying size in the espW gene in E. coli O157:H7 isolates. This pipeline is designed to determine which allele is present in an input genome.
The allele of espW is determined as follows:
- Three alleles of espW (deletion, full length, and insertion) are used as queries to search against the input genome using
blastn
. The search strategy employed is capable of distinguishing which allele is present in the isolate. - If
blastn
is unable to determine the espW allele, thenariba
is used to recruit reads and assemble the espW gene. The resulting assemblies are then screened to see if the espW allele can be determined. - If espW is not assembled by
ariba
, thenespwAlleleCaller
will report the allele as "absent". If espW was assembled but the assembly did not allow for clear identification of the allele, thenespwAlleleCaller
will report the allele as "ambiguous".
Here is an overview of the workflow:
flowchart TB
email[/"email address"/]
input[/"input file"/]
download_fna["download fastas
from NCBI"]
failed_download(["failed downloads"])
blast["blastn"]
blast_alleles(["espW alleles"])
failed_blasts(["failed blasts"])
download_srr["download reads"]
ariba["ariba"]
ariba_alleles(["espW alleles"])
write["write results"]
out(["output file"])
email --> download_fna
input --> download_fna
download_fna --> blast
download_fna --> failed_download
blast --> blast_alleles
blast --> failed_blasts
input --> download_srr
failed_download --> download_srr
failed_blasts --> download_srr
download_srr --> ariba
ariba --> ariba_alleles
blast_alleles --> write
ariba_alleles --> write
write --> out
- download the two required git repositories
git clone https://github.com/ncezid-biome/BIOME-scripts.git
git clone https://github.com/ncezid-biome/espwAlleleCaller.git
- set up the environment
echo "MISC_DIR = '$(pwd)/BIOME-scripts/misc-python-scripts/'" > ./espwAlleleCaller/miscDirectory.py
conda env create -f ./espwAlleleCaller/environment.yml
conda activate espwallelecaller
Check the installation with the following command:
./espwAlleleCaller/espwAlleleCaller.py --check_env
If everything was installed correctly, the following message will be printed to screen:
environment is suitable
usage:
espwAlleleCaller.py [-ieosnvhc]
required arguments:
-i, --in [file] filename of a tab-separated file with three columns and no headers: key, ncbi accession, srr id
-e, --email [str] email address (used to query NCBI)
optional arguments:
-o, --out [file] filename to write the output
-s, --seq_dir [directory] the directory where sequence files where be downloaded (will be created if necessary)
-n, --num_threads [int] the number of threads to use for parallel processing
-v, --version print the version
-h, --help print this help message
-c, --check_env check that all dependencies are installed