espW Allele Caller

A pipeline to atuomatically detect the allele of espW in E. coli genomes.

Joseph S. Wirth, 2023

Overview

Background

Recent investigations (publication forthcoming) identified a homopolymer of varying size in the espW gene in E. coli O157:H7 isolates. This pipeline is designed to determine which allele is present in an input genome.

Workflow

The allele of espW is determined as follows:

Three alleles of espW (deletion, full length, and insertion) are used as queries to search against the input genome using blastn. The search strategy employed is capable of distinguishing which allele is present in the isolate.
If blastn is unable to determine the espW allele, then ariba is used to recruit reads and assemble the espW gene. The resulting assemblies are then screened to see if the espW allele can be determined.
If espW is not assembled by ariba, then espwAlleleCaller will report the allele as "absent". If espW was assembled but the assembly did not allow for clear identification of the allele, then espwAlleleCaller will report the allele as "ambiguous".

Here is an overview of the workflow:

flowchart TB
    email[/"email address"/]
    input[/"input file"/]
    download_fna["download fastas
    from NCBI"]
    failed_download(["failed downloads"])
    blast["blastn"]
    blast_alleles(["espW alleles"])
    failed_blasts(["failed blasts"])
    download_srr["download reads"]
    ariba["ariba"]
    ariba_alleles(["espW alleles"])
    write["write results"]
    out(["output file"])

    email --> download_fna
    input --> download_fna
    download_fna --> blast
    download_fna --> failed_download
    blast --> blast_alleles
    blast --> failed_blasts
    input --> download_srr
    failed_download --> download_srr
    failed_blasts --> download_srr
    download_srr --> ariba
    ariba --> ariba_alleles
    blast_alleles --> write
    ariba_alleles --> write
    write --> out

Installation

Installing `espwAlleleCaller` using a `conda` environment

download the two required git repositories

git clone https://github.com/ncezid-biome/BIOME-scripts.git
git clone https://github.com/ncezid-biome/espwAlleleCaller.git

set up the environment

echo "MISC_DIR = '$(pwd)/BIOME-scripts/misc-python-scripts/'" > ./espwAlleleCaller/miscDirectory.py
conda env create -f ./espwAlleleCaller/environment.yml
conda activate espwallelecaller

Checking installation

Check the installation with the following command:

./espwAlleleCaller/espwAlleleCaller.py --check_env

If everything was installed correctly, the following message will be printed to screen:


environment is suitable

Running `espwAlleleCaller`

usage:
    espwAlleleCaller.py [-ieosnvhc]

required arguments:
    -i, --in             [file] filename of a tab-separated file with three columns and no headers: key, ncbi accession, srr id
    -e, --email          [str] email address (used to query NCBI)

optional arguments:
    -o, --out            [file] filename to write the output
    -s, --seq_dir        [directory] the directory where sequence files where be downloaded (will be created if necessary)
    -n, --num_threads    [int] the number of threads to use for parallel processing
    -v, --version        print the version
    -h, --help           print this help message
    -c, --check_env      check that all dependencies are installed

ncezid-biome/espwAlleleCaller