beav - a bacterial genome and mobile element annotation pipeline

beav: Bacteria/Element Annotation reVamped

beav is a command line tool that streamlines bacterial genome and mobile genetic element annotation. It combines multiple annotation tools, automating the process of running, parsing, and combining the results into a single easy-to-read output. Annotated features include secretion systems, anti-phage defense systems, integrative & conjugative/mobilizable elements, integrons, prophage regions, amino acid biosynthesis pathways, small carbon metabolite catabolism pathways, and biosynthetic gene clusters. Type VI secretion system (T6SS) vgrG operons are automatically identified. Plasmid origin of transfer (oriT) elements are also characterized.

The beav pipeline also includes several tools and databases that enhance the annotation of plant associated microbes, including phytopathogens and symbionts. Custom bakta databases provide correct gene names and annotations for phytopathogen virulence genes, effectors, and genes important for mutualist symbiosis. Other tools annotate promoter elements such as the pip box, tts box, nod box, tra box, vir box, etc.

An optional Agrobacterium-specific pipeline identifies the presence of Ti and Ri plasmids and classifies them under the Weisberg et al. 2020 scheme. It also annotates Ti/Ri plasmid elements including T-DNA borders, overdrive, virbox, trabox, and other binding sites, and determines the biovar and genomospecies of the input strain. Virulence and T-DNA genes, including opine synthase and transport/catabolism loci, are also correctly named and annotated.

beav will generate Circos plot annotating important features for the genome as well as pTi/pRi plasmid (if Agrobacterium specific analysis is conducted). It is also possible to separately run the Circos script.

Example Circos plot of whole genome annotations automatically generated by beav.

Example Circos plot visualizing oncogenic Ti/Ri plasmids generated by the optional Agrobacterium-specific pipeline.

Quick Start

#download and install beav with conda/mamba
mamba create -n beav beav
conda activate beav
#download all prerequisite databases
beav_db
#run beav
beav --input /path/to/file/test.fna --threads 8 --skip_tiger

Installation

The beav pipeline requires a number of programs and databases be installed. Therefore, it is highly encouraged and recommended to use conda to install beav and all of its dependencies.

Once the tool is installed, run the beav_db tool to download all necessary databases.

From conda (Recommended)

It is recommended to use either conda with libmamba or mamba to install beav as this will greatly speed up the time solving the environment.

instructions for conda:

conda create -n beav
conda install -n beav beav

alternative instructions using mamba:

conda create -n beav
mamba install -n beav beav

or as one combined command:

conda create -n beav beav
or
mamba create -n beav beav

The conda environment can then be activated using:

conda activate beav

Alternative: From source

Clone the beav github repository.

git clone https://github.com/weisberglab/beav.git

If installing from source, DBSCAN-SWA, TIGER2, and GapMind (PaperBLAST) need to be installed in the software folder within the beav folder. Then the BEAV_DIR environment variable needs to be set and pointing to the beav directory.

Prerequisites:

Program	Install location
Bakta	PATH
IntegronFinder	PATH
MacSyFinder	PATH
DefenseFinder	PATH
TIGER2	$BEAV_DIR/software
GapMind (PaperBlast)	$BEAV_DIR/software
DBSCAN-SWA	$BEAV_DIR/software
antiSMASH	PATH
EMBOSS	PATH
HMMER	PATH

Databases for each of these programs can then be installed manually. Alternatively, the following can be used to install them automatically.

Install all databases

conda activate beav 
beav_db

Database script optional parameters

usage: beav_db [--skip_bakta_db] [--light] [--bakta_db_path DIRECTORY] [--update]
    --skip_bakta_db 
        Skip downloading the Bakta databases
    --light
        Install the light version of Bakta databases
    --bakta_db_path DIRECTORY
        Install Bakta databases in nondefault location 
    --update
        Update Bakta databases

Usage

NOTE: If you get an error stating "ModuleNotFoundError: No module named 'nrpys'", then you can run the following command (with the beav conda environment activated) to force reinstall it:

python -m pip install --upgrade --force-reinstall nrpys

NOTE: there is currently a bug in the latest DefenseFinder models that cause an error in MacSyFinder when running it. We recommend running Beav with `--skip_defensefinder` until the MacSyFinder bug fix is released in bioconda. Alternatively, copying the patched file to the MacSyFinder python library folder of your conda release will fix the issue.

Patching instructions

To do so, find the python version of your conda environment:

python --version

Then download the patched registries.py file:

wget https://github.com/gem-pasteur/macsyfinder/blob/27ee21ceb8e7100d9183b084356f791487aca4ad/macsypy/registries.py

Then copy it to the correct folder in your conda env, changing the python version as necessary:

cp registries.py $CONDA_PREFIX/lib/python3.9/site-packages/macsypy/

usage: beav [--input INPUT] [--output OUPUT_DIRECTORY] [--strain STRAIN] [--bakta_arguments BAKTA_ARGUMENTS] [--tiger_arguments TIGER_ARGUMENTS][--agrobacterium AGROBACTERIUM] [--skip_macsyfinder] [--skip_integronfinder][--skip_defensefinder] [--skip_tiger] [--skip_gapmind][--skip_dcscan-swa] [--skip_antismash] [--help] [--threads THREADS] [--genbank] [--continue]
    BEAV- Bacterial Element Annotation reVamped
    Input/Output: 
        --input, -i STRAIN.fna
                Input file in fasta nucleotide format (Required)
        --output DIRECTORY
                Output directory (default: current working directory)
        --strain STRAIN
                Strain name (default: input file prefix)
        --bakta_arguments ARGUMENTS
                Additional arguments and database options specific to Bakta 
        --antismash_arguments ARGUMENTS
                Additional arguments and database options specific to antiSMASH (Default: \"$antismash_args\") 
        --tiger_blast_database DBPATH
                Path to a reference genome blast database for TIGER2 ICE analysis (Required unless --skip_tiger is used)
        --run_operon_email EMAIL
                Annotate predicted operons using the Operon-mapper webserver. Must input an email address for the job
    Options:
        --agrobacterium
                Agrobacterium specific tools that identify biovar/species group, Ti/Ri plasmid, T-DNA borders, virboxes and traboxes
        --skip_macsyfinder
                Skip detection and annotation of secretion systems
        --skip_integronfinder
                Skip detection and annotation of integrons 
        --skip_defensefinder
                Skip detection and annotation of anti-phage defense systems 
        --skip_tiger
                Skip detection and annotation of integrative conjugative elements (ICEs)
        --skip_gapmind
                Skip detection of amino acid biosynthesis and carbon metabolism pathways
        --skip_dbscan-swa
                Skip detection and annotation of prophage
        --skip_antismash
                Skip detection and annotation of biosynthetic gene clusters
        --continue
                Continue running BEAV from any point in the pipeline. Rerun programs that gave an error or were skipped.
        --gbk
                Use a GenBank file as input
    General:
        --help, -h
                Show BEAV help message
        --threads, -t
                Number of CPU threads

Options

--antismash_arguments

Additional antiSMASH arguments can be input into antiSMASH using the --antismash_arguments option. This allows for full usage of antiSMASH and additional databases.

--tiger_blast_database

Required if running TIGER. Users must provide a path to a blast database of reference genomes using the --tiger_blast_database option.

--bakta_arguments

Additional arguments can be passed to bakta using the --bakta_arguments option.

--agrobacterium

The --agrobacterium option activates an additional pipeline to provide agrobacterium-specific annotation.

--skip-PROGRAM

The skip options allow for specified programs to be skipped if the annotation is not needed or required programs are not installed.

--continue

The continue option will check the output of existing Beav runs and rerun programs that errored or were skipped. This option allows for the pipeline to be used with existing Bakta runs.

--gbk

A GenBank file can be used as the input file when the genbank option is used.

Examples

Minimal run

beav --input /path/to/file/test.fna --threads 8 --skip_tiger

Standard run

beav --input /path/to/file/test.fna --threads 8 --tiger_blast_database /path/to/databases/blast/refseq_genomic.fna

Standard run with operon annotation (remote)

beav --input /path/to/file/test.fna --threads 8 --tiger_blast_database /path/to/databases/blast/refseq_genomic.fna --run_operon_email myemail@email.com

Standard run with genbank input

beav --input /path/to/file/test.gbk --threads 8 --tiger_blast_database /path/to/databases/blast/refseq_genomic.fna --gbk

Complex run

beav --input /path/to/file/test.fna --threads 8 --bakta_arguments '--db /path/to/alternative-data-bases/bakta-1.7/' --tiger_blast_database /path/to/databases/blast/allagro.fna --agrobacterium --skip_integronfinder

Standalone Circos plot generation

To generate Circos plots on your GenBank file independant of the beav pipeline, make sure the beav conda environment is activated:

conda activate beav

Usage:

beav_circos -i <GenBank_file> [-c <Contig_for_subset_visualization>] [--pTi <Contig_for_oncogenic_visualization>]

Examples:

# Generate a general Circos plot for all contigs
beav_circos -i test.gbk

# Generate a general Circos plot for all contigs and a oncogenic Circos plot for single contig
beav_circos -i test.gbk --pTi contig_1

# Generate a general Circos plot for all contigs and a oncogenic Circos plot for a set of contigs
beav_circos -i test.gbk --pTi "contig_1 contig_2"

# Generate a general Circos plot for single contig
beav_circos -i test.gbk -c contig_1

# Generate a general Circos plot for a set of contigs
beav_circos -i test.gbk -c "contig_1 contig_2"

Citation

Beav can be cited as:

Jung J.M., Rahman A., Schiffer A.M., and Weisberg A.J., Beav: a bacterial genome and mobile element annotation pipeline. (2024) bioRxiv 2024.01.25.577299; doi: https://doi.org/10.1101/2024.01.25.577299