/tormes

Making whole genome bacterial sequencing data analysis easy

Primary LanguageShellGNU General Public License v3.0GPL-3.0

Anaconda-Server Badge
Anaconda-Server Badge

TORMES

An automated pipeline for whole bacterial genome analysis directly from raw Illumina paired-end sequencing data.
(repository under development)

Contents


What is TORMES?

TORMES is an open-source, user-friendly pipeline for whole bacterial genome sequencing (WGS) analysis directly from raw Illumina paired-end sequenceing data. TORMES work with every bacterial WGS dataset, regardless the number, origin or species. By following very simple instructions, TORMES automates all steps included in a typical WGS analysis, including:

  1. Sequence quality filtering
  2. De novo genome assembly
  3. Draft genome ordering against a reference (optional)
  4. Genome annotation
  5. Multi-Locus Sequence Typing (MLST, optional)
  6. Antibiotic resistance genes screening
  7. Virulence genes screening
  8. Pangenome comparison (optional)

When working with Escherichia or Salmonella sequence data, extensive analysis can be enabled (by using the -g/--genera option), including:

  1. Antibiotic resistance screening based on point mutations
  2. Plasmid replicons screening
  3. Serotyping
  4. fimH-Typing (only for Escherichia)

Once the WGS analysis is ended, TORMES summarizes the results in an interactive web-like file that can be opened in any web browser, making the results easy to analyze, compare and share.


TORMES is written in a combination of Bash, R and R Markdown. Once the WGS is ended, TORMES will automatically generate a RMarkdown file (unique for each analysis) that will be use to render the report in R environment. This file is susceptible for user's modification and the generation of user-specific reports (by including additional information, tables, figures, etc.). Additional information can be found in the Rendering customized reports section. We will probrably transition TORMES code (Bash) to use SnakeMake, in order to make the pipeline also susceptible for user customization.
TORMES is a pipeline, and for its use it is necessary the utilization of a lot of bioinformatic tools that constitutes the backbone of TORMES and that are listed in the Required dependencies section.
We think that the availability of open-source, user-friendly pipelines, will broaden the application of certain technologies, such as WGS, sometimes limited by the analysis of the great amount of data that is generated by high-throughput DNA sequencing (HTS) platforms. TORMES was devised with this aim, inspired by some excellent currently available software, such as Nullarbor, EnteroBase and the Center for Genomic Epidemiology. Users are encouraged to use every tool available and to find those that better fit in their studies.
We would like TORMES to be regularly updated with the most novel tools and databases, and to improve the pipeline taking into account users' suggestions. Currently, we are working in the following improvements:

  • Possibility to start the pipeline from (draft) genomes.
  • Extend the pipeline to data generated by other HTS platforms.
  • Include additional analysis for Staphylococcus aureus (by using the -g/--genera option).
  • Include the use of customized databases.

Installation

TORMES is a pipeline that requries a lot of dependenices to work. It has been devised to be used as a conda environment. For installing TORMES an all its dependencies run:

wget https://anaconda.org/nmquijada/tormes-1.0/2019.04.25.180147/download/tormes-1.0.yml
conda env create -n tormes-1.0 --file tormes-1.0.yml

To activate TORMES environment run:

conda activate tormes-1.0

Additionally, the first time you are using TORMES, run (after activating TORMES environment):

tormes-setup

This step will install additional dependencies not available in conda, including the MiniKrakenDB_8GB required by Kraken to work (the database size is ~8GB and might take some time to download). Additionally it will automatically create the config_file.txt required for TORMES to work (see below).


Required dependencies

TORMES is a pipeline and it requires several dependencies to work:

Additional software when working with -g/--genera Escherichia.

Additional software when working with -g/--genera Salmonella.

TORMES will look to the software included in the config_file.txt, which is a simple tab-separated text file indicating the software/database and its location. An automatic config_file.txt will be created after running tormes-setup command. However, you can change the PATH to each software if other software version would like to be used (if you do so, respect software names and tab-separation).
You can find an example of the config_file.txt here.


Usage

Usage: tormes [options]

OBLIGATORY OPTIONS:
        -m/--metadata   Path to the file with the metadata regarding the samples
                        The file must have an specific organization for the program to work
                        If you don't have any or you would like to have an example or extra information,
                        please type:
                        tormes example-metadata
        -o/--output     Path and name of the output directory

OTHER OPTIONS:
        -a/--adapter    Path to the adapters file
                        (default="PATH/TO/TORMES/files/adapters.fa")
        --assembler     Select the assembler to use. Options available: 'spades', 'megahit'
                        (default='spades')
        -c/--config     Path to the configuration file with the location of all dependencies
                        (default="PATH/TO/TORMES/files/config_file.txt")
        --citation      Show citation
        --fast          Faster analysis (default='0')
                        ('megahit' is used as assembler and contig ordering and pangenome analysis are disabled)
        --filtering     Select the software for filtering the reads.
                        Options available: 'prinseq', 'sickle', 'trimmomatic'
                        (default="prinseq")
        -g/--genera     Type genera name to allow special analysis (default='none')
                        Options available: 'Escherichia', 'Salmonella'
        -h/--help       Show this help
        --min_len       Minimum length to the reads to survive after filtering (default=125) <integer>
        --no_mlst       Disable MLST analysis (default='0')
        --no_pangenome  Disable pangenome analysis (default='0')
        -q/--quality    Minimum mean phred score of the reads to survive after filtering (default=25) <integer>
        -r/--reference  Type path to reference genome (fasta, gbk) (default='none')
                        Reference will be used for contig ordering of the draft genome
        -t/--threads    Number of threads to use (default=1) <integer>
        --title         Path to a file containing the title in the project that will be used as title in the report
                        Avoid using special characters. TORMES will perform a default title if this option is not used
        -v/--version    Show version


Example:

tormes --metadata salmonella_metadata.txt --output Salmonella_TORMES_2018 --reference S_enterica-CT02021853.fasta --threads 32 --genera Salmonella

Obligatory options

A metadata text file is needed for TORMES to work by using the -m/--metadata option. This file will include all the information regarding the sample and requires an specific organization:

  • Columns should be tab separated.
  • First column must me called Samples and harbor samples names (avoid special characters).
  • Second column must be called Read1 and harbor the path to the R1 (forward) reads (either fastq or fastq.gz).
  • Third column must be called Read2 and harbor the path to the R2 (reverse) reads (either fastq or fastq.gz).
  • Fourth (and so on) columns are descriptive. The information included here is not needed for TORMES to work but will be included in the interactive report. You can add as many description columns as needed (including information such as isolation date or source, different codification of each sample, etc.).

This is an example of how the metadata file should looks like:

Samples Read1 Read2 Description1 Description2
Sample1 Forward read location Reverse read location Description 1 of Sample 1 Description 2 of Sample 1
Sample2 Forward read location Reverse read location Description 1 of Sample 2 Description 2 of Sample 2

If problems are encountered when performing the metadata file, you can generate a template metadata file by typing: tormes example-metadata.
This command will generate a file called samples_metadata.txt in your working directory that can be used as a template for your own dataset.


Output

TORMES stores every file generated during the analysis is different directories regarding the step within the analysis (assembly, annotation, etc.), all of them included within the main output directory specified with the -o/--output option:

  • annotation: one directory per sample containing all the annotation files generated by Prokka.
  • antibiotic_resistance_genes: results of the scrrening for antibiotic resistance genes by using Abricate against three databases: ARG-ANNOT, CARD and ResFinder.
  • assembly: files resulting from genome assembly with SPAdes or Megahit (in gzipped directories, to unzip them type tar xzf file-name.tgz) and the assembly stats generated with Quast.
  • cleaned_reads: reads that survived after quality filtering using Prinseq, Trimmomatic or Sickle.
  • draft_genomes: stores the draft genomes. If the -r/--reference option is used, draft genomes will be ordered against a reference by using Mauve and stored here. Contigs < 200 bp are removed.
  • mlst: results of Multi-Locus Sequence Typing (MLST) by using mlst.
  • pangenome: results of pangenome comparison based on the presence/absence of genes between the samples by using Roary.
  • report_files.tgz: files necessary for the generation of the interactive web-like report. See further instructions here.
  • sequencing_assembly_report.txt: tabulated file including information of the sequencing (number of reads, average read length, sequencing depth), the assembly (number of contigs, genome length, average contig length, N50, GC content) and consensus taxonomic assignment.
  • species_identification: consensus taxonomic assignment of each sample by using Kraken.
  • tormes.log: log file of TORMES analysis progress.
  • tormes_report.html: web-interactive report generated automatically after WGS analysis that summarizes the results. Can be open in any browser, shared and analyzed in a simple way.
  • virulence_genes: results of the scrrening for virulence genes by using Abricate against the Virulence Factors Database.

Once the WGS analysis is ended, TORMES summarizes the results in a interactive web-like report file. An example of a report file can be visualized here.
For the generation of the report file, tormes calls tormes-report (included in the TORMES pipeline) that generates a rmarkdown file (in R environment), called tormes_report.Rmd, that can be modified by the user for the generation of customized reports without the need of re-running the entire analysis.


Rendering customized reports

Reports are generated after rendering the "tormes_report.Rmd" file in R environment. This file is automatically generated after TORMES WGS analysis is ended and it is unique for each study. The file is written in R Markdown code and it can be manually modified and used for the generation of customized reports.
R Markdown is a file format for creating dynamic documents with R. Excellent documentation about this format is already available in the R Markdown from R Studio webpage. The R Markdown Reference Guide and Cheat Sheet are also recommended.
The user is encouraged to modify the "tormes_report.Rmd" file for the generation of user-customized reports by following the guidelines above. Once the "tormes_report.Rmd" file has been modified, it can be used to render a new report by using the following command in the same directory (TORMES environment might be activated):


Rscript -e 'library(rmarkdown); rmarkdown::render("tormes_report.Rmd", "html_document", encoding="UTF-8")'

This command will generate a new "tormes_report.html".
Please note that all the information (tables, figures, etc.) that is wanted to be included in the report file, need to be in the same directory as the "tormes_report.Rmd" file.


Citation

Please cite the following pubication if you are using TORMES:


Narciso M. Quijada, David Rodríguez-Lázaro, Jose María Eiros and Marta Hernández (2019). TORMES: an automated pipeline for whole bacterial genome analysis. Bioinformatics, 35(21), 4207–4212, https://doi.org/10.1093/bioinformatics/btz220


The dependencies described in this section are the backbone of TORMES, and users are encouraged to cite them when using TORMES.


License

TORMES is a free software, licensed under GPLv3.


Versions history

  • v.1.1: solved issues from v.1.0. New features: quality control reports of the reads before/after quality filtering and possibility to start the analysis directly from the (draft)genomes (will be released soon)
  • v.1.0 (current): original version of the TORMES pipeline.