An automated pipeline for whole bacterial genome analysis directly from raw Illumina paired-end sequencing data.
(repository under development)
- What is TORMES?
- Installation
- Usage
- Output
- Rendering customized reports
- Open-source and networking community
- Citation
- License
- Versions history
TORMES is an open-source, user-friendly pipeline for whole bacterial genome sequencing (WGS) analysis directly from raw Illumina paired-end sequenceing data. TORMES work with every bacterial WGS dataset, regardless the number, origin or species. By following very simple instructions, TORMES automates all steps included in a typical WGS analysis, including:
- Sequence quality filtering
- De novo genome assembly
- Draft genome ordering against a reference (optional)
- Genome annotation
- Multi-Locus Sequence Typing (MLST, optional)
- Antibiotic resistance genes screening
- Virulence genes screening
- Pangenome comparison (optional)
When working with Escherichia or Salmonella sequence data, extensive analysis can be enabled (by using the -g/--genera
option), including:
- Antibiotic resistance screening based on point mutations
- Plasmid replicons screening
- Serotyping
- fimH-Typing (only for Escherichia)
Once the WGS analysis is ended, TORMES summarizes the results in an interactive web-like file that can be opened in any web browser, making the results easy to analyze, compare and share.
TORMES is written in a combination of Bash, R and R Markdown. Once the WGS is ended, TORMES will automatically generate a RMarkdown file (unique for each analysis) that will be use to render the report in R environment. This file is susceptible for user's modification and the generation of user-specific reports (by including additional information, tables, figures, etc.). Additional information can be found in the Rendering customized reports section. We will probrably transition TORMES code (Bash) to use SnakeMake, in order to make the pipeline also susceptible for user customization.
TORMES is a pipeline, and for its use it is necessary the utilization of a lot of bioinformatic tools that constitutes the backbone of TORMES and that are listed in the Required dependencies section.
We think that the availability of open-source, user-friendly pipelines, will broaden the application of certain technologies, such as WGS, sometimes limited by the analysis of the great amount of data that is generated by high-throughput DNA sequencing (HTS) platforms. TORMES was devised with this aim, inspired by some excellent currently available software, such as Nullarbor, EnteroBase and the Center for Genomic Epidemiology. Users are encouraged to use every tool available and to find those that better fit in their studies.
We would like TORMES to be regularly updated with the most novel tools and databases, and to improve the pipeline taking into account users' suggestions. Currently, we are working in the following improvements:
- Possibility to start the pipeline from (draft) genomes.
- Extend the pipeline to data generated by other HTS platforms.
- Include additional analysis for Staphylococcus aureus (by using the
-g/--genera
option). - Include the use of customized databases.
TORMES is a pipeline that requries a lot of dependenices to work. It has been devised to be used as a conda environment. For installing TORMES an all its dependencies run:
wget https://anaconda.org/nmquijada/tormes-1.0/2019.04.25.180147/download/tormes-1.0.yml
conda env create -n tormes-1.0 --file tormes-1.0.yml
To activate TORMES environment run:
conda activate tormes-1.0
Additionally, the first time you are using TORMES, run (after activating TORMES environment):
tormes-setup
This step will install additional dependencies not available in conda, including the MiniKrakenDB_8GB required by Kraken to work (the database size is ~8GB and might take some time to download). Additionally it will automatically create the config_file.txt required for TORMES to work (see below).
TORMES is a pipeline and it requires several dependencies to work:
- ABRicate
- FastTree
- GNUParallel
- ImageMagick
- Kraken
- Megahit
- mlst
- Prinseq
- progrressiveMauve
- Prokka
- Quast
- R
- Roary
- roary2svg.pl
- Sickle
- SPAdes
- Trimmomatic
Additional software when working with -g/--genera Escherichia
.
Additional software when working with -g/--genera Salmonella
.
TORMES will look to the software included in the config_file.txt, which is a simple tab-separated text file indicating the software/database and its location. An automatic config_file.txt will be created after running tormes-setup
command. However, you can change the PATH to each software if other software version would like to be used (if you do so, respect software names and tab-separation).
You can find an example of the config_file.txt here.
Usage: tormes [options]
OBLIGATORY OPTIONS:
-m/--metadata Path to the file with the metadata regarding the samples
The file must have an specific organization for the program to work
If you don't have any or you would like to have an example or extra information,
please type:
tormes example-metadata
-o/--output Path and name of the output directory
OTHER OPTIONS:
-a/--adapter Path to the adapters file
(default="PATH/TO/TORMES/files/adapters.fa")
--assembler Select the assembler to use. Options available: 'spades', 'megahit'
(default='spades')
-c/--config Path to the configuration file with the location of all dependencies
(default="PATH/TO/TORMES/files/config_file.txt")
--citation Show citation
--fast Faster analysis (default='0')
('megahit' is used as assembler and contig ordering and pangenome analysis are disabled)
--filtering Select the software for filtering the reads.
Options available: 'prinseq', 'sickle', 'trimmomatic'
(default="prinseq")
-g/--genera Type genera name to allow special analysis (default='none')
Options available: 'Escherichia', 'Salmonella'
-h/--help Show this help
--min_len Minimum length to the reads to survive after filtering (default=125) <integer>
--no_mlst Disable MLST analysis (default='0')
--no_pangenome Disable pangenome analysis (default='0')
-q/--quality Minimum mean phred score of the reads to survive after filtering (default=25) <integer>
-r/--reference Type path to reference genome (fasta, gbk) (default='none')
Reference will be used for contig ordering of the draft genome
-t/--threads Number of threads to use (default=1) <integer>
--title Path to a file containing the title in the project that will be used as title in the report
Avoid using special characters. TORMES will perform a default title if this option is not used
-v/--version Show version
Example:
tormes --metadata salmonella_metadata.txt --output Salmonella_TORMES_2018 --reference S_enterica-CT02021853.fasta --threads 32 --genera Salmonella
A metadata text file is needed for TORMES to work by using the -m/--metadata
option. This file will include all the information regarding the sample and requires an specific organization:
- Columns should be tab separated.
- First column must me called
Samples
and harbor samples names (avoid special characters). - Second column must be called
Read1
and harbor the path to the R1 (forward) reads (either fastq or fastq.gz). - Third column must be called
Read2
and harbor the path to the R2 (reverse) reads (either fastq or fastq.gz). - Fourth (and so on) columns are descriptive. The information included here is not needed for TORMES to work but will be included in the interactive report. You can add as many description columns as needed (including information such as isolation date or source, different codification of each sample, etc.).
This is an example of how the metadata file should looks like:
Samples | Read1 | Read2 | Description1 | Description2 |
---|---|---|---|---|
Sample1 | Forward read location | Reverse read location | Description 1 of Sample 1 | Description 2 of Sample 1 |
Sample2 | Forward read location | Reverse read location | Description 1 of Sample 2 | Description 2 of Sample 2 |
If problems are encountered when performing the metadata file, you can generate a template metadata file by typing: tormes example-metadata
.
This command will generate a file called samples_metadata.txt
in your working directory that can be used as a template for your own dataset.
TORMES stores every file generated during the analysis is different directories regarding the step within the analysis (assembly, annotation, etc.), all of them included within the main output directory specified with the -o/--output
option:
- annotation: one directory per sample containing all the annotation files generated by Prokka.
- antibiotic_resistance_genes: results of the scrrening for antibiotic resistance genes by using Abricate against three databases: ARG-ANNOT, CARD and ResFinder.
- assembly: files resulting from genome assembly with SPAdes or Megahit (in gzipped directories, to unzip them type
tar xzf file-name.tgz
) and the assembly stats generated with Quast. - cleaned_reads: reads that survived after quality filtering using Prinseq, Trimmomatic or Sickle.
- draft_genomes: stores the draft genomes. If the
-r/--reference
option is used, draft genomes will be ordered against a reference by using Mauve and stored here. Contigs < 200 bp are removed. - mlst: results of Multi-Locus Sequence Typing (MLST) by using mlst.
- pangenome: results of pangenome comparison based on the presence/absence of genes between the samples by using Roary.
- report_files.tgz: files necessary for the generation of the interactive web-like report. See further instructions here.
- sequencing_assembly_report.txt: tabulated file including information of the sequencing (number of reads, average read length, sequencing depth), the assembly (number of contigs, genome length, average contig length, N50, GC content) and consensus taxonomic assignment.
- species_identification: consensus taxonomic assignment of each sample by using Kraken.
- tormes.log: log file of TORMES analysis progress.
- tormes_report.html: web-interactive report generated automatically after WGS analysis that summarizes the results. Can be open in any browser, shared and analyzed in a simple way.
- virulence_genes: results of the scrrening for virulence genes by using Abricate against the Virulence Factors Database.
Once the WGS analysis is ended, TORMES summarizes the results in a interactive web-like report file. An example of a report file can be visualized here.
For the generation of the report file, tormes
calls tormes-report
(included in the TORMES pipeline) that generates a rmarkdown file (in R environment), called tormes_report.Rmd
, that can be modified by the user for the generation of customized reports without the need of re-running the entire analysis.
Reports are generated after rendering the "tormes_report.Rmd" file in R environment. This file is automatically generated after TORMES WGS analysis is ended and it is unique for each study. The file is written in R Markdown code and it can be manually modified and used for the generation of customized reports.
R Markdown is a file format for creating dynamic documents with R. Excellent documentation about this format is already available in the R Markdown from R Studio webpage. The R Markdown Reference Guide and Cheat Sheet are also recommended.
The user is encouraged to modify the "tormes_report.Rmd" file for the generation of user-customized reports by following the guidelines above. Once the "tormes_report.Rmd" file has been modified, it can be used to render a new report by using the following command in the same directory (TORMES environment might be activated):
Rscript -e 'library(rmarkdown); rmarkdown::render("tormes_report.Rmd", "html_document", encoding="UTF-8")'
This command will generate a new "tormes_report.html".
Please note that all the information (tables, figures, etc.) that is wanted to be included in the report file, need to be in the same directory as the "tormes_report.Rmd" file.
TORMES was devised with the aim of being an open-source and easy tool that everybody can use for their WGS experiments. Bacterial bioinformatics is developing rapidly, and the availability of open code and tools is crucial for the scientific community to benefit from these developments.
Additionally, TORMES is intended to be a networking project with users providing their feedback and personal experience so that TORMES can become a more complete pipeline including as many analyses and genera as possible.
There’s been almost a year since we launch this tool and we are very happy with the responses of the community. Most of the suggestions are considered for further improvements of the TORMES pipeline and some users have also shared their code that could be used to extend TORMES analysis and/or to overcome some issues/challenges.
We are working for a finer tool for WGS that can be freely provided to the community and definitively the feedback from users is being pivotal.
- In this issue - nmquijada#16 - @biobrad greatly compiles some of the issues he has run through when running TORMES and shares instructions on how to solve them.
- Additionally, @biobrad has developed Tormesbot, a tool to assist other microbiologists who are not computer savvy in manipulating the metadata and parsing arguments to a HPC environment.
Please cite the following pubication if you are using TORMES:
Narciso M. Quijada, David Rodríguez-Lázaro, Jose María Eiros and Marta Hernández (2019). TORMES: an automated pipeline for whole bacterial genome analysis. Bioinformatics, 35(21), 4207–4212, https://doi.org/10.1093/bioinformatics/btz220
The dependencies described in this section are the backbone of TORMES, and users are encouraged to cite them when using TORMES.
TORMES is a free software, licensed under GPLv3.
- v.1.1: solved issues from v.1.0. New features: quality control reports of the reads before/after quality filtering and possibility to start the analysis directly from the (draft)genomes (will be released soon)
- v.1.0 (current): original version of the TORMES pipeline.