Shotgun Metagenomics & Metatranscriptomics Nextflow Pipeline

Introduction

Chiamh/meta-omics-nf is a bioinformatics pipeline that takes raw metagenomic (MGX) and/or metatranscriptomic (MTX) reads and annotates them at the taxonomic and functional levels.

The pipeline is built using Nextflow DSL2 , a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It uses Docker containers (also compatible with Singularity) for ease of installation and computational reproducibility.

This pipeline currently only accepts paired-end reads as inputs.

Pipeline summary for metagenomic reads

Adapter trimming and quality control using fastp (0.22.0)
Optional removal of host (human) reads by mapping to a reference genome using bwa (0.7.17)
Taxonomic classification of non-human reads using Kraken2 (2.1.2) and taxonomic abundance re-estimation using Bracken (2.6.1)
Optional removal of microbial spike-in sequences
Functional annotation of reads by mapping to a microbial gene catalog of choice using bowtie2 (2.4.4)
Functional annotation of unmapped reads in the previous step, to a larger protein database e.g. UHGP or Uniref90 using diamond (2.0.12)
Annotation of mappable reads into Clusters of Orthologous groups (COGs) using EggNOG mapper (2.1.6)

Pipeline summary for metatranscriptomic reads

Adapter trimming and quality control using fastp (0.22.0)
Optional removal of host (human) reads using STAR (2.7.9a), a splice aware aligner.
Optional computational removal of prokaryotic and eukaryotic rRNAs using a k-mer based strategy with bbmap (38.93)
Optional sequence de-duplication using bbmap (38.93) clumpify.sh
Taxonomic classification of non-human reads using Kraken2 (2.1.2)
Functional annotation of reads by mapping to a microbial gene catalog of choice using bowtie2 (2.4.4)
Functional annotation of unmapped reads in the previous step, to a larger protein database e.g. UHGP or Uniref90 using diamond (2.0.12)
Annotation of mappable reads into Clusters of Orthologous groups (COGs) using EggNOG mapper (2.1.6)

Quick Start

Install Nextflow (>=21.04.0) and add the nextflow executable to your $PATH
Install Docker

Clone the pipeline and refer to the help message

$ git clone https://github.com/Chiamh/meta-omics-nf

$ nextflow run ./meta-omics-nf/main.nf --help

Add a custom config file which contains the paths to various pre-installed databases. Refer to the test.config file in this repo for an example.
Add a custom profile in the nextflow.config file, allowing you to specify the use of docker or singularity, and/or a task scheduler.

Make sure all helper scripts in meta-omics-nf/bin have execute permissions
```
$ chmod +x ./meta-omics-nf/bin/*
```

Run the pipeline

$ nextflow run ./meta-omics-nf/main.nf -profile docker,your_profile --rna_reads /path/to/metatranscriptomes --dna_reads /path/to/metagenomes --outdir /path/to/results

You can specifiy multiple profiles separated by comma, e.g. -profile docker,sge.
The taxonomic classification, nucleotide alignment, translated search and annotation modules can be quite memory intensive depending on the databases used
Delete the work/ directory after running the pipeline to free up space taken up by intermediate files
There are modular workflows (decontaminate and classify) to reduce the size of intermediate files produced by the pipeline. See the help message for more details.
There is a concatenate workflow (-entry concatenate) to merge fastq.gz files across lanes for the same sample ID.

You have the flexibility to turn off DNA spike in removal and/or the eggNOG annotation modules. See the help message for more details

 $ nextflow run ./meta-omics-nf/main.nf -profile docker,your_profile -entry decontaminate --rna_reads /path/to/metatranscriptomes --dna_reads /path/to/metagenomes --outdir /path/to/results
 
 $ nextflow run ./meta-omics-nf/main.nf -profile docker,your_profile -entry classify --rna_reads /path/to/DECONTAMINATED_metatranscriptomes --dna_reads /path/to/DECONTAMINATED_metagenomes --outdir /path/to/results

Output files (bolded items are noteworthy for downstream analyses)

Caveat: Read counts (paired-end) for some reports are not directly comparable with read counts (unpaired) for other reports. This is necessary because kraken2 is run in paired-end mode, whereas functional annotations take "unpaired" inputs.

Why is it preferable to perform functional annotations using unpaired despite paired-end data? Read this.

decont/DNA (for metagenomes) or decont/RNA (for metatranscriptomes)
- These folders contain decontaminated reads in fastq.gz format. Decontamination means adapter removal, host read removal, rRNA removal (for MTX) and de-duplication (for MTX). These steps are on by default, but are optional.
kraken2_out/DNA
- *kraken2.out : Kraken2 raw output with taxonomic classification for each read.
- *kraken2.tax : Kraken2 report
- *kraken2_minimizer.tax : Kraken2 report with distinct minimizer count information.
- *bracken.tax : Bracken report file with a simlar format to the Kraken2 report
- *bracken.out : Bracken output, tab separated with taxonomic classifications, read counts (paired-end) and abundance estimations.
- *k2.s.tsv : Simplified species level Kraken2 report, tab separated. Column 1: relative abundance; Column 2: paired read counts rooted at this taxon; Column 3: Number of minimizers in read data associated with this taxon; Column 4: An estimate of the number of distinct minimizers in read data associated with this taxon; Column 5: taxonomic classification at species level or unclassified. Note: reads classified at genus level and above are not included in this report.
kraken2_out/RNA
- *kraken2.out : Kraken2 raw output with taxonomic classification for each read.
- *kraken2.tax : Kraken2 report
- *kraken2_minimizer.tax : Kraken2 report with distinct minimizer count information.
- *k2.s.tsv : Simplified species level Kraken2 report, tab separated. Column 1: relative abundance; Column 2: paired read counts rooted at this taxon; Column 3: Number of minimizers in read data associated with this taxon; Column 4: An estimate of the number of distinct minimizers in read data associated with this taxon; Column 5: taxonomic classification at species level or unclassified. Note: reads classified at genus level and above are not included in this report.
MGX_panalign_out/ (for metagenomes) or MTX_panalign_out/ (for metatranscriptomes)
- *bt2_pangenome_aligned.bam : BAM file after alignment of reads (unpaired) to pangene catalog.
- *bt2_pangenome_aligned_filtered_cov.tsv : Tab separated file containing read coverage across pangenes. Only pangenes with >= 50% coverage are reported here.
- *bt2_pangenome_unaligned.fastq.gz : All reads that did not align to the pangene catalog.
MGX_dmnd_out/ (for metagenomes) or MTX_panalign_out/ (for metatranscriptomes)
- *uniref90_aligned.out : Tab separated file containing alignment information to Uniref90 database. Columns are in BLAST format in this order: qseqid sseqid pident length mismatch gapopen qstart qend qlen sstart send slen evalue bitscore salltitles
- *uniref90_unaligned.fa : All reads that did not align to the Uniref90 database (and also by consequence, the pangene catalog)
MGX_annotations/ (for metagenomes) or MTX_annotations/ (for metatranscriptomes)
- *decont.emapper.annotations : Tab separated output from eggnog mapper 2.1.6.
  
  Note: Apart from reads which did not align to pangene catalogs or Uniref90 databases, the following annotations are over features with >= 50% read coverage.
- *all_aligned_taxonomy.tsv : Tab separated output. Column 1: Pangene ID; Column 2: Read ID; Column 3: Kraken2 taxonomic classification; Column 4: Uniref90 ID.
- *all_aligned_taxonomy_summary.tsv : Tab separated output summarizing read counts per kraken2 taxon per gene feature. Column 1: Number of unpaired reads; Column 2: Pangene ID; Column 3: Kraken2 classification; Column 4: Uniref90 ID.
- *unaligned_taxonomy.tsv : Tab separated output. Column 1: Read ID; Column 2: Kraken2 taxonomic classification. This details the taxonomic classifications for reads which did not align to pangene catalogs or Uniref90 databases
- *panalign_annot.tsv : Tab separated output containing functional annotations (COGs, Uniref90 and Gene Onthologies) for pangenes with >= 50% read coverage. Columns are in order: pangene length, percent_cov, unpaired_read_count, pangene_desc, uniref90_ID, uniref90_desc, uniref90_GO, emapper_max_annot_OG, emapper_OG, emapper_max_annot_lvl, eggnog_cat, eggnog_GO, eggnog_desc
- *transl-search_annot.tsv : Tab separated output containing functional annotations (COGs, Uniref90 and Gene Onthologies) for Uniref90 clusters (with no pangene assignment) with >= 50% read coverage. Columns are in order: uniref90_ID, percent_cov, AA_length, unpaired_read_count, uniref90_desc, emapper_max_annot_OG, emapper_OG, emapper_max_annot_lvl, eggnog_cat, eggnog_GO, eggnog_desc, uniref90_GO

Updates

There is now a "concatenate" subworkflow to merge fastq.gz files across different lanes for the same sample ID. Fixed bug in the concat module due to imprecise globbing. (Apr 2024)
Fixed code to not have the same variable name appear more than once in the input block for any process.
Add distinct minimizer information to kraken2 reports for false positive filtering. (Mar 2023)
Add taxonomy summary reports (Mar 2023)
Better templates for singularity.config and running singularity. Fix conflict in the main nextflow.config file. (Apr 2024)

Contact

Minghao Chia: chia_minghao@gis.a-star.edu.sg, chiaminghao@gmail.com