BaitCapture

BaitCapture is a bioinformatics workflow designed for processing sequencing data obtained from targeted resistome bait-capture sequencing, built using Nextflow.

Introduction
Quick start
Pipeline summary
Usage
- Input type: Samplesheet
- Input type: Folder
  - Case #1: Default file name pattern
  - Case #2: Alternate file name pattern
Output
Testing the workflow
Running the workflow on high-performance compute clusters
- Waffles
Advanced usage
Contributions and support
Citations

Introduction

BaitCapture is based upon a Bash script workflow originally created by Shay et al. (2023). Though it was designed in particular consideration of bait-capture sequencing data, BaitCapture can be used for any paired-end Illumina sequencing dataset where the user needs to align many sequence reads to a reference database of gene targets.

BaitCapture offers the following features:

Quality control: Assess the quality of raw and pre-processed sequence data.
Pre-processing:
- Read decontamination using a host reference genome
- Quality-based trimming
- Adapter removal
Read alignment: Align reads against a reference database of gene targets using KMA, BWA-MEM2, or BWA.
Alignment reports:
- mapstats.tsv: A table of read alignment statistics against each gene target for each sample, including KMA-specific alignment statistics.
- sumstats.tsv: A table of summary statistics for each sample, including the on-target alignment rate, and the number of reads lost from host decontamination and filtering.
- presence_absence.tsv: A table of presence-absence calls for each gene target in each sample, based upon user-defined thresholds.
- presence_absence_clusters.tsv: A table of presence-absence calls for each gene target cluster in each sample, with clusters defined by a target metadata file (e.g. resistance mechanism).

Quick start

Pipeline summary

The steps of the workflow are:

Report the quality of the raw sequence data using FastQC.
(Optional) Trim the raw sequence reads using fastp.
(Optional) Decontaminate the trimmed sequence reads using a host reference genome with BWA-MEM2.
Report the quality of the pre-processed sequence data using FastQC.
Obtain total read and bp counts from the raw and pre-processed sequence data using fastq-scan.
Align trimmed and/or decontaminated reads against the database of gene targets using KMA, BWA-MEM2, or BWA.
Obtain sequence coverage and depth statistics from the alignments using AlignCov, Mosdepth, and SAMtools.
Create a MultiQC report and other summary reports with custom scripts.

Usage

If you are new to Nextflow and the nf-core framework, please refer to this page on how to set-up Nextflow.

Please provide pipeline parameters via the CLI or Nextflow -params-file option. Custom config files including those provided by the -c Nextflow option can be used to provide any configuration except for parameters; see docs.

BaitCapture can be run using two different input types:

A samplesheet, including sample names and paths to paired-end gzipped FASTQ files, or
A folder containing paired-end gzipped FASTQ files.

Input type: Samplesheet

First, prepare a samplesheet with your input data that looks as follows:

samplesheet.csv:

sample,fastq_1,fastq_2
CONTROL_REP1,AEG588A1_S1_L002_R1_001.fastq.gz,AEG588A1_S1_L002_R2_001.fastq.gz

Each row represents a gzipped FASTQ file.

Now, you can run the pipeline using:

nextflow run OLC-Bioinformatics/BaitCapture \
   -profile <docker/singularity/.../institute> \
   --input samplesheet.csv \
   --targets targets.fa \
   --outdir <OUTDIR>

Input type: Folder

Instead of a samplesheet, the user can instead provide a path to a directory containing gzipped FASTQ files. In this case, the sample name will be the name of the file up until the first underscore (_).

Case #1: Default file name pattern

The default file naming pattern that --input_folder searches for is "/*_R{1,2}.fastq.gz". For example, for a folder data/ containing sequencing files that looks as follows:

data
├── ERR9958133_R1_001.fastq.gz
├── ERR9958133_R2_001.fastq.gz
├── ERR9958134_R1_001.fastq.gz
└── ERR9958134_R2_001.fastq.gz

The workflow can be run simply by using:

nextflow run OLC-Bioinformatics/BaitCapture \
   -profile <docker/singularity/.../institute> \
   --input_folder data/ \
   --targets targets.fa \
   --outdir <OUTDIR>

And the sample names will be:

ERR9958133
ERR9958134

Case #2: Alternate file name pattern

If the names of the gzipped FASTQ files do not end with _R{1,2}_001.fastq.gz, an alternate sequencing file pattern must be specified using --pattern. For example, for a folder more-data/ that looks as follows:

more-data
├── SAMN000214_R1.fastq.gz
├── SAMN000214_R2.fastq.gz
├── SAMN000215_R1.fastq.gz
└── SAMN000215_R2.fastq.gz

The workflow can be run using:

nextflow run OLC-Bioinformatics/BaitCapture \
   -profile <docker/singularity/.../institute> \
   --input_folder more-data/ \
   --pattern "/*_R{1,2}.fastq.gz" \
   --targets targets.fa \
   --outdir <OUTDIR>

And the sample names will be:

SAMN000214
SAMN000215

Note

When providing an argument to --pattern, the string must be enclosed in double quotes ("") and must be prepended with a forward slash (/).

Output

The pipeline will output the following directories:

results/
├── fastp
├── fastqc
│   ├── preprocessed
│   └── raw
├── multiqc
│   ├── multiqc_data
│   └── multiqc_plots
├── pipeline_info
└── summary

fastp/: Contains fastp reports for trimmed FASTQ files.
fastqc/: Contains FastQC reports for raw and pre-processed FASTQ files.
multiqc/: Contains MultiQC reports.
pipeline_info/: Contains Nextflow logs and reports.
summary/: Contains alignment reports for each sample.

Testing the workflow

To check if BaitCapture, Nextflow, and your container manager have been configured properly, a test run of the workflow can be performed by first cloning the GitHub repository and then running the test workflow as follows:

git clone https://github.com/OLC-Bioinformatics/BaitCapture
cd BaitCapture/
nextflow run . \
  -profile test,<docker/singularity/.../institute> \
  --outdir <OUTDIR>

If your <OUTDIR> was results/, you could then run the following command to inspect the summary statistics for the test sample:

$ cat results/summary/sumstats.tsv 
sampleid        raw_total_reads raw_total_bp    fastp_total_reads       fastp_total_bp  decontam_total_reads        decontam_total_bp   mapped_total_reads      mapped_total_bp percent_reads_lost_fastp        percent_reads_lost_decontam    percent_reads_on_target
SRR14739083     553624  83043600        464776  69619648        455764  68278988        98485   14753774        16.05   1.94    21.61

Running the workflow on high-performance compute clusters

Nextflow is capable of running several jobs in parallel using job submission managers (e.g. SLURM) that have been configured on high-performance compute (HPC) clusters. For your convenience, profiles have been added to simplify running the workflow on commonly used clusters.

Waffles

To run BaitCapture on National Microbiology Laboratory's HPC cluster Waffles using Singularity, use -profile waffles. For example:

nextflow run OLC-Bioinformatics/BaitCapture \
  -profile waffles \
  --input samplesheet.csv \
  --targets targets.fa \
  --outdir <OUTDIR>

Advanced usage

More usage information can be obtained at any time by running nextflow run OLC-Bioinformatics/BaitCapture --help:

$ nextflow run OLC-Bioinformatics/BaitCapture --help
N E X T F L O W  ~  version 23.10.1
Launching `https://github.com/OLC-Bioinformatics/BaitCapture` [stupefied_bassi] DSL2 - revision: 0fd15d548b [dev]


------------------------------------------------------
  olc/baitcapture v1.0.0-g0fd15d5
------------------------------------------------------
Typical pipeline command:

  nextflow run olc/baitcapture --input samplesheet.csv --targets targets.fa --outdir results/ -profile singularity

Input/output options
  --input                            [string]  Path to comma-separated file containing information about the samples in the experiment.
  --input_folder                     [string]  Path to folder containing paired-end gzipped FASTQ files.
  --pattern                          [string]  Naming of sequencing files for `--input_folder`. Must use double-quotes (`""`) and a prepended slash (`/`). 
                                               [default: "/*_R{1,2}_001.fastq.gz"] 
  --targets                          [string]  Path to FASTA file of gene targets for alignment.
  --outdir                           [string]  The output directory where the results will be saved. You have to use absolute paths to storage on Cloud 
                                               infrastructure. 
  --host                             [string]  Path to FASTA file of host genome to use for host DNA removal (decontamination).
  --adapters                         [string]  Path to FASTA file of adapter sequences to use for adapter removal with FASTP.
  --target_metadata                  [string]  Path to comma-separated file containing information about the metadata for targets used in the 
                                               experiment. 
  --email                            [string]  Email address for completion summary.
  --multiqc_title                    [string]  MultiQC report title. Printed as page header, used for filename if not otherwise specified.

Workflow execution options
  --aligner                          [string]  Alignment tool to use for aligning (preprocessed) reads to the provided database of gene targets). (accepted: 
                                               bwamem2, kma, bwa) [default: kma] 
  --skip_trimming                    [boolean] Indicate whether to skip trimming of raw reads.
  --report_all                       [boolean] Report undetected targets in merged results files.

Target detection thresholds
  --fold_cov_threshold               [number]  The minimum fold-coverage of a target that must be achieved to call a positive detection. [default: 0.9]
  --len_cov_threshold                [integer] The minimum length (in bp) that a target must be covered by to call a positive detection. [default: 0]
  --mapped_reads_threshold           [integer] The minimum number of reads that must be mapped to a target to call a positive detection. [default: 2]
  --prop_cov_threshold               [number]  The minimum percentage of length (in bp) that a target must be covered by to call a positive detection. 
                                               [default: 0.9] 
  --pident_threshold                 [number]  The minimum percentage identity match to a target that must be achieved to call a positive detection (only 
                                               available with `--aligner kma`). 

Generic options
  --multiqc_methods_description      [string]  Custom MultiQC yaml file containing HTML including a methods description.

 !! Hiding 23 params, use the 'validation.showHiddenParams' config value to show them !!
------------------------------------------------------
If you use olc/baitcapture for your analysis please cite:

* The nf-core framework
  https://doi.org/10.1038/s41587-020-0439-x

* Software dependencies
  https://github.com/olc/baitcapture/blob/master/CITATIONS.md
------------------------------------------------------

Contributions and support

If you would like to contribute to this pipeline, please see the contributing guidelines.

Citations

If you use BaitCapture in a publication, please consider citing the software using the Zenodo DOI: 10.5281/zenodo.11283946.

An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file.

This pipeline uses code and infrastructure developed and maintained by the nf-core initative, including nf-core/mag and nf-core/ampliseq, and reused here under the MIT license.

Ewels PA, Peltzer A, Fillinger S, Patel H, Alneberg J, Wilm A, Garcia MU, Di Tommaso P, Nahnsen S. The nf-core framework for community-curated bioinformatics pipelines. Nat Biotechnol. 2020 Mar;38(3):276-278. doi: 10.1038/s41587-020-0439-x. PubMed PMID: 32055031.

OLC-Bioinformatics/BaitCapture