BaitCapture is a bioinformatics workflow designed for processing sequencing data obtained from targeted resistome bait-capture sequencing, built using Nextflow.
- Introduction
- Quick start
- Pipeline summary
- Usage
- Output
- Testing the workflow
- Running the workflow on high-performance compute clusters
- Advanced usage
- Contributions and support
- Citations
BaitCapture is based upon a Bash script workflow originally created by Shay et al. (2023). Though it was designed in particular consideration of bait-capture sequencing data, BaitCapture can be used for any paired-end Illumina sequencing dataset where the user needs to align many sequence reads to a reference database of gene targets.
BaitCapture offers the following features:
- Quality control: Assess the quality of raw and pre-processed sequence data.
- Pre-processing:
- Read decontamination using a host reference genome
- Quality-based trimming
- Adapter removal
- Read alignment: Align reads against a reference database of gene targets using KMA, BWA-MEM2, or BWA.
- Alignment reports:
mapstats.tsv
: A table of read alignment statistics against each gene target for each sample, including KMA-specific alignment statistics.sumstats.tsv
: A table of summary statistics for each sample, including the on-target alignment rate, and the number of reads lost from host decontamination and filtering.presence_absence.tsv
: A table of presence-absence calls for each gene target in each sample, based upon user-defined thresholds.presence_absence_clusters.tsv
: A table of presence-absence calls for each gene target cluster in each sample, with clusters defined by a target metadata file (e.g. resistance mechanism).
The steps of the workflow are:
- Report the quality of the raw sequence data using FastQC.
- (Optional) Trim the raw sequence reads using fastp.
- (Optional) Decontaminate the trimmed sequence reads using a host reference genome with BWA-MEM2.
- Report the quality of the pre-processed sequence data using FastQC.
- Obtain total read and bp counts from the raw and pre-processed sequence data using fastq-scan.
- Align trimmed and/or decontaminated reads against the database of gene targets using KMA, BWA-MEM2, or BWA.
- Obtain sequence coverage and depth statistics from the alignments using AlignCov, Mosdepth, and SAMtools.
- Create a MultiQC report and other summary reports with custom scripts.
If you are new to Nextflow and the nf-core framework, please refer to this page on how to set-up Nextflow.
Please provide pipeline parameters via the CLI or Nextflow -params-file
option.
Custom config files including those provided by the -c
Nextflow option can be used to provide any configuration except for parameters;
see docs.
BaitCapture can be run using two different input types:
- A samplesheet, including sample names and paths to paired-end gzipped FASTQ files, or
- A folder containing paired-end gzipped FASTQ files.
First, prepare a samplesheet with your input data that looks as follows:
samplesheet.csv
:
sample,fastq_1,fastq_2
CONTROL_REP1,AEG588A1_S1_L002_R1_001.fastq.gz,AEG588A1_S1_L002_R2_001.fastq.gz
Each row represents a gzipped FASTQ file.
Now, you can run the pipeline using:
nextflow run OLC-Bioinformatics/BaitCapture \
-profile <docker/singularity/.../institute> \
--input samplesheet.csv \
--targets targets.fa \
--outdir <OUTDIR>
Instead of a samplesheet, the user can instead provide a path to a directory containing gzipped FASTQ files.
In this case, the sample name will be the name of the file up until the first underscore (_
).
The default file naming pattern that --input_folder
searches for is "/*_R{1,2}.fastq.gz"
.
For example, for a folder data/
containing sequencing files that looks as follows:
data
├── ERR9958133_R1_001.fastq.gz
├── ERR9958133_R2_001.fastq.gz
├── ERR9958134_R1_001.fastq.gz
└── ERR9958134_R2_001.fastq.gz
The workflow can be run simply by using:
nextflow run OLC-Bioinformatics/BaitCapture \
-profile <docker/singularity/.../institute> \
--input_folder data/ \
--targets targets.fa \
--outdir <OUTDIR>
And the sample names will be:
- ERR9958133
- ERR9958134
If the names of the gzipped FASTQ files do not end with _R{1,2}_001.fastq.gz
, an alternate sequencing file pattern must be specified using --pattern
.
For example, for a folder more-data/
that looks as follows:
more-data
├── SAMN000214_R1.fastq.gz
├── SAMN000214_R2.fastq.gz
├── SAMN000215_R1.fastq.gz
└── SAMN000215_R2.fastq.gz
The workflow can be run using:
nextflow run OLC-Bioinformatics/BaitCapture \
-profile <docker/singularity/.../institute> \
--input_folder more-data/ \
--pattern "/*_R{1,2}.fastq.gz" \
--targets targets.fa \
--outdir <OUTDIR>
And the sample names will be:
- SAMN000214
- SAMN000215
Note
When providing an argument to --pattern
, the string must be enclosed in double quotes (""
) and must be prepended with a forward slash (/
).
The pipeline will output the following directories:
results/
├── fastp
├── fastqc
│ ├── preprocessed
│ └── raw
├── multiqc
│ ├── multiqc_data
│ └── multiqc_plots
├── pipeline_info
└── summary
fastp/
: Contains fastp reports for trimmed FASTQ files.fastqc/
: Contains FastQC reports for raw and pre-processed FASTQ files.multiqc/
: Contains MultiQC reports.pipeline_info/
: Contains Nextflow logs and reports.summary/
: Contains alignment reports for each sample.
To check if BaitCapture, Nextflow, and your container manager have been configured properly, a test run of the workflow can be performed by first cloning the GitHub repository and then running the test workflow as follows:
git clone https://github.com/OLC-Bioinformatics/BaitCapture
cd BaitCapture/
nextflow run . \
-profile test,<docker/singularity/.../institute> \
--outdir <OUTDIR>
If your <OUTDIR>
was results/
, you could then run the following command to inspect the summary statistics for the test sample:
$ cat results/summary/sumstats.tsv
sampleid raw_total_reads raw_total_bp fastp_total_reads fastp_total_bp decontam_total_reads decontam_total_bp mapped_total_reads mapped_total_bp percent_reads_lost_fastp percent_reads_lost_decontam percent_reads_on_target
SRR14739083 553624 83043600 464776 69619648 455764 68278988 98485 14753774 16.05 1.94 21.61
Nextflow is capable of running several jobs in parallel using job submission managers (e.g. SLURM) that have been configured on high-performance compute (HPC) clusters. For your convenience, profiles have been added to simplify running the workflow on commonly used clusters.
To run BaitCapture on National Microbiology Laboratory's HPC cluster Waffles using Singularity, use -profile waffles
.
For example:
nextflow run OLC-Bioinformatics/BaitCapture \
-profile waffles \
--input samplesheet.csv \
--targets targets.fa \
--outdir <OUTDIR>
More usage information can be obtained at any time by running nextflow run OLC-Bioinformatics/BaitCapture --help
:
$ nextflow run OLC-Bioinformatics/BaitCapture --help
N E X T F L O W ~ version 23.10.1
Launching `https://github.com/OLC-Bioinformatics/BaitCapture` [stupefied_bassi] DSL2 - revision: 0fd15d548b [dev]
------------------------------------------------------
olc/baitcapture v1.0.0-g0fd15d5
------------------------------------------------------
Typical pipeline command:
nextflow run olc/baitcapture --input samplesheet.csv --targets targets.fa --outdir results/ -profile singularity
Input/output options
--input [string] Path to comma-separated file containing information about the samples in the experiment.
--input_folder [string] Path to folder containing paired-end gzipped FASTQ files.
--pattern [string] Naming of sequencing files for `--input_folder`. Must use double-quotes (`""`) and a prepended slash (`/`).
[default: "/*_R{1,2}_001.fastq.gz"]
--targets [string] Path to FASTA file of gene targets for alignment.
--outdir [string] The output directory where the results will be saved. You have to use absolute paths to storage on Cloud
infrastructure.
--host [string] Path to FASTA file of host genome to use for host DNA removal (decontamination).
--adapters [string] Path to FASTA file of adapter sequences to use for adapter removal with FASTP.
--target_metadata [string] Path to comma-separated file containing information about the metadata for targets used in the
experiment.
--email [string] Email address for completion summary.
--multiqc_title [string] MultiQC report title. Printed as page header, used for filename if not otherwise specified.
Workflow execution options
--aligner [string] Alignment tool to use for aligning (preprocessed) reads to the provided database of gene targets). (accepted:
bwamem2, kma, bwa) [default: kma]
--skip_trimming [boolean] Indicate whether to skip trimming of raw reads.
--report_all [boolean] Report undetected targets in merged results files.
Target detection thresholds
--fold_cov_threshold [number] The minimum fold-coverage of a target that must be achieved to call a positive detection. [default: 0.9]
--len_cov_threshold [integer] The minimum length (in bp) that a target must be covered by to call a positive detection. [default: 0]
--mapped_reads_threshold [integer] The minimum number of reads that must be mapped to a target to call a positive detection. [default: 2]
--prop_cov_threshold [number] The minimum percentage of length (in bp) that a target must be covered by to call a positive detection.
[default: 0.9]
--pident_threshold [number] The minimum percentage identity match to a target that must be achieved to call a positive detection (only
available with `--aligner kma`).
Generic options
--multiqc_methods_description [string] Custom MultiQC yaml file containing HTML including a methods description.
!! Hiding 23 params, use the 'validation.showHiddenParams' config value to show them !!
------------------------------------------------------
If you use olc/baitcapture for your analysis please cite:
* The nf-core framework
https://doi.org/10.1038/s41587-020-0439-x
* Software dependencies
https://github.com/olc/baitcapture/blob/master/CITATIONS.md
------------------------------------------------------
If you would like to contribute to this pipeline, please see the contributing guidelines.
If you use BaitCapture in a publication, please consider citing the software using the Zenodo DOI: 10.5281/zenodo.11283946.
An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md
file.
This pipeline uses code and infrastructure developed and maintained by the nf-core initative, including nf-core/mag and nf-core/ampliseq, and reused here under the MIT license.
Ewels PA, Peltzer A, Fillinger S, Patel H, Alneberg J, Wilm A, Garcia MU, Di Tommaso P, Nahnsen S. The nf-core framework for community-curated bioinformatics pipelines. Nat Biotechnol. 2020 Mar;38(3):276-278. doi: 10.1038/s41587-020-0439-x. PubMed PMID: 32055031.