Variant Support Nextflow Pipeline

Overview

This Nextflow pipeline processes FastQ files from targeted sequencing to analyze and quantify variant support across samples. The pipeline handles adapter trimming, read alignment, amplicon matching, variant coverage analysis, and produces a comprehensive report of variant support metrics.

Features

Processing of paired-end FastQ files
Adapter trimming using fastp
Read alignment to a reference genome using minimap2
BAM file processing (conversion, sorting, duplicate marking)
Amplicon matching against probe hits from a probing pipeline
Variant coverage analysis using bam-readcount
Generation of detailed variant support spreadsheets with metrics including:
- Probe hits per amplicon
- Reference and alternate allele coverage
- Variant Allele Frequency (VAF) calculations

Quick Start

Check the quick start guide here.

Requirements

Software

The pipeline uses Singularity containers for all processes, so you'll need:

Nextflow (>= 21.04.0)
Singularity (>= 3.0)

The following tools are used via containers:

fastp (for adapter trimming)
minimap2 (for read alignment)
samtools (for SAM/BAM conversion and processing)
sambamba (for BAM sorting and duplicate marking)
bedtools (for amplicon matching)
bam-readcount (for variant coverage analysis)
Python with pandas/openpyxl (for results processing)

Input Data

The pipeline requires three main inputs:

FastQ files directory (--fastq)
- Files must follow the pattern: <SAMPLE>_<MATEPAIR>_<anything>.fastq.gz
- <SAMPLE> must match the folder names from the probing results
- <MATEPAIR> should be 1/2 or R1/R2
- Example: PX3492_AAGAGGCA-AGAGGATA_1_150bp_2_lanes.merge_chastity_passed.fastq.gz
Probing results directory (--probing_results)
- Contains subfolders for each sample with probe hit information
- Each sample folder contains files like <sample>.merge.hit.sequences.summary
- These files contain probe hit information in a FASTA-like format
Variant information JSON (--variant_info_json)
- Contains detailed annotation for variants used in probing
- Includes genomic coordinates, HGVS notation, probe sequences, etc.

Reference Data

Reference genome FASTA file and index
Amplicon BED file with coordinates

Usage

Basic Usage

nextflow run main.nf \
  --fastq /path/to/fastq_dir \
  --probing_results /path/to/probing_results \
  --out_dir /path/to/output_dir

Full Parameters

nextflow run main.nf \
  --fastq /path/to/fastq_dir \
  --probing_results /path/to/probing_results \
  --out_dir /path/to/output_dir \
  --amplicon_bed_file /path/to/amplicon.bed \
  --variant_info_json /path/to/variant_info.json \
  --reference_file /path/to/reference.fa \
  --reference_index /path/to/reference.fa.fai \
  --cancer_types /path/to/cancer_types.tsv \
  --debug true \
  --dev true \
  --randomize true

Required Arguments

Parameter	Description
`--fastq`	Directory containing FastQ files
`--probing_results`	Directory containing probing results
`--out_dir`	Output directory for results

Optional Arguments

Parameter	Description	Default
`--amplicon_bed_file`	BED file with amplicon coordinates	`/projects/trans_scratch/validations/M_Anglesio/hg19_bams/amplicon.bed`
`--variant_info_json`	JSON file with variant information	`/gsc/pipelines/probes/Anglesio-hg19a-ensembl69/variant_info.json`
`--reference_file`	Reference genome FASTA	`/projects/alignment_references/9606/hg19a/genome/bwa_64/hg19a.fa`
`--reference_index`	Reference genome index	`/projects/alignment_references/9606/hg19a/genome/bwa_64/hg19a.fa.fai`
`--cancer_types`	TSV file with sample cancer types	(None)
`--debug`	Enable output of intermediate files	`false`
`--dev`	Run only 3 samples for testing	`false`
`--randomize`	Randomize FastQ processing order	`false`

Profiles

The pipeline provides different execution profiles:

standard (default): Run locally
cluster: Submit jobs to SLURM
prod: Use production priority on the cluster

Example with profile:

nextflow run main.nf -profile cluster,prod \
  --fastq /path/to/fastq_dir \
  --probing_results /path/to/probing_results \
  --out_dir /path/to/output_dir

Advanced Nextflow Arguments

-resume: Resume a previous run
-with-report: Generate execution report
-with-trace: Generate execution trace
-with-timeline: Generate execution timeline
-w: Specify work directory

Pipeline Workflow

The pipeline performs the following steps:

Read Processing
- Adapter trimming with fastp
- Read alignment to reference genome with minimap2
- SAM to BAM conversion
- BAM sorting
- Duplicate marking
Amplicon Matching
- Extract reads matching probes from the probe hit files
- Create a sub-BAM with those reads
- Match reads to amplicons
- Count reads per amplicon
Variant Analysis
- Parse variant information from JSON
- Calculate coverage at variant positions using bam-readcount
- Calculate Variant Allele Frequency (VAF)
Output Generation
- Create comprehensive spreadsheet with variant support metrics
- Optional separation by cancer type

Output Files

The pipeline generates several output files, with the two primary results being:

variant_support.xlsx: Excel spreadsheet with all variant support metrics
variant_support.txt: Tab-separated text version of the same data

Additionally, the pipeline produces:

amplicon_counts.txt: Raw counts of amplicon matches
all_variants_coverage.txt: Coverage data for all variants
Sample-specific BAM files in samples/ directory

Cancer Type Classification

If a cancer types TSV file is provided (--cancer_types), the output will be separated by cancer type in the Excel spreadsheet. This file should contain two columns (no header):

Sample name
Cancer type

Example:

PX3492_AAGAGGCA-AGAGGATA    TYPE_A
PX3492_AAGAGGCA-ATAGAGAG    TYPE_B

Understanding the Variant Support Spreadsheet

The variant support spreadsheet (both Excel and text formats) contains the following key columns:

Sample Information

Column	Description
Sample	Sample identifier matching the FastQ filename
CancerType	Cancer type classification (if provided)

Variant Information

Column	Description
OriginalVariant	Original variant notation (e.g., `AKT1:c.49G>A`)
ProbeVariant	Variant notation used for the probe (HGVS cDNA format)
Probe	Probe identifier (typically genomic coordinates)
Chr	Chromosome containing the variant
Coordinate	Genomic position of the variant
ReferenceAllele	Reference nucleotide at the variant position
AlternateAllele	Alternate nucleotide at the variant position

Probe Hit Metrics

Column	Description
ProbeHits_S1	Number of probe hits matching amplicon position 1
ProbeHits_S2	Number of probe hits matching amplicon position 2
ProbeHits_S3	Number of probe hits matching amplicon position 3
TotalProbeHits	Total number of probe hits across all amplicons
SupportingAmplicons	Number of amplicons with at least one probe hit (0-3)

Coverage Metrics

Column	Description
BamCoverage	Total read depth at the variant position
ReferenceCoverage	Number of reads with the reference allele
AlternateCoverage	Number of reads with the alternate allele
VAF	Variant Allele Frequency (percentage of reads with alternate allele)
A_cov	Number of reads with 'A' at the variant position
C_cov	Number of reads with 'C' at the variant position
G_cov	Number of reads with 'G' at the variant position
T_cov	Number of reads with 'T' at the variant position

Troubleshooting

Common Issues

Missing sample files: Ensure the sample names in the FastQ files match exactly with the sample folders in the probing results
Missing dependencies: The pipeline uses Singularity containers, so ensure Singularity is properly installed and configured

Error Handling

The pipeline is configured with errorStrategy = 'retry' which will attempt to retry failed processes before failing the pipeline.

Project Structure

.
├── main.nf                 # Main workflow script
├── nextflow.config         # Configuration file
├── README.md               # Main document file.
├── docs/                   # extra documentation files.
├── workflows/
│   ├── genome_processes.nf     # Genome alignment processes
│   ├── amplicon_matching_processes.nf  # Amplicon matching processes
│   ├── variant_processes.nf    # Variant analysis processes
│   └── utilities_processes.nf  # Utility processes
└── bin/
    └── build_variant_support_df.py  # Script for generating final output

Detailed technical documentation

Check here for a comprehensive overview of the core workflow and process details.

bcgsc/variant_support_pipeline