Variant Support Nextflow Pipeline

Overview

This Nextflow pipeline processes FastQ files from targeted sequencing to analyze and quantify variant support across samples. The pipeline handles adapter trimming, read alignment, amplicon matching, variant coverage analysis, and produces a comprehensive report of variant support metrics.

Features

  • Processing of paired-end FastQ files
  • Adapter trimming using fastp
  • Read alignment to a reference genome using minimap2
  • BAM file processing (conversion, sorting, duplicate marking)
  • Amplicon matching against probe hits from a probing pipeline
  • Variant coverage analysis using bam-readcount
  • Generation of detailed variant support spreadsheets with metrics including:
    • Probe hits per amplicon
    • Reference and alternate allele coverage
    • Variant Allele Frequency (VAF) calculations

Quick Start

Check the quick start guide here.

Requirements

Software

The pipeline uses Singularity containers for all processes, so you'll need:

  • Nextflow (>= 21.04.0)
  • Singularity (>= 3.0)

The following tools are used via containers:

  • fastp (for adapter trimming)
  • minimap2 (for read alignment)
  • samtools (for SAM/BAM conversion and processing)
  • sambamba (for BAM sorting and duplicate marking)
  • bedtools (for amplicon matching)
  • bam-readcount (for variant coverage analysis)
  • Python with pandas/openpyxl (for results processing)

Input Data

The pipeline requires three main inputs:

  1. FastQ files directory (--fastq)

    • Files must follow the pattern: <SAMPLE>_<MATEPAIR>_<anything>.fastq.gz
    • <SAMPLE> must match the folder names from the probing results
    • <MATEPAIR> should be 1/2 or R1/R2
    • Example: PX3492_AAGAGGCA-AGAGGATA_1_150bp_2_lanes.merge_chastity_passed.fastq.gz
  2. Probing results directory (--probing_results)

    • Contains subfolders for each sample with probe hit information
    • Each sample folder contains files like <sample>.merge.hit.sequences.summary
    • These files contain probe hit information in a FASTA-like format
  3. Variant information JSON (--variant_info_json)

    • Contains detailed annotation for variants used in probing
    • Includes genomic coordinates, HGVS notation, probe sequences, etc.

Reference Data

  • Reference genome FASTA file and index
  • Amplicon BED file with coordinates

Usage

Basic Usage

nextflow run main.nf \
  --fastq /path/to/fastq_dir \
  --probing_results /path/to/probing_results \
  --out_dir /path/to/output_dir

Full Parameters

nextflow run main.nf \
  --fastq /path/to/fastq_dir \
  --probing_results /path/to/probing_results \
  --out_dir /path/to/output_dir \
  --amplicon_bed_file /path/to/amplicon.bed \
  --variant_info_json /path/to/variant_info.json \
  --reference_file /path/to/reference.fa \
  --reference_index /path/to/reference.fa.fai \
  --cancer_types /path/to/cancer_types.tsv \
  --debug true \
  --dev true \
  --randomize true

Required Arguments

Parameter Description
--fastq Directory containing FastQ files
--probing_results Directory containing probing results
--out_dir Output directory for results

Optional Arguments

Parameter Description Default
--amplicon_bed_file BED file with amplicon coordinates /projects/trans_scratch/validations/M_Anglesio/hg19_bams/amplicon.bed
--variant_info_json JSON file with variant information /gsc/pipelines/probes/Anglesio-hg19a-ensembl69/variant_info.json
--reference_file Reference genome FASTA /projects/alignment_references/9606/hg19a/genome/bwa_64/hg19a.fa
--reference_index Reference genome index /projects/alignment_references/9606/hg19a/genome/bwa_64/hg19a.fa.fai
--cancer_types TSV file with sample cancer types (None)
--debug Enable output of intermediate files false
--dev Run only 3 samples for testing false
--randomize Randomize FastQ processing order false

Profiles

The pipeline provides different execution profiles:

  • standard (default): Run locally
  • cluster: Submit jobs to SLURM
  • prod: Use production priority on the cluster

Example with profile:

nextflow run main.nf -profile cluster,prod \
  --fastq /path/to/fastq_dir \
  --probing_results /path/to/probing_results \
  --out_dir /path/to/output_dir

Advanced Nextflow Arguments

  • -resume: Resume a previous run
  • -with-report: Generate execution report
  • -with-trace: Generate execution trace
  • -with-timeline: Generate execution timeline
  • -w: Specify work directory

Pipeline Workflow

The pipeline performs the following steps:

  1. Read Processing

    • Adapter trimming with fastp
    • Read alignment to reference genome with minimap2
    • SAM to BAM conversion
    • BAM sorting
    • Duplicate marking
  2. Amplicon Matching

    • Extract reads matching probes from the probe hit files
    • Create a sub-BAM with those reads
    • Match reads to amplicons
    • Count reads per amplicon
  3. Variant Analysis

    • Parse variant information from JSON
    • Calculate coverage at variant positions using bam-readcount
    • Calculate Variant Allele Frequency (VAF)
  4. Output Generation

    • Create comprehensive spreadsheet with variant support metrics
    • Optional separation by cancer type

Output Files

The pipeline generates several output files, with the two primary results being:

  • variant_support.xlsx: Excel spreadsheet with all variant support metrics
  • variant_support.txt: Tab-separated text version of the same data

Additionally, the pipeline produces:

  • amplicon_counts.txt: Raw counts of amplicon matches
  • all_variants_coverage.txt: Coverage data for all variants
  • Sample-specific BAM files in samples/ directory

Cancer Type Classification

If a cancer types TSV file is provided (--cancer_types), the output will be separated by cancer type in the Excel spreadsheet. This file should contain two columns (no header):

  • Sample name
  • Cancer type

Example:

PX3492_AAGAGGCA-AGAGGATA    TYPE_A
PX3492_AAGAGGCA-ATAGAGAG    TYPE_B

Understanding the Variant Support Spreadsheet

The variant support spreadsheet (both Excel and text formats) contains the following key columns:

Sample Information

Column Description
Sample Sample identifier matching the FastQ filename
CancerType Cancer type classification (if provided)

Variant Information

Column Description
OriginalVariant Original variant notation (e.g., AKT1:c.49G>A)
ProbeVariant Variant notation used for the probe (HGVS cDNA format)
Probe Probe identifier (typically genomic coordinates)
Chr Chromosome containing the variant
Coordinate Genomic position of the variant
ReferenceAllele Reference nucleotide at the variant position
AlternateAllele Alternate nucleotide at the variant position

Probe Hit Metrics

Column Description
ProbeHits_S1 Number of probe hits matching amplicon position 1
ProbeHits_S2 Number of probe hits matching amplicon position 2
ProbeHits_S3 Number of probe hits matching amplicon position 3
TotalProbeHits Total number of probe hits across all amplicons
SupportingAmplicons Number of amplicons with at least one probe hit (0-3)

Coverage Metrics

Column Description
BamCoverage Total read depth at the variant position
ReferenceCoverage Number of reads with the reference allele
AlternateCoverage Number of reads with the alternate allele
VAF Variant Allele Frequency (percentage of reads with alternate allele)
A_cov Number of reads with 'A' at the variant position
C_cov Number of reads with 'C' at the variant position
G_cov Number of reads with 'G' at the variant position
T_cov Number of reads with 'T' at the variant position

Troubleshooting

Common Issues

  • Missing sample files: Ensure the sample names in the FastQ files match exactly with the sample folders in the probing results
  • Missing dependencies: The pipeline uses Singularity containers, so ensure Singularity is properly installed and configured

Error Handling

The pipeline is configured with errorStrategy = 'retry' which will attempt to retry failed processes before failing the pipeline.

Project Structure

.
├── main.nf                 # Main workflow script
├── nextflow.config         # Configuration file
├── README.md               # Main document file.
├── docs/                   # extra documentation files.
├── workflows/
│   ├── genome_processes.nf     # Genome alignment processes
│   ├── amplicon_matching_processes.nf  # Amplicon matching processes
│   ├── variant_processes.nf    # Variant analysis processes
│   └── utilities_processes.nf  # Utility processes
└── bin/
    └── build_variant_support_df.py  # Script for generating final output

Detailed technical documentation

Check here for a comprehensive overview of the core workflow and process details.