This Nextflow pipeline processes FastQ files from targeted sequencing to analyze and quantify variant support across samples. The pipeline handles adapter trimming, read alignment, amplicon matching, variant coverage analysis, and produces a comprehensive report of variant support metrics.
- Processing of paired-end FastQ files
- Adapter trimming using fastp
- Read alignment to a reference genome using minimap2
- BAM file processing (conversion, sorting, duplicate marking)
- Amplicon matching against probe hits from a probing pipeline
- Variant coverage analysis using bam-readcount
- Generation of detailed variant support spreadsheets with metrics including:
- Probe hits per amplicon
- Reference and alternate allele coverage
- Variant Allele Frequency (VAF) calculations
Check the quick start guide here.
The pipeline uses Singularity containers for all processes, so you'll need:
- Nextflow (>= 21.04.0)
- Singularity (>= 3.0)
The following tools are used via containers:
- fastp (for adapter trimming)
- minimap2 (for read alignment)
- samtools (for SAM/BAM conversion and processing)
- sambamba (for BAM sorting and duplicate marking)
- bedtools (for amplicon matching)
- bam-readcount (for variant coverage analysis)
- Python with pandas/openpyxl (for results processing)
The pipeline requires three main inputs:
-
FastQ files directory (
--fastq
)- Files must follow the pattern:
<SAMPLE>_<MATEPAIR>_<anything>.fastq.gz
<SAMPLE>
must match the folder names from the probing results<MATEPAIR>
should be 1/2 or R1/R2- Example:
PX3492_AAGAGGCA-AGAGGATA_1_150bp_2_lanes.merge_chastity_passed.fastq.gz
- Files must follow the pattern:
-
Probing results directory (
--probing_results
)- Contains subfolders for each sample with probe hit information
- Each sample folder contains files like
<sample>.merge.hit.sequences.summary
- These files contain probe hit information in a FASTA-like format
-
Variant information JSON (
--variant_info_json
)- Contains detailed annotation for variants used in probing
- Includes genomic coordinates, HGVS notation, probe sequences, etc.
- Reference genome FASTA file and index
- Amplicon BED file with coordinates
nextflow run main.nf \
--fastq /path/to/fastq_dir \
--probing_results /path/to/probing_results \
--out_dir /path/to/output_dir
nextflow run main.nf \
--fastq /path/to/fastq_dir \
--probing_results /path/to/probing_results \
--out_dir /path/to/output_dir \
--amplicon_bed_file /path/to/amplicon.bed \
--variant_info_json /path/to/variant_info.json \
--reference_file /path/to/reference.fa \
--reference_index /path/to/reference.fa.fai \
--cancer_types /path/to/cancer_types.tsv \
--debug true \
--dev true \
--randomize true
Parameter | Description |
---|---|
--fastq |
Directory containing FastQ files |
--probing_results |
Directory containing probing results |
--out_dir |
Output directory for results |
Parameter | Description | Default |
---|---|---|
--amplicon_bed_file |
BED file with amplicon coordinates | /projects/trans_scratch/validations/M_Anglesio/hg19_bams/amplicon.bed |
--variant_info_json |
JSON file with variant information | /gsc/pipelines/probes/Anglesio-hg19a-ensembl69/variant_info.json |
--reference_file |
Reference genome FASTA | /projects/alignment_references/9606/hg19a/genome/bwa_64/hg19a.fa |
--reference_index |
Reference genome index | /projects/alignment_references/9606/hg19a/genome/bwa_64/hg19a.fa.fai |
--cancer_types |
TSV file with sample cancer types | (None) |
--debug |
Enable output of intermediate files | false |
--dev |
Run only 3 samples for testing | false |
--randomize |
Randomize FastQ processing order | false |
The pipeline provides different execution profiles:
standard
(default): Run locallycluster
: Submit jobs to SLURMprod
: Use production priority on the cluster
Example with profile:
nextflow run main.nf -profile cluster,prod \
--fastq /path/to/fastq_dir \
--probing_results /path/to/probing_results \
--out_dir /path/to/output_dir
-resume
: Resume a previous run-with-report
: Generate execution report-with-trace
: Generate execution trace-with-timeline
: Generate execution timeline-w
: Specify work directory
The pipeline performs the following steps:
-
Read Processing
- Adapter trimming with fastp
- Read alignment to reference genome with minimap2
- SAM to BAM conversion
- BAM sorting
- Duplicate marking
-
Amplicon Matching
- Extract reads matching probes from the probe hit files
- Create a sub-BAM with those reads
- Match reads to amplicons
- Count reads per amplicon
-
Variant Analysis
- Parse variant information from JSON
- Calculate coverage at variant positions using bam-readcount
- Calculate Variant Allele Frequency (VAF)
-
Output Generation
- Create comprehensive spreadsheet with variant support metrics
- Optional separation by cancer type
The pipeline generates several output files, with the two primary results being:
variant_support.xlsx
: Excel spreadsheet with all variant support metricsvariant_support.txt
: Tab-separated text version of the same data
Additionally, the pipeline produces:
amplicon_counts.txt
: Raw counts of amplicon matchesall_variants_coverage.txt
: Coverage data for all variants- Sample-specific BAM files in
samples/
directory
If a cancer types TSV file is provided (--cancer_types
), the output will be separated by cancer type in the Excel spreadsheet. This file should contain two columns (no header):
- Sample name
- Cancer type
Example:
PX3492_AAGAGGCA-AGAGGATA TYPE_A
PX3492_AAGAGGCA-ATAGAGAG TYPE_B
The variant support spreadsheet (both Excel and text formats) contains the following key columns:
Column | Description |
---|---|
Sample | Sample identifier matching the FastQ filename |
CancerType | Cancer type classification (if provided) |
Column | Description |
---|---|
OriginalVariant | Original variant notation (e.g., AKT1:c.49G>A ) |
ProbeVariant | Variant notation used for the probe (HGVS cDNA format) |
Probe | Probe identifier (typically genomic coordinates) |
Chr | Chromosome containing the variant |
Coordinate | Genomic position of the variant |
ReferenceAllele | Reference nucleotide at the variant position |
AlternateAllele | Alternate nucleotide at the variant position |
Column | Description |
---|---|
ProbeHits_S1 | Number of probe hits matching amplicon position 1 |
ProbeHits_S2 | Number of probe hits matching amplicon position 2 |
ProbeHits_S3 | Number of probe hits matching amplicon position 3 |
TotalProbeHits | Total number of probe hits across all amplicons |
SupportingAmplicons | Number of amplicons with at least one probe hit (0-3) |
Column | Description |
---|---|
BamCoverage | Total read depth at the variant position |
ReferenceCoverage | Number of reads with the reference allele |
AlternateCoverage | Number of reads with the alternate allele |
VAF | Variant Allele Frequency (percentage of reads with alternate allele) |
A_cov | Number of reads with 'A' at the variant position |
C_cov | Number of reads with 'C' at the variant position |
G_cov | Number of reads with 'G' at the variant position |
T_cov | Number of reads with 'T' at the variant position |
- Missing sample files: Ensure the sample names in the FastQ files match exactly with the sample folders in the probing results
- Missing dependencies: The pipeline uses Singularity containers, so ensure Singularity is properly installed and configured
The pipeline is configured with errorStrategy = 'retry'
which will attempt to retry failed processes before failing the pipeline.
.
├── main.nf # Main workflow script
├── nextflow.config # Configuration file
├── README.md # Main document file.
├── docs/ # extra documentation files.
├── workflows/
│ ├── genome_processes.nf # Genome alignment processes
│ ├── amplicon_matching_processes.nf # Amplicon matching processes
│ ├── variant_processes.nf # Variant analysis processes
│ └── utilities_processes.nf # Utility processes
└── bin/
└── build_variant_support_df.py # Script for generating final output
Check here for a comprehensive overview of the core workflow and process details.