Somatic variation workflow

This repository contains a nextflow workflow to identify somatic variation in a paired normal/tumor sample. This workflow currently perform:

Alignment QC and statistics.
Somatic short variant calling (SNV and Indels).
Somatic structural variants calling (SV).
Modified sites calling (mod).

Introduction

This workflow enables analysis of somatic variation using the following tools:

Quickstart

The workflow uses nextflow to manage compute and software resources, as such nextflow will need to be installed before attempting to run the workflow.

The workflow can currently be run using either Docker or Singularity to provide isolation of the required software. Both methods are automated out-of-the-box provided either Docker or Singularity is installed.

It is not required to clone or download the git repository in order to run the workflow. For more information on running EPI2ME Labs workflows visit our website.

Workflow options

To obtain the workflow, having installed nextflow, users can run:

nextflow run epi2me-labs/wf-somatic-variation --help

to see the options for the workflow.

Input and Data preparation

The workflow relies on three primary input files:

A reference genome in fasta format
An aligned BAM file for the tumor sample
An aligned BAM file for the normal sample

The workflow is designed to work with human samples, and the reference genome should be either hg19 (GRCh37) or hg38 (GRCh38). Despite this, the majority of tasks within the workflow are species agnostic. The following options will require the workflow to check for the genome build, and will require hg19 or hg38:

Insert classification in nanomonSV (enabled with --classify_insert)

The aligned bam files can be generated starting from:

POD5/FAST5 files using the wf-basecalling workflow, or
fastq files using wf-alignment.

Both workflows will generate aligned BAM files that are ready to be used with wf-somatic-variation.

Demo data

The workflow comes with matched demo data accessible here:

wget -q -O demo_data.tar.gz https://ont-exd-int-s3-euwst1-epi2me-labs.s3.amazonaws.com/wf-somatic-variation/wf-somatic-variation-demo.tar.gz

This demo is derived from a Tumor/Normal pair of samples, that we have made publicly accessible. Check out our blog post for more details.

Somatic short variant calling

The workflow currently implements a deconstructed version of ClairS (v0.1.0) to identify somatic variants in a paired tumor/normal sample. This workflow allows to take advantage of the parallel nature of Nextflow, providing the best performance in high-performance, distributed systems.

Currently, ClairS supports the following basecalling models:

dna_r10.4.1_e8.2_400bps_sup@v3.5.2
dna_r9.4.1_e8_hac@v3.3
dna_r9.4.1_e8_sup@v3.3
dna_r9.4.1_450bps_hac_prom
dna_r9.4.1_450bps_hac Any other model provided will prevent the workflow to start.

Indel calling

Currently, indel calling is supported only for dna_r10 basecalling models. When the user specify an r9 model the workflow will automatically skip the indel processes and perform only the SNV calling.

Somatic structural variant (SV) calling with Nanomonsv

The workflow allows for the call of somatic SVs using long-read sequencing data. Starting from the paired cancer/control samples, the workflow will:

Parse the SV signatures in the reads using nanomonsv parse
Call the somatic SVs using nanomonsv get
Filter out the SVs in simple repeats using add_simple_repeat.py (optional)
Annotate transposable and repetitive elements using nanomonsv insert_classify (optional)

As of nanomonsv v0.7.1, users can provide the approximate single base quality value (QV) for their dataset. To decide which is the most appropriate value for your dataset visit nanomonsv get web page, but it can be summarized as follow:

Basecaller	Quality value
guppy (v5)	10
guppy (v5 or v6)	15
dorado	20

To provide the correct qc value, simply use --qv 20.

Modified base calling

Modified base calling can be performed by specifying --mod. The workflow will call modified bases using modkit. The default behaviour of the workflow is to run modkit with the --cpg --combine-strands options set. It is possible to report strand-aware modifications by providing --force_strand, which will trigger modkit to run in default mode. The modkit run can be fully customized by providing --modkit_args. This will override any preset, and allow full control over the run of modkit.

Output folder

The output directory has the following structure:

output/
├── execution # Execution reports
│   ├── report.html
│   ├── timeline.html
│   └── trace.txt
│
├── SAMPLE
│   ├── qc
│   │   ├── coverage
│   │   │   ├── SAMPLE_normal.mosdepth.global.dist.txt
│   │   │   ├── SAMPLE_normal.mosdepth.summary.txt
│   │   │   ├── SAMPLE_normal.per-base.bed.gz
│   │   │   ├── SAMPLE_normal.regions.bed.gz
│   │   │   ├── SAMPLE_normal.thresholds.bed.gz
│   │   │   ├── SAMPLE_tumor.mosdepth.global.dist.txt
│   │   │   ├── SAMPLE_tumor.mosdepth.summary.txt
│   │   │   ├── SAMPLE_tumor.per-base.bed.gz
│   │   │   ├── SAMPLE_tumor.regions.bed.gz
│   │   │   └── SAMPLE_tumor.thresholds.bed.gz
│   │   └── readstats
│   │       ├── SAMPLE_normal.flagstat.tsv
│   │       ├── SAMPLE_normal.readstats.tsv.gz
│   │       ├── SAMPLE_tumor.flagstat.tsv
│   │       └── SAMPLE_tumor.readstats.tsv.gz
│   │
│   ├── snv  # ClairS outputs
│   │   ├── change_counts  # Mutational change counts for the sample; for now, it only works for the SNVs
│   │   │   └── SAMPLE_changes.csv
│   │   ├── varstats  # Bcftools stats output
│   │   │   └── SAMPLE.stats
│   │   └── vcf  # VCF outputs
│   │       ├── SAMPLE_tumor_germline.vcf.gz
│   │       ├── SAMPLE_tumor_germline.vcf.gz.tbi
│   │       ├── SAMPLE_normal_germline.vcf.gz
│   │       ├── SAMPLE_normal_germline.vcf.gz.tbi
│   │       ├── SAMPLE_somatic_indels.vcf.gz
│   │       ├── SAMPLE_somatic_indels.vcf.gz.tbi
│   │       ├── SAMPLE_somatic_snv.vcf.gz
│   │       └── SAMPLE_somatic_snv.vcf.gz.tbi
│   │
│   ├── sv
│   │   ├── single_breakend
│   │   │   └── SAMPLE.nanomonsv.sbnd.result.txt
│   │   └── txt
│   │       └── SAMPLE.nanomonsv.result.annot.txt
│   │
│   └── mod
│       ├── modC   # Modified bases code
│       │   ├── DML   # Differentially methylated loci
│       │   │   └── SAMPLE.modC.dml.tsv
│       │   ├── DMR   # Differentially methylated regions
│       │   │   └── SAMPLE.modC.dmr.tsv
│       │   ├── DSS   # DSS input files
│       │   │   ├── modC.SAMPLE_normal.dss.tsv
│       │   │   └── modC.SAMPLE_tumor.dss.tsv
│       │   └── bedMethyl   # bedMethyl output files
│       │       ├── modC.SAMPLE_normal.bed.gz
│       │       └── modC.SAMPLE_tumor.bed.gz
│       └── raw   # Raw outputs from modkit
│           ├── SAMPLE_normal.bed
│           └── SAMPLE_tumor.bed
│
├── info  # single component runtime info
│   ├── mod
│   │   ├── params.json
│   │   └── versions.txt
│   ├── snv
│   │   ├── params.json
│   │   └── versions.txt
│   └── sv
│       ├── params.json
│       └── versions.txt
│
├── SAMPLE_somatic_mutype.vcf.gz
├── SAMPLE_somatic_mutype.vcf.gz.tbi
├── SAMPLE.nanomonsv.result.wf_somatic_sv.vcf.gz
├── SAMPLE.nanomonsv.result.wf_somatic_sv.vcf.gz.tbi
├── SAMPLE.normal.mod_summary.tsv
├── SAMPLE.tumor.mod_summary.tsv
├── SAMPLE.wf-somatic-snp-report.html
├── SAMPLE.wf-somatic-sv-report.html
├── SAMPLE.wf-somatic-variation-readQC-report.html
├── params.json
└── versions.txt

The primary outputs are:

output/SAMPLE_somatic_mutype.vcf.gz: the final VCF file with SNVs and, if r10, InDels
output/SAMPLE.nanomonsv.result.wf_somatic_sv.vcf.gz: the final VCF with the somatic SVs from nanomonsv
output/*.html: the reports of the different stages
output/SAMPLE/snp/spectra/SAMPLE_changes.csv: the mutation changes for the sample
output/SAMPLE/snp/vcf/germline/[tumor/normal]: the germline calls for both the tumor and normal bam files
output/SAMPLE/sv/txt/SAMPLE.nanomonsv.result.annot.txt: the somatic SVs called with nanomonsv in tabular format
output/SAMPLE/sv/single_breakend/SAMPLE.nanomonsv.sbnd.result.txt: the single break-end SVs called with nanomonsv
output/SAMPLE/mod/: the results from modkit and DSS

Somatic structural variant (SV) calling with Nanomonsv

The workflow allows for the call of somatic SVs using long-read sequencing data. Starting from the paired cancer/control samples, the workflow will:

Parse the SV signatures in the reads using nanomonsv parse
Call the somatic SVs using a custom version of nanomonsv get
Filter out the SVs in simple repeats using add_simple_repeat.py (optional)
Annotate transposable and repetitive elements using nanomonsv insert_classify (optional)

Hardware limitations: the SV calling workflow requires to run on a system supporting AVX2 instructions. please, ensure that your system support it before running it.

kpalin/wf-somatic-variation

Somatic variation workflow

Introduction

Quickstart

Useful links