/vcfmerge

MuTect2 + Strelka somatic VCF merger

Primary LanguagePython

vcfmerge

vcfmerge - a small Python script that merges somatic SNV/InDel calls from three individual (bgzipped) VCF files:

  • MuTect2 VCF (SNVs + InDels)
  • Strelka2 SNV VCF
  • Strelka2 InDel VCF

Two resulting VCF files are produced:

  • <tumor_id>_<normal_id>_vcfmerge_all.vcf - contains all calls, both rejected and PASSed
  • <tumor_id>_<normal_id>_vcfmerge_somatic.vcf - contains all somatic calls (PASS by one or both of Strelka2 and MuTect2)

The script produces multiple dedicated VCF INFO tags in the resulting output files to simplify downstream annotation, most importantly:

  • TDP - total sequencing depth of variant site in tumor (i.e. DP in tumor sample, MuTect2 values have priority)
  • TVAF - allelic fraction of alternate allele in tumor (i.e. AF in tumor sample, MuTect2 values have priority)
  • CDP - total sequencing depth of variant site in control (i.e. DP in control sample, MuTect2 values have priority)
  • CVAF - allelic fraction of alternate allele in control (i.e. AF in control sample, MuTect2 values have priority)
  • VARIANT_CALLERS - any of mutect2, strelka2, or mutect2,strelka2 (called by both)
  • MNV_SUPPORT_STRELKA - as Strelka2 does not properly call multinucleotide variants (MNVs or block substitutions), the script gathers consecutive SNVs (all PASS) from Strelka2 calls when they are found as an MNV (PASS) in MuTect2

Usage

usage: vcfmerge.py [-h] [--mutect_vcf MUTECT_VCF] [--strelka_snv_vcf STRELKA_SNV_VCF] [--strelka_indel_vcf STRELKA_INDEL_VCF] [--compress] [--force_overwrite] tumor_sample_id control_sample_id output_dir

Merge somatic calls (SNVs/InDels) from multiple VCF files into a single VCF

positional arguments:
  tumor_sample_id       Sample ID for the tumor sample
  control_sample_id     Sample ID for the control sample
  output_dir            Directory for output files

optional arguments:
  -h, --help            show this help message and exit
  --mutect_vcf MUTECT_VCF
                        Bgzipped VCF input file with somatic query variants (SNVs) called with MuTect (version 2.x). (default: None)
  --strelka_snv_vcf STRELKA_SNV_VCF
                        Bgzipped VCF input file with somatic query variants (SNVs) called with Strelka (version 2.x). (default: None)
  --strelka_indel_vcf STRELKA_INDEL_VCF
                        Bgzipped VCF input file with somatic query variants (InDels) called with Strelka (version 2.x) (default: None)
  --compress            Compress output VCF with bgzip + tabix (default: False)
  --force_overwrite     Overwrite existing output files (default: False)

Notes

  • The tumor and control sample identifiers provided as input arguments must match the names provided in the individual VCF files (sample columns)
  • Note that the script currently fully ignores multi-allelic sites (i.e. sites with multiple alternate alleles). However, from our experience so far, it seems that limited sites of this nature contain somatic events with a PASS status