vcfmerge - a small Python script that merges somatic SNV/InDel calls from three individual (bgzipped) VCF files:
- MuTect2 VCF (SNVs + InDels)
- Strelka2 SNV VCF
- Strelka2 InDel VCF
Two resulting VCF files are produced:
- <tumor_id>_<normal_id>_vcfmerge_all.vcf - contains all calls, both rejected and PASSed
- <tumor_id>_<normal_id>_vcfmerge_somatic.vcf - contains all somatic calls (PASS by one or both of Strelka2 and MuTect2)
The script produces multiple dedicated VCF INFO tags in the resulting output files to simplify downstream annotation, most importantly:
- TDP - total sequencing depth of variant site in tumor (i.e. DP in tumor sample, MuTect2 values have priority)
- TVAF - allelic fraction of alternate allele in tumor (i.e. AF in tumor sample, MuTect2 values have priority)
- CDP - total sequencing depth of variant site in control (i.e. DP in control sample, MuTect2 values have priority)
- CVAF - allelic fraction of alternate allele in control (i.e. AF in control sample, MuTect2 values have priority)
- VARIANT_CALLERS - any of mutect2, strelka2, or mutect2,strelka2 (called by both)
- MNV_SUPPORT_STRELKA - as Strelka2 does not properly call multinucleotide variants (MNVs or block substitutions), the script gathers consecutive SNVs (all PASS) from Strelka2 calls when they are found as an MNV (PASS) in MuTect2
usage: vcfmerge.py [-h] [--mutect_vcf MUTECT_VCF] [--strelka_snv_vcf STRELKA_SNV_VCF] [--strelka_indel_vcf STRELKA_INDEL_VCF] [--compress] [--force_overwrite] tumor_sample_id control_sample_id output_dir
Merge somatic calls (SNVs/InDels) from multiple VCF files into a single VCF
positional arguments:
tumor_sample_id Sample ID for the tumor sample
control_sample_id Sample ID for the control sample
output_dir Directory for output files
optional arguments:
-h, --help show this help message and exit
--mutect_vcf MUTECT_VCF
Bgzipped VCF input file with somatic query variants (SNVs) called with MuTect (version 2.x). (default: None)
--strelka_snv_vcf STRELKA_SNV_VCF
Bgzipped VCF input file with somatic query variants (SNVs) called with Strelka (version 2.x). (default: None)
--strelka_indel_vcf STRELKA_INDEL_VCF
Bgzipped VCF input file with somatic query variants (InDels) called with Strelka (version 2.x) (default: None)
--compress Compress output VCF with bgzip + tabix (default: False)
--force_overwrite Overwrite existing output files (default: False)
- The tumor and control sample identifiers provided as input arguments must match the names provided in the individual VCF files (sample columns)
- Note that the script currently fully ignores multi-allelic sites (i.e. sites with multiple alternate alleles). However, from our experience so far, it seems that limited sites of this nature contain somatic events with a PASS status