TimD1/vcfdist

"terminate called after throwing an instance of 'std::out_of_range'"

MariaNattestad opened this issue · 2 comments

Hi Tim

I got an error while running vcfdist on a DeepVariant output VCF.
Here is a reproducible example:

# Grab a chr20 subset for quick iteration:
bcftools view https://42basepairs.com/download/gs/deepvariant/case-study-outputs/1.5.0/PacBio/deepvariant.output.vcf.gz chr20 > deepvariant-small-sample.vcf

# Other inputs:
gs://deepvariant/case-study-testdata/GCA_000001405.15_GRCh38_no_alt_analysis_set.fa
gs://deepvariant/case-study-testdata/HG002_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed
gs://deepvariant/case-study-testdata/HG002_GRCh38_1_22_v4.2.1_benchmark.vcf.gz


DV="deepvariant-small-sample.vcf"
TRUTH="HG002_GRCh38_1_22_v4.2.1_benchmark.vcf.gz"
TRUTH_REGIONS="HG002_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed"
REF="GCA_000001405.15_GRCh38_no_alt_analysis_set.fa"
OUTPUT="vcfdist_on_deepvariant/"
mkdir $OUTPUT

vcfdist "${DV}" "${TRUTH}" "${REF}" --bed "${TRUTH_REGIONS}" --prefix "${OUTPUT}" --verbosity 1  2>&1 | tee vcfdist_on_deepvariant.log

Output:

[INFO  vcfdist 13:40:31] Command: 'vcfdist deepvariant-small-sample.vcf /usr/local/google/home/marianattestad/commonly_used_deepvariant_files/HG002_GRCh38_1_22_v4.2.1_benchmark.vcf.gz /usr/local/google/home/marianattestad/commonly_used_deepvariant_files/GRCh38.no_alt_analysis_set.fa.gz --bed /usr/local/google/home/marianattestad/commonly_used_deepvariant_files/HG002_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed --prefix vcfdist_on_deepvariant/ --verbosity 1'
[INFO  vcfdist 13:40:31]
[INFO  vcfdist 13:40:31] Parsing VCF 'deepvariant-small-sample.vcf'
[WARN  vcfdist 13:40:31] QUERY VCF variant overlap at chr20:1278060, skipping
[WARN  vcfdist 13:40:31] QUERY VCF variant overlap at chr20:1288626, skipping
[WARN  vcfdist 13:40:31] QUERY VCF variant overlap at chr20:1339612, skipping
[WARN  vcfdist 13:40:31] QUERY VCF variant overlap at chr20:2248990, skipping
[WARN  vcfdist 13:40:31] QUERY VCF variant overlap at chr20:3220062, skipping
...
[INFO  vcfdist 13:40:31]     Haplotype 2
[INFO  vcfdist 13:40:31]       REF  0
[INFO  vcfdist 13:40:31]       SNP  71148
[INFO  vcfdist 13:40:31]       INS  5920
[INFO  vcfdist 13:40:31]       DEL  5642
[INFO  vcfdist 13:40:31]       CPX  0
[INFO  vcfdist 13:40:31]
[INFO  vcfdist 13:40:31]   Contigs:
[INFO  vcfdist 13:40:31]       'chr20' hap 1: 29067 variants
[INFO  vcfdist 13:40:31]       'chr20' hap 2: 82710 variants
[INFO  vcfdist 13:40:31]
[INFO  vcfdist 13:40:31]   VCF Overview:
[INFO  vcfdist 13:40:31]     VARIANTS  311095
[INFO  vcfdist 13:40:31]     KEPT HAP1 29067
[INFO  vcfdist 13:40:31]     KEPT HAP2 82710
[INFO  vcfdist 13:40:31]
terminate called after throwing an instance of 'std::out_of_range'
  what():  unordered_map::at

I got the same error on the full VCF that was .gz with a .tbi index as well.

TimD1 commented

Thanks for raising an issue! What commit and version of vcfdist are you running? I just downloaded and tested the files you provided (subsetted to chr20) and it runs fine on the most recent release (vcfdist v1.2.2).

Side note: vcfdist is designed to run on locally phased truth and query VCFs. It looks like neither of your VCFs are phased (as a result, reported precision and recall are around 75% when I run vcfdist). A phased version of your truth VCF is available from the GIAB consortium here. For the query VCF, I recommend phasing with WhatsHap.

I'll plan to make a few changes based on this issue:

  • add a warning if either input VCFs are unphased
  • update README, making it more clear that phased VCFs are expected
  • add --branch v1.2.2 to git clone instructions to ensure only more stable releases are cloned

Please let me know if you continue to have issues with a std::out_of_range() error. At the moment, I'm unable to reproduce it.

Okay, the requirement of phasing the query VCF might be outside the scope for me, since that means I can't easily use vcfdist as a drop-in replacement for hap.py in my workflow -- I was hopeful since I wasn't sure how much the phasing mattered based on reading the abstract and skimming the paper.
But thanks for your help!

To answer your question:

git log | head
commit 6014c05f9b28dffc19cabf716925578986ed05a3
Author: Tim Dunn <me.timd1@gmail.com>
Date:   Thu Apr 13 10:03:25 2023 -0400
...