Inconsistent classification of same variant
mbhall88 opened this issue · 2 comments
Me again 😬
I have a curious case where the same variants from two different variant callers gives a TP for one and a FP for the other...
Here is the variant (with context) from Clair3 with the vcfdist assessments
chromosome 3462134 . G C . PASS . GT:BD:BC:RD:QD:BK:QQ:SC:SG:PS:PB:BS:FE 1:TP:1.000000:1:0:gm:0:10999:5:0:0:.:. 1:TP:1.000000:1:0:gm:24:10999:5:0:0:0:0
chromosome 3462135 . C T . PASS . GT:BD:BC:RD:QD:BK:QQ:SC:SG:PS:PB:BS:FE 1:TP:1.000000:1:0:gm:0:10999:4:0:0:.:. 1:TP:1.000000:1:0:gm:16:10999:4:0:0:0:0
chromosome 3462136 . C T . PASS . GT:BD:BC:RD:QD:BK:QQ:SC:SG:PS:PB:BS:FE 1:TP:1.000000:1:0:gm:0:10999:3:0:0:.:. 1:TP:1.000000:1:0:gm:23:10999:3:0:0:0:0
chromosome 3462139 . G GGT . PASS . GT:BD:BC:RD:QD:BK:QQ:SC:SG:PS:PB:BS:FE .:.:.:.:.:.:.:10999:.:.:0:.:. 1:TP:1.000000:5:0:gm:17:10999:2:0:0:0:0
chromosome 3462140 . T G . PASS . GT:BD:BC:RD:QD:BK:QQ:SC:SG:PS:PB:BS:FE 1:TP:1.000000:5:0:gm:0:10999:2:0:0:.:. .:.:.:.:.:.:.:10999:.:.:0:.:.
chromosome 3462141 . A T . PASS . GT:BD:BC:RD:QD:BK:QQ:SC:SG:PS:PB:BS:FE 1:TP:1.000000:5:0:gm:0:10999:2:0:0:.:. .:.:.:.:.:.:.:10999:.:.:0:.:.
chromosome 3462142 . T A . PASS . GT:BD:BC:RD:QD:BK:QQ:SC:SG:PS:PB:BS:FE .:.:.:.:.:.:.:10999:.:.:0:.:. 1:TP:1.000000:5:0:gm:16:10999:2:0:0:0:0
chromosome 3462143 . T A . PASS . GT:BD:BC:RD:QD:BK:QQ:SC:SG:PS:PB:BS:FE 1:TP:1.000000:5:0:gm:0:10999:2:0:0:.:. .:.:.:.:.:.:.:10999:.:.:0:.:.
chromosome 3462144 . ACC A . PASS . GT:BD:BC:RD:QD:BK:QQ:SC:SG:PS:PB:BS:FE .:.:.:.:.:.:.:10999:.:.:0:.:. 1:TP:1.000000:5:0:gm:25:10999:2:0:0:0:0
chromosome 3462145 . C T . PASS . GT:BD:BC:RD:QD:BK:QQ:SC:SG:PS:PB:BS:FE 1:TP:1.000000:5:0:gm:0:10999:2:0:0:.:. .:.:.:.:.:.:.:10999:.:.:0:.:.
chromosome 3462146 . C A . PASS . GT:BD:BC:RD:QD:BK:QQ:SC:SG:PS:PB:BS:FE 1:TP:1.000000:5:0:gm:0:10999:2:0:0:.:. .:.:.:.:.:.:.:10999:.:.:0:.:.
chromosome 3462151 . A G . PASS . GT:BD:BC:RD:QD:BK:QQ:SC:SG:PS:PB:BS:FE 1:TP:1.000000:1:0:gm:0:10999:1:0:0:.:. 1:TP:1.000000:1:0:gm:16:10999:1:0:0:0:0
and the same variants from DeepVariant
chromosome 3462134 . G C . PASS . GT:BD:BC:RD:QD:BK:QQ:SC:SG:PS:PB:BS:FE 1:TP:1.000000:1:0:gm:0:11000:5:0:0:.:. 1:TP:1.000000:1:0:gm:20:11000:5:0:0:0:0
chromosome 3462135 . C T . PASS . GT:BD:BC:RD:QD:BK:QQ:SC:SG:PS:PB:BS:FE 1:TP:1.000000:1:0:gm:0:11000:4:0:0:.:. 1:TP:1.000000:1:0:gm:12:11000:4:0:0:0:0
chromosome 3462136 . C T . PASS . GT:BD:BC:RD:QD:BK:QQ:SC:SG:PS:PB:BS:FE 1:TP:1.000000:1:0:gm:0:11000:3:0:0:.:. 1:TP:1.000000:1:0:gm:11:11000:3:0:0:0:0
chromosome 3462139 . G GGT . PASS . GT:BD:BC:RD:QD:BK:QQ:SC:SG:PS:PB:BS:FE .:.:.:.:.:.:.:11000:.:.:0:.:. 1:FP:0.600000:5:2:lm:3:11000:2:0:0:0:0
chromosome 3462140 . T G . PASS . GT:BD:BC:RD:QD:BK:QQ:SC:SG:PS:PB:BS:FE 1:FN:0.600000:5:2:lm:0:11000:2:0:0:.:. .:.:.:.:.:.:.:11000:.:.:0:.:.
chromosome 3462141 . A T . PASS . GT:BD:BC:RD:QD:BK:QQ:SC:SG:PS:PB:BS:FE 1:FN:0.600000:5:2:lm:0:11000:2:0:0:.:. .:.:.:.:.:.:.:11000:.:.:0:.:.
chromosome 3462142 . T A . PASS . GT:BD:BC:RD:QD:BK:QQ:SC:SG:PS:PB:BS:FE .:.:.:.:.:.:.:11000:.:.:0:.:. 1:FP:0.600000:5:2:lm:22:11000:2:0:0:0:0
chromosome 3462143 . T A . PASS . GT:BD:BC:RD:QD:BK:QQ:SC:SG:PS:PB:BS:FE 1:FN:0.600000:5:2:lm:0:11000:2:0:0:.:. .:.:.:.:.:.:.:11000:.:.:0:.:.
chromosome 3462145 . C T . PASS . GT:BD:BC:RD:QD:BK:QQ:SC:SG:PS:PB:BS:FE 1:FN:0.600000:5:2:lm:0:11000:2:0:0:.:. .:.:.:.:.:.:.:11000:.:.:0:.:.
chromosome 3462146 . C A . PASS . GT:BD:BC:RD:QD:BK:QQ:SC:SG:PS:PB:BS:FE 1:FN:0.600000:5:2:lm:0:11000:2:0:0:.:. .:.:.:.:.:.:.:11000:.:.:0:.:.
chromosome 3462151 . A G . PASS . GT:BD:BC:RD:QD:BK:QQ:SC:SG:PS:PB:BS:FE 1:TP:1.000000:1:0:gm:0:11000:1:0:0:.:. 1:TP:1.000000:1:0:gm:14:11000:1:0:0:0:0
the truth variants for this region are
chromosome 3462134 2abfe98a G C . PASS . GT 1
chromosome 3462135 75aed307 C T . PASS . GT 1
chromosome 3462136 9c7391b9 C T . PASS . GT 1
chromosome 3462140 60d3b9d8 T G . PASS . GT 1
chromosome 3462141 a08fd538 A T . PASS . GT 1
chromosome 3462143 f3d80833 T A . PASS . GT 1
chromosome 3462145 b8c36e5e C T . PASS . GT 1
chromosome 3462146 f6f2bac0 C A . PASS . GT 1
chromosome 3462151 ecc19214 A G . PASS . GT 1
You can see in the clair3 VCF that there are positions marked at TP when there is no QUERY variant and vice versa. This doesn't seem to make a lot of sense? Then the same positions for DeepVariant are (correctly) marked as FPs and FNs.
I have attached two tarballs with the necessary data to reproduce this with the following command
vcfdist ATCC_10708__202309.100x.clair3.filter.vcf.gz truth.vcf.gz mutreference.fna --largest-variant 50 --credit-threshold 1.0 -p c3_ATCC_10708. -b ATCC_10708__202309.bed -mx 99
This may initially seem like a bug, but vcfdist
is correct and behaving exactly as intended in the example above. Note that the Clair3 VCF contains an additional variant not present in the DeepVariant VCF:
chromosome 3462144 . ACC A . PASS . GT:BD:BC:RD:QD:BK:QQ:SC:SG:PS:PB:BS:FE .:.:.:.:.:.:.:10999:.:.:0:.:. 1:TP:1.000000:5:0:gm:25:10999:2:0:0:0:0
For both VCFs, the cluster of variants in the range 346,2139-346,2146
are considered to be part of a complex variant (because they are in the same supercluster (SC=11000
) and sync group (SG=2
)). For the Clair3 VCF (but not the DeepVariant VCF), the set of variants in that region results in the same sequence as the Truth VCF, and causes all variants to be marked as TP
.
For the bases 3462139-3462146
(inclusive):
Reference Sequence: GTATTACC
Truth VCF Sequence: GGTTAATA
Clair3 VCF Sequence: GGTTAATA
DeepVariant Sequence: GGTTAATACC
Ahhh that is a MUCH better way of looking at it. It's so easy to get bogged down in looking at the individual variants and forget to zoom out a little and see the overall sequence. This is why I love vcfdist.
Thanks for the detailed response.