SNP calling and known genotypes (question + feature suggestion)
Opened this issue · 2 comments
I have genotype array, WGS, and WGBS data for my samples. I am using this information to detect sample swaps. I find that Biscut genotype calls are highly concordant with WGS genotype calls except for the case where the reference is 'C' and the true genotype is 'TT'. I understand that it is not possible to accurately genotype in this case, but I am curious about the behavior of Biscuit. For example:
From the pileup, there is no evidence of a C allele:
chr1 852875 C 60 TTTTTTTTTTTTtttttttTTTTTTTtttTTTTttTTTttTTttTTTTtttTTTTTtttt
However, in the VCF, the allele support for this position shows 33 Cs and 26 Ts:
chr1 852875 . C T,A,G 34 PASS . DP:GT:GP:GQ:SP:CV:BT 60:0/1:84,4,115:99:C33,T26,A0:.:
Question: In this case, does Biscuit just generate the 'C' count from an expected distribution?
My suggestion is that a nice feature would be detecting sample swaps when genotype information is known. Basically just a script that compares a VCF of known genotypes to the Biscuit-generated VCF, ignoring sites where it is difficult/impossible to genotype correctly from WGBS, and output a likelihood score of the two VCFs having been generated from the same individual.
bcftools gtcheck will do this, if the VCFs are valid v4.1. I'm going to take a whack at that