zhou-lab/biscuit

SNP calling and known genotypes (question + feature suggestion)

Opened this issue · 2 comments

I have genotype array, WGS, and WGBS data for my samples. I am using this information to detect sample swaps. I find that Biscut genotype calls are highly concordant with WGS genotype calls except for the case where the reference is 'C' and the true genotype is 'TT'. I understand that it is not possible to accurately genotype in this case, but I am curious about the behavior of Biscuit. For example:

From the pileup, there is no evidence of a C allele:
chr1 852875 C 60 TTTTTTTTTTTTtttttttTTTTTTTtttTTTTttTTTttTTttTTTTtttTTTTTtttt

However, in the VCF, the allele support for this position shows 33 Cs and 26 Ts:
chr1 852875 . C T,A,G 34 PASS . DP:GT:GP:GQ:SP:CV:BT 60:0/1:84,4,115:99:C33,T26,A0:.:

Question: In this case, does Biscuit just generate the 'C' count from an expected distribution?

My suggestion is that a nice feature would be detecting sample swaps when genotype information is known. Basically just a script that compares a VCF of known genotypes to the Biscuit-generated VCF, ignoring sites where it is difficult/impossible to genotype correctly from WGBS, and output a likelihood score of the two VCFs having been generated from the same individual.

bcftools gtcheck will do this, if the VCFs are valid v4.1. I'm going to take a whack at that

6fc5d23 fixes the VCF issue (don't look at how trivial the fix was, you will feel bad, I did). bcftools csq now works on the generated VCF files; bcftools gtcheck should too.