Phasing chromosome X is wrong when males are encoded as haploids in input VCF/BCF
Opened this issue · 0 comments
Hi,
Phasing results for chromosome X non-PAR regions are wrong when providing input VCF/BCF files with haploid male genotypes represented as 0 and 1 instead of 0/0 and 1/1.
The issue is in the lines 89-91 in the genotype_reader::readGenotypes()
:
for(int32_t i = 0 ; i < 2 * n_main_samples ; i += 2) {
bool a0 = (bcf_gt_allele(main_buffer[i+0])==1);
bool a1 = (bcf_gt_allele(main_buffer[i+1])==1);
bool mi = (main_buffer[i+0] == bcf_gt_missing || main_buffer[i+1] == bcf_gt_missing);
When the genotype is haploid, the second value (i.e. a1) will hold bcf_int32_vector_end
, denoting the end of genotype. Thus, current code will transform GT = '1' to GT = '1/0', which is heterozygous. All heterozygous calls will be replaced by missing values, which will completely invalidate phasing results.
I suggest the following change to support both haploid male gemotype coding (i.e. GT=0 and 1; and GT = 0/0 and 1/1) in input VCF/BCF:
for(int32_t i = 0 ; i < 2 * n_main_samples ; i += 2) {
bool a0 = (bcf_gt_allele(main_buffer[i+0])==1);
bool a1 = (main_buffer[i + 1] == bcf_int32_vector_end) ? a0 : (bcf_gt_allele(main_buffer[i+1])==1);
bool mi = (main_buffer[i + 1] == bcf_int32_vector_end) ? (main_buffer[i+0] == bcf_gt_missing) : (main_buffer[i+0] == bcf_gt_missing || main_buffer[i+1] == bcf_gt_missing);
P.S. I caught this when I got an error similar to Issue #33 . Then, I noticed that phase_common phased many of haploid samples as hets.