natsuhiko/rasqual

pasteFiles puts the AS counts at end of FORMAT field, when not all sample info is present

npklein opened this issue · 5 comments

We have VCF files created by the GATK genotyping pipeline. This includes multiple columns other than GT in the FORMAT field, such as AD, DP GQ, PL (GT:AD:DP:GQ:PGT:PID:PL). When we use pasteFiles to add AS counts this gets appended to the end (GT:AD:DP:GQ:PGT:PID:PL:AS). Missing fields can be set to ., but trailing fields are dropped according to the VCF format.

E.g. a sample field has DP, GQ, PGT, PID, and PL missing, so the value is ./.:0,0:0. Then the AS gets appended to ./.:0,0:0:0,0, but now it does not match the FORMAT field.

dg13 commented

Hi Niek,
Can you upload/paste a small example vcf file where pasteFiles fails so we can replicate the issue on our side?
Dan

Hi Dan,

Thanks for the fast reply. Example vcf: http://pastebin.com/jkmQJm1Y

Dear Niek,

I understand your point. However, it is not so realistic to parse and check all possible FORMAT combinations for each SNP times individual in the "pasteFiles" because of computational complexity. In addition, RASQUAL does not accept unphased and missing genotypes in the VCF. You have to impuate/phase those genotypes first. And I guess during the process, this FORMAT problem will be solved.

Best regards,
Natsuhiko

dg13 commented

Hi Niek,
As Natsuhiko has said, the best solution to this is impute and phase your genotypes. You will need to do this anyway for RASQUAL to work, and it will also solve the formatting issue I think.
Best
Dan

Thanks for your quick help, I will do that.