cbg-ethz/SCIPhI

Output VCF has malformed INFO tag DP

winni2k opened this issue · 2 comments

The output of sciphi v0.1.7 produces something looking like this:

##fileformat=VCFv4.1
##source=SCIPhISCIPhI v0.1.7
##FILTER=<ID=LowQual,Description="Low quality">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been filtered">
##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for alt alleles">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO
1       1271619 *       C       T       *       PASS    DP:30 ...

This raises errors when processing with bcftools. I believe the correct way to specify INFO tags should be with an equal sign instead of a colon. So for example, the last line of the example above should instead be:

1       1271619 *       C       T       *       PASS    DP=30 ...

This seems to fix the problem:

sed 's/DP:/DP=/' < sciphi.vcf > fixed_sciphi.vcf

Hi @winni2k,

You are right indeed. The "INFO fields are encoded as a semicolon-separated series of short keys with optional values in the format: =[,data]. Arbitrary keys are permitted, although the following sub-fields are reserved (albeit optional):" (https://www.internationalgenome.org/wiki/Analysis/vcf4.0/)

It is fixed in the current master branch. A bioconda update will follow soon.

Sorry for the inconvenience!

While we are on the subject of VCF tags: The PL tag is for genotype likelihoods as of VCF spec v4.1 (https://github.com/samtools/hts-specs/blob/master/VCFv4.1.pdf), which is log10 Pr(Data|Model). However, the values in the output VCF are actually phred scaled posterior genotype probabilities (log10 Pr(Model|Data)). I think the correct standard tag to use would be GP (see the VCFv4.1 spec for a definition of the GP tag). However, the GP tag is not phred scaled, so if you want phred scaled probabilities, then you will need to define your own tag.

I also recommend using the latest VCF spec as a reference, as it has more information on these details (https://github.com/samtools/hts-specs/blob/master/VCFv4.3.pdf).

Another useful reference is the documentation provided by BEAGLE, another tool that essentially converts pileups into genotype probabilities: https://faculty.washington.edu/browning/beagle/intro-to-vcf.html

Also, on reading the BEAGLE documentation, it looks like genotype probabilities need to sum to 1 because they are probabilities, and that might be hard to achieve with a rounded phred transformation.