TimD1/vcfdist

How does it deal with sex chromosomes?

jdidion opened this issue · 7 comments

The procedure for determining partial credit in hap.py's xcmp takes into account the sex of the query sample (and thus whether the X chromosome is haploid). I don't see a parameter for the sample sex in vcfdist, so how does it take this into account, or do you have another way of dealing with it?

TimD1 commented

Haploid comparison is currently not implemented. Most benchmarking studies I've seen have omitted the sex chromosomes for simplicity. It would be relatively simple but time-consuming to add. Is this important for your use-case? My current approach is to deliver a minimal working tool, and then add extend it in the directions that people believe would be most useful.

Sure that makes sense. I would say it's moderately important. I would definitely put stratification ahead of it in importance.

I am very interested in haploid comparison for bacterial variant calls. Is this on the roadmap? I would love to use this tool as the partial credit functionality is super useful and what I've always intuitively felt was fair

TimD1 commented

Hi Michael,

Thanks for suggesting this. vcfdist should theoretically be able to evaluate haploid genomes in the current release (v1.3.0). However, this hasn't been my main focus and it's relatively untested atm. I may need to change a few things to make the workflow easier (the only thing I can think of off the top of my head is that GT is a required field, so each variant will need GT = 1 in the FORMAT field).

I can try running a few bacterial datasets through vcfdist tomorrow and let you know if everything works as expected, or you can just start using it and raise issues if you encounter any. But my guess is that others would like to use vcfdist for bacterial genomes too, so I'd like to fully support this.

  • Tim

Thanks for the quick response Tim. Depending on the variant caller used you sometimes get your GTs as a single number ie. 1 or as an unphased 1/1. I assume the unphased form would not work with vcfdist? I guess I could just covert the 1/1 to 1.

TimD1 commented

Yeah, technically the correct input would be 1. If you provide input with all 1/1 variants, vcfdist will count each variant twice, since it handles each haplotype separately.

TimD1 commented

@mbhall88 I just tested and added some more support for monoploid/haploid variant comparison in release v1.3.1. The output summary VCFs should now be in the correct format. An input of GT=1 is expected, but no GT field at all will now also succeed (with a warning).