How does it deal with sex chromosomes?
jdidion opened this issue · 7 comments
The procedure for determining partial credit in hap.py's xcmp
takes into account the sex of the query sample (and thus whether the X chromosome is haploid). I don't see a parameter for the sample sex in vcfdist, so how does it take this into account, or do you have another way of dealing with it?
Haploid comparison is currently not implemented. Most benchmarking studies I've seen have omitted the sex chromosomes for simplicity. It would be relatively simple but time-consuming to add. Is this important for your use-case? My current approach is to deliver a minimal working tool, and then add extend it in the directions that people believe would be most useful.
Sure that makes sense. I would say it's moderately important. I would definitely put stratification ahead of it in importance.
I am very interested in haploid comparison for bacterial variant calls. Is this on the roadmap? I would love to use this tool as the partial credit functionality is super useful and what I've always intuitively felt was fair
Hi Michael,
Thanks for suggesting this. vcfdist should theoretically be able to evaluate haploid genomes in the current release (v1.3.0). However, this hasn't been my main focus and it's relatively untested atm. I may need to change a few things to make the workflow easier (the only thing I can think of off the top of my head is that GT is a required field, so each variant will need GT = 1 in the FORMAT field).
I can try running a few bacterial datasets through vcfdist tomorrow and let you know if everything works as expected, or you can just start using it and raise issues if you encounter any. But my guess is that others would like to use vcfdist for bacterial genomes too, so I'd like to fully support this.
- Tim
Thanks for the quick response Tim. Depending on the variant caller used you sometimes get your GTs as a single number ie. 1 or as an unphased 1/1. I assume the unphased form would not work with vcfdist? I guess I could just covert the 1/1 to 1.
Yeah, technically the correct input would be 1
. If you provide input with all 1/1
variants, vcfdist will count each variant twice, since it handles each haplotype separately.