Support stratified variant comparison

Question

Support stratified variant comparison

jdidion opened this issue 2 years ago · 4 comments

GIAB provides stratification bed files and hap.py gives the option to produce stratified results based on these files. It's very useful to be able to look at benchmarking performance by type of region. It would be great if vcfdist could support this option as well.

Answer 1 · 2023-03-13T21:57:53.000Z

I agree.

vcfdist currently outputs detailed per-variant, per-cluster, and per-phaseblock results in a custom CSV format.

I believe the best approach to supporting stratification would be to instead report results in hap.py's Intermediate VCF File format (Supplementary Note 3 of "Best practices for benchmarking..."). That way, vcfeval could be used as a comparison engine within hap.py, just like vcfeval or xcmp. This is what I plan to work on next; it shouldn't take too long.

Answer 2 · 2023-03-13T21:59:23.000Z

Sounds like a great idea.

Answer 3 · 2023-03-13T22:02:28.000Z

The only downside is that hap.py most likely can't deal with partial positive variants. I may need to consider them false positives for interoperability. The majority of vcfdist's improvement came from standardizing the variant representation and enforcing phasing, so this shouldn't impact results too much, although it's certainly not ideal.

Answer 4 · 2023-03-14T20:01:35.000Z

In the meantime, one option would be to run bedtools intersect of your high-confidence BED and each stratification region, since vcfdist currently accepts a single BED file for region selection.