How do different methods affect precision?
zhengluo-lz opened this issue · 4 comments
First, divide the VCF file into different chromosomes, then construct VG files for each chromosome separately and perform SV (structural variant) identification. Alternatively, construct a VG file for the whole genome first, then use vg chunk to partition it, and finally perform SV identification. Will the precision of these two methods differ?
You can't really have a pipeline that's split by chromosome the entire way through. At some point you will need to map reads to the graph, and when you do that, you need to have the full graph available to the mapping algorithm.
I have another question. I observed that precision and recall increase as MAF increases, so I would like to ask whether you set a threshold to evaluate precision and recall for sites with MAF greater than this threshold.
Yes, higher MAF increases the chances that the variant is actually observed in the sample, so in general, higher MAF variants are more likely to be useful. I've seen different thresholds used in practice, but I don't know of a place where anyone has quantified the precision/recall effects of different thresholds.
There is a small literature on variant selection for pangenomes that you may be interested in looking at:
https://academic.oup.com/bioinformatics/article/37/Supplement_1/i460/6319683
https://genomebiology.biomedcentral.com/articles/10.1186/s13059-018-1595-x
Thanks!