parameter settings for more support reads

Question

parameter settings for more support reads

yekaizhou opened this issue 4 years ago · 4 comments

Hi,

I am using SVIM with a somewhat special purpose that I wish the support reads of every SV calls to be as much as possible, with the overall SV calling accuracy not be affected much. I will use these supporting reads for some downstream analysis.

I have tested that CLUSTER_MAX_DISTANCE is positively related to support reads. I am wondering if other parameters have the same effect? If so, how can I properly combine and adjust their values to balance between support reads and overall accuracy?

Thanks!

Answer 1 · 2021-05-25T09:59:22.000Z

Hi,

thanks for your question. Could you please clarify your question a bit more so that I can answer it?

What do you mean precisely with "The support reads of every SV calls should be as much as possible"? Do you want to maximize the number of supporting reads for every call or do you want to maximize something else? And why is that your goal?

I'm not sure what would be the point of maximizing the number of supporting reads. Even though we do not have complete knowledge about it, the set of supporting reads for a given variant is fixed. If you increase the CLUSTER_MAX_DISTANCE then you tell the clustering algorithm to construct larger clusters which (as you already observed) also increases the number of supporting reads for each cluster. You can increase that number but at some point you will merge more and more reads that do not come from the same variant. I don't see why that would be desirable but am curious to learn more about the background of your question.

Cheers
David

Answer 2 · 2021-05-25T10:50:07.000Z

Hi David,

Thank you very much for your reply! Sorry for my vague description.

My ultimate goal is to phase the SV callset. I have already tagged the responsible reads with paternal and maternal information. I will have more room to further design phasing logistics, if given a larger set of supporting reads with phase tags. That is my initial blueprint.

I understand that a called variant has a fixed set of supporting reads. The clustering algorithm will merge the read sets and their corresponding variants. A proper value assigned to CLUSTER_MAX_DISTANCE will result in a great calling accuracy, and a unsuitable high value will have some side effects on the overall performance. Do I understand it right about SVIM and its clustering algorithm?

If so, consider my objective, I wish if some proper parameter setting (maybe not only adjusting CLUSTER_MAX_DISTANCE) will give me generally larger sets of supporting reads, but also will not greatly dampen the SV calling performance? For example, if I set a high cluster max distance, which other parameter(s) should I adjust to compensate its effects?

Thanks!
Yekai

Answer 3 · 2021-05-27T12:16:46.000Z

Hi Yekai,

thanks for the additional explanation which now made it clear to me what you wish to achieve.

I understand that a called variant has a fixed set of supporting reads. The clustering algorithm will merge the read sets and their corresponding variants. A proper value assigned to CLUSTER_MAX_DISTANCE will result in a great calling accuracy, and a unsuitable high value will have some side effects on the overall performance. Do I understand it right about SVIM and its clustering algorithm?

Yes, you understood correctly. An unsuitably high value will lead to the mixing of read signatures from separate variants which will harm the accuracy.

If so, consider my objective, I wish if some proper parameter setting (maybe not only adjusting CLUSTER_MAX_DISTANCE) will give me generally larger sets of supporting reads, but also will not greatly dampen the SV calling performance? For example, if I set a high cluster max distance, which other parameter(s) should I adjust to compensate its effects?

Here is what you can try:

choose a slightly higher --cluster_max_distance to get larger clusters. It's hard to predict the effect of this parameter on the accuracy for your concrete dataset but with PacBio CLR data values up to 0.5 or 0.6 should be fine.
choose a slightly lower --min_mapq (maybe 5 or 10) to incorporate reads with lower mapping quality (can also lead to more false positives)
choose a higher --segment_gap_tolerance and --segment_overlap_tolerance (e.g. 100) to incorporate reads with unaligned bases or overlapping alignment segments

You can also use --read_names to include the names of supporting reads for each variant in the VCF. This makes it possible to link the phase information that you have for each read to the variants.

I hope this is somewhat helpful.

Best
David

Answer 4 · 2021-05-29T07:31:47.000Z

Thanks a lot David!