how many calls should we expect?
woodoo46 opened this issue · 5 comments
Hi there,
In general, how many calls (or vcf records) should we expect from the output, if I use default parameters? For example,
${svim} alignment --read_names --zmws ${workdir} ${inputfile} ${genome}
Thanks.
George
Hi George,
It's difficult to give you an expected number of records per sample - even when using default parameters. The number not only changes significantly depending on the read quality and coverage but also on the genome used as a reference. Usually, with increasing read coverage and quality the number of records increases as well. The same holds for more recent references. For example, you would have an increased number of records using the telomere-to-telomere (CHM13) reference compared to GRCh38.
Generally, the number of records is much higher than compared to other SV callers since SVIM doesn't filter the output by read support. To give you an example: I recently applied SVIM to the PacBio data of a medium-sized patient cohort. The coverage range was between 12 and 40X (GRCh38). The number of raw records was between 1.5 and 5 million. While this doesn't answer your question directly, I hope this gives you an impression of how the number of records behaves with increasing coverage for a specific reference genome.
Best,
Jakob
Hi Jakob,
Thanks for your reply. Your answered my question. I had more than 5M calls which is way too many compared with other SV callers. If this number is expected due to no internal filtering, then we would proceed to design filters on the calls.
Thank you very much!
George
By the way, do you have any recommended filtering setups? Thanks.
George
Hi George,
sorry that I haven't been in touch until now.
Since the quality score of SVIM highly depends on the number of supporting reads, I would recommend choosing a filtering setup with a coverage-dependent rather than a static threshold. The strictness of this threshold also depends on the study design. For example, if you are looking for SVs identified by multiple callers or sequencing technologies it might be useful to increase the threshold while for the identification of disease-causing SVs the threshold could be decreased to include as many potential pathogenic SVs as possible. So, if you have multiple patients with different coverages you could pick 40% of the individual genome-wide coverage as a rather strict threshold or 30% if you want to be sure not to discard any potential true positive variants. For a 40X coverage this would correspond to 16 or 12 as thresholds for the quality score, respectively (e.g. bcftools view -i 'QUAL >= 16' variants.vcf'
). Hope this helps!
Best,
Jakob
Thanks! We will give it a try.