With bayestyper genotyping SV, the genotype rate decreases as the depth of sample sequencing increases.

Question

With bayestyper genotyping SV, the genotype rate decreases as the depth of sample sequencing increases.

yangqimeng99 opened this issue 8 months ago · 0 comments

Dear BayesTyper developer,

I hope this message finds you well. I am reaching out to discuss an unexpected issue I encountered while using BayesTyper, a tool I greatly admire for its excellence in genetic genotyping.

Iam testing with a human sample, HG002 with 2x150bp short reads, to genotype a set of structural variants (SVs) derived from hifi reads. This SV set comprises only insertions (INS) and deletions (DEL) with alleles >50bp. However, I’ve observed an unusual phenomenon where genotyping rates are lowest using 30x short reads compared to tests run with 20x and 10x coverage, which contradicts the common understanding that higher sequencing depth typically yields better genotyping rate.

To ensure thoroughness, I conducted tests based on both bam and fastq formats, and interestingly, the outcomes consistently align with the issue described above. Here is a brief outline of the code I utilized for this process:

kmc -k55 -ci1 -fbam ${inputBam} ${outputPrefix} ./kmc_tmp
bayesTyperTools makeBloom -k ${outputPrefix} -p ${threads}
bayesTyper cluster -v ${inuptVCF} -s ${sampleTsv} -g ${refCanon} -d ${refDecoy} -p ${threads}
bayesTyper genotype -v bayestyper_unit_1/variant_clusters.bin -c bayestyper_cluster_data -s ${sampleTsv} -g ${refCanon} -d ${refDecoy} -o bayestyper_unit_1/bayestyper  -z -p ${threads}

Based on the code mentioned, I obtained genotyping rates of 0.45, 0.48, 0.50, and 0.48 at sequencing depths of 30x, 20x, 10x, and 5x, respectively.Given this context, I am at a loss as to why this performance discrepancy occurs at higher sequencing depths. I would deeply appreciate any insights or suggestions you could provide. Could there potentially be a factor in BayesTyper that inversely impacts genotyping efficiency with increased read depth, specifically in the context of using short reads for SV genotyping? Or, is there any chance my code or approach inadvertently introduces a bias or error?

Thank you very much for your time and assistance. I look forward to any guidance you can offer.

Best regards