[MRD automatic analysis notebook] Duplicate Records Issue During norm_coverage Building Step
Opened this issue · 0 comments
Hi Ultima Genomics Team,
I’ve encountered an issue in the norm_coverage building step of the VariantCalling pipeline.
When multiple signatures share the same chrom and pos (happens when setting higher number of synthetic signatures), the norm_coverage DataFrame includes duplicate records.
This duplication results in inflated coverage values for those signatures with intersecting vcf records.
Steps to Reproduce
1. Run MRD analysis pipeline. Set high number of synthetic signatures (e.g. 50-100).
2. See the coverage values in the result from jupyter notebook . The synthetic signatures will have higher coverage compared to matched signature.
3. Observe that duplicate records exist when multiple signatures have the same `chrom` and `pos` in `norm_covearge` variable in the notebook..
Expected Behavior
Each combination of chrom and pos should have a unique norm_coverage value, ensuring accurate coverage without inflation.
Actual Behavior
Duplicate records for the same chrom and pos lead to inflated norm_coverage values, affecting the accuracy of background error produced by synthetic signatures.
Proposed Solution
To eliminate duplicate records, I suggest modifying the code to drop duplicates based on chrom and pos before setting the index (assuming cfDNA is shared, the coverage chrom pos should be unique).
Below is the proposed code change:
Original Code:
norm_coverage = (x / x.median()).rename("norm_coverage")
Proposed Code:
norm_coverage = (x / x.median()).rename("norm_coverage")
norm_coverage_df = norm_coverage.reset_index()
# Drop duplicates based on 'chrom' and 'pos'
norm_coverage_df_unique = norm_coverage_df.drop_duplicates(subset=['chrom', 'pos'], keep='first')
# Set index back to ['chrom', 'pos']
norm_coverage = norm_coverage_df_unique.set_index(['chrom', 'pos'])['norm_coverage']
Reference
You can view the specific lines in the notebook here.
Impact
Implementing this change will ensure that norm_coverage accurately reflects the coverage without being artificially inflated due to duplicate records.
Thank you for addressing this issue!