pllittle/UNMASC

UNMASC failing on a subset of samples

Closed this issue · 4 comments

Hello,

I am attempting to run UNMASC on a cohort of Agilent SureSelect XT HS2 gene panel samples which were aligned to hg19 using bwa-mem and pre-processed using Agilent AGeNT software (https://www.agilent.com/cs/library/software/public/AGeNTBestPractices.pdf) and am getting some errors. I am attempting to call variants for 177 samples. Of these, 74 samples go through the entire UNMASC workflow with no issues and produce all final output files.

From the remaining 103 samples, I am experiencing two different issues:

  1. The most common error occurs during the OXOG/FFPE/ARTI filtering:
Determine OXOG,FFPE,ARTI status ...
Number of unique variants = 621
.621
tumor VAF segmentation on variants...
	chr1: 100%
	chr2: 100%
	chr3: 100%
	chr4: 100%
	chr5: 100%
	chr6: 100%
	chr7: 100%
	chr8: 100%
	chr9: 100%
	chr10: 100%
	chr11: 100%
	chr12: 100%
	chr13: 100%
	chr14: 100%
	chr15: 100%
	chr16: 100%
	chr17: 100%
	chr19: 100%
	chr20: 100%
Error in `$<-.data.frame`(`*tmp*`, "index", value = 1:0) : 
  replacement has 2 rows, data has 0
Calls: <Anonymous> ... UNMASC_tSEG -> segment_tVAF -> $<- -> $<-.data.frame
Execution halted

The exact chromosome this error occurs at varies, most reach up to chr20, but a few samples only manage to complete up to chr 4 or chr7 before failing.

  1. Other samples don't make it that far and just end in a NULL shortly after starting, with no error message:
% ------------------------------- %
% Welcome to the UNMASC workflow! %
% ------------------------------- %
Sun Oct 23 01:14:14 2022: Import image ...
Sun Oct 23 01:14:14 2022: Finding oxoG artifacts ...
Sun Oct 23 01:14:14 2022: Merge strand info ...
1 2 3 
4 5 6 
7 8 9 
10 11 12 
13 14 15 
16 17 18 
19 20 21 
22 23 24 
25 26 27 
28 29 30 

Sun Oct 23 01:14:15 2022: Finding FFPE artifacts ...
Sun Oct 23 01:14:15 2022: Merge strand info ...
1 2 3 
4 5 6 
7 8 9 
10 11 12 
13 14 15 16 
17 18 19 
20 21 22 
23 24 25 
26 27 28 29 
30 
Sun Oct 23 01:14:18 2022: Finding ARTI artifacts ...
Sun Oct 23 01:14:18 2022: Merge strand info ...
1 2 3 
4 5 6 
7 8 9 
10 11 12 
13 14 15 
16 17 18 
19 20 21 
22 23 24 
25 26 27 
28 29 30 

nrow of uniq_vcs = 815
nANNO ...
............25 out of 25

Infer H2M status ...
tANNO ...
..........21 out of 21

Infer ALLELE_STAT and tANNO ...

nANNO ...
............25 out of 25

Infer H2M status ...
tANNO ...
..........21 out of 21

Infer ALLELE_STAT and tANNO ...

NULL

  1. Lastly, a few samples fail immediately upon starting, from what I assume is low sequencing quality or poor quality variant calls?
% ------------------------------- %
% Welcome to the UNMASC workflow! %
% ------------------------------- %
Sun Oct 23 01:00:39 2022: Calculate mutID and light filtering ...
Sun Oct 23 01:00:39 2022: LowQCSample b/c low variant count after base filtering ...
NULL

Any insights into what could be causing these issues would be greatly appreciated and please let me know if you require any additional information in order to determine what may be causing the issue.

Thanks,

Javier

Hi @javi-a-lopez,

Sorry to hear about the errors.

Would it possible for you to share the image.rds file for a sample that fails for cases (1) and (2)? I should be able to replicate the error and find the bug.

For (3), I suspect the depth of loci or Qscore are too low. Could you send a gzipped vcf as an example?

Best,

@pllittle

Hello, here are some image.rds files as well as a vcf file from one of the third error samples.
UNMASC_diagnostics.zip

Thank you very much for the help!

Hello again, I've tried a few things to troubleshoot, such as removing all variants from the chromosomes which throw error #1, but I'm still no closer to figuring out what's causing the error or why only in a subset of samples. Any thoughts?

Hi @javi-a-lopez,

Apologies for the delayed response. Thank you for the image files, they quickly aided me in finding some potential issues. I see that some additional documentation and tips for the user are needed. Here are my thoughts for now.

  1. Screening unmatched normals: When generating the image.rds files, the initial clustering of normal read counts creates the nCLUST directory with plots of the normal VAF (nVAF). With 30 unmatched normal controls, there do appear to be a subset that appear lower quality (STUDYNUMBERs 1 and 3 specifically). We expect the nVAF to be concentrated around 0, 0.5, or 1. Any concentrated deviations from 0.5 could be indications of somatic copy number change due to a tumor/normal sample swap/mislabeling.

  2. Read count distribution: Based on the default clustering of counts, the binomial distribution may be less favorable than the beta-binomial distribution. The data.frame SE within image.rds provides the metrics calculated from clustering normal read counts. You can run the unexported function as UNMASC:::run_nCLUST() to see the difference when switching between binom = TRUE vs binom = FALSE.

  3. Sparsity: UNMASC was benchmarked with a targeted gene panel and performance improves with increased coverage (WES, WGS). The limited number of genes captured in your panel is leading to a sparsity concern leading to poor segmentation of tumor and normal VAF per chromosome. The current UNMASC implementation is a potential limitation for your samples at the moment. One solution I can consider is pooling all loci together for a genome-wide segmentation to overcome this sparsity. This can be a future direction of UNMASC but it'll take a few weeks to implement, test, and debug.

Best,
@pllittle