pllittle/UNMASC

About tumor-only calling

Closed this issue · 3 comments

Hi developers,
I am trying to use UNMASC on my dataset. But I am still a little bit confused about the workflow.

  1. Can I use other tools like mutect, or if only strelka is allowed, how to run a tumor-only case with configureStrelkaSomaticWorkflow.py? From my understanding, the normal bam file is one of the unmatched normal bam. And for each normal file, run strelka using tumor and this unmatched normal ? And then merged all the snv and indels of each pair?
  2. vcf = prep_UNMASC_VCF(outdir,DAT,FILTER, target_fn,anno_fn,ncores) for here, what is the DAT and FILTER format? so the DAT is the list of tumor-normal paired files? Can you show me a example?

DAT: A data.frame containing column names 'FILENAME' for the full
vcf filename and 'STUDYNUMBER' for a unique string mapped to
the control sample used when calling variants against the
same tumor.

Thanks,
FN

Hi @hfl112,

Thank you for your interest in UNMASC!

  1. You are more than welcome to use other somatic variant callers such as MuTect. I haven't provided any sample code for how to annotate MuTect's outputted VCF (currently only Strelka2).

The key to UNMASC is leveraging the distribution of read counts from among unmatched normal controls. So whichever somatic caller you use needs to have the tumor bam and unmatched normal bam (refer to the sample Strelka code in README.md). So there is no tumor-only step to generate the VCFs (each VCF is generated by a tumor and an unmatched normal pair). More unmatched normal bams improve the ability to identify regions that may harbor false positive variants. And yes these VCFs of SNVs and INDELs, for a single tumor, are merged together, annotated, and inputted into UNMASC.

  1. By typing ?UNMASC::prep_UNMASC_VCF you can see the documentation. DAT is a dataframe containing column names FILENAME and STUDYNUMBER. Suppose you have 20 unmatched normal controls used against each tumor in your cohort. Then FILENAME could be "/var1.vcf", ..., "/var20.vcf". And STUDYNUMBER could be "N_1", "N_2", ..., "N_20". This is because at the end of running UNMASC, you might be curious which variant call appeared from using which unmatched normal control. Some variants might appear in each vcf, some may not.

FILTER can be seen from the documentation. For example setting FILTER = list(nDP = 2,tDP = 2,Qscore = 3) means to retain variants with normal total read depth greater than or equal to 2, tumor total read depth greater than or equal to 2, and Qscore greater than or equal to 3. This is meant to pre-filter variants at extremely low depth or low variant quality score before running UNMASC. If these thresholds seem too stringent, you're welcome to set them all to zero.

Hope this helps!

Thank you, it's really helpful !!
So if I got 10 tumor cases and 20 unmatched normals, I should run 10x20 tumor-normal paired somatic calling. For each sample, generated the FILENAME, STUDYNUMBER list, and run prep_UNMASC_VCF & run_UNMASC.

Another thing I want to ask is can I use annovar to replace vep annotation? if I change the output of annovar to MAF format, is that compatible with UNMASC?

Thank you so much,
FN

You're welcome @hfl112!

Yes, you're correct, 10x20 vcfs would be generated and annotated. If you ran the Strelka/VEP code provided, each tumor sample would then be run with prep_UNMASC_VCF and run_UNMASC functions.

You are welcome to use ANNOVAR over VEP. The challenge will be making sure the vcf data.frame for run_UNMASC contains the necessary columns and each column is formatted correctly. In the near future, I'll look into adding R functions for handling ANNOVAR output.

Good luck and hope this helps!