pllittle/UNMASC

panel of normals input

Closed this issue · 4 comments

stu2 commented

Hi, after reading the docs I am still a bit confused about exactly how to generate the vcf dataframe as the input into run_UNMASC. The column names for this dataframe mentioned in the docs include:
"nAD" (e.g. 10), control alternate depth
"nRD" (e.g. 20), control reference depth
"tAD" (e.g. 10), tumor alternate depth
"tRD" (e.g. 20), tumor reference depth
These are possible to generate if starting with a single normal sample and a matched tumor sample, but as I understand it we are meant to be able to use a panel of several unrelated normals which is seemingly not compatible with having only a single column for nAD and nRD. Also, my impression after skimming the Strelka manual is that it seems to require matched tumor/normal samples too, so it is unclear how its output is used in UNMASC if I understood correctly. Do you have a workflow for generating this dataframe from unrelated tumour and normal BAMs and VCFs?

Hi @stu2, thank you for interest in UNMASC!

Regarding your final comment, a function to auto-generate the data.frame from vcfs would benefit the workflow. I can provide an R script/function to process an outputted annotated vcf from the example code provided in the README.md (using Strelka + VEP). For other variant callers and annotation formats, this step would need to be customized by the user.

To create the input R data.frame and stated in the documentation for function "run_UNMASC", the key column for appending multiple unrelated sample variants together is "STUDYNUMBER". This serves to track which variants are generated from each unmatched normal control (and vcf). For example, variants in the data.frame for

  1. tumor vs normal_1 vcf can be labeled STUDYNUMBER = "normal_1",
  2. tumor vs normal_2 vcf can be labeled STUDYNUMBER = "normal_2",
    and so on.

Regarding Strelka, it is true that it expects a tumor with its matched normal. UNMASC is purposely designed to allow users to use their preferred variant caller (Strelka, Mutect, etc.) and replace the caller's expected matched normal bam with any unrelated normal bam and repeat for multiple unrelated normal bams.

Hope this helps!

stu2 commented

Thank you @pllittle, that was very helpful.

I still don't understand. please look at this examples.
image
Input bam files of Strelka somatic calling for UNMASC
tumor 1 / normal 1
tumor 2 / one of the normal samples
tumor 3 / normal 3
tumor 4 / one of the normal samples
tumor 5 / normal 5
tumor 6 / one of the normal samples

The one of the normal samples are selected randomly ?
Or, tumor-only samples are performed with all normal samples, respectively (t2-n1, t2-n3, t2-n5)

Hi @byeongill,

Thank you for the figure and questions!

To answer your question, UNMASC expects each tumor to be processed against multiple unmatched normal controls. So, the second answer you provided (t2-n1, t2-n2, t2-n3, t2-n4, ..., t2-n20) is more precise.

When benchmarking UNMASC, we selected 100 tumors and 20 strictly unmatched normals, therefore each tumor has 20 vcfs and there were 100*20 vcfs in total. When we ran UNMASC, the 20 vcfs per tumor were aggregated into one R data.frame as input.

In your case, if some normals are matched and some are unmatched (like 5 matched normals + 15 unmatched normals), you can still proceed to use all of them against each tumor. Just make sure to keep track of each vcf name (e.g. tumor1_normal1.vcf, tumor1_normal5.vcf).

Hope this helps! ^_^