pllittle/UNMASC

About output label

Closed this issue · 6 comments

Hi @pllittle ,
Sorry for bothering you again.
1.I found there are several labels at UNMASC somatic mutations's output, so the label is following the criteria from this table?
image
2. what does the "BAF-like" label mean?
3. If I want a high confident somatic mutation list, does remove muts with AV(FFPE,OXOG, strand bias, and ARTI) enough? should I only keep no_label muts? How you filter the mutations when you are comparing the PPV with other methods (Figure 4 from your paper)?
5. If I only keep mutations with no labels in the output, there were only few muts (< 10), and all of these mutations are from chrX, is that possible?

Thanks,
FN

Hi @hfl112,

  1. The LABEL column of the output file is the set of characterizations (telling the story) of each variant.
  2. BAF-like means the variant's VAF is "close" to the local segmental BAF. That could mean one of two things. Either the tumor is highly pure (purity near 1, like for cell lines) and the variant is either somatic or germline (distinguishing between the two would require looking at if its in COSMIC or a germline database for extra context), or the variant is more likely germline in a less than pure tumor.
  3. For a high confident set, I would keep the no_label, variants in COSMIC (cosmic>=10) without other labels. If you already know the samples are not pure tumors, you can also exclude BAF-like. The strand-bias and H2M would be easily excluded. FFPE/OXOG/ARTI require a quick inspection of the genomewide plot. Is there a huge concentrated cluster of low VAF black color points in the png? If yes, then exclude the FFPE/OXOG/ARTI variants. Could you provide the sample's genomewide png plot?
  4. If you have gender information per sample, that can be inputted into run_UNMASC() to help with BAF-like labeling for chrX. If the final count number seems too low, make sure the target BED file correctly captures the genes/gene panel you're interested in b/c loci are then labeled on or off target based on it. May I ask if the samples are targeted capture, whole exome, or whole genome? The input cutoffs (depth and Qscore) used by UNMASC may need to be tuned depending on coverage. It's also possible UNMASC suffered poor performance in terms of BAF segmentation (the png per sample can help debug these issues).

Hi @hfl112,

  1. The LABEL column of the output file is the set of characterizations (telling the story) of each variant.
  2. BAF-like means the variant's VAF is "close" to the local segmental BAF. That could mean one of two things. Either the tumor is highly pure (purity near 1, like for cell lines) and the variant is either somatic or germline (distinguishing between the two would require looking at if its in COSMIC or a germline database for extra context), or the variant is more likely germline in a less than pure tumor.
  3. For a high confident set, I would keep the no_label, variants in COSMIC (cosmic>=10) without other labels. If you already know the samples are not pure tumors, you can also exclude BAF-like. The strand-bias and H2M would be easily excluded. FFPE/OXOG/ARTI require a quick inspection of the genomewide plot. Is there a huge concentrated cluster of low VAF black color points in the png? If yes, then exclude the FFPE/OXOG/ARTI variants. Could you provide the sample's genomewide png plot?
  4. If you have gender information per sample, that can be inputted into run_UNMASC() to help with BAF-like labeling for chrX. If the final count number seems too low, make sure the target BED file correctly captures the genes/gene panel you're interested in b/c loci are then labeled on or off target based on it. May I ask if the samples are targeted capture, whole exome, or whole genome? The input cutoffs (depth and Qscore) used by UNMASC may need to be tuned depending on coverage. It's also possible UNMASC suffered poor performance in terms of BAF segmentation (the png per sample can help debug these issues).

I am running UNMASC using WES sequencing with a average depth 100X+ and with a default parameter setting.
This sample is actually a patient-derived xenograft models, so the tumor purity should be much higher than patient tumor, I think I would keep all the "BAF-like" muts.

From the plot, actually almost all candidates are H2M
image

I see, this is a sign the normal VAF segmentation is problematic for this sample. It somehow incorrectly inferred H2M regions. Can you provide the corresponding nSEG/nSEG.png?

I see, this is a sign the normal VAF segmentation is problematic for this sample. It somehow incorrectly inferred H2M regions. Can you provide the corresponding nSEG/nSEG.png?

Here I posted the segment png of this sample
tSEG:
image
nSEG (all the VAFs < 0.5):
image

And I also check another test sample I've ran, the nSEG is similar to this posted png.

I see, I believe the issue may rest in the input preparation before running UNMASC. Which variant caller did you use and can you provide an example code? I'm curious if any pre-filtering was done on the VCFs.

I see, I believe the issue may rest in the input preparation before running UNMASC. Which variant caller did you use and can you provide an example code? I'm curious if any pre-filtering was done on the VCFs.

Ah, I've only included all the "PASS" variants of strelka, I think this might due to this filtering ...
in that case... it will take a long time to annotation using VEP, even using "-fork ", because I have 100,0000+ unique mutataions per sample