campbio/musicatk

Is there any way to get evidence that a DBS is actually 2 SBS? What are the relative rates?

Closed this issue · 3 comments

Discussed in #50

Originally posted by achevali October 22, 2021
Currently we have a switch to determine whether we should merge 2 adjacent SBS, is there a baseline ratio of these occurrences? Is there a way to gain evidence that a given variant should be treated as 2 or mutations or 1?

Strictly speaking, whether two adjacent mutations are marked as one 'DBS' or two 'SPS' depends on the allele where they are. For example, two SNPs, reference:

-A-T-

They both mutated, and they are all heterozygous mutation. Then, on the diploid human genome, the relative positions of the two mutations can have the following two situations:

  • cis
Ref:      -A-T-
Allele1:  -G-C-
Allele2:  -A-T-
  • trans
Ref:      -A-T-
Allele1:  -G-T-
Allele2:  -A-C-

In the cis situation, obviously, the two mutations can be marked as DBS, using vcf format:

chrXXX 123 rsXXXXX AT GC 1000 PASS . GT 1/0

and in the trans situation, marking these two mutations together is perhaps not properly, vcf record should be like this:

chrXXX 123 rsXXXXY A G 1000 PASS . GT 1/0
chrXXX 124 rsXXXXZ T C 1000 PASS . GT 1/0

But in tumor genome study, more complicated situations we have to face up. Marking two adjacent SNPs as SBS or DBS indicates relative location and even the potential origin of mutations. This can be tough for researchers!

We can also see this different situations in PROTEIN's view! For example, the reference AT belong to one codon of Tri-nucleic acid code AUG, representing amino acid Met.

Then in cis situation, codon became GCG, representing Ala.

In trans situation, two Neo amino acids come: GUG represents Val, and ACG represents Thr !

Hi @wangshun1121, thanks for your comments. We would need access to the reads in the bam file in order to correctly resolve the whether or not the mutations are truly adjacent on the same allele. However, musicatk operates on maf and vcf files, which do not contain this information. If the mutation caller did not phase it correctly, then the annotators such as VEP and Oncotator probably will not predict the correct protein change either (although I think at some point oncotator did have a method to try and predict which adjacent SBSs are likely DBSs based on alternate allele frequencies in the maf, I'm not sure about VEP). We have given users the ability to merge adjacent SBSs into DBS as a "rescue" step. Although it is not perfect and may falsely convert a few truly adjacent SBSs, this will hopefully not have a huge impact on the counts. Users can turn this check on or off in the create_musica function. Hopefully, people will be inclined to use mutation callers and annotators that appropriately handle this in the future. I'm going to close this issue as there is not much we can do about it from the point of view of this toolkit.