getzlab/SignatureAnalyzer

"Unusual context" error

Closed this issue · 2 comments

Hi,

I converted Strelka VCF files to maf format using annovar and maftools. I removed non-exonic somatic variants from the maf file prior to running SignatureAnalyzer. I'm running into an error with the maf format. Here is my command:

import signatureanalyzer as sa
import pandas as pd
maf_df = pd.read_csv( "Input_351_hg38_multianno_exonic_SNP_min.maf", sep='\t').loc[:,[
'Hugo_Symbol',
'Tumor_Sample_Barcode',
'Chromosome',
'Start_Position',
'Reference_Allele',
'Tumor_Seq_Allele2',
'Variant_Type'
]]

_,spectra_sbs = sa.spectra.get_spectra_from_maf(maf_df, cosmic='cosmic3_exome', hgfile='hg38.2bit')

  * Mapping contexts: 17327 / 17328

Traceback (most recent call last):
File "/uufs/env/lib/python3.8/site-packages/signatureanalyzer/spectra.py", line 114, in get_spectra_from_maf
maf['context96.num'] = contig.apply(context96.getitem)
File "/uufs/env/lib/python3.8/site-packages/pandas/core/series.py", line 4108, in apply
mapped = lib.map_infer(values, f, convert=convert_dtype)
File "pandas/_libs/lib.pyx", line 2467, in pandas._libs.lib.map_infer
KeyError: '-AAG'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "", line 1, in
File "/uufs/env/lib/python3.8/site-packages/signatureanalyzer/spectra.py", line 116, in get_spectra_from_maf
raise KeyError('Unusual context: ' + str(e))
KeyError: "Unusual context: '-AAG'"

I get the same error when I run:

signatureanalyzer Input_351_hg38_multianno_exonic_SNP_min.maf --hg_build hg38.2bit -n 10 --cosmic cosmic3_exome --objective poisson --max_iter 30000 --prior_on_H L1 --prior_on_W L1

I don't have "AAG" in my file. Any thoughts? Thanks!

Hello,

Looking at the traceback, it is likely that in the maf file there is an insertion that is marked as a SNP in the Variant_Type column. The context for single-base substitutions is 4 characters long and is of the format (ref) (alt) (ref-1) (ref+1), created from columns in the maf and the hg reference sequence. The traceback indicates that the reference base is '-', an insertion.

Thanks for helping me, that was it! There were about 200 instances total in both reference/alternative columns. Thanks!