Error: not one of the known HGVSc strings: c.-1_1dupAA
Opened this issue · 3 comments
Let's make sure this wasn't a one-off error. CC @gongyixiao @kpjonsson
If never seen again, we can ignore .
It's not matching this regex:
Line 691 in 62f75a8
I'm not sure whether this is a misformatted string or not. @gongyixiao, was this using a MAF annotated with vcf2maf/maf2maf?
@cband Is this a type of HGVSc string that should be captured by the regex?
This is the variant:
Chromosome 20
Start_Position 18446001
End_Position 18446002
Reference_Allele -
Tumor_Seq_Allele2 TT
Hugo_Symbol DZANK1
HGVSc c.-1_1dupAA
HGVSp p.Met1?
this situation has appeared again and is mentioned here: mskcc/tempo#838
i would propose replacing the parsing of dup|ins|del|inv
HGVSc strings with the following:
elif re.match(r'^c\..*_(-?\d+).*(dup)([ATCG]+)$', hgvsc):
position, hgvsc_type, sequence = re.match(r'^c\..*_(-?\d+).*(dup)([ATCG]+)$', hgvsc).groups()
elif re.match(r'^c\.(-?\d+).*(dup|ins|del|inv)([ATCG]+)$', hgvsc):
position, hgvsc_type, sequence = re.match(r'^c\.(\d+).*(dup|ins|del|inv)([ATCG]+)$', hgvsc).groups()
else:
sys.exit('Error: not one of the known HGVSc strings: ' + hgvsc)
position = int(position) - 1
if hgvsc_type in 'dup,ins':
alt_allele = sequence
elif hgvsc_type == 'del':
ref_allele = sequence
elif hgvsc_type == 'inv':
ref_allele = sequence
alt_allele = self.reverse_complement(sequence)
ref_allele = ref_allele if position > -1 else ref_allele[position * -1:]
alt_allele = alt_allele if position > -1 else alt_allele[position * -1:]
## start of mutated region in CDS
cds = re.search(self.cds_seq + '.*', self.cdna_seq).group()
seq_5p = cds[0:position] if position > -1 else ''
seq_3p = cds[position:len(cds)] if position > -1 else cds
#print self.hgvsp + '\t' + self.variant_class + '\t' + self.variant_type + '\t' + self.ref_allele + '\t' + self.alt_allele + \
# '\t' + self.cds_position + '\nFull CDS: ' + self.cds_seq + '\nSeq_5: ' + seq_5p + '\nSeq_3' + seq_3p + '\n>mut_1--' + mut_cds_1 + '\n>mut_2--' + mut_cds_2 + '\n>mut_3--' + mut_cds_3
self.wt_cds = seq_5p + ref_allele + seq_3p[len(ref_allele):len(seq_3p)]
self.mt_cds = seq_5p + alt_allele + seq_3p[len(ref_allele):len(seq_3p)]
for dup
variants i preferred the number after the underscore because the first position occurring before the underscore is actually referring to the start of the reference allele, whereas the second position is the start of the alt allele. if there is no underscore we can process like the others (ins|del|inv
).
Looking forward to hearing someone's thoughts on this solution.