taylor-lab/neoantigen-dev

Error: not one of the known HGVSc strings: c.-1_1dupAA

Opened this issue · 3 comments

Let's make sure this wasn't a one-off error. CC @gongyixiao @kpjonsson

If never seen again, we can ignore .

It's not matching this regex:

position, hgvsc_type, sequence = re.match(r'^c\.(\d+).*(dup|ins|del|inv)([ATCG]+)$', hgvsc).groups()

I'm not sure whether this is a misformatted string or not. @gongyixiao, was this using a MAF annotated with vcf2maf/maf2maf?

@cband Is this a type of HGVSc string that should be captured by the regex?

This is the variant:

Chromosome          20
Start_Position    18446001
End_Position    18446002
Reference_Allele           -
Tumor_Seq_Allele2          TT
Hugo_Symbol      DZANK1
HGVSc c.-1_1dupAA
HGVSp     p.Met1?

this situation has appeared again and is mentioned here: mskcc/tempo#838

i would propose replacing the parsing of dup|ins|del|inv HGVSc strings with the following:

        elif re.match(r'^c\..*_(-?\d+).*(dup)([ATCG]+)$', hgvsc):
            position, hgvsc_type, sequence = re.match(r'^c\..*_(-?\d+).*(dup)([ATCG]+)$', hgvsc).groups()

        elif re.match(r'^c\.(-?\d+).*(dup|ins|del|inv)([ATCG]+)$', hgvsc):
            position, hgvsc_type, sequence = re.match(r'^c\.(\d+).*(dup|ins|del|inv)([ATCG]+)$', hgvsc).groups()

        else:
            sys.exit('Error: not one of the known HGVSc strings: ' + hgvsc)

        position = int(position) - 1
        if hgvsc_type in 'dup,ins':
            alt_allele = sequence
        elif hgvsc_type == 'del':
            ref_allele = sequence
        elif hgvsc_type == 'inv':
            ref_allele = sequence
            alt_allele = self.reverse_complement(sequence)
        ref_allele = ref_allele if position > -1 else ref_allele[position * -1:]
        alt_allele = alt_allele if position > -1 else alt_allele[position * -1:]

        ## start of mutated region in CDS
        cds = re.search(self.cds_seq + '.*', self.cdna_seq).group()

        seq_5p = cds[0:position] if position > -1 else ''
        seq_3p = cds[position:len(cds)] if position > -1 else cds


        #print self.hgvsp + '\t' + self.variant_class + '\t' + self.variant_type + '\t' + self.ref_allele + '\t' + self.alt_allele + \
        #      '\t' + self.cds_position + '\nFull CDS: ' + self.cds_seq + '\nSeq_5: ' + seq_5p + '\nSeq_3' + seq_3p + '\n>mut_1--' + mut_cds_1 + '\n>mut_2--' + mut_cds_2 + '\n>mut_3--' + mut_cds_3
        self.wt_cds = seq_5p + ref_allele + seq_3p[len(ref_allele):len(seq_3p)]
        self.mt_cds = seq_5p + alt_allele + seq_3p[len(ref_allele):len(seq_3p)]

for dup variants i preferred the number after the underscore because the first position occurring before the underscore is actually referring to the start of the reference allele, whereas the second position is the start of the alt allele. if there is no underscore we can process like the others (ins|del|inv).

Looking forward to hearing someone's thoughts on this solution.