Index Error for variant using am37.c_to_g()
Opened this issue · 7 comments
We encountered a bug when running c_to_g via assembly mapper for variant 'NM_001291722.1:c.283-3C>T':
$ hgvs-shell
Python 3.6.9 (default, Jan 8 2020, 08:00:01)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.13.0 -- An enhanced Interactive Python. Type '?' for help.
############################################################################
hgvs-shell -- interactive hgvs
hgvs version: 1.5.1
data provider url: postgresql://anonymous:anonymous@uta.biocommons.org/uta/uta_20180821
schema_version: 1.1
data_version: uta_20180821
sequences source: SeqRepo (/Users/paige.taylor/Data/seqrepo/2018-08-21)
The following variables are defined:
* global_config
* hp, parser, hgvs_parser -- Parser instance
* hdp, hgvs_data_provider -- UTA data provider instance
* vm, variant_mapper, hgvs_variant_mapper -- VariantMapper instance
* am37, hgvs_assembly_mapper_37 -- GRCh37 Assembly Mapper instance
* am38, projector, hgvs_assembly_mapper_38 -- GRCh38 Assembly Mapper instances
* hn, normalizer, hgvs_normalizer -- Normalizer instance
* hv, validator, hgvs_validator) -- Validator instance
The following functions are available:
* parse, normalize, validate
* g_to_c, g_to_n, g_to_t,
* c_to_g, c_to_n, c_to_p,
* n_to_c, n_to_g,
* t_to_g,
* get_relevant_transcripts
When submitting bug reports, include the version header shown above
and use these variables/variable names whenever possible.
In [1]: var = hp.parse_hgvs_variant('NM_001291722.1:c.283-3C>T')
In [2]: am37.c_to_g(var)
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
~/virtualenv/hgvs_v15/lib/python3.6/site-packages/hgvs/shell.py in <module>
----> 1 am37.c_to_g(var)
~/virtualenv/hgvs_v15/lib/python3.6/site-packages/hgvs/assemblymapper.py in c_to_g(self, var_c)
114 alt_ac = self._alt_ac_for_tx_ac(var_c.ac)
115 var_out = super(AssemblyMapper, self).c_to_g(
--> 116 var_c, alt_ac, alt_aln_method=self.alt_aln_method)
117 return self._maybe_normalize(var_out)
118
~/virtualenv/hgvs_v15/lib/python3.6/site-packages/hgvs/variantmapper.py in c_to_g(self, var_c, alt_ac, alt_aln_method)
301 pos_n = mapper.g_to_n(pos_g)
302 edit_g = hgvs.edit.NARefAlt(
--> 303 ref='', alt=self._get_altered_sequence(mapper.strand, pos_n, var_n))
304 pos_g.uncertain = var_c.posedit.pos.uncertain
305 var_g = hgvs.sequencevariant.SequenceVariant(
~/virtualenv/hgvs_v15/lib/python3.6/site-packages/hgvs/variantmapper.py in _get_altered_sequence(self, strand, interval, var)
524
525 if edit.type == 'sub':
--> 526 seq[pos_start] = edit.alt
527 elif edit.type == 'del':
528 del seq[pos_start:pos_end]
IndexError: list assignment index out of range
@sptaylor: I can confirm this bug.
I accidentally discovered that this bug doesn't happen if I use a uta_20150827 (discovered through a sloppy copy-paste from old notes). That discovery only increases my curiosity.
I don't have funding, or time to donate, to work on this right now.
@reece we are experiencing the same error in a different context.
when using the Assembly Mapper's g_to_c function with strict_bounds=True, this variant NC_000001.10:g.16890441C>G throws a HGVSInvalidIntervalError due to gaps in the sequence. when using the same function with strict_bounds=False we get
'Traceback (most recent call last):
{some lines ommitted for brevity}
File "/usr/local/lib/python3.6/site-packages/hgvs/assemblymapper.py", line 100, in g_to_c
var_g, tx_ac, alt_aln_method=self.alt_aln_method)
File "/usr/local/lib/python3.6/site-packages/hgvs/variantmapper.py", line 259, in g_to_c
ref='', alt=self._get_altered_sequence(mapper.strand, pos_g, var_g))
File "/usr/local/lib/python3.6/site-packages/hgvs/variantmapper.py", line 526, in _get_altered_sequence
seq[pos_start] = edit.alt
IndexError: list assignment index out of range'
We propose a try/except in the else statement code block at line 225 of variantmapper.py (also below, hgvs1.5.1) to raise the HGVSInvalidInteralError. Does this make sense or is there a deeper issue?
'if not pos_c.uncertain:
edit_c = self._convert_edit_check_strand(mapper.strand, var_g.posedit.edit)
if edit_c.type == 'ins' and pos_c.start.offset == 0 and pos_c.end.offset == 0 and pos_c.end - pos_c.start > 1:
pos_c.start.base += 1
pos_c.end.base -= 1
edit_c.ref = ''
else:
# variant at alignment gap
pos_g = mapper.c_to_g(pos_c)
edit_c = hgvs.edit.NARefAlt(
ref='', alt=self._get_altered_sequence(mapper.strand, pos_g, var_g))'
Validation problem?
Other tools seem to think this is an error, so perhaps it's a validation problem:
ClinGen allele registry: dies with "intronic position inside exon"
Variant Validator: ExonBoundaryError: Position c.283-3 does not correspond with an exon boundary for transcript NM_001291722.1
Exon positions in UTA
Where is "c.283" ?
In [46]: var_c = hp.parse_hgvs_variant('NM_001291722.1:c.283C>T') # remove the 3
In [47]: am37.c_to_n(var_c)
Out[47]: SequenceVariant(ac=NM_001291722.1, type=n, posedit=422C>T, gene=None)
Double check - NM_001291722.1 CDS start = 139
139 + 283 = transcript pos 422 so c_to_n is correct here
GRCh37 Tx exons around 422 are:
['CYFIP2', 'NM_001291722.1', 'NC_000005.9', 'splign', 1, 3, 346, 424, 156721791, 156721866, '75=3D', None, None, 738621, 366974, 6225535, 3655233, 3833217]
['CYFIP2', 'NM_001291722.1', 'NC_000005.9', 'splign', 1, 4, 424, 526, 156723677, 156723782, '3I102=', None, None, 738621, 366974, 6225536, 3655234, 3833382]
It looks like perhaps UTA is calling the exon wrong here? Why have a deletion of 3 bases at exon/intron boundary rather than shift the exon boundary?
Other investigation
I started to look into this and noticed something...
var_c = hp.parse_hgvs_variant('NM_001291722.1:c.283-3C>T')
mapper = am37._fetch_AlignmentMapper("NM_001291722.1", "NC_000005.9", "splign")
pos_c = var_c.posedit.pos
pos_c_to_n = mapper.c_to_n(pos_c)
pos_g = mapper.c_to_g(pos_c)
pos_c_to_g_to_n = mapper.g_to_n(pos_g)
print(pos_c_to_n)
print(pos_c_to_g_to_n)
These should be the same? Output:
422-3
418_419
Not sure if related, but I've found a few other examples of where conversion seems inconsistent (if you convert from c to g then back to c it's changed)
# These examples are from ./tests/data/clinvar.gz so presumably real HGVS from the wild
# Yes, they are probably wrong, eg "NM_000180.3:c.3233-3236dup" is probably a typo, dash is supposed to be underscore
example_hgvs = [
'NM_001145026.1:c.715A>G',
'NM_004543.4:c.7432-2025_7536+372del2502',
'NM_000180.3:c.3233-3236dup',
'NM_032383.3:c.1303+1G>A',
'NM_001011.3:c.148+1G>A',
'NM_000158.3:c.2052+5289_2052+5297delGTGTGGTGGinsTGTTTTTTACATGACAGGT',
'NM_003611.2:c.2122-2125dupAAGA',
'NM_001017975.4:c.1687-1G>C',
'NM_006904.6:c.1776+1523dupA',
'NM_001122679.1:c.2968-2A>T'
]
for hgvs_str in example_hgvs:
var_c = hp.parse_hgvs_variant(hgvs_str)
var_g = am37.c_to_g(var_c)
var_c2 = am37.g_to_c(var_g, var_c.ac) # This should in theory be the same as we started
print(var_c)
print(var_c2)
print("-" * 20)
Between the dashed lines, the two should be consistent
NM_001145026.1:c.715A>G
NM_001145026.1:c.661-2_661-1insGAGAATAGTGAATCTTTTTTATGGAGTACAGCCAGCCCTTCTCCAACCCTTGGTGGAGTTACACCTCCATCGCGTACCACACATTCATCAAGCACGTTGACACAGAATGAGATCAGCTCTGTGTGGAAAGAGCCTATCAGTTTTGTAGTGACACACTTGAG
--------------------
NM_004543.4:c.7432-2025_7536+372del
NM_004543.4:c.7431+1917_7536+372del
--------------------
NM_000180.3:c.3233-3236dup
NM_000180.3:c.2226dup
--------------------
NM_032383.3:c.1303+1G>A
NM_032383.3:c.1304T>A
--------------------
NM_001011.3:c.148+1G>A
NM_001011.3:c.149=
--------------------
NM_000158.3:c.2052+5289_2052+5297delinsTGTTTTTTACATGACAGGT
NM_000158.3:c.2053-3358_2053-3350delinsTGTTTTTTACATGACAGGT
--------------------
NM_003611.2:c.2122-2125dup
NM_003611.2:c.1654+9dup
--------------------
NM_001017975.4:c.1687-1G>C
NM_001017975.4:c.1686G>C
--------------------
NM_006904.6:c.1776+1523dup
NM_006904.6:c.1777-723dup
--------------------
NM_001122679.1:c.2968-2A>T
NM_001122679.1:c.2966G>T
Some of these are def invalid, though in that case, they should throw an error?
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 7 days.
This issue was closed because it has been stalled for 7 days with no activity.
This issue was closed by stalebot. It has been reopened to give more time for community review. See biocommons coding guidelines for stale issue and pull request policies. This resurrection is expected to be a one-time event.
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 7 days.