Coordinate is out of bounds error when roundtripping GRCh38 variants via `g_to_c` + `c_to_g` and `relevant_transcripts`
mihaitodor opened this issue · 5 comments
Describe the bug
I'm parsing a g.
variant, then I'm fetching some relevant transcripts for it via the GRCh38
AssemblyMapper, then I'm projecting it to any of the NM
relevant transcript and then I'm projecting that back to a g.
variant. Naively, what I think should happen is that it should be able to round-trip and print the input variant, but right now it's just throwing this error. Note that it does work as expected if you use, for example variant = 'NC_000011.10:g.8263343T>C'
. Also, it works if I use GRCh37
.
To Reproduce
Steps to reproduce the behavior:
import hgvs.assemblymapper
import hgvs.dataproviders.uta
import hgvs.parser
database_schema = 'uta_20210129b'
data_provider = hgvs.dataproviders.uta.connect(
db_url=f"postgresql://anonymous:anonymous@uta.biocommons.org/uta/{database_schema}")
b38_assembly_mapper = hgvs.assemblymapper.AssemblyMapper(
data_provider, assembly_name='GRCh38', alt_aln_method='splign')
parser = hgvs.parser.Parser()
variant = 'NC_000001.10:g.145592073A>T'
parsed_variant = parser.parse_hgvs_variant(variant)
transcripts = b38_assembly_mapper.relevant_transcripts(parsed_variant)
print(transcripts)
relevant_transcript = 'NM_006468.7'
# relevant_transcript = 'NM_001303456.1'
var_c = b38_assembly_mapper.g_to_c(parsed_variant, relevant_transcript)
b38_projected_variant = b38_assembly_mapper.c_to_g(var_c)
print(b38_projected_variant)
Running the above code fails with the following error:
Traceback (most recent call last):
File "/Users/mihaitodor/Projects/genomex/spdi/./test.py", line 25, in <module>
b38_projected_variant = b38_assembly_mapper.c_to_g(var_c)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/hgvs/assemblymapper.py", line 115, in c_to_g
var_out = super(AssemblyMapper, self).c_to_g(var_c, alt_ac, alt_aln_method=self.alt_aln_method)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/hgvs/variantmapper.py", line 283, in c_to_g
pos_g = mapper.c_to_g(var_c.posedit.pos)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/hgvs/alignmentmapper.py", line 277, in c_to_g
return self.n_to_g(self.c_to_n(c_interval), strict_bounds=strict_bounds)
^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/hgvs/alignmentmapper.py", line 263, in c_to_n
n_start = pos_c_to_n(c_interval.start)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/hgvs/alignmentmapper.py", line 260, in pos_c_to_n
raise HGVSInvalidIntervalError(f"c.{pos} coordinate is out of bounds")
hgvs.exceptions.HGVSInvalidIntervalError: c.*578 coordinate is out of bounds
Expected behavior
The code should print NC_000001.10:g.145592073A>T
at the end.
Additional context
@reece mentioned on the biocommons
Slack that this looks like a bug.
Looking into this, it turns out that the hgvs_c string here is NM_006468.7:c.*578=
, so the UTR region of this transcript (in which that variant lies) does have a different sequence than the reference genome. We call that a "transcript-ref-disagree" position.
Somehow in assembly 38 this position seems to be outside of transcript coding space, therefore there is a coordinate is out of bounds
error message. To work around here, you could set hgvs.global_config.mapping.strict_bounds = False
. Doing so results in NC_000001.11:g.145842814_145842815=
Looking into the underlying alignment in UTA, The last exon (that is fully coding for the UTR) has alignment issues in both assemblies. in assembly 37 this transcript has one exon more but the alignment is 183=1X377=3I1027=
. Assembly 38 is one exon shorter and the second to last has an alignment of 476=1588I
. I believe that's why that specific position comes out as "out of bounds".
I would conclude from that that neither assembly is a good match for the terminal UTR region here, and we should prob not assume that we can map variants in that region across assemblies.
I don't think there's a hgvs bug here, this is just the nature of the reference assemblies...
Thank you for looking into this @andreasprlic and for providing such a detailed explanation!
Given this issue, do you think it would make sense to bubble up strict_bounds
as an optional parameter for c_to_g
in AssemblyMapper
and VariantMapper
? That way, users could disable this validation selectively. In my case, I'm OK to disable it globally and that seems to work fine.
This issue was auto-closed due to inactivity, but I think it's related
#608
@mihaitodor if your goal is identifying if the location of a variant in another genome assembly, I am concerned about trying to do that in regions where the assemblies have changed a lot. Just because you can compute some coordinates does not mean these are biologically "the same". I really think in the example you provided above, the lifted over coordinates should not be trusted. As such I would not encourage you to map across assemblies when strict_bounds=False are necessary. I also would recommend to think about a QC approach to make sure that any lifted over variant can be considered to be equivalent in the context of both assemblies, otherwise treat them as distinct variants.
Thank you @andreasprlic! Indeed, in such cases it's probably better to error out. Initially, we had some code based on pyliftover which I guess would have similar issues? I opened #711 after @reece mentioned that it would be handy to have some direct liftover support in hgvs.
The code I'm trying to add some features to is just a reference implementation for now, but indeed, we'll have to be stricter in production-ready implementations.