biocommons/hgvs

Is it possible to get exon information and/or gene name for a given transcript?

wlymanambry opened this issue · 5 comments

Is there a way to get information about the exons for a given variant, ie which exon it falls in as well as the CDS block that the variant falls in? Also, is there a way to get the gene name for a given transcript? I read through a lot of previous issues and I saw a mention of a UTA rest interface but couldn't find the info for these questions.

Thanks!

Possibly an easier way to do this is to annotate your variants, eg with Ensembl VEP - and then read off the exon ID that it provides.

But to do it with this library, HGVS talks with UTA contains transcript information, including exon coordinates. The data provider API is briefly documented

You can call UTA methods to eventually get exons, then iterate over them to work out what overlaps with the variant. To modify the example from the HGVS README file (warning: untested code)

import hgvs.parser
import hgvs.assemblymapper
import hgvs.dataproviders.uta

hdp = hgvs.dataproviders.uta.connect()
am = hgvs.assemblymapper.AssemblyMapper(hdp, assembly_name='GRCh37', alt_aln_method='splign', replace_reference=True)
hp = hgvs.parser.Parser()

hgvs_g = 'NC_000007.13:g.36561662C>T'
var_g = hp.parse_hgvs_variant(hgvs_g)


# identify transcripts that overlap this genomic variant
transcripts = am.relevant_transcripts(var_g)


variant_start = var_g.posedit.pos.start.base
variant_end = var_g.posedit.pos.end.base + 1

print(f"{var_g.ac} {variant_start}-{variant_end}")

for transcript_id in transcripts:
    tx_identity_info = hdp.get_tx_identity_info(transcript_id)
    hgnc_symbol = tx_identity_info["hgnc"]
    print(f"transcript: {transcript_id}, gene: {hgnc_symbol}")
    
    # I can't get CDS block but we can get this...
    cds_start = tx_identity_info["cds_start_i"]
    cds_end = tx_identity_info["cds_end_i"]

    for exon in hdp.get_tx_exons(transcript_id, var_g.ac, 'splign'):
        if variant_end > exon["alt_start_i"] and variant_start < exon["alt_end_i"]:
            exon_id = exon["ord"] + 1
            print(f"{transcript_id} exon_id: {exon_id}")

gives output:

NC_000007.13 36561662-36561663
transcript: NM_001177506.1, gene: AOAH
NM_001177506.1 exon_id: 20
transcript: NM_001637.3, gene: AOAH
NM_001637.3 exon_id: 20
transcript: NM_001177507.1, gene: AOAH
NM_001177507.1 exon_id: 19

the CDS block

The CDS id is not provided in UTA, so if you really, really need this, you'll have to read the GTF yourself, and perhaps put it into an IntervalTree

is there a way to get the gene name for a given transcript?

Example is in script above as well, but as a 1 off:

import hgvs.dataproviders.uta

hdp = hgvs.dataproviders.uta.connect()
hdp.get_tx_identity_info('NM_001177507.1')["hgnc"]

@wlymanambry hope this helped, are you able to close this issue if happy?

It did, thanks so much for the help! (it can be closed)

Just to say that this data is in UTA... (the underlying database used by the hgvs library)

reece commented

Thanks @davmlaw and @andreasprlic for replying.