SACGF/cdot

Transcript aliases - HGVS using gene names (canonical transcripts) or LRG

davmlaw opened this issue · 2 comments

Often someone will want to resolve a gene symbol, ie: BRCA2:c.36del

Data providers could take an argument eg resolve_genes_with_canonical_transcripts=True which would look up the MANE select transcript, then substitute it for eg NM_000059.4:c.36del

The same goes with LRG, eg Variant Validator gives "LRG_1:g.8638G>T automapped to equivalent RefSeq record
NG_007400.1:g.8638G>T"

PROBLEM

The way the biocommons HGVS code current works (and our data provider) is to retrieve the transcript (build independent) then get coordinates out later. So switching the transcripts out before knowing what build it is could cause troubles, as eg Eg NUDT4B - has transcript NM_001355407 which is MANE select (v2) but no present at all in 37

Example: OR8G1 has NM_001002905.1 as the "37 RefSeq select" and NM_001002905.2 as "38 MANE Select" (NM_001002905.2 does not exist in 37) - this is pretty rare though (out of 19k MANE transcripts - 20 don't have anything for 37, 22 don't have the version)

This would be seemingly straight forward to do inside the data provider by just switching out transcripts using an aliases (eg LRG or gene names for canonical transcript, eg:

# cdot.hgvs.dataproviders.json_data_provider.JSONDataProvider

self.aliases = {
    "LRG_1t1": "NM_000088.3",
}

    def _resolve_transcript_alias(self, tx_ac):
        """ Returns an alias if we have one, otherwise tx_ac unmodified """
        return self.aliases.get(tx_ac, tx_ac)

    def _get_transcript(self, tx_ac):
        tx_ac = self._resolve_transcript_alias(tx_ac)
        return self.transcripts.get(tx_ac)

    def get_seq(self, ac, start_i=None, end_i=None):
        ac = self._resolve_transcript_alias(ac)
        return super().get_seq(ac, start_i=start_i, end_i=end_i)

LRG

wget https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/RefSeqGene/LRG_RefSeqGene

Records look like:

9606	1277	COL1A1	NG_007400.1	LRG_1	NM_000088.3	t1	NP_000079.2	p1	reference standard

Seemingly build independent

You could possibly do it this way then just throw an exception the few times it fails (ie returns a MANE transcript that isn't in 37 - note only if you have a data file with both genome builds too) the downside is there is no way to return a message, eg that you substituted out a transcript in there, so this may be slightly dodgy

Few thoughts while looking through code:

Interval tree and gene stuff should be separated as interval takes way longer. Can move the canonical into that loop too (or just always do it)

You may want to indicate a sorting mechanism to pick what gets swapped out eg refseq vs ensembl etc