Transcript aliases - HGVS using gene names (canonical transcripts) or LRG
davmlaw opened this issue · 2 comments
Often someone will want to resolve a gene symbol, ie: BRCA2:c.36del
Data providers could take an argument eg resolve_genes_with_canonical_transcripts=True
which would look up the MANE select transcript, then substitute it for eg NM_000059.4:c.36del
The same goes with LRG, eg Variant Validator gives "LRG_1:g.8638G>T automapped to equivalent RefSeq record
NG_007400.1:g.8638G>T"
PROBLEM
The way the biocommons HGVS code current works (and our data provider) is to retrieve the transcript (build independent) then get coordinates out later. So switching the transcripts out before knowing what build it is could cause troubles, as eg Eg NUDT4B - has transcript NM_001355407 which is MANE select (v2) but no present at all in 37
Example: OR8G1 has NM_001002905.1 as the "37 RefSeq select" and NM_001002905.2 as "38 MANE Select" (NM_001002905.2 does not exist in 37) - this is pretty rare though (out of 19k MANE transcripts - 20 don't have anything for 37, 22 don't have the version)
This would be seemingly straight forward to do inside the data provider by just switching out transcripts using an aliases (eg LRG or gene names for canonical transcript, eg:
# cdot.hgvs.dataproviders.json_data_provider.JSONDataProvider
self.aliases = {
"LRG_1t1": "NM_000088.3",
}
def _resolve_transcript_alias(self, tx_ac):
""" Returns an alias if we have one, otherwise tx_ac unmodified """
return self.aliases.get(tx_ac, tx_ac)
def _get_transcript(self, tx_ac):
tx_ac = self._resolve_transcript_alias(tx_ac)
return self.transcripts.get(tx_ac)
def get_seq(self, ac, start_i=None, end_i=None):
ac = self._resolve_transcript_alias(ac)
return super().get_seq(ac, start_i=start_i, end_i=end_i)
LRG
wget https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/RefSeqGene/LRG_RefSeqGene
Records look like:
9606 1277 COL1A1 NG_007400.1 LRG_1 NM_000088.3 t1 NP_000079.2 p1 reference standard
Seemingly build independent
You could possibly do it this way then just throw an exception the few times it fails (ie returns a MANE transcript that isn't in 37 - note only if you have a data file with both genome builds too) the downside is there is no way to return a message, eg that you substituted out a transcript in there, so this may be slightly dodgy
Few thoughts while looking through code:
Interval tree and gene stuff should be separated as interval takes way longer. Can move the canonical into that loop too (or just always do it)
You may want to indicate a sorting mechanism to pick what gets swapped out eg refseq vs ensembl etc