Announcing cdot: a data provider to load lots of transcripts fast
davmlaw opened this issue · 23 comments
Hi, thanks for HGVS project, it’s great.
I’ve written a data provider that has 5.5 times as many transcripts as UTA, and a REST service (>10x faster than public UTA and works behind firewalls)
pip install cdot
from cdot.hgvs.dataproviders import RESTDataProvider
hdp = cdot.RESTDataProvider()
It takes a different approach from UTA - converting existing RefSeq/Ensembl GTFs to JSON.
This allows you to use flat files instead of Postgres, or a trivial (300 line) REST service. It aims at being fast and simple, and doesn't implement all of the provider functionality yet (or maybe ever - ie just transcript/genome conversions)
See project: https://github.com/SACGF/cdot
This is a glue project, and everything is MIT so I’d be happy to merge it into the core HGVS code or become a BioCommons project if you’ll have it - see https://github.com/SACGF/cdot/wiki/Project-goals---directions
Thanks!
OMG, I am so thrilled that this exists! I had hoped that something like this would arise to solve clear issues with UTA currency, but also as a way to make it much easier to support custom transcripts. I haven't tried cdot yet, but the direction is spot-on.
What does it take to load from RefSeq? I have ~6 years of NCBI GFF3 archives. That might be useful to generate historical transcript alignments.
Kudos! And thanks for the contribution.
Let's do talk about working together on this... are you on the biocommons slack workspace? Try this: https://join.slack.com/t/biocommons/shared_invite/zt-12xzu037o-abkp_8a8Ta1Sc8JaBkpyqA (expires 2022-03-02).
Hi, as an example here's the script to download and generate RefSeq for 37:
https://github.com/SACGF/cdot/blob/main/generate_transcript_data/refseq_transcripts_grch37.sh
There's scripts for 38 and Ensembl in the same directory.
Will join up on slack at work tomorrow, also keen to define a JSON transcript format - with the super long term goal of a GA4GH standard and pushing Ensembl and RefSeq for an API for every transcript they've ever made but yeah happy to build whatever glue is needed in the meantime.
@davmlaw I successfully used cdot bc/ it provides a flat file for transcript mapping, and thus in principle no internet connection to map coordinates from say coding to genomic. However, this fails w/o a connection:
from cdot.hgvs.dataproviders import JSONDataProvider
from hgvs.assemblymapper import AssemblyMapper
from hgvs.parser import Parser
hdp = JSONDataProvider(['cdot-0.2.1.refseq.grch37_grch38.json.gz'])
am = AssemblyMapper(
hdp,
assembly_name='GRCh37',
alt_aln_method='splign',
replace_reference=True)
# blat, splign, genewise
hp = Parser()
var_c = hp.parse_hgvs_variant('NM_001637.3:c.100A>T')
am.c_to_g(var_c)
# HGVSDataNotAvailableError: Failed to fetch NM_001637.3 from bioutils.seqfetcher (network fetching) (Failed to fetch NM_001637.3 (HTTPSConnectionPool(host='eutils.ncbi.nlm.nih.gov', port=...): Max retries exceeded with url: /entrez/eutils/efetch.fcgi?db=nucleotide&id=NM_001637.3&rettype=fasta&seq_start=501&seq_stop=501&tool=bioutils&email=biocommons-dev@googlegroups.com (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at ...>: Failed to establish a new connection: [Errno 8] nodename nor servname provided, or not known'))))
You mentioned
This allows you to use flat files instead of Postgres [...]
so what am I doing wrong with the above snippet? To clarify, my aim is to map coding to genomic coords without access to an internet connection. Thanks for your help!
I haven't used cdot yet, so I can't provide complete guidance.
The error you received is from seqefetcher, which fetches sequences during normalization. This is distinct from transcript alignments from cdot.
It appears that seqfetcher failed because the call was rejected by NCBI. That could be due to a sporadic outage at NCBI, failure of those credentials, too many recent attempts from your IP, local network issues, or something else.
See https://hgvs.readthedocs.io/en/stable/installation.html for details. You should think of cdot as replacing uta. Consider installing seqrepo to obviate remote sequence lookups.
Thank you for your answer @reece !
Consider installing seqrepo to obviate remote sequence lookups.
How does that work? Like, how can I use seqrepo to not lookup sequences remotely?
Yes - exactly as Reece said - HGVS requires a way to get sequences (SeqRepo) and transcripts (UTA/cdot)
You can install them all locally, for SeqRepo copy/paste the 3 command line snippets from instructions here
then set HGVS_SEQREPO_DIR
environment variable before you call HGVS
Hi @davmlaw, I share the same excitement as @reece! It simplifies the DevOps of provisioning and maintaining a relational database inside kubernetes. I tried out the cdot dataprovider for hgvs (https://github.com/SACGF/cdot). The two main gaps I see are missing functionality for getting the predicted amino acid change HGVS nomenclature (critical for clinical reporting) and all possible transcripts for a given variant position. Are you planning to add these to the cdot dataprovider? I am happy to open this up as an issue in your GitHub repo if that helps. Thanks!!
Hi, I've added the issues to the cdot repo (linked above) please add any further info if you need. I hope to be able to find time within a few weeks and will ping you when done
Hi @roysomak4 - I've released a new version of cdot (0.2.8) that handles protein identifiers (so c_to_p works) and also all of the possible transcripts for a given position (this only works on local JSON files, not yet on the REST server)
If you use local JSON file, you'll need to download the latest files
Let me know if there are any issues
Thanks @davmlaw! I will run my tests on the local JSON file and share the results shortly.
Hi @davmlaw, the c_to_p function is working fine. I check with one EGFR and one TP53 variant and the correct p. nomenclature is generated for both of them.
The function that should return possible transcripts for a given variant is returning an empty list (example below) regardless of the variant.
>>> import hgvs
>>> from hgvs.assemblymapper import AssemblyMapper
>>> from cdot.hgvs.dataproviders import JSONDataProvider
>>> hdp = JSONDataProvider()
>>> hdp = JSONDataProvider(["./cdot-0.2.8.refseq.grch38.json.gz"])
>>> am = AssemblyMapper(hdp,assembly_name='GRCh38',alt_aln_method='splign',replace_reference=True)
>>> hp = hgvs.parser.Parser()
>>> hgvs_g = 'NC_000017.11:g.7676154G>C'
>>> var_g = hp.parse_hgvs_variant(hgvs_g)
>>> transcripts = am.relevant_transcripts(var_g)
>>> transcripts
[]
>>> hgvs_g = 'NC_000007.14:g.55181370G>A'
>>> var_g = hp.parse_hgvs_variant(hgvs_g)
>>> transcripts = am.relevant_transcripts(var_g)
>>> transcripts
[]
I am using GRCh38
Thanks,
Somak
Hi, this was due to an off by 1 error (I assumed var_g.posedit.pos.start.base would be 1 past the end)
I've fixed it and upgraded cdot, the data files haven't changed just the library
Thanks for the quick fix! I can confirm that the transcript function is working now (see log below).
In terms of speed, I observed that starting v0.2.8, loading the JSON file take much longer than it did in the prior versions I tested. I am loading cdot-0.2.8.refseq.grch38.json.gz
database. It used to be almost instantaneous in the versions prior to 0.2.8. With v0.2.9 it takes about 10-15 seconds. I tried loading the data from nvme SSD and there seems to be no change in the load time.
>>> hgvs_g = 'NC_000017.11:g.7676154G>C'
>>> hp = hgvs.parser.hgvs
hgvs.parser.hgvs
>>> hp = hgvs.parser.Parser()
>>> var_g = hp.parse_hgvs_variant(hgvs_g)
>>> str(var_g)
'NC_000017.11:g.7676154G>C'
>>> var_c = am.g_to_c(var_g, 'NM_000546.6')
>>> str(var_c)
'NM_000546.6:c.215C>G'
>>> var_p = am.c_to_p(var_c)
>>> str(var_p)
'NP_000537.3:p.(Pro72Arg)'
>>> transcripts = am.relevant_transcripts(var_g)
>>> transcripts
['NM_000546.5', 'NM_001126114.2', 'NM_001126118.1', 'NM_001126112.2', 'NM_001126113.2', 'NM_001276695.1', 'NM_001276696.1', 'NM_001276761.1', 'NM_001276760.1', 'NM_001126113.3', 'NM_001126114.3', 'NM_001126118.2', 'NM_001276695.3', 'NM_001126112.3', 'NM_001276696.3', 'NM_001276760.3', 'NM_001276761.3', 'NM_000546.6', 'NM_001276696.2', 'NM_001276695.2', 'NM_001276760.2', 'NM_001276761.2']
Hi, yeah it takes a while to builds the interval tree, but it only does so lazily when you need it.
If you don't use that functionality it should be the same.
You can't store the interval tree as JSON but you can pickle it, so I may look into that as an optional local cache
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 7 days.
This issue was closed because it has been stalled for 7 days with no activity.
This issue was closed by stalebot. It has been reopened to give more time for community review. See biocommons coding guidelines for stale issue and pull request policies. This resurrection is expected to be a one-time event.
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 7 days.
Hi, cdot has been used by a few other people now, and I think I've ironed out the bugs.
I might make a pull request with some (commented out) options in the documentation examples talking about data provider options
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 7 days.
This issue was closed because it has been stalled for 7 days with no activity.
Few things to finish off first but I hope to make a pull request adding cdot to the docs in a week or two
Happy to let this close for now