Announcing cdot: a data provider to load lots of transcripts fast

Question

Announcing cdot: a data provider to load lots of transcripts fast

davmlaw opened this issue 3 years ago · 23 comments

Hi, thanks for HGVS project, it’s great.

I’ve written a data provider that has 5.5 times as many transcripts as UTA, and a REST service (>10x faster than public UTA and works behind firewalls)

pip install cdot

from cdot.hgvs.dataproviders import RESTDataProvider

hdp = cdot.RESTDataProvider()

It takes a different approach from UTA - converting existing RefSeq/Ensembl GTFs to JSON.

This allows you to use flat files instead of Postgres, or a trivial (300 line) REST service. It aims at being fast and simple, and doesn't implement all of the provider functionality yet (or maybe ever - ie just transcript/genome conversions)

See project: https://github.com/SACGF/cdot

This is a glue project, and everything is MIT so I’d be happy to merge it into the core HGVS code or become a BioCommons project if you’ll have it - see https://github.com/SACGF/cdot/wiki/Project-goals---directions

Thanks!

Answer 1 · 2022-02-03T08:01:27.000Z

OMG, I am so thrilled that this exists! I had hoped that something like this would arise to solve clear issues with UTA currency, but also as a way to make it much easier to support custom transcripts. I haven't tried cdot yet, but the direction is spot-on.

What does it take to load from RefSeq? I have ~6 years of NCBI GFF3 archives. That might be useful to generate historical transcript alignments.

Kudos! And thanks for the contribution.

Let's do talk about working together on this... are you on the biocommons slack workspace? Try this: https://join.slack.com/t/biocommons/shared_invite/zt-12xzu037o-abkp_8a8Ta1Sc8JaBkpyqA (expires 2022-03-02).

Answer 2 · 2022-02-03T09:25:17.000Z

Hi, as an example here's the script to download and generate RefSeq for 37:

https://github.com/SACGF/cdot/blob/main/generate_transcript_data/refseq_transcripts_grch37.sh

There's scripts for 38 and Ensembl in the same directory.

Will join up on slack at work tomorrow, also keen to define a JSON transcript format - with the super long term goal of a GA4GH standard and pushing Ensembl and RefSeq for an API for every transcript they've ever made but yeah happy to build whatever glue is needed in the meantime.

Answer 3 · 2022-04-05T06:53:18.000Z

@davmlaw I successfully used cdot bc/ it provides a flat file for transcript mapping, and thus in principle no internet connection to map coordinates from say coding to genomic. However, this fails w/o a connection:

from cdot.hgvs.dataproviders import JSONDataProvider
from hgvs.assemblymapper import AssemblyMapper
from hgvs.parser import Parser

hdp = JSONDataProvider(['cdot-0.2.1.refseq.grch37_grch38.json.gz'])
am = AssemblyMapper(
    hdp,
    assembly_name='GRCh37',
    alt_aln_method='splign',
    replace_reference=True)
    # blat, splign, genewise
hp = Parser()
var_c = hp.parse_hgvs_variant('NM_001637.3:c.100A>T')
am.c_to_g(var_c)

# HGVSDataNotAvailableError: Failed to fetch NM_001637.3 from bioutils.seqfetcher (network fetching) (Failed to fetch NM_001637.3 (HTTPSConnectionPool(host='eutils.ncbi.nlm.nih.gov', port=...): Max retries exceeded with url: /entrez/eutils/efetch.fcgi?db=nucleotide&id=NM_001637.3&rettype=fasta&seq_start=501&seq_stop=501&tool=bioutils&email=biocommons-dev@googlegroups.com (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at ...>: Failed to establish a new connection: [Errno 8] nodename nor servname provided, or not known'))))

You mentioned

This allows you to use flat files instead of Postgres [...]

so what am I doing wrong with the above snippet? To clarify, my aim is to map coding to genomic coords without access to an internet connection. Thanks for your help!

Answer 4 · 2022-04-08T23:02:43.000Z

I haven't used cdot yet, so I can't provide complete guidance.

The error you received is from seqefetcher, which fetches sequences during normalization. This is distinct from transcript alignments from cdot.

It appears that seqfetcher failed because the call was rejected by NCBI. That could be due to a sporadic outage at NCBI, failure of those credentials, too many recent attempts from your IP, local network issues, or something else.

See https://hgvs.readthedocs.io/en/stable/installation.html for details. You should think of cdot as replacing uta. Consider installing seqrepo to obviate remote sequence lookups.

Answer 5 · 2022-04-09T19:07:31.000Z

Thank you for your answer @reece !

Consider installing seqrepo to obviate remote sequence lookups.

How does that work? Like, how can I use seqrepo to not lookup sequences remotely?

Answer 6 · 2022-04-11T08:17:37.000Z

Yes - exactly as Reece said - HGVS requires a way to get sequences (SeqRepo) and transcripts (UTA/cdot)

You can install them all locally, for SeqRepo copy/paste the 3 command line snippets from instructions here

then set HGVS_SEQREPO_DIR environment variable before you call HGVS

Answer 7 · 2022-04-11T15:03:44.000Z

This worked, thanks @reece @davmlaw !

Answer 8 · 2022-08-05T18:56:24.000Z

Hi @davmlaw, I share the same excitement as @reece! It simplifies the DevOps of provisioning and maintaining a relational database inside kubernetes. I tried out the cdot dataprovider for hgvs (https://github.com/SACGF/cdot). The two main gaps I see are missing functionality for getting the predicted amino acid change HGVS nomenclature (critical for clinical reporting) and all possible transcripts for a given variant position. Are you planning to add these to the cdot dataprovider? I am happy to open this up as an issue in your GitHub repo if that helps. Thanks!!

Answer 9 · 2022-08-08T13:51:04.000Z

Hi, I've added the issues to the cdot repo (linked above) please add any further info if you need. I hope to be able to find time within a few weeks and will ping you when done

Answer 10 · 2022-08-29T12:26:23.000Z

Hi @roysomak4 - I've released a new version of cdot (0.2.8) that handles protein identifiers (so c_to_p works) and also all of the possible transcripts for a given position (this only works on local JSON files, not yet on the REST server)

If you use local JSON file, you'll need to download the latest files

Let me know if there are any issues

Answer 11 · 2022-08-31T14:25:16.000Z

Thanks @davmlaw! I will run my tests on the local JSON file and share the results shortly.

Answer 12 · 2022-08-31T15:22:23.000Z

Hi @davmlaw, the c_to_p function is working fine. I check with one EGFR and one TP53 variant and the correct p. nomenclature is generated for both of them.
The function that should return possible transcripts for a given variant is returning an empty list (example below) regardless of the variant.

>>> import hgvs
>>> from hgvs.assemblymapper import AssemblyMapper
>>> from cdot.hgvs.dataproviders import JSONDataProvider
>>> hdp = JSONDataProvider()
>>> hdp = JSONDataProvider(["./cdot-0.2.8.refseq.grch38.json.gz"])
>>> am = AssemblyMapper(hdp,assembly_name='GRCh38',alt_aln_method='splign',replace_reference=True)
>>> hp = hgvs.parser.Parser()

>>> hgvs_g = 'NC_000017.11:g.7676154G>C'
>>> var_g = hp.parse_hgvs_variant(hgvs_g)
>>> transcripts = am.relevant_transcripts(var_g)
>>> transcripts
[]

>>> hgvs_g = 'NC_000007.14:g.55181370G>A'
>>> var_g = hp.parse_hgvs_variant(hgvs_g)
>>> transcripts = am.relevant_transcripts(var_g)
>>> transcripts
[]

I am using GRCh38

Thanks,
Somak

Answer 13 · 2022-09-01T02:40:34.000Z

Hi, this was due to an off by 1 error (I assumed var_g.posedit.pos.start.base would be 1 past the end)

I've fixed it and upgraded cdot, the data files haven't changed just the library

Answer 14 · 2022-09-02T19:35:34.000Z

Thanks for the quick fix! I can confirm that the transcript function is working now (see log below).

In terms of speed, I observed that starting v0.2.8, loading the JSON file take much longer than it did in the prior versions I tested. I am loading cdot-0.2.8.refseq.grch38.json.gz database. It used to be almost instantaneous in the versions prior to 0.2.8. With v0.2.9 it takes about 10-15 seconds. I tried loading the data from nvme SSD and there seems to be no change in the load time.

>>> hgvs_g = 'NC_000017.11:g.7676154G>C'
>>> hp = hgvs.parser.hgvs
hgvs.parser.hgvs
>>> hp = hgvs.parser.Parser()
>>> var_g = hp.parse_hgvs_variant(hgvs_g)
>>> str(var_g)
'NC_000017.11:g.7676154G>C'
>>> var_c = am.g_to_c(var_g, 'NM_000546.6')
>>> str(var_c)
'NM_000546.6:c.215C>G'
>>> var_p = am.c_to_p(var_c)
>>> str(var_p)
'NP_000537.3:p.(Pro72Arg)'
>>> transcripts = am.relevant_transcripts(var_g)
>>> transcripts
['NM_000546.5', 'NM_001126114.2', 'NM_001126118.1', 'NM_001126112.2', 'NM_001126113.2', 'NM_001276695.1', 'NM_001276696.1', 'NM_001276761.1', 'NM_001276760.1', 'NM_001126113.3', 'NM_001126114.3', 'NM_001126118.2', 'NM_001276695.3', 'NM_001126112.3', 'NM_001276696.3', 'NM_001276760.3', 'NM_001276761.3', 'NM_000546.6', 'NM_001276696.2', 'NM_001276695.2', 'NM_001276760.2', 'NM_001276761.2']

Answer 15 · 2022-09-05T02:21:39.000Z

Hi, yeah it takes a while to builds the interval tree, but it only does so lazily when you need it.

If you don't use that functionality it should be the same.

You can't store the interval tree as JSON but you can pickle it, so I may look into that as an optional local cache

Answer 16 · 2023-12-09T01:53:57.000Z

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 7 days.

Answer 17 · 2023-12-17T01:59:15.000Z

This issue was closed because it has been stalled for 7 days with no activity.

Answer 18 · 2024-02-19T17:45:35.000Z

This issue was closed by stalebot. It has been reopened to give more time for community review. See biocommons coding guidelines for stale issue and pull request policies. This resurrection is expected to be a one-time event.

Answer 19 · 2024-05-20T01:58:45.000Z

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 7 days.

Answer 20 · 2024-05-20T03:50:56.000Z

Hi, cdot has been used by a few other people now, and I think I've ironed out the bugs.

I might make a pull request with some (commented out) options in the documentation examples talking about data provider options

Answer 21 · 2024-08-23T02:05:42.000Z

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 7 days.

Answer 22 · 2024-08-30T02:10:40.000Z

This issue was closed because it has been stalled for 7 days with no activity.

Answer 23 · 2024-08-30T05:13:48.000Z

Few things to finish off first but I hope to make a pull request adding cdot to the docs in a week or two

Happy to let this close for now