Edinburgh-Genome-Foundry/codon-usage-tables

Kazusa codon database is down

Closed this issue · 4 comments

https://www.kazusa.or.jp/codon/ is down (been down for a few days now). Wondering if you've got any thoughts for implementing a failover?

(PS Hello Zulko and EGF, we love the stuff you're working on!)

Zulko commented

I hit this problem too yesterday, and I added an optional "timeout" in my local version of the python library, so it won't just hang when connecting to Kazusa. I will push it online soon. I don't know of any mirror for kasuza, so if it proves unreliable the only long term workaround is to add more tables in the tables folder.

And thanks for the support, don't hesitate to reach out for anything related to our software (I'm always glad to learn it can be helpful outside of our foundry), or if you ever need a biofoundry - we're always happy to discuss projects. We're also hiring a computational person at the moment, in case you know anyone willing to move to Edinburgh (great environment, great city).

Hi,

I have been experiencing problems with Kazusa too lately. If you are still considering the option of adding more tables to the data folder, I suggest to consider switching to Hive as source. The tables on there are based on many more CDSs wrt Kazusa, and the DB is regularly updated. Unfortunately it doesn't seem to offer programmatic access to its resources, but it should be fairly easy to parse the Refseq_CDS.tsv (or genbank_CDS.tsv) file from the page "Available Files to Download". Total size for RefSeq is 96MB, but seems to contain strain-level tables too.

I am available for helping on this if you like the idea. I would cluster at the species level and add tables with the average frequencies from their clades in the data folder. If it turns out to be too big for distribution, there could be an option to download and build it at the first run of the library.

Simone

Zulko commented

Hello,

If Kazusa keeps giving you problems, I agree that having more tables in the repo (potentially from Kazusa) would be a good idea.

The one thing I worry about with adding Hive as a source is that it may confuse users as to which database (of Kasuza or Hive) they are using. Plus I am not sure how the Washington Uni would react to us hosting a big chunk of their data.

I see two possible approaches:

  • Add Hive tables to this project, separated from the Kasuza ones, and let users determine which database they want to use via a global setting in their python script. If the hive codon tables are big (say, more than 10Mo), their download should be a pip install option.
  • Or fork this project under "hive codon tables" and make a hive-specific repo (maybe with the same APIs as this one, so we can easily switch in projects).

Two final remarks:

  • I am not at the EGF anymore (@veghp is taking over the software projects of the foundry) and I am not sure how much I'll be able to help beyond commenting on questions and PRs (I'll find out soon).
  • There is a Synbiopython project which should get support for codon tables soon. Tagging @neilswainston in the discussion for awareness.