SACGF/cdot

Split code and data releases?

holtgrewe opened this issue · 5 comments

It might be worth considering to create a pip-installable cdot package that has the reuseable Python code for GFF parsing etc. and would continue to live in one repository.

The data builds could then go to a separate directory and one could have two series of releases. One using the identical data to VEP releases and one that aggregates all historical releases.

It's a good idea to split them, though I am not sure whether we need a separate pip package, that may confuse some people

I think you can just make as many releases as you want - we could label them as dates maybe for simplicity

You are right, one can probably just checkout a given treeish (git hash, branch, tag) of the tools directory.

As for releases, I guess it makes sense to think about different streams of releases:

  • as many transcripts as possible with all latest and greatest versions
  • equivalent to VEP releases

One could think of "latest NCBI", but I think having 2-4 releases a year that match VEP is better.

On second thought, maybe just doing a release that matches the latest VEP/ENSEMBL release would be sufficient. There, attach a file per genome release per each refseq and ensembl that matches the release used by VEP. Also, attach files that have "as many transcripts as possible". One would thus just track VEP releases which would make things more predictable.

Another idea is to track the versions of cdot and release as ${VEP_VERSION}+{CDOT_VERSION}.

What do you think?

There are a lot of different concepts mushed together into a "cdot version"

  1. cdot client version / pip package
  2. data version - client major/minor needs to be same to read data, but no need to regenerate historical GFFs if data hasn't changed
  3. releasing new GFFs when they come out, then updating "all historical transcripts" files

For (2) - I think we should move to a new repo, that would fix the setup issue too in pull request 5 - just have a requirements.txt - we don't need to give it a pip package

For (3) I think we can just do it like you said, I think keep the data hosted as releases on this repo (cdot client), even though they are generated by cdot data and will have that version on it - will have a note saying you can download any data that has same major/minor

I looked and there is very minor code sharing between client/data

Thinking on it a bit more, splitting repos would lead to people raising issues in wrong places, and the repo is already pretty small in the grand scheme of things.

I will decouple code as much as possible from client/generation

I think we just need a requirements.txt and a separate JSON schema version

We can always split into 2 repos later if we decide it's best.

To summarise:

  • Kept same repo
  • Data/code versions now separate, data driven by generate_transcript_data.json_schema_version.JSON_SCHEMA_VERSION
  • I tag the data releases with tags like "data_v0.2.22" to distinguish from code tags, eg next code tag is v0.2.22

Will see how this goes can change it later we come up with a better way