geneontology/reactome-go-cams

Re-synchronize reactome-go-cams with known good source (noctua-models master)

kltm opened this issue · 10 comments

kltm commented

@vanaukenk @dustine32 @ukemi

I've taken a little bit of a look at this repo (reactome-go-cams), noctua-models master , and noctua-models dev and I think I have a better handle on what's going on. I believe that:

  • The shared reactome (R-HSA-*) models in noctua-models master and noctua-models dev are pretty much the same, except for ordering within the file (eyeballing and the fact that all the files are exactly the same size)
  • There are more reactome models in noctua-models dev than master; these are likely due to previous iterations and trials for reactome that were not wiped out in dev, but overwritten
  • The files in reactome-go-cams are quite a bit different to the ones in noctua-models: they seem to be structured differently, have different content patterns, and seem to mostly be larger
    • Given their earlier date (14m vs 8m), I believe that what we have in production is "truer"
    • There has been some churn in the reactome files in noctua-models master (i.e. automated saves to GH); while it may be due to tweaks to minerva that have happened, I have not checked to see if no people have ever modified them

What I would propose is:

  • The contents of noctua-models master matching R-HSA-* are copied into reactome-go-cams and we declare that to be the source of truth (for things like resetting noctua-dev or emergencies)
  • Moving forward, when we next update reactome-go-cams from the reactome upstream (not sure when that's scheduled), all operations pass though reactome-go-cams first
    • ^ I believe this was violated at some point in the last year, given the differences

I'd appreciate feedback on that proposal, as well as somebody maybe double checking my thought process here.

Confirming that Ben's 2020-11-24 models (geneontology/noctua-models#162) are the last "real" load/generation of the Reactome R-HSA-* models. These are what's currently in noctua-models master and can get copied to this repo (I can do this).

I agree that the next load should first make a stop here in this repo.

kltm commented

Great--thank you for pulling that out!

Hm. As part of that, there would be two "new" models:

	R-HSA-9648002.ttl
	R-HSA-9670095.ttl

Also, looking at the models on noctua-models master and these, they do not seem to be "matching". For example:
https://github.com/geneontology/noctua-models/blob/09ad209599eee50191f774c64fbe0ed448814812/models/R-HSA-74217.ttl
and
https://github.com/geneontology/noctua-models/blob/master/models/R-HSA-74217.ttl
are quite different.

I'm thinking that we did some kind of in-place update at some point since then? Or maybe the serialization is a little different after getting cycled through minerva? Perhaps @balhoff may remember?

Either way, at this point, it may be "safer" to go with what is currently in production and being uses in the pipeline (and wipe the slate clean). It would be nice to understand/rememeber what happened and why, but I suspect using noctua-models master as the template for now may be the most expedient path forward in either case.

Thoughts?

ukemi commented

It doesn't surprise me that these models are very different. @deustp01 and I have been working intensively in this area.

ukemi commented

The other piece that may be of use here is that I use the Reactome models as templates for my hand-curated mouse models. To do this, I almost always rearrange the models for better viewing, then I save. So technically I do modify the models.

kltm commented

@ukemi Okay, that may explain at least some of the churn then as models being modified and cycled through minerva, which has a different serialization (?) than what Ben originally put in, as well as some shifts in content.

While I think there is maybe a bit of a mystery here still, I'm still in camp Let's Rebase Off of Master. Would anybody have a problem with that? Again, in the future when we bring another cycle of Reactome in, we'd be doing it through this repo and then replacing the current R-HSA-* models with them.

ukemi commented

I'm not exactly sure what you mean by "Rebase' off of Master. But this sounds best to me since this is the version of the models that we used for the paper and it is the ones that I've been messing with layout-wise. However, we should also completely update the models wrt Reactome's latest release.

However, we should also completely update the models wrt Reactome's latest release.
There's a Reactome release scheduled for mid-September so it would make sense to time the GO-CAM update to use that version of Reactome.

kltm commented

Okay, great! @dustine32 made a PR and I took it.
Moving forward, issues to think about:

  • when to schedule refreshes from reactome
  • what to do about things like layout
  • double-checking the serialization and testing the incoming models (using what we have now as reference)
ukemi commented

"when to schedule refreshes from reactome"

I would go with @deustp01 's proposal above and schedule refreshes to coincide with official Reactome releases.

schedule refreshes to coincide with official Reactome releases.

Practical question - how often to refresh? Four times a year? Less often? My usability guesses are that twice a year might be enough for now and that a reliable schedule will be more important to users than the exact frequency. I hope that, as a few GO-CAM re-builds are done, we will figure out how to automate the process and also identify additional QA features on the Reactome side that will ensure a clean GO-CAM conversion. At that point, there can perfectly well be a refresh with every Reactome release.