add a CLI command to dump all models in JSON response format

Question

add a CLI command to dump all models in JSON response format

balhoff opened this issue 3 years ago · 15 comments

See geneontology/api-gorest-2021#6 (comment)

@kltm is this something we should go ahead and start on?

Answer 1 · 2022-07-12T19:42:10.000Z

@balhoff If it's not too hard: yes. This is part of a discretionary project, so we are clear to move ahead.
The idea would be to get a dump of separate JSON files into a directory, which look like the JSON model contents of returned responses. Alternatively, if a mega-file was somehow much easier, we could maybe feed that into jq or something and work it out.

Answer 2 · 2022-07-12T19:43:43.000Z

Considering how we're going to use this, a tarballed version may eventually find its way into releases.

Answer 3 · 2022-07-26T23:20:50.000Z

@balhoff Looking at the output, I think that it should be CURIEs rather than URIs to match what we're currently doing. (Re: #501)
Otherwise, I think this may be it.

Answer 4 · 2022-08-10T20:30:16.000Z

@balhoff Apologies to tag you here again, but I think my message above crossed when you were out. I just wanted to let you know that I tested, but we're currently using CURIEs instead of URIs for this use case (just like communicating with noctua).

Answer 5 · 2022-08-11T20:10:56.000Z

@kltm thanks, this was a mistake in setting up the CURIE handler. I fixed it: a16995b

Answer 6 · 2022-08-11T21:49:33.000Z

Great--thank you! I'm trying it out now.

Answer 7 · 2022-08-11T23:40:48.000Z

@balhoff I attempted to run the full command, but it seemed to error out towards the end, after about an hour and a half runtime, with:

2022-08-11 15:16:27,327 WARN  (com.bigdata.rdf.ServiceProviderHook:171) Running.
2022-08-11 15:16:27,327 WARN  (com.bigdata.rdf.ServiceProviderHook:171) Running.

java.lang.IllegalStateException: Manager on ontology OntologyID(OntologyIRI(<http://model.geneontology.org/MGI_MGI_1924374>) VersionIRI(<http://model.geneontology.org/MGI_MGI_1924374>)) is null; the ontology is no longer associated to a manager. Ensure the ontology is not being used after being removed from its manager.
	at uk.ac.manchester.cs.owl.owlapi.OWLImmutableOntologyImpl.getOWLOntologyManager(OWLImmutableOntologyImpl.java:202)
	at uk.ac.manchester.cs.owl.owlapi.concurrent.ConcurrentOWLOntologyImpl.withReadLock(ConcurrentOWLOntologyImpl.java:162)
	at uk.ac.manchester.cs.owl.owlapi.concurrent.ConcurrentOWLOntologyImpl.getOWLOntologyManager(ConcurrentOWLOntologyImpl.java:238)
	at org.geneontology.minerva.ModelContainer.getOWLOntologyManager(ModelContainer.java:95)
	at org.geneontology.minerva.ModelContainer.dispose(ModelContainer.java:103)
	at org.geneontology.minerva.CoreMolecularModelManager.unlinkModel(CoreMolecularModelManager.java:805)
	at org.geneontology.minerva.CoreMolecularModelManager.dispose(CoreMolecularModelManager.java:822)
	at org.geneontology.minerva.BlazegraphMolecularModelManager.dispose(BlazegraphMolecularModelManager.java:826)
	at org.geneontology.minerva.cli.CommandLineInterface.modelsToJSON(CommandLineInterface.java:530)
	at org.geneontology.minerva.cli.CommandLineInterface.main(CommandLineInterface.java:240)

It also seemed to be a few short of all the models

sjcarbon@moiraine:/tmp/jsonout$:) ls -AlF | wc -l
41940
sjcarbon@moiraine:~/local/src/git/noctua-models[master]$:( ls -AlF models/ | wc -l
42130

But maybe that's due to some not being capable of producing JSON for some reason? (Although it could be due to an increase in models while I was doing the periodic update flush--I may have to double check that.)

Any thoughts on this error?

Answer 8 · 2022-08-12T14:47:34.000Z

Thanks for testing. I added parallelism to the output, since it was so slow. I'll disable this and see if it fixes the error. I suspect Minerva is not as robust to multithreading as we would like.

Answer 9 · 2022-08-16T18:45:50.000Z

@kltm I updated the branch without the parallelism. The job ran to completion on my laptop.

Answer 10 · 2022-08-19T01:12:16.000Z

Currently testing on pipeline build machine.

Answer 11 · 2022-08-19T20:18:05.000Z

@balhoff The output now seems in line with what I'd expect--thank you!

Noting that this took two hours on a pretty peppy machine--it might be good to explore what's going on with this as we are likely having the same issue writ small all the time in responses.

From here, we can look at adding this to the pipeline, and then onto supporting the GO-CAM API...

Answer 12 · 2022-08-19T20:19:08.000Z

This ticket can be closed once added to the pipeline and the products are located in an accessible location for an API. geneontology/api-gorest-2021#6

Answer 13 · 2022-09-20T23:46:17.000Z

@balhoff IIRC, you were thinking that there might be an optimization that could be done to help accelerate the JSON dump process?

Answer 14 · 2022-09-22T15:04:17.000Z

So far I've just done some profiling; it looks like about half the time is reading models out of the database, and the other half is doing the queries to categorize nodes by high level terms.

Answer 15 · 2022-09-28T20:19:56.000Z

Okay, I think I've found a place where is parallelizes fairly well, so maybe we don't have to worry too much about speed for the moment (although this might not scale).
With that, I think we're done here--further issues can be a new ticket.
Thank you @balhoff !