add a CLI command to dump all models in JSON response format
balhoff opened this issue ยท 15 comments
See geneontology/api-gorest-2021#6 (comment)
@kltm is this something we should go ahead and start on?
@balhoff If it's not too hard: yes. This is part of a discretionary project, so we are clear to move ahead.
The idea would be to get a dump of separate JSON files into a directory, which look like the JSON model contents of returned responses. Alternatively, if a mega-file was somehow much easier, we could maybe feed that into jq
or something and work it out.
Considering how we're going to use this, a tarballed version may eventually find its way into releases.
@balhoff Apologies to tag you here again, but I think my message above crossed when you were out. I just wanted to let you know that I tested, but we're currently using CURIEs instead of URIs for this use case (just like communicating with noctua).
Great--thank you! I'm trying it out now.
@balhoff I attempted to run the full command, but it seemed to error out towards the end, after about an hour and a half runtime, with:
2022-08-11 15:16:27,327 WARN (com.bigdata.rdf.ServiceProviderHook:171) Running.
2022-08-11 15:16:27,327 WARN (com.bigdata.rdf.ServiceProviderHook:171) Running.
java.lang.IllegalStateException: Manager on ontology OntologyID(OntologyIRI(<http://model.geneontology.org/MGI_MGI_1924374>) VersionIRI(<http://model.geneontology.org/MGI_MGI_1924374>)) is null; the ontology is no longer associated to a manager. Ensure the ontology is not being used after being removed from its manager.
at uk.ac.manchester.cs.owl.owlapi.OWLImmutableOntologyImpl.getOWLOntologyManager(OWLImmutableOntologyImpl.java:202)
at uk.ac.manchester.cs.owl.owlapi.concurrent.ConcurrentOWLOntologyImpl.withReadLock(ConcurrentOWLOntologyImpl.java:162)
at uk.ac.manchester.cs.owl.owlapi.concurrent.ConcurrentOWLOntologyImpl.getOWLOntologyManager(ConcurrentOWLOntologyImpl.java:238)
at org.geneontology.minerva.ModelContainer.getOWLOntologyManager(ModelContainer.java:95)
at org.geneontology.minerva.ModelContainer.dispose(ModelContainer.java:103)
at org.geneontology.minerva.CoreMolecularModelManager.unlinkModel(CoreMolecularModelManager.java:805)
at org.geneontology.minerva.CoreMolecularModelManager.dispose(CoreMolecularModelManager.java:822)
at org.geneontology.minerva.BlazegraphMolecularModelManager.dispose(BlazegraphMolecularModelManager.java:826)
at org.geneontology.minerva.cli.CommandLineInterface.modelsToJSON(CommandLineInterface.java:530)
at org.geneontology.minerva.cli.CommandLineInterface.main(CommandLineInterface.java:240)
It also seemed to be a few short of all the models
sjcarbon@moiraine:/tmp/jsonout$:) ls -AlF | wc -l
41940
sjcarbon@moiraine:~/local/src/git/noctua-models[master]$:( ls -AlF models/ | wc -l
42130
But maybe that's due to some not being capable of producing JSON for some reason? (Although it could be due to an increase in models while I was doing the periodic update flush--I may have to double check that.)
Any thoughts on this error?
Thanks for testing. I added parallelism to the output, since it was so slow. I'll disable this and see if it fixes the error. I suspect Minerva is not as robust to multithreading as we would like.
@kltm I updated the branch without the parallelism. The job ran to completion on my laptop.
Currently testing on pipeline build machine.
@balhoff The output now seems in line with what I'd expect--thank you!
Noting that this took two hours on a pretty peppy machine--it might be good to explore what's going on with this as we are likely having the same issue writ small all the time in responses.
From here, we can look at adding this to the pipeline, and then onto supporting the GO-CAM API...
This ticket can be closed once added to the pipeline and the products are located in an accessible location for an API. geneontology/api-gorest-2021#6
@balhoff IIRC, you were thinking that there might be an optimization that could be done to help accelerate the JSON dump process?
So far I've just done some profiling; it looks like about half the time is reading models out of the database, and the other half is doing the queries to categorize nodes by high level terms.