draeger-lab/ModelPolisher

Current state

Closed this issue · 9 comments

A quick rundown of current issues that hinder productive use of ModelPolisher:

  • There was a logic error when using AnnotateDB: a check in one place was inverted and ADB was used when it should not have been and vice versa. This should be fixed in master after pulling in changes from the refactoring branch today
  • Both COBRAparser and JSONParser currently error on some models from bigg_models data and are being partially rewritten to fix this issue
  • identifiers.org (MIRIAM registry) changed to a JSON format from XML and the parser is currently being rewritten, i.e. some more recent annotations are not handled yet by ModelPolisher and the "Report this ID" message is written to the console for these
  • A docker bigg-1.6 image is available now, but only integrated in the bigg-1.6 branch, as both issues above need to be solved first, before we can guarantee compatibility with BiGG

JSONparser should now parse all models in bigg_models_data, however two issues remain:

  • Some (long) GPRs can't be parsed, which will require having a look at the respective code in the utils package.
  • Non-Integers in formulas cannot be parsed by the FBCSpeciesPlugin (e.g H96.4365P0.8756C37.9375K0.0017O6.9303S0.1371Na0.0001Fe1.7854N14.7005R-1.0Co0.0057 for M_bm_cofactors_c in iJB785) -> how to deal with this?

Implementing parsing for annotations and notes is blocked until integration with the new identifiers.org format is finished.

During the rewrite of the parser it became apparent, that we do not adhere to the BiGG specification when handling pseudoreactions (https://github.com/SBRG/bigg_models/wiki/BiGG-Models-ID-Specification-and-Guidelines), as the BiGG ID was prefixed "R_" in all cases.
This apparently led to some duplicate entries regarding demand/sink/exchange reactions and likely also biomass. A separate issue against bigg_models will be opened with all IDs and models affected.
Additionally some IDs with more than a few underscores were not parsed correctly.

The core of BiGGID parsing is rewritten now, but the corresponding code in COBRAParser, SBMLPolisher and BiGGAnnotation needs to be adapted, as there are some issues that need to be investigated.
Currently there is a Nullpointer issue regarding AnnotateDB and the mismatch between Gene and GPR IDs reappeared.

Next step is writing unit tests for the Parsers/BiGGId part to assure that everything is working as expected for all three of them.
Then integration with the new identifiers.org format will be finished and annotation related code refactored.

According to @bgoli it will become possible to encode non-integer molarities in the FBC package version 3. The specification text currently discussed is:

The optional attribute charge contains a signed double referring to the Species object’s charge (in terms of electrons, not the SI unit coulombs). Note, that unlike FBC versions one and two a Species may, for the purposes of charge, be interpreted as a pseudoisomer or aggregate molecule and may assume a non-integer value. Non-integer charges should be used with caution as their use may have unintended side-effects, for example, with respect to the accuracy of reaction balancing.

For now Formulas that cannot be parsed are ignored.
There are still problems with ClassCastExceptions in the GPR parsing code, however I'll open another issue to discuss this.

We also seem to have some annotations where, e.g. H20 is annotated as hydroxide additionally.
Need to check whether similar things happen for other species/reactions/etc. and queries need to be adapted.

Integration with the rewritten BiGGId class should now be finished, however there are some small issues with probably empty ids being produced somewhere (currently only seen when annotating species), so code needs to be checked for passing empty ids when creating BiGGIds. This also needs some further test cases before merging. Changes to the previous version will be discussed in another issue, as there are still some smaller things to discuss.

ncbigi annotations are no longer supported by identifiers.org, but should still be resolvable using the direct URL https://www.ncbi.nlm.nih.gov/protein/. I'll open another issue to clarify if this should be done.

There now is a progress bar for the annotation process, however it's slightly inaccurate due to gene products being added to the model if required by a reaction rule, so it jumps back to a lower percentage if too much additional gene products are created.
Some code I haven't pushed yet adds information on what is being processed currently for both the SBMLPolisher and BiGGAnnotation in the form of "Annotating <Species/Reaction/etc.> (step number/total)". This also helps a bit with debugging, as we now directly have context for logged information.

Annotation validation and correction is now centralized in the Registry class.
Retrieved URIs are first converted into identifiers.org URIs, and then ids are checked for validity and corrected, if possible.
Using information for alternative URIs should now allow to correctly convert each valid MIRIAM annotation to a identifiers.org URL.
All newly created created annotations should be https (will document this in another issue).

There are some additional issues tha remain to be done before merging into master and producing at least a beta release.

Retrieval of annotations from BiGGDB might be a bit wonky as described in SBRG/bigg_models#363.
Need to have a look whether this has to do with how the query is constructed or if we need to add a correction step directly after fetching the annotation.
Additionally both the functionality to retrieve bigg ids from annotations if no valid BiGGId is present for an entity and interaction with AnnotateDB need to be tested for the current changes.

And to add to all of that, what needs to be done most of all in the coming days is adding tests.
At least the parts covering parsing the registry, retrieving and correcting annotations and the BiGGId creation process should be covered by test cases before we have a new release.

Maybe it could be helpful to turn some of the statements from above into individual tracker items to work on.

AnnotateDB should not be used for now in ModelPolisher, as it currently retrieves wrong annotations.

@matthiaskoenig What is the current status of the AnnotateDB project?

@mephenor Could you open an issue for the wrong annotations on https://github.com/matthiaskoenig/annotatedb/issues
I am not aware of what the issues are. I just used the bigg database which was available at that time. So all the issues should be issues in the original bigg model database. If there is a newer version I can simply update the AnnotateDB project.

Without knowing what is the problem I cannot fix it ;)

The issue with AnnotatedDB is more likely a problem with the ModelPolisher integration/queries.
I get annotations that probably should be something like identifiers.org/envipath/650babc9-9d68-4b73-9332-11972ca26f7b/compound/799908db-b8c9-4982-86cb-1f225e2ad08c but are identifiers.org/650babc9-9d68-4b73-9332-11972ca26f7b/compound/799908db-b8c9-4982-86cb-1f225e2ad08c and sabiork.reactions for species.
There might be more, but I haven't had the time to check this thoroughly, though.
Sorry for the misleading wording.

@matthiaskoenig @draeger
I checked the query and some of the AnnotateDB content and this actually seems to be an issue with ADB content:

For sabiork.reaction target namespace both bigg.reaction and bigg.metabolite are present as source namespace in the mapping_view.

For envipath namespace the urlpattern is {$id}, but should rather be identifiers.org/envipath/{$id}, as this can be resolved.

I am currently checking, if there are more cases where both source namespaces are present, but the annotation is only applicable to one of them