HajoRijgersberg/OM

Use Common Serialization Style

Closed this issue · 14 comments

jmkeil commented

Following up #42 and #49 (@dr-shorthair), I also would like to propose to use an standard serialization for OM.

This has the advantage, that one can do automated changes beyond regex replacements, also automatically triggered after each push. An disadvantage would be, that comments in the documents get lost (but they could be moved into annotations).

I think the best choice would be the OWL API serialization because:

  • it puts the ontology resource at the top of the document
  • it produces a stable order of the statements
  • it is used by Protege, Robot, …
  • if some tool uses something else, it can be restored with a single Robot call

An alternative would for example be the Apache Jena serialization:

  • it preserves the exact set of triples
    • it does not automatically add new (inferred) statements
    • it does not switch subject and object of statements with symmetric properties as predicate
  • CLI tools are available too
    but
  • it does not produce a stable statement order -> bad for git diff
  • it does not put the ontology resource at the top of the document

One could also use this occasion to switch to Turtle as the main development serialization. It has in my point of view the following advantages compared to RDF/XML:

  • smaller
  • eases diffs and (automated) merges (less merge conflicts)
  • provides consistent handling of prefixes
  • eases string manipulation based changes (regex replace, scripts)

An automated generation of other serialization would be possible with a release pipeline. (But that is another issue.)

If you want this, the following steps must be done:

  • put every XML comment that should survive into an annotation (e.g. something like dcterms:rights for the copyright comment at the top of the comment)
  • use some tool to do the initial conversion

Thanx Jan Martin, for this comprehensive overview. Very difficult to weigh the different pros and cons, I immediately admit. Such issues have always been the reasons why I have worked with a manual text file so far. But apart from that, it is always good to look ahead. I really appreciate that.
As to the Turtle serialization, that seems to be something that could/may be done. Do you perhaps know about an automated tool that could do that for us, leaving the order of all productions in the text file intact including all comments etc.?

jmkeil commented

As to the Turtle serialization, that seems to be something that could/may be done. Do you perhaps know about an automated tool that could do that for us, leaving the order of all productions in the text file intact including all comments etc.?

No, I don't know one. That is a very special requirement and I doubt there exists one or it would be worth the effort to develop one. The switch from RDF/XML to Turtle was intended just a minor addition, given a switch to a standard serialization will be done. The main issue is the use of an standard serialization, as it would enable the use of all kind of tools for automated updates on the ontology and would therefore ease and speed up the work on many other issues.

OWLAPI is the most widely used serializer.

jmkeil commented

@HajoRijgersberg: Any progress on this? To better understand your requirement: What do you use the comments and order for?

Thanx both, I have to dive into OWLAPI.
The comments and order, or in general the well-structured, human-readable file, is for the latter purpose: human readability. This is an important factor in transparency of OM.

I have been thinking a lot in the meanwhile about the entire issue. I would like to know your opinion on the following (I think not ideal) approach: what if I created a script that writes the entire OM, and different versions of OM (DL, EL, etc.)? The script should read from a database containing all units, quantities and so on and relations between them (and other concepts). Both the database and the script could be on this Git. Every time the database gets altered/extended and/or the script has changed, a new version of OM could be generated using the script.

jmkeil commented

I thinks this kind of automation is the direction to go to improve the development process. That is what I meant with

automated changes beyond regex replacements, also automatically triggered after each push

Github Actions / Pipelines can automatically get triggered by pushes, releases, pull requests, … and they are configured in the repository itself.

But I recommend to stick with RDF based technologies: TTL instead of CSVs or databases, SPARQL CONSTRUCT or SPARQL UPDATE where feasible.

The first steps in my point of view are:

  1. apply standard serialization: #80 (this issue)
  2. setup basic release pipeline: #90

After that, one could start to add automated generation scripts.

Thanx Jan Martin, the idea of pipelines sounds really good. Two questions, if you allow me: 1. Earlier I argued why we (still) have to work with a manual rdf file. In short: because of transparency through human readability, on its turn through structure, order and comments. I write 'yet' in brackets because I still keep the hope that there will be an ontology editor that will maintain such kind of things. But I assume that in principle it would be no problem that the pipelines would be based on that manual source file? I thought I'd check with you for optimal clarity.
2. As I wrote earlier, I would certainly like to switch to Turtle (in the long term). I write 'in the long term' in brackets because it is difficult to say when exactly that will be (due to time limitations, etc.). But do I understand correctly that that Turtle format would not be essential for the pipelines? I guess they can also handle XML formats? Of course derived versions of the ontology in Turtle may already be generated. This question again to you for optimal clarity.

jmkeil commented

Earlier I argued why we (still) have to work with a manual rdf file. In short: because of transparency through human readability, on its turn through structure, order and comments.

I don't think the new serialization will worse transparency. The TTL serialization of the OWL API has a (different) clear structure to:

  • ordered by resource type (ontology, annotation property, data property, object property, class, individual, annotations) and for each type in alphabetical order
  • clear indention for human readability

But human readability is also matter of taste on the one hand. On the other hand, you as the author are used to the structure of the file you wrote. I completely understand and respect, that you don't want give that up something lightly.

I write 'yet' in brackets because I still keep the hope that there will be an ontology editor that will maintain such kind of things.

I don't expect that there will once be such an editor. The reason is, that (with some exception like databases) tools typically parse/de-serialize a file in the beginning into an internal representation to work on it and at the end, they generate a new serialization from the internal representation. That way, serialization and de-serialization are completely separated modules of the software, which makes them easier to maintain. Adapting an existing serialization would require an (additional) module that is concerned with both, serialization and de-serialization, which is probably error-prone and way harder to maintain (especially, if the format is not designed for in-place updates).

But I assume that in principle it would be no problem that the pipelines would be based on that manual source file? I thought I'd check with you for optimal clarity.
2. As I wrote earlier, I would certainly like to switch to Turtle (in the long term). I write 'in the long term' in brackets because it is difficult to say when exactly that will be (due to time limitations, etc.). But do I understand correctly that that Turtle format would not be essential for the pipelines? I guess they can also handle XML formats?

Yes, a pipeline could also work on the current serialization to perform regular tasks. However, it would not be possible to use tools to do one-time changes that will get pushed back into the repository. That way, it would not be possible to use, e.g. SPARQL UPDATE to perform updates of issues like #79, #84. Given that neither you nor someone else has the time to do all these changes manually soon, the further evolution of OM would benefit significantly from the option to use automation for one-time changes.

Of course derived versions of the ontology in Turtle may already be generated.

I see the main benefit of Turtle for the ontology maintenance, as it eases (in my point of view) editing, comparing and merging. The serialization format of the fill one imports into another ontology or uploads into triplestore does not really matter. But my major point in this issue is the use of a standard serialization.

Thanx again for your response, Jan Martin.

I don't think the new serialization will worse transparency.

It's not so much the ttl; I would certainly like to move to ttl, for two reasons: it is even better human readable, and it is more popular (I think).

The TTL serialization of the OWL API has a (different) clear structure to:
ordered by resource type (ontology, annotation property, data property, object property, class, individual, annotations) and for each type in alphabetical order
clear indention for human readability

The structure of om-2.0.rdf is different: it is a.o. according to application area. (It should even be improved: simplified as a matter of fact. The second level, namely, is per quantity and unit (to put it simply). These will be integrated in the future. It is good that I tried this out (the current structure) as I have learnt how to improve it in a next step.)
The idea is to move to application-area-oriented subontologies in the future. In the present structure I have made a prestep to that.

But human readability is also matter of taste on the one hand. On the other hand, you as the author are used to the structure of the file you wrote.

True, but it is more than that: the quality of OM is supported in two ways: your ABECTO and me as the author being able to overview the complete file and its contents (and also enabling other people to do so). This is of course a huge guarantee (although never ultimate) for the quality of OM, which we can never give up. (There's also the problem that we could never go back once we would have given it up, but I'm not sure how important that problem is compared to the statement about the guarantee of the quality of OM.)

I completely understand and respect,

I appreciate that, really a lot. One of my primary concerns namely is that I may disappoint you in the above matters. Where you do so much for OM. It would really make me sad if I would disappoint you, I'm telling you honestly. On the other hand, of course, I have to carry the responsibility for OM till the extent that I am able to carry it. I'm sure you'll understand. And I'm also sure that we will make steps with your pipelines, maybe not everything how you envisioned it, but definitely important steps like automatically deriving ttl DL, EL, etc. versions from the original om-2.0.rdf every time it gets updated. That would really be very great! :) Of course I'll describe all that in the readme of the OM Git.

that you don't want give that up something lightly.

As described above, unfortunately we cannot do that.

I don't expect that there will once be such an editor. (...)

That editor should "only" be able to remember the order of statements and comments as much as possible and put them back. I don't think that should be in the serialization. RDF, namely, does not support such functionality (I think?).

Yes, a pipeline could also work on the current serialization to perform regular tasks.

That is great to read. I hope we (you) could start with that! :)

However, it would not be possible to use tools to do one-time changes that will get pushed back into the repository. That way, it would not be possible to use, e.g. SPARQL UPDATE to perform updates of issues like #79, #84.

I understand. We have to postpone that to OM 3.0. I would be a fool if I wouldn't develop that one in ttl.

Given that neither you nor someone else has the time to do all these changes manually soon,

Not soon, but I have manually developed OM 1 and 2. Number 3 will also be developed, manually, by me. I am making preparations presently.

the further evolution of OM would benefit significantly from the option to use automation for one-time changes.

True, but as argued above, summarized: the price concerning the transparency and therefore the quality of OM is, unfortunately, too high. :/

I see the main benefit of Turtle for the ontology maintenance, as it eases (in my point of view) editing, comparing and merging. The serialization format of the fill one imports into another ontology or uploads into triplestore does not really matter. But my major point in this issue is the use of a standard serialization.

As to the standard serialization: purely for my optimal understanding: the xml format is also a standard serialization, isn't it?

As to the standard serialization: purely for my optimal understanding: the xml format is also a standard serialization, isn't it?

Yes, RDF/XML is a standardized serialization. And OM is compliant to them. My wording in this point was not ideal: The point is to use a common serialization style (compliant to the standardized serialization format), produced by a widely used serialization implementation that produces a stable order of statements to become able to restore the style and order after using arbitrary tools to update the ontology.

The idea is to move to application-area-oriented subontologies in the future. In the present structure I have made a prestep to that.

Did you already consider to split up the ontology into several files? That could enable a specific order and the use of a common serialization style at the same time. With a release pipeline, it could be merged into one file later on.

Number 3 will also be developed, manually, by me. I am making preparations presently.

Will you share earlier states for feedback / discussion, e.g. in another branch?

I close this issue now, as there is a clear decision to not change the serialization style.

Thanx for your answer, Jan Martin. Please allow me to ask some further questions, although this issue has been closed:

The point is to use a common serialization style

But, again for my understanding, RDF/XML also is a common serialization style, right?

a widely used serialization implementation that produces a stable order of statements to become able to restore the style and order after using arbitrary tools to update the ontology.

That would be good, but: if it were ordered according to application area, a.o. Note that we have this fixed order, but I assume you mean using an automated tool (not manually as I'm doing with OM).

Did you already consider to split up the ontology into several files?

Yes, OM is organized in such way that a prestep is made to future splitting it up in these several application-area-specific ontologies.
B.t.w., I would like to call these application areas disciplines in the future.

Will you share earlier states for feedback / discussion, e.g. in another branch?

Yes, that's a very good idea. A number of the present issues b.t.w. relate to this future version in the sense that I would like to deal with them/incorporate them in OM 3.0.

But, again for my understanding, RDF/XML also is a common serialization style, right?

RDF/XML is a serialization language / RDF file format specified in a W3C recommendation. However, this standard provides some flexibility, e.g. in terms of statement order, indentation, …. With the term serialization style (a self made term, borrowed from code style, maybe there is better term for that) I refer to additional rules that narrow the flexibility in the serialization language specification to gain readability and comparability of the serialization. That way, RDF/XML is a standardized serialization language, which can be used in several serialization styles.

Clear (I think), thanx!