unitsofmeasurement/uom-systems

UCUM format seems broken for microliter.

Closed this issue · 18 comments

I'm using the following dependency:

<dependency>
  <groupId>systems.uom</groupId>
  <artifactId>systems-ucum-java8</artifactId>
  <version>0.7</version>
</dependency>
<dependency>
  <groupId>systems.uom</groupId>
  <artifactId>systems-quantity</artifactId>
  <version>0.7</version>
</dependency>

I have the following code:

final UnitFormat unitFormat =
    ServiceProvider.current().getUnitFormatService().getUnitFormat("CS");
final Unit<?> microliter = MetricPrefix.MICRO(Units.LITRE);
System.out.println(unitFormat.format(microliter)); // prints "nst"!

final Unit<?> microliter2 = unitFormat.parse("uL");
System.out.println(unitFormat.format(microliter2)); // prints "nst"!

Formatting MetricPrefix.MICRO(Units.LITRE) with the UCUM unit format should give me uL, but it gives me nst. Why?

Even worse, if I parse uL and the print it back _with the same UnitFormat instance, it gives me a different value. It isn't even providing round-trip consistency!!!

If I'm making a mistake here, it would be a really stupid one, because that code seems simple and straightforward.

What's the problem?

keilw commented

It looks like STERE = addUnit(new ProductUnit<Volume>(METER.pow(3))); overlaps both LITER definitions we now got in UCUM ;-/ How it gets to NANO instead of MICRO, that's another problem, but the identical definition of STERE and LITER is the main problem.

This is a huge problem. A blocker problem. i.e. 0.7 cannot be used with this behavior.

@magnonasc when can this be addressed? We'll need a new release as soon as possible with a fix.

keilw commented

The first question is, how important is STERE and which fields really use it?
https://en.wikipedia.org/wiki/Stere states, it is a legacy unit, used if at all in Scandinavia for wood volumes (maybe at IKEA?;-)

So a quick solution would be to comment it out for a 0.7.1 (quick fix, I had not planned it before a few weeks) release and see, if there are ways to handle it in UCUMFormat? The "late binding" UCUM libraries like UOMo also have all sorts of special cases, not just around LITER.

If it's in the UCUM, then shouldn't the implementation remember if I wanted a liter or a stere? If I parse "mL" why does it give me a nanostere? The implementation gets confused if there are two units with the same definition? Doesn't it keep track of the name?

keilw commented

the UCUMFormat implementation is pretty much modeled after EBNFUnitFormat. There is no "keeping track of names", it parses an expression. In the early days some of that was even preprocessed by JavaCC. We stopped doing that some while ago. See the JavaDoc in https://github.com/unitsofmeasurement/uom-se/blob/master/src/main/java/tec/uom/se/format/EBNFUnitFormat.java

keilw commented

The key are operations, it parses things that are based on and if that expression exists both behind LITER and STERE, then one erases the other. One would have to do some very detailed debugging to see what exactly happens, but the principles and operation parsing clearly points in that direction. Both CS and CI are affected btw, so one using dm³ does not really help either.

@keilw if you could work with @magnonasc to get to the bottom of this, I would appreciate it. I'm not very happy about the situation, and I'm stuck here with a large release that needs this --- and it's completely broken. Thanks for anything you can do.

keilw commented

@garretwilson would pruning the STERE help you for now? I am not sure, if 100% of the UCUM catalog is fully exposed, some fields still look commented out. And we're at this point not obliged to do all of them. I reopened #81 to cover as many of the active units in the system as we can. How the parser-centric method of evaluating operations and derived units can be improved, I cannot say if we find a quick fix that's available in a few days. Because UCUM has distinct names even for things that are essentially the same (also Units.METRE or SI.METRE vs. USCustomary.METER, but they are at least in separate systems) not just two LITER entries, it is more complex and problematic than the EBNF methodology developed in JSR 275

@keilw first I would prefer that someone get to the bottom of exactly why this is happening. (I won't to be sure that the "fix" actually fixes the problem.) Why does this happen only for uL and not for L, for example?

keilw commented

Is that the only prefix where it happens?

I don't know. @magnonasc please investigate and work with with @keilw to get to the bottom of this.

(To illustrate how frustrating this is, GlobalMentor, Inc. this week is releasing 1.0 of a reasonable-sized system for a large biotech client. Out of all the units, the first version is only using one: uL. We don't do any conversions. We even hard-code uL in the UI. Simply persisting this single unit using JSR 363 with uol-systems doesn't work. It makes me leery of the reliability of the whole uol-systems code.)

keilw commented

Sorry for that. I also use a much smaller implementation (it only requires 2 quantities, so something along the lines of https://github.com/unitsofmeasurement/uom-impl-enum is currently all it takes) for a client project with 2 major companies (one in the Global 50, the other one in Global 500 list) involved. Not many of them discuss changes or improvements in the open, but I know at least one other Giant pharma group already uses JSR 363 and some extensions (possibly their own in-house)

Unlike the RI https://github.com/unitsofmeasurement/unit-ri or API https://github.com/unitsofmeasurement/unit-api not every extension module applies code-coverage, so it might be a lot less than 50% in several cases. If you plan to use modules other than the RI or uom-se (which should have a similar coverage, but it is not the official RI for now) everyone who contributes to these should probably ensure code coverage or similar metrics (e.g. "codacy" for all modules other than the API has "B", that is not bad, mostly Checkstyle like nagging, but I know companies in Real Time or Embedded where it would have to be "A++" ;-) are as high as we and customers feel is appropriate.

You don't even want to know how bad some of the most popular Open Source tools or frameworks do or would do against some of these metrics ;-O

keilw commented

For Java SE 8 itself, it was less than a month after GA the first Update 5 came out: https://en.wikipedia.org/wiki/Java_version_history#Java_8_updates
Meaning, Update 1 was probably a week or so later, at least if you care to get it from OpenJDK ;-)

keilw commented

@garretwilson @magnonasc JSR-275 included a BNF definition in JavaCC: https://github.com/unitsofmeasurement/jsr-275/blob/master/src/main/javacc/javax/measure/unit/format/UnitParser.jj It was (must have been a Maven plugin) generated and then used by the Java code. Eventually we gave up doing that, but the definitions match what's behind https://github.com/unitsofmeasurement/uom-se/blob/master/src/main/java/tec/uom/se/format/EBNFUnitFormat.java.
Although UCUMFormat refers to a UCUMFormatParser.jj file I never saw it nor could I find it in JScience 5. It was probably find/replace of words.
https://github.com/unitsofmeasurement/uom-systems/blob/master/ucum-java8/src/main/java/systems/uom/ucum/format/UCUMFormat.java mentions, that the BNF in UCUM contained an error or still does. UCUM also follows a BNF, see http://unitsofmeasure.org/ucum.html#section-Syntax-Rules, it looks similar to that used by EBNFUnitFormat, but it is not necessarily the same. Whether the BNF of UCUM itself still contains flaws or the Java parser/lexer has to be adopted, we shall see. It seems that STERE was always there, but cannot say, if e.g. LITER was tested or worked back then.

keilw commented

@garretwilson @magnonasc Please check out a new test case https://github.com/unitsofmeasurement/uom-systems/blob/master/ucum-java8/src/test/java/systems/uom/ucum/format/UCUMFormatVolumeTest.java

UCUM is really exceptionally weird and error-prone when it comes to volumes and LITRE because of that redundant definition. Plus m³ overshadowed by STERE. The math is however always right, microliter becomes nanom³ (or nanostere), etc.

Parse operations look like they are perfectly fine. LITER, LITER_DM3 and STERE all get parsed correctly.
Because LITER has no C/I variant, you always get LITER_DM3 parsing context-insensitive.

Only format() currently seems to mix-up m³ and STERE, both are of course identical with 1 STERE=1m³.

keilw commented

@garretwilson @magnonasc After some probes into a few UCUM units, I created #97. Leaving this one open for now, but it looks like LITER, LITER_DM3 and STERE as such are not affected, while multiples and submultiples of various UCUM units have problems formatting.

keilw commented

Closing this, will deal with it in #97