google/truth

Support approximately-equal floating point assertions using ULPs

dimo414 opened this issue · 10 comments

Google's C++ test framework supports approximate comparison of floating point numbers without requiring the user specify a precision (though that's available too). The ASSERT_DOUBLE_EQ/ASSERT_FLOAT_EQ macros assert that two numbers are within four ULPs of one another, which (the documentations claims) should work well enough for most use-cases.

I'm sure this has been discussed before, so apologies if I'm dredging up something that's already been long-decided. But if ULP-based assertions are intentionally absent from DoubleSubject and FloatSubject I think the documentation should explain why that decision was made and what users should do instead. Currently the class makes no mention of ULPs, and expects callers to always pick an appropriate tolerance.

I was confident that I would find that our approximate equality was introduced by @PeteGillin , but it appears not :) (internally: CL 82886021, way back in 2014, so no API Review doc that I can see)

Pete did extend approximate equality to Fuzzy Truth, so he may have thoughts.

I recall reading the Random ASCII post linked to by GoogleTest during at least some discussions of approximate equality. I think that at least somewhat discouraged me from exploring ULPs, given the "Know what you’re doing" section's statement that "There is no silver bullet," which mentions specific weaknesses of ULPs.

That said, we could dig into how GoogleTest's approach has worked out in practice. I have always felt bad that most users are probably just typing "0.000001" with as many zeros as they feel like typing, without actually having much idea of what the "right" value is. And see also https://errorprone.info/bugpattern/FloatingPointAssertionWithinEpsilon....

Yeah, I added the support for floating point arrays (PrimitiveDoubleArraySubject.usingTolerance(...)) and iterables (IterableSubject.comparingElementsUsing(tolerance(...))) but the support for single floating point values (DoubleSubject.isWithin(...)) had been around for a long time before, and I think for the APIs I added that we decided that following that precedent was "obviously" the right thing to do.

Like you, I can only speculate about how that decision was made in the first place.

I would, however, agree with the assertion that there is no silver bullet. That Random ASCII post looks like it goes into loads of detail and seems pretty good. The short version is that the kinds of errors you'd consider acceptable depend on the calculations behind the number and no assumption will always be right. I do think that it's possible that research into how these things are actually used could result in something that would be right often enough to be useful, but I think we'd want to do that research rather than guess.

For what it's worth, we do have some rules of thumb in our docs. Again, if anyone can produce better advice with evidence that it's better, I'm happy to LGTM that.

This may be a weak, "easy-way-out" position to take, but I'd be inclined to simply mirror what googletest is doing, barring some evidence that they regret adding this assertion.

It seems better to offer an out-of-the-box usually reasonable approach than force everyone to come up with a magic constant for each usage that doesn't have any semantic meaning to their tests.

I am sympathetic. This is just one of those things that is not on fire, just a slow bleed of inconvenience :\

One thing that might be fun is to edit usingTolerance to ignore the supplied tolerance (except perhaps in the case of a 0 tolerance) and apply the ULP logic instead. Then we could run that against all our tests and see if anything actually breaks.

Caveats:

  • If it makes tests more lenient -- and too lenient -- we won't know. (Maybe we can get some idea by changing the behavior of isNotWithin?)
  • Even if tests pass now, they might fail, I don't know, on different platforms or with different JDKs or whatever.

It might also be interesting to compare the results against overriding the tolerance to always be zero: I imagine that that "works" fairly often, and we might get more interesting data by comparing the results of that with the results of the ULP change.

(Vaguely related, even less of a priority: Another thing we could do is, when an approximate comparison fails but is "close," is to output a tolerance that would have let it pass.)

That might be fun, indeed. (I would suggest doing the experiment with isWithin rather than usingTolerance as you'll hit far more cases. And, indeed, doing it with isNotWithin could be instructive too.)

You are correct that other platforms could see different behaviour. With f.p., it's even theoretically possible that behaviour could change on the same exact machine according to load, although I don't know whether that's a big deal in practice. But it'd still be instructive, we'd just have to take the results with that caveat.

One other thing worth bearing in mind is that we might want to change ProtoTruth, which also inherited the semantic from DoubleSubject. (Just be grateful that we are moving away from the behaviour in the old Google-internal thing that ProtoTruth is replacing.)

I ran a test that overrides all DoubleSubject.isWithin tolerances to 0. It produces, well, let's say n failures :)

The next question is what happens if we change it to check "within 4 ULPs" instead of "within a tolerance of 0." I'm not sure if that's as trivial to implement. To follow googletest, maybe we'd use doubleToLongBits? Or maybe we can just call nextUp and nextDown 4 times each and see if we hit the number?

First of all, I haven't looked at the googletest implementation, and I'm not even sure exactly what "within four ULPs of one another" means.

If x and y are not close to a power of two then ULP(x) = ULP(y) and the semantic is obvious. In that case, I think you can do doubleToLongBits, mask off the two least significant bits, and test for equality, right?

If x and y are close enough to a power of two then ULP(x) != ULP(y) and "four ULPs" is ambiguous. If I had to guess, I'd guess that it uses the larger of the two ULPs, maybe? That would seem intuitively more reasonable. But I'd have to check the code to be sure. Anyway, this is not the same as calling nextUp four times, which I think will give you four ULPs but some of the four will be ULP(x) and some ULP(y).

I'm being vague about how close to a power of two you have to be there for two reasons. One is that there's a couple of different definitions of ULP floating around, and they differ close to powers of two (per wikipedia). (I guess that maybe IEEE 754 must give a definition since it specifies required accuracies in terms of them, but I don't know.) The other reason is that I'd have to think more than I have done :-).

TL;DR is that, if it's important to be consistent with googletest then I think we need to port the googletest code into Java rather than trying to infer the semantic from documentation.

P.S. There are some things in DoubleUtils that might help with the impl.

(Actually, I suppose it's possible that the meaning of "within four ULPs of one another" when ULP(x) != ULP(y) is exactly the mix of the two ULPs that corresponds to calling nextUp four times. But I stick with the conclusion in my TL;DR.)