edf-hpc/verrou

Documentation of uncounted operation

hbrunie opened this issue · 6 comments

Hi,

what does uncounted operation mean?
Thanks,
Cheers,
Hugo

From memory, these are floating-point operations that verrou did not perturbate, either because it cannot perturbate them yet (e.g. comparisons) or because it was not configured to perturbate them (e.g. x87 floating-point operations aren't perturbated by default).

Thank you for these information.

Does sqrt fall into one of these categories?
27 ==9757== Uncounted operation: Sqrt64Fx2 (5751)

Also I launched verrou using these exclude libraries:

sym lib

  •  /lib64/libm.so.6
    
  •  /lib64/libm-2.26.so
    

But I don't think this message comes from these exclude, as many other function are excluded and do not throw any warning.

Also this is not in the same topic, but if you accept to answer here,
Verrou shows me that there is a NaN created from log10(a), because a<0.
This happens only when Verrou applies random rounding.
But this does not happen in "normal" execution.
A part from the fact that the program is sensitive to small perturbations, what can I learn from this? How should I investigate this?
Thank you for this tool!

Regarding Sqrt64Fx2, I cannot answer with my current Verrou knowledge (I'm only a user and occasional patch submitter, not one of the main developers), so you'll need to wait for @ffevotte or @lathuili to pass by.


Regarding log10 NaNing out with random rounding, such "edge effects" can either be a false positive of verrou's analysis or an indication of a genuine fragility in your computation. You need to investigate further in order to understand which one it is.

One reason why it could indicate a genuine problem is that it suggests that your computation is passing a zero, or something close to zero, to log10. This is suspicious, because log(0) is mathematically undefined in the set of reals (which floating-point computations usually try to emulate, and which some HPC compilers assume floating-point numbers always contain), and logarithms of very small numbers can be less precise than those of "normal" numbers and easily diverge into -inf under very small changes to initial conditions or parallel computation scheduling.

As a first debugging step, I would walk up the backtrace emitted by Verrou when the NaN is created, to try to understand which part of your codebase is (indirectly) leading to a negative value being passed into log10 under random rounding. Note that if the failure is spurious, you can use --vr-seed together with the RNG seed printed by Verrou at the beginning of execution in order to stabilize the pseudorandom rounding into a failing configuration.

If this investigation leads you to believe that this is a false-positive of verrou, you have several options:

  • Locate the piece(s) of code that compute the negative input to log10 via delta-debugging, and add this specifically to your exclusion list if it's not an overly wide exclusion like a BLAS routine.
  • Modify the incriminated log10(a) call with a log10(max(a, 0)) call, possibly under a verrou-specific conditional compilation mode in order to avoid hurting your performance.

Thank you for such precise answer!

I'll leave the issue open for the Sqrt64Fx2 question.
I will investigate in depth the source of the close to zero values passed to my log10.
Thanks a lot,
Hugo

I'm back after a long break: sorry for the late answer.

Verrou is not yet able to instrument all floating point operations :

  • as we are not able to instrument properly the scalar operations of the x87 register, they are instrumented, by default, only to be counted. The user can see that perturbation instrumentation is missing by looking the counters.

  • there are also operations we do not instrument at all although the stochatic approach requires their perturbations. The warning "uncount" is here to signal to the user a lack in the implementation of the stochastic approach.

Amoung these uncounted operations there are the square root, cosinus and sinus when done with hardware instructions. When the compiler decides to use inlined hardware instructions instead of the dynanic library, the libm.ex exclusion file is not useful. In your case you are simply ignoring the rounding error implied by the square root. If the evaluation of the accuracy is not sufficient for your application, you have problem in the rest of your code and so ignoring the square root is not a problem for the localization step.
If you want to hide the warnings you can use the suppression mecanism of valgrind.
If you want to instrument the square root, for now I can only propose a workaround. You need to recompile your application to use the libm dynamic library and use the Interlibmath via LD_PRELOAD (short documentation Interlimath/README.md and one exemple in unitTest/check-libM/Makefile). The use of interlibmath is not yet straightforward.

To make it work without this workaround will be possible but requires a verrou developpment which is not (yet) high ranked in your roadmap.

I'm curious to see the mathematical expression f(x) which compute a. If we can prove that f(x)> 0 over the definition set, the problem is in the floating point implementation of f(x) or is a verrou false positive (delta debug can pinpoint both). If we can not prove it, the problem is the algorithm (or at least our algorithm understanding).