What about subnormals?

Question

What about subnormals?

jfbastien opened this issue 10 years ago · 72 comments

Should denormal be:

Unspecified (say so in Nondeterminism.md).
Fully specified as IEEE-754 compliance.
Fully specified as IEEE-754 compliance for scalars, and do something for SIMD because existing hardware (especially ARMv7 NEON) doesn't support denormals.
Specified as DAZ/FTZ (not IEEE-754 compliant).

We should probably let ourselves change this from developer feedback, but I'd like to make some decision for MVP. I suggested on esdiscuss that JavaScript go full unspecified, and just do DAZ/FTZ because it's often faster. Yes, x86 is better than it used to be but that's not universal, ignores current hardware, and doesn't look towards what new hardware will do. I like leaving the door open :-)

For @sunfishcode searchability, I'll use the word "subnormals" too :-)

Answer 1 · 2015-06-11T15:40:02.000Z

For @jfbastien searchability, I'll say the word "denormals" too :-). But IEEE754-2008 is about 7 years ago, so it's time to be up to date :-).

Answer 2 · 2015-06-11T17:58:18.000Z

I would argue for IEEE 754 compliance from the beginning. The rationale is that hardware and software all tends toward the standard over time. I think we're uncorking a long-term bottle of annoyance and incompatibility to deviate.

As for subnormals, it seems the same cycle has repeated multiple times in floating point history. It's been tempting for hardware to cut corners to either do FTZ or something else, and has always tended back to IEEE 754 full compliance. Even GPUs are implementing full IEEE now. Of our tier 1 platforms, the only one I am aware of where subnormals are not implemented at all is Float32x4 on Arm NEON (i.e. SIMD). Float64x2 on NEON is fully IEEE compliant. Scalar arithmetic on arm is of course compliant.

The subnormal situation seems to come down to SIMD, specifically the case of Arm NEON above.

Based on conversations with hardware designers, microarchitectures have gotten so much better at superscalar floating point, even Arm cores, that it might be acceptable to spec Float32 as IEEE as well as Float32 SIMD operations as IEEE, and then simply not do Float32x4 as SIMD on arm. But this is something we should measure and motivate when SIMD is coming into WebAsm. At that point, if the performance really justifies weakening the spec, we can weaken the spec. Otherwise, it would be hard to tighten up the spec later.

Answer 3 · 2015-06-11T18:15:14.000Z

My objection to that is: which developers care about denormals? Developers usually learn about them when they have stray denormals in their compute kernel and code goes orders of magnitude slower. I'm still looking for someone who wants denormal support.

Answer 4 · 2015-06-11T18:34:45.000Z

If we specify FTZ as a developer-controlled mode, we're not weakening the spec (as in, we're not making semantics any looser); we're giving strictly more power to developers and this is something that they are specifically asking for (in our discussions with asm.js-using gamedevs). If we consider that wasm will always be run on emerging platforms (which often start w/ terrible denormal perf) and very old platforms, then this will be a consistent feature request, not one that will get definitely fixed. Furthermore, from my discussions with Intel, even with new, optimized denormals, they're not equivalent in speed to normal numbers and they're also not optimized for all ops. This is why setting FTZ is standard practice; it's just one less perf cliff to worry about.

Answer 5 · 2015-06-11T18:51:39.000Z

Setting DAZ/FTZ should be a global property that's set once for the entire wasm application, though?

Or can it be set/unset arbitrarily and mess with AOT compilation?
What of threading setting/unsetting it?
Wasm shares its process with other modules and non-wasm code. What does that imply when setting/unsetting?

Answer 6 · 2015-06-11T19:14:37.000Z

I would expect it to be a global flag on a wasm module that cannot be toggled and can thus be assumed for AOT/cached compilation. That still leaves questions w/ dynamic linking but I'd default to the simple option of: if you try to dynamically link a wasm module w/ a different flag than you, loading fails.

Answer 7 · 2015-06-12T00:09:54.000Z

I want to see data, and the burden of proof for violating IEEE 754 is very
high in my book. Even if the performance gains are huge, they'll likely be
spotty, i.e. on only a small number of platforms. I'd recommend that we
find a way to spec it in an "opt-in to nondeterminism because I want speed"
fashion. But mostly I trust the IEEE 754 specification and the numerical
expertise that went in to it more than hardware manufacturers on the
cutting edge.

I think the phase where we gather sufficient data to motivate a digression
from IEEE 754 hasn't come yet, so spec'ing nondeterminism or deviance seems
premature now.

I think mandating FTZ is a no-go, since mode-switching seems to be really
expensive on Intel.

On Thu, Jun 11, 2015 at 12:14 PM, Luke Wagner notifications@github.com
wrote:

I would expect it to be a global flag on a wasm module that cannot be
toggled and can thus be assumed for AOT/cached compilation. That still
leaves questions w/ dynamic linking but I'd default to the simple option
of: if you try to dynamically link a wasm module w/ a different flag than
you, loading fails.

—
Reply to this email directly or view it on GitHub
#148 (comment).

Answer 8 · 2015-06-12T02:13:44.000Z

To be clear, it's not nondeterminism that is being discussed: it's (deterministically) flushing denormals to zero. Also, we have had reports (e.g. and these guys iirc) specifically about people hitting denormal perf problems in JS. That setting FTZ (globally, not toggling dynamically) is standard practice for whole domains (games, signal processing) demonstrates that this is something developers expect.

Answer 9 · 2015-06-15T19:59:21.000Z

I agree with titzer.

I don’t think there is much long-term value to specifying floating point behavior that is not IEEE.

-Filip

On Jun 11, 2015, at 5:09 PM, titzer notifications@github.com wrote:

I want to see data, and the burden of proof for violating IEEE 754 is very
high in my book. Even if the performance gains are huge, they'll likely be
spotty, i.e. on only a small number of platforms. I'd recommend that we
find a way to spec it in an "opt-in to nondeterminism because I want speed"
fashion. But mostly I trust the IEEE 754 specification and the numerical
expertise that went in to it more than hardware manufacturers on the
cutting edge.

I think the phase where we gather sufficient data to motivate a digression
from IEEE 754 hasn't come yet, so spec'ing nondeterminism or deviance seems
premature now.

I think mandating FTZ is a no-go, since mode-switching seems to be really
expensive on Intel.

On Thu, Jun 11, 2015 at 12:14 PM, Luke Wagner notifications@github.com
wrote:

I would expect it to be a global flag on a wasm module that cannot be
toggled and can thus be assumed for AOT/cached compilation. That still
leaves questions w/ dynamic linking but I'd default to the simple option
of: if you try to dynamically link a wasm module w/ a different flag than
you, loading fails.

—
Reply to this email directly or view it on GitHub
#148 (comment).

—
Reply to this email directly or view it on GitHub #148 (comment).

Answer 10 · 2015-06-23T00:10:54.000Z

I think the vast majority of developers that encounter denormals discover them while troubleshooting unexpected performance hits. In the DSP world it tends to be a very common performance gotcha so I would advocate defaulting FTZ with an optional global IEEE-compliant mode.

Answer 11 · 2015-06-23T13:36:48.000Z

My preference would be full IEEE 754 support, or barring that to use DAZ. Undefined behavior would be a very terrible decision in my opinion. Inconsistent semantics make it very difficult to implement algorithms from computational geometry, which require exactly computing things like the sign of a determinant for example. A common technique to speed up these calculations is to use a floating point filter as a quick check before falling back to a more expensive exact arithmetic test. If the floating point behavior is not specified, then it becomes much more difficult (and in some cases impossible) to construct such a filter.

Answer 12 · 2015-07-02T00:22:24.000Z

Here is a discussable proposal which I believe gives most people what they want, though it makes some tradeoffs (as any proposal must):

Default to full support for subnormals.
A function attribute sets either "standard" or "maybe_flush", with "maybe_flush" meaning that it's nondeterministic whether subnormals are flushed as input and/or output at each operation within the function body (and does not apply within called functions).
Float32x4 is not accelerated on 32-bit NEON in "standard" mode, but is in "maybe_flush" mode.

The following questions seem interesting:

Is "standard" the right default? Losing Float32x4 on 32-bit NEON by default is not pretty (though compilers and tools could help detect problems and guide developers to solutions). Compiler flags are a nuisance. However, abrupt underflow is also sometimes problematic, and it's non-IEEE, so it's a question of priorities and perhaps also short-term versus long-term.

Is function-body the right scope for mode switching? It's somewhat fine-grained, but also gives implementations a natural optimization boundary, because dealing with mode changes in the middle of a function is awkward. Inlining can blur such boundaries, but optimizers would at least have the option of declining to do inlining (or other interprocedural optimizations) across boundaries where the modes differ. And implementations might be able to avoid the cost of mode switching across many function boundaries when the mode doesn't actually change.

Is "maybe_flush" what we want, or would a straight "flush" be better? "maybe_flush" avoids requiring CPUs to have both DAZ and FTZ flags. And, some implementations may wish to stay in "standard" mode in some cases. But, it does introduce nondeterminism which could lead to different problems.

Answer 13 · 2015-07-02T10:35:37.000Z

What is the advantage of hardcoding in the spec that mode switching must
occur on function boundaries, and not in other places?

2015-07-02 3:22 GMT+03:00 Dan Gohman notifications@github.com:

Here is a discussable proposal which I believe gives most people what they
want, though it makes some tradeoffs (as any proposal must):

Default to full support for subnormals.

A function attribute sets either "standard" or "maybe_flush", with
"maybe_flush" meaning that it's nondeterministic whether subnormals are
flushed as input and/or output at each operation within the function body
(and does not apply within called functions).

Float32x4 is not accelerated on 32-bit NEON in standard mode, but is
in "maybe_flush" mode.

The following questions seem interesting:

Is "standard' the right default? Losing Float32x4 on 32-bit NEON by
default is not pretty (though compilers and tools could help detect
problems and guide developers to solutions). Compiler flags are a nuisance.
However, abrupt underflow is also sometimes problematic, and it's non-IEEE,
so it's a question of priorities and perhaps also short-term versus
long-term.

Is function-body the right scope for mode switching? It's somewhat
fine-grained, but also gives implementations a natural optimization
boundary, because dealing with mode changes in the middle of a function is
awkward. Inlining can blur such boundaries, but optimizers would at least
have the option of declining to do inlining (or other interprocedural
optimizations) across boundaries where the modes differ. And
implementations might be able to avoid the cost of mode switching across
many function boundaries when the mode doesn't actually change.

Is "maybe_flush" what we want, or would a straight "flush" be better?
"maybe_flush" avoids requiring CPUs to have both DAZ and FTZ flags. And,
some implementations may wish to stay in "standard" mode in some cases.
But, it does introduce nondeterminism which could lead to different
problems.

—
Reply to this email directly or view it on GitHub
#148 (comment).

Answer 14 · 2015-07-02T15:17:14.000Z

This makes perfect sense to me.

-Fil

On Jul 1, 2015, at 5:22 PM, Dan Gohman notifications@github.com wrote:

Here is a discussable proposal which I believe gives most people what they want, though it makes some tradeoffs (as any proposal must):

Default to full support for subnormals.
A function attribute sets either "standard" or "maybe_flush", with "maybe_flush" meaning that it's nondeterministic whether subnormals are flushed as input and/or output at each operation within the function body (and does not apply within called functions).
Float32x4 is not accelerated on 32-bit NEON in standard mode, but is in "maybe_flush" mode.
The following questions seem interesting:

Is "standard' the right default? Losing Float32x4 on 32-bit NEON by default is not pretty (though compilers and tools could help detect problems and guide developers to solutions). Compiler flags are a nuisance. However, abrupt underflow is also sometimes problematic, and it's non-IEEE, so it's a question of priorities and perhaps also short-term versus long-term.

Is function-body the right scope for mode switching? It's somewhat fine-grained, but also gives implementations a natural optimization boundary, because dealing with mode changes in the middle of a function is awkward. Inlining can blur such boundaries, but optimizers would at least have the option of declining to do inlining (or other interprocedural optimizations) across boundaries where the modes differ. And implementations might be able to avoid the cost of mode switching across many function boundaries when the mode doesn't actually change.

Is "maybe_flush" what we want, or would a straight "flush" be better? "maybe_flush" avoids requiring CPUs to have both DAZ and FTZ flags. And, some implementations may wish to stay in "standard" mode in some cases. But, it does introduce nondeterminism which could lead to different problems.

―
Reply to this email directly or view it on GitHub.

Answer 15 · 2015-07-02T15:32:06.000Z

Structured mode switching, rather than just arbitrary dynamic mode switching, means that one can always statically determine the mode for any operation, which is an important property. Putting mode switches at function boundaries achieves this, though another option would be to have a mode-switch AST node which would be like a block node but would set the mode within its lexical extent.

Between function attributes and AST nodes, I chose function attributes because it gives implementations a few more options for avoiding mode switching costs. However, AST nodes would give applications some more flexibility, so we can consider both choices here.

Answer 16 · 2015-07-02T15:37:26.000Z

This proposal sounds pretty reasonable to me. I prefer the "opt-in" to
nondeterminism option. The AST node has the advantage of allowing the
source producer to do inlining without changing the semantics.

On Thu, Jul 2, 2015 at 5:32 PM, Dan Gohman notifications@github.com wrote:

Structured mode switching, rather than just arbitrary dynamic mode
switching, means that one can always statically determine the mode for any
operation, which is an important property. Putting mode switches at
function boundaries achieves this, though another option would be to have a
mode-switch AST node which would be like a block node but would set the
mode within its lexical extent.

Between function attributes and AST nodes, I chose function attributes
because it gives implementations a few more options for avoiding mode
switching costs. However, AST nodes would give applications some more
flexibility, so we can consider both choices here.

—
Reply to this email directly or view it on GitHub
#148 (comment).

Answer 17 · 2015-07-02T16:27:32.000Z

Should it be and AST node, or a per-operation property? That won't cause code bloat because of the way we specify operations.

Answer 18 · 2015-07-02T16:30:08.000Z

I'd also be fine(r) if this was a SIMD only feature, where vector
operations could specify IEEE, FTZ, or DontCare.

On Thu, Jul 2, 2015 at 6:27 PM, JF Bastien notifications@github.com wrote:

Should it be and AST node, or a per-operation property? That won't cause
code bloat because of the way we specify operations.

—
Reply to this email directly or view it on GitHub
#148 (comment).

Answer 19 · 2015-07-02T16:34:33.000Z

That sounds OK, though I would stick with IEEE and DontCare.

Having an FTZ attribute would mean mode switching in embedded scenarios - like if a native app uses JavaScriptCore.framework and then the JS code goes and loads some wasm (this is something I that I think we’ll eventually want to support). The rest of the app will almost certainly be IEEE.

-Filip

On Jul 2, 2015, at 9:30 AM, titzer notifications@github.com wrote:

I'd also be fine(r) if this was a SIMD only feature, where vector
operations could specify IEEE, FTZ, or DontCare.

On Thu, Jul 2, 2015 at 6:27 PM, JF Bastien notifications@github.com wrote:

Should it be and AST node, or a per-operation property? That won't cause
code bloat because of the way we specify operations.

—
Reply to this email directly or view it on GitHub
#148 (comment).

—
Reply to this email directly or view it on GitHub #148 (comment).

Answer 20 · 2015-07-02T16:36:32.000Z

For usecases like games I think DontCate isn't sufficient: folks actually want Fastest. If the HW makes denorms free then great, but otherwise they want FTZ.

Answer 21 · 2015-07-02T16:39:18.000Z

Right, they want Fastest. FTZ won’t be Fastest if you have to mode-switch on every native API boundary.

How about rename DontCare to Fastest? The point is: “I care less about the semantics of denorms than I care about how fast my code runs”.

-Filip

On Jul 2, 2015, at 9:36 AM, JF Bastien notifications@github.com wrote:

For usecases like games I think DontCate isn't sufficient: folks actually want Fastest. If the HW makes denorms free then great, but otherwise they want FTZ.

—
Reply to this email directly or view it on GitHub #148 (comment).

Answer 22 · 2015-07-02T16:41:48.000Z

sgtm.
Fastest + lots of API calls != fast
:-)

Answer 23 · 2015-07-02T16:44:50.000Z

Having a compile time per-function mode or a ast node for denormals can cause code size explosion, since the compiler would have to emit same code twice for both modes, in case a common function ended up being called from two separate functions that had the denormal setting set to different modes. Or is the plan to have such a setting be "shallow" that it doesn't propagate to any called functions? I think the usefulness of such a setting would be very limited.

Answer 24 · 2015-07-02T16:48:04.000Z

I think have a per-operation flag for scalar arithmetic isn't likely to
give as big of a benefit, at the cost of either a lot more opcodes or a
bigger encoding. So that leaves either an AST node or per-function
attribute if we want the modes to apply to scalar arithmetic.

If the mode were limited to SIMD operations (which are presumably more rare
than Float32/Float64 arithmetic), then it'd be easier to accept a
per-operation flag.

On Thu, Jul 2, 2015 at 6:45 PM, juj notifications@github.com wrote:

Having a compile time per-function mode or a ast node for denormals can
cause code size explosion, since the compiler would have to emit same code
twice for both modes, in case a common function ended up being called from
two separate functions that had the denormal setting set to different
modes. Or is the plan to have such a setting be "shallow" that it doesn't
propagate to any called functions? I think the usefulness of such a setting
would be very limited.

—
Reply to this email directly or view it on GitHub
#148 (comment).

Answer 25 · 2015-07-02T16:48:12.000Z

@juj shallow, which is why per-operation is better.

Answer 26 · 2015-07-02T16:49:54.000Z

@titzer size won't explode unless an application uses both opcodes with and without this mode. Applications will just have a dictionary of ops they use, so they'll get the shortest encoding for their usage.

Answer 27 · 2015-07-02T18:03:30.000Z

@juj Yes, shallow.

@jfbastien Per-operation flags will make it more tedious for implementations to effectively minimize mode switching costs because they'll have to run an algorithm to compute mode switch points. Scoped mode switching makes it easier to just do something simple and get good results (given reasonable input). And, we don't really want to encourage applications to be switching subnormal modes every operation anyway. (This is different from rounding modes, where switching per-operation is one of the main use cases.)

@titzer There are important use cases which want this feature for scalar operations as well.

There's an interesting implied semantic difference between "maybe_flush" and "dont_care". There exists hardware which implements subnormals but rounds them differently, and "dont_care" could permit that. I'd be ok supporting that, for now.

Answer 28 · 2015-07-02T18:15:07.000Z

Using per-op flag will be harder to optimize only if the developer's code switches from one mode to another. In that case the developer is already opting into getting slow code! The trivial case where ops all have the same mode is easy to optimize, and removes headaches when inlining.

Answer 29 · 2015-07-02T18:24:54.000Z

@jfbastien Even assuming a sane application, with per-operation flags:

for (...) {
  if (...) {
      x = y + z @subnormal_flushing_mode;
  }
  ...
}

The ideal thing here would be to set the mode outside the loop rather than setting it inside the loop. To do this, an implementation would have to scan the entire loop body and ascertain that there are no operations which need a different mode. And, this is a simple example; more complex cases are possible too.

With scoped flags:

@subnormal_flushing_mode {
  for (...) {
    if (...) {
      x = y + z;
    }
    ...
  }
}

It's still up to the application to use scope boundaries responsibly, but this way it's trivial for the implementation to just set the mode in one place without any other analysis.

Answer 30 · 2015-07-02T18:41:53.000Z

Agreed, but the same problem occurs as soon as we do inlining. Making the behavior mandatory means that either we can't inline, or we have to handle things the same way as if it were a per-op property.

Answer 31 · 2015-07-02T19:12:19.000Z

I'm ok with AST-node scoping, if that's the consensus (and @titzer liked it). Although be aware that if we do this, I'm going to be writing testcases which bait optimizers into optimizing across scope boundaries in invalid ways, so I wouldn't advise adopting this approach if you're hoping to do fancy optimizations without worrying about this ;-).

Answer 32 · 2015-07-06T13:44:06.000Z

#243 is a PR which spells out the AST-node approach.

Answer 33 · 2015-07-06T19:47:52.000Z

I'm really not a fan of the scoped approach:

It encodes the same information as per-operation, but makes it hard to see that callees aren't affected by the block scope.
Such attribute blocks aren't used anywhere else in our current AST specification, making it fairly opaque to the optimizer. Should we support attribute blocks for other properties too? We shouldn't design a on-off feature IMO.

Per-operation flags aren't any harder to optimize properly, and they're easy to optimize in a silly way for quick-and-dirty compilation. That's similar to the memory access alignment property.

Size-wise a per-operation approach won't have any impact because of paramaterized macro compression.

Answer 34 · 2015-07-06T20:09:37.000Z

I agree that there's some visual ambiguity.

I disagree that it encodes the same information -- it encodes more, because it includes a hint as to where a reasonable place to put mode-switching code is. This hint is what will help simple implementations achieve decent results. I disagree about the ease of achieving decent results using quick-and-dirty optimization without this hint.

Optimizing implementations can trivially convert from scoped AST nodes to per-operation flags when constructing their own IR if they wish.

As to whether we should support attribute blocks for other properties, it really depends on the property and how we expect it to be used. FutureFeatures.md has a proposal for how floating point rounding modes can be handled, for example, and they're more naturally per-operation because they have different expected usage patterns. Switching between rounding modes frequently happens in one of the major use cases for dynamic rounding modes -- interval arithmetic.

I agree that there's little difference in terms of overall encoding size.

Answer 35 · 2015-07-06T20:29:36.000Z

The other issue I have with the scoped approach is that sub-trees that are exactly the same don't have the same semantics, and don't yield the same results.

If we go this route I strongly want us to design scoped annotations for more than just subnormals, before we commit to this approach for subnormals. This implies figuring out when an attribute is per-operation versus block scoped.

Answer 36 · 2015-07-07T09:40:47.000Z

I would actually advocate that we table the subnormal discussion until
after MVP, since that discussion has more bearing on SIMD than scalar
operations.

On Mon, Jul 6, 2015 at 10:30 PM, JF Bastien notifications@github.com
wrote:

The other issue I have with the scoped approach is that sub-trees that are exactly
the same don't have the same semantics, and don't yield the same results.

If we go this route I strongly want us to design scoped annotations for
more than just subnormals, before we commit to this approach for
subnormals. This implies figuring out when an attribute is per-operation
versus block scoped.

—
Reply to this email directly or view it on GitHub
#148 (comment).

Answer 37 · 2015-07-07T12:52:42.000Z

The reason I'm pushing for this at this time is that it's important for scalar too.

Answer 38 · 2015-07-09T21:25:20.000Z

A few things: to echo Luke's earlier comments, I confirmed with the Intel performance architects I know from a prior lifetime that FTZ/DAZ are still recommended by Intel for all users. And small cores still have much more significant performance issues around denormals than large cores do. Also, in that same prior life, when we talked to FP-intensive applications developers we flat never heard that denormals were something they wanted. IEEE-ness almost always meant roundings/internal precision (x87), comparison semantics, and FMA. In that spirit, what I think this issue should be, rather than "subnormal" behavior, is an "FP model", a la https://msdn.microsoft.com/en-us/library/e7s85ffb.aspx

Answer 39 · 2015-07-09T21:37:09.000Z

Haswell is the best x86 chip at handling subnormals today that Agner has data for, and it has a staggering 124-clock stall when an operation on normal operands produces a subnormal result. Other chips have stalls from 150 to over 200 clock stalls. And, some applications use algorithms which produce a lot of subnormal values, and they can't easily be avoided. If we do nothing for subnormals, these kinds of applications become essentially unusable.

Nondeterminism always has the risk of impacting user-visible behavior. Making non-determinism opt-in helps, but we can still expect many developers to just blindly enable any flag that has the potential to make their code faster. Limiting the nondeterminism to subnormal values does have a plausible chance of limiting the impacts to user-visible behavior.

These are the conditions which drive us to the somewhat extraordinary length of proposing a subnormal_mode AST node.

These conditions are all very different than the conditions surrounding superficially similar-seeming things like "fast-math" flags commonly supported by compilers. Automatic FMA formation or looser NaN comparison semantics or similar things at the WebAssembly level would not have anywhere near the same performance impact, and at the same time would dramatically increase nondeterminism in ways that are very likely to lead to observable behavior differences.

Answer 40 · 2015-07-09T22:05:33.000Z

The majority of your first two paragraphs is the best argument I've seen for my suggestion that FTZ/DAZ be default and making any application that rely on denormals opt-in to performance pitfalls. With the latter being post MVP, as Ben suggests.

So can a developer rely on wasm rounding at each operation, or are we going to allow FMA recognition? This is in fact identical to supporting denormals in operations. Both are specifying the operational semantics of an IR node. FMA, tree height reduction/reassociation/etc. may not affect performance as much as denormals, except when your application does not converge to the correct answer. If by superficially similar you mean that allowing reassociation/fusion goes in the "fast" direction and denormals goes in the "slow" direction, I agree. But we should hammer out a set of default behaviors more comprehensively than just denormals.

Answer 41 · 2015-07-09T22:11:39.000Z

On Jul 9, 2015, at 2:37 PM, Dan Gohman notifications@github.com wrote:

Haswell is the best x86 chip at handling subnormals today that Agner has data for http://www.agner.org/optimize/microarchitecture.pdf, and it has a staggering 124-clock stall when an operation on normal operands produces a subnormal result. Other chips have stalls from 150 to over 200 clock stalls. And, some https://bugzilla.mozilla.org/show_bug.cgi?id=1027624 applications http://oortonline.com/ use algorithms which produce a lot of subnormal values, and they can't easily be avoided. If we do nothing for subnormals, these kinds of applications become essentially unusable.

This is inconsequential without hard data showing the net slow-down of enabling denormals in a representative application.

Do you have this data?

Nondeterminism always has the risk of impacting user-visible behavior. Making non-determinism opt-in helps, but we can still expect many developers to just blindly enable any flag that has the potential to make their code faster. Limiting the nondeterminism to subnormal values does have a plausible chance of limiting the impacts to user-visible behavior.

These are the conditions which drive us to the somewhat extraordinary length of proposing a subnormal_mode AST node.

These conditions are all very different than the conditions surrounding superficially similar-seeming things like "fast-math" flags commonly supported by compilers. Automatic FMA formation or looser NaN comparison semantics or similar things at the WebAssembly level would not have anywhere near the same performance impact, and at the same time would dramatically increase nondeterminism in ways that are very likely to lead to observable behavior differences.

I somewhat agree with the claim that this isn’t as bad as “fast-math". I’m fine with a flag that enables non-deterministic denormals, so long as it is correct for an implementation to ignore it and just support denormals per IEEE anyway. That’s almost certainly what our implementation will do.

-Filip

—
Reply to this email directly or view it on GitHub #148 (comment).

Answer 42 · 2015-07-10T00:22:57.000Z

@pizlonator Indeed, my proposal is designed to let you do exactly that.

@davidsehr: Many of the kinds of optimizations you're describing are more appropriate for WebAssembly producers. Even in applications which don't mind the precision effects of automatic FMA formation, it can still be important to ensure that the FMA formation is done consistently, so that there aren't e.g. visible discontinuities between tiles. Leaving things up to the WebAssembly implementation makes it difficult to guarantee this. Having the WebAssembly code ship with two versions of key functions, an FMA version and a non-FMA version, and pick between them using a feature test, means that we can do the optimization without sacrificing robustness.

Floating point reassociation for tree height reduction is trickier, because it depends much more on architectural and microarchitectural details. However at the same time, the risks of numerical instability are well known to be far greater, so the risks of harming portability are far greater, so this isn't something we can let implementations do, without a very strong motivation. (I think there are things we could do in this space, but this is a different conversation.)

As for whether we should flush subnormals by default, this is also something we can address in WebAssembly producers. Producers can each have their own defaults, tailored to the needs of their users, independently of what WebAssembly's own default is.

Answer 43 · 2015-07-10T04:19:31.000Z

It seems like, rather than weakening wasm semantics to maybe let the impl emit FMAs that it'd be better to instead have a separate FMA op (as is already proposed in FutureFeatures). This is symmetric with our choices in other cases (like, say, strict aliasing) where we leave the nondeterminism up to the C++ compiler so we can keep our semantics deterministic (thereby reducing cross-browser or cross-version breakage).

Question, though: other than vendors not wanting to set FTZ for philosophical reasons, is there or is there expected to be hardware that won't have FTZ as a builtin capability (such that if we changed fastest to always flush we'd be actually be penalizing wasm running on that hardware)? If 'yes', that's a good reason for admitting this limited form of nondeterminism; but if 'no', then it seems like we're basically agreeing to provide divergent behavior between browsers for no great reason, which seems like we have more work to do to reach consensus.

Answer 44 · 2015-07-10T05:37:26.000Z

On Jul 9, 2015, at 9:19 PM, Luke Wagner notifications@github.com wrote:

It seems like, rather than weakening wasm semantics to maybe let the impl emit FMAs that it'd be better to instead have a separate FMA op (as is already proposed in FutureFeatures https://github.com/WebAssembly/design/blob/master/FutureFeatures.md#additional-floating-point-operations). This is symmetric with our choices in other cases (like, say, strict aliasing) where we leave the nondeterminism up to the C++ compiler so we can keep our semantics deterministic (thereby reducing cross-browser or cross-version breakage).

Question, though: other than vendors not wanting to set FTZ for philosophical reasons, is there or is there expected to be hardware that won't have FTZ as a builtin capability (such that if we changed fastest to always flush we'd be actually be penalizing wasm running on that hardware)? If 'yes', that's a good reason for admitting this limited form of nondeterminism; but if 'no', then it seems like we're basically agreeing to provide divergent behavior between browsers for no great reason, which seems like we have more work to do to reach consensus.

I support the non-determinism instead of a deterministic FTZ-mode because I believe that changing the FTZ mode is not free.

For example, turning FTZ on and off is very awkward on ARM. You need to do a read-modify-write operation that involves three instructions, for example to disable FTZ:

vmrs , fpscr
andr , ~FTZ_BIT // clear the FTZ bit to enable IEEE
vmsr fpscr,

Note that I don’t know what the cost of vmrs/vmsr. I don’t have time to benchmark it right now. I just don’t like it because three instructions is three instruction too many, when you’re trying to emit great code that runs fast.

My overall preference would be for not having FTZ at all - even in a nondeterministic form. We don’t have benchmark evidence that any of this will make a difference when you factor in the other overheads of wasm (like the lack of undef and the need to have some safety checks on memory accesses). I think that FTZ makes sense in native apps where it’s valid to enable FTZ for the whole process at start-up, but I don’t think it makes sense when you have to support one mode for the browser (JS in particular) and another for wasm. It may be that we’re overthinking a very small optimization and coming up with overly complicated proposals that amount to little more than premature optimization.

That said, I’m probably fine with having some FTZ story for SIMD, since not all platforms support IEEE for SIMD. But when we do add FTZ for SIMD, we should add it in a manner that takes into account the potential overheads of mode-flipping. That probably means having a non-determinisitic FTZ setting, like in Dan’s proposal. AArch32 implementations will have to do some mode-flipping, but that will be better than not having SIMD at all, while other implementations will probably just do full IEEE, since that will be cheaper than mode-flips. In any case, that’s not relevant to the MVP.

-Filip

Answer 45 · 2015-07-10T05:51:44.000Z

@pizlonator the es-discuss discussion I pointed at earlier goes into JS semantics, as well as SIMD.

We do have data from scientific computing and gaming folks that denormals are actively harmful to performance, though I agree that controlling the whole process makes the solution easier. We don't have data that anyone wants denormals in their computation (besides implementors equating "want IEEE 754 support" to "must support denormals that don't flush").

Designing WebAssembly for JS embedding is important, but many usecases won't pay the 3-instruction cost you mention. And what if JS didn't have denormals, or at the minimum didn't specify them? Furthermore, many WebAssembly usecases won't be embedded in JS, or if they are won't require denormals at all.

Answer 46 · 2015-07-10T07:45:43.000Z

If we define FTZ module-wide (which was my expectation at the beginning of this thread), then I wouldn't expect the mode flipping to be significant. I asked Intel people about the costs and they said that, while it's not something you want to be doing in a loop, it's not super expensive and has gotten cheaper. Thus, if you factor in the cost of doing the call and associated bookkeeping, I expect the mode flipping wouldn't be significant. There is the use case where someone writes a wasm module full of tiny-bodied functions that are called repeatedly from JS (say, some Math library), but these are simply cases where devs shouldn't set ftz.

Answer 47 · 2015-07-10T19:13:30.000Z

@jfbastien: I disagree with changing JS semantics with respect to denormals. That es-discuss debate didn’t appear to be supportive of such a change. In any case, I wouldn’t gate wasm semantics on a JS change, especially a JS change that might break the web. I vaguely recall that in WebKit we found that disabling denormals did break the web. As for the non-browser uses of wasm, I think they are interesting but that’s not the killer app. The killer app here is the browser, so lets optimize for that.

@lukewagner: You’re totally right about module-wide FTZ being acceptable. I’d be OK with that.

Maybe I’m missing something, but nobody who is advocating FTZ has provided hard data on how much faster this is. Referencing cycle counts for specific operations isn’t good enough for me, since that doesn’t reveal how often those denormal cases are actually hit. We should be making decisions based on how they improve end-to-end performance. My acceptance of Dan’s maybe-FTZ proposal, or the module-wide FTZ proposal, is predicated upon someone showing some speed-up numbers of what this actually buys us.

-Filip

On Jul 10, 2015, at 12:46 AM, Luke Wagner notifications@github.com wrote:

If we define FTZ module-wide (which was my expectation at the beginning of this thread), then I wouldn't expect the mode flipping to be significant. I asked Intel people about the costs and they said that, while it's not something you want to be doing in a loop, it's not super expensive and has gotten cheaper. Thus, if you factor in the cost of doing the call and associated bookkeeping, I expect the mode flipping wouldn't be significant. There is the use case where someone writes a wasm module full of tiny-bodied functions that are called repeatedly from JS (say, some Math library), but these are simply cases where devs shouldn't set ftz.

—
Reply to this email directly or view it on GitHub #148 (comment).

Answer 48 · 2015-07-10T19:15:40.000Z

I've seen complaints from developers in places like the Web Audio WG mailing list about how denormals caused 10x-100x performance degradation in their applications; is that sort of thing not sufficient and we need data showing that denormals are impairing performance of top 100 websites, or something like that? What kind of motivation do we need to justify actually dealing with denormals, given the cost?

Answer 49 · 2015-07-10T19:40:21.000Z

Any developer who has spent time with audio DSP hit's the denormal problem pretty quickly. Any exponential decays or IIR filters unless carefully designed exhibit a pretty massive performance hit (at least 10x) once they drop into the denormal range.
One of the reasons I am excited about wasm is because it will enable things that really aren't possible with javascript due to performance like audio DSP, video codecs etc where control over things like denormals is critical.

Answer 50 · 2015-07-10T19:52:41.000Z

On Jul 10, 2015, at 12:16 PM, Katelyn Gadd notifications@github.com wrote:

I've seen complaints from developers in places like the Web Audio WG mailing list about how denormals caused 10x-100x performance degradation in their applications; is that sort of thing not sufficient and we need data showing that denormals are impairing performance of top 100 websites, or something like that? What kind of motivation do we need to justify actually dealing with denormals, given the cost?

I read over one of these discussions. It’s true that audio tends to hit this case more so than other kinds of code. Just to clarify though: from what I read, they aren’t arguing that some program runs 100x slower; they’re arguing that an instruction in some program runs 100x slower, and that makes the program run sufficiently slowly that you see a CPU usage spike. That’s bad, but "100x performance degradation” is an overstatement.

Moreover, when people claim “100x slower” the numbers they cite are usually cycle counts, which isn’t very interesting. How fast one instruction runs on a slow path isn’t necessarily indicative of overall performance. One benchmark measurement that I did find was http://charm.cs.illinois.edu/newPapers/06-13/paper.pdf http://charm.cs.illinois.edu/newPapers/06-13/paper.pdf, but that only claims >100x slow-down on Pentium 4 - newer CPUs do much better. Also, that’s for a benchmark that does nothing but a floating point math loop where all but one of the input values are denormal. I’m not sure that Pentium 4 is a very interesting CPU anymore, and I’m not sure that such a microbenchmark is representative.

The audio use case seems like a strong motivating example for a per-module mandatory FTZ mode, but cycle counts in an architecture manual and a paper based on a microbenchmark on old hardware isn’t very satisfactory to me. I’d like to understand: is the claim here that one person’s legacy audio code runs too slow in asm.js, or is this a systemic problem affecting many audio codes?

-Filip

Answer 51 · 2015-07-10T19:53:56.000Z

This is sounding pretty convincing. Question: is the only available method of battling this slow down to enable FTZ, or are there other tricks that people use also?

-Filip

On Jul 10, 2015, at 12:40 PM, Shannon Smith notifications@github.com wrote:

Any developer who has spent time with audio DSP hit's the denormal problem pretty quickly. Any exponential decays or IIR filters unless carefully designed exhibit a pretty massive performance hit (at least 10x) once they drop into the denormal range.
One of the reasons I am excited about wasm is because it will enable things that really aren't possible with javascript due to performance like audio DSP, video codecs etc where control over things like denormals is critical.

—
Reply to this email directly or view it on GitHub #148 (comment).

Answer 52 · 2015-07-10T20:15:17.000Z

As for the non-browser uses of wasm, I think they are interesting but that’s not the killer app. The killer app here is the browser, so lets optimize for that.

@pizlonator I was thinking about in-browser usecases that are fully wasm, with little to no JS glue around. Yes, out-of-browser is also a usecase to which my argument applies, but I agree we can mostly ignore it for this discussion.

Answer 53 · 2015-07-10T20:20:24.000Z

This is sounding pretty convincing. Question: is the only available method of battling this slow down to enable FTZ, or are there other tricks that people use also?

If it you can't enable FTZ (eg when doing DSP in a Java VM) there are several tricks you can use such as testing for and flushing denormal values explicitly or injecting inaudible noise into filters to keep them out of the denormal range.
These sort of workarounds will work with wasm but it would be nicer to have a fastmath flag (or better yet a strictmath flag with fastmath being the default).

Answer 54 · 2015-07-10T20:57:18.000Z

To be clear though, do you want a fast-math flag in the style of Java, a fast-math flag in the style of C compilers, or a fast-math flag that just means FTZ?

-Filip

On Jul 10, 2015, at 1:20 PM, Shannon Smith notifications@github.com wrote:

This is sounding pretty convincing. Question: is the only available method of battling this slow down to enable FTZ, or are there other tricks that people use also?

If it you can't enable FTZ (eg when doing DSP in a Java VM) there are several tricks you can use such as testing for and flushing denormal values explicitly or injecting inaudible noise into filters to keep them out of the denormal range.
These sort of workarounds will work with wasm but it would be nicer to have a fastmath flag (or better yet a strictmath flag with fastmath being the default).

—
Reply to this email directly or view it on GitHub #148 (comment).

Answer 55 · 2015-07-10T21:01:13.000Z

Thought about this argument more, and I’m no longer so happy with deterministic FTZ even if it’s module-wide.

I expect that wasm users will modularize their code. This is tendency in any language that supports modules: you create separate modules for separate things.

A deterministic module-wide FTZ setting will make cross-module calls slower in cases where there is a settings mismatch.

I’m still not convinced about this FTZ thing. Another thing that occurred to me about empirically observed FTZ slow-downs is that they may be due to the presence of denormals changing the convergence characteristics of a numerical fixpoint - that is the fixpoint may take longer to converge. Of course it’s sad when native code exhibits different behavior in wasm than it did natively, but that ship has already sailed. I still don’t see evidence that the lack of FTZ prevents people from writing performant code; it feels like a nice-to-have. And having an FTZ setting that sometimes makes fine-grained cross-module calls slow seems broken.

-Filip

On Jul 10, 2015, at 12:46 AM, Luke Wagner notifications@github.com wrote:

If we define FTZ module-wide (which was my expectation at the beginning of this thread), then I wouldn't expect the mode flipping to be significant. I asked Intel people about the costs and they said that, while it's not something you want to be doing in a loop, it's not super expensive and has gotten cheaper. Thus, if you factor in the cost of doing the call and associated bookkeeping, I expect the mode flipping wouldn't be significant. There is the use case where someone writes a wasm module full of tiny-bodied functions that are called repeatedly from JS (say, some Math library), but these are simply cases where devs shouldn't set ftz.

—
Reply to this email directly or view it on GitHub #148 (comment).

Answer 56 · 2015-07-10T21:31:36.000Z

To be clear though, do you want a fast-math flag in the style of Java, a fast-math flag in the style of C compilers, or a fast-math flag that just means FTZ?

I was thinking more in line with C compilers where IEEE compliance is not guaranteed and denormal support may be disabled. Thinking about it more however, it would probably need to be more explicit.

Answer 57 · 2015-07-10T23:45:57.000Z

What @davidsehr proposed is to formalize a full math model, with more than just control on denormals. If we're doing scoped attributes then allow developers to wrap regions where reassociation is OK, where FP contraction can be done, and so on.

I suggest we discuss denormals in this issue, and figure out a wider math model in another issue, potentially punting to post-MVP: figuring out the denormal default matters IMO, but I think we can agree on other math behavior for MVP (essentially, not fast math).

Answer 58 · 2015-07-11T07:34:23.000Z

@pizlonator I see the theoretical multi-module app situation you're talking about (once we have dynamic linking, that is), but if we have a clear default mode (as a non-normative note in the spec and in llvm-wasm) then 99% of modules will all have that default mode. It would also make sense to issue a console warning when dynamically linking heterogeneous ftz-mode modules.

Answer 59 · 2015-07-11T08:58:25.000Z

I'd argue concerns about the cost of the FTZ switch at module boundaries are also less relevant since the cost of calling out of/into an asm.js module is already elevated in SpiderMonkey (or was the last time I checked, anyhow). The overhead there can be aggressively optimized over time, but you're still going to effectively be transitioning between runtime environments, which means argument values (unless they're ints or floats in registers) are being marshaled into/out of the heap and various other setup is happening. I suspect there will always be some overhead involved here, so the introduction of more in the case of FTZ state mismatch is reasonable given the upside (superior, predictable performance in applications that need FTZ).

There are definitely scenarios where people will want to call into/out of wasm a lot, and in those cases we'll want to strongly discourage the use of FTZ. But the same is true for many existing native APIs - IIRC DirectX on Win32 is rather opinionated about x87 modes etc and it's just something game developers deal with.

Answer 60 · 2015-07-11T11:25:01.000Z

@jfbastien To address the need for a full math model, I created #260.

Answer 61 · 2015-07-11T15:30:56.000Z

On Jul 11, 2015, at 1:58 AM, Katelyn Gadd notifications@github.com wrote:

I'd argue concerns about the cost of the FTZ switch at module boundaries are also less relevant since the cost of calling out of/into an asm.js module is already elevated in SpiderMonkey (or was the last time I checked, anyhow). The overhead there can be aggressively optimized over time, but you're still going to effectively be transitioning between runtime environments, which means argument values (unless they're ints or floats in registers) are being marshaled into/out of the heap and various other setup is happening. I suspect there will always be some overhead involved here, so the introduction of more in the case of FTZ state mismatch is reasonable given the upside (superior, predictable performance in applications that need FTZ).

Such an overhead doesn't exist in JSC. I don't want the spec to require me to have inter-module call overhead.
There are definitely scenarios where people will want to call into/out of wasm a lot, and in those cases we'll want to strongly discourage the use of FTZ. But the same is true for many existing native APIs - IIRC DirectX on Win32 is rather opinionated about x87 modes etc and it's just something game developers deal with.

This is not about native call overhead. This is inter-module call overhead between wasm modules.

I think it's silly to design wasm just for games, and to have a module mode flag that masquerades as a performance feature but could cause slow downs due to mode switching.

-Filip

―
Reply to this email directly or view it on GitHub.

Answer 62 · 2015-07-11T18:44:14.000Z

How is that different from the myriad of existing performance techniques, though?

Tools like PGO can reduce your performance or break your application if guided by bad data/configured incorrectly. You might opt to use a lookup table in a scenario where it's actually more expensive than the computation due to memory characteristics. You might hand-inline some logic into your JavaScript, pushing its size over a threshold and causing some JS engines not to optimize it (FWIW, this can happen in .NET too). In the bad old days on x86, MMX and x87 shared registers so if you mixed those two you paid an enormous mode switch cost to bounce between them. Threading a performance-sensitive algorithm can reduce performance if it ends up highly contended on a lock or atomic.

There are very few optimizations you can make thoughtlessly that have no chance of hurting performance. Optimization is something that has to be an informed decision. FTZ is the same. AFAIK we're talking about an optional FTZ flag that defaults to off, so the vast majority of developers will be fine with the default and not turn it on. Many of those developers will be turning it on because their native application already had FTZ enabled, so they were paying that cost to begin with.

FWIW my SpiderMonkey example was not to imply that JSC is exactly like SpiderMonkey, but to imply that there will probably be some sort of overhead for JS<->WASM or Module<->Module transitions in most engines (eventually, if not right when the MVP is implemented). The design is already making various performance sacrifices for good reasons.

We could always make FTZ an advisory flag so it's spec-compliant for JSC to ignore it, and then we'll find out whether users care or not :-)

Answer 63 · 2015-07-11T19:25:19.000Z

On Jul 11, 2015, at 11:44 AM, Katelyn Gadd notifications@github.com wrote:

How is that different from the myriad of existing performance techniques, though?

Tools like PGO can reduce your performance or break your application if guided by bad data/configured incorrectly. You might opt to use a lookup table in a scenario where it's actually more expensive than the computation due to memory characteristics. You might hand-inline some logic into your JavaScript, pushing its size over a threshold and causing some JS engines not to optimize it (FWIW, this can happen in .NET too). In the bad old days on x86, MMX and x87 shared registers so if you mixed those two you paid an enormous mode switch cost to bounce between them. Threading a performance-sensitive algorithm can reduce performance if it ends up highly contended on a lock or atomic.

There are very few optimizations you can make thoughtlessly that have no chance of hurting performance. Optimization is something that has to be an informed decision. FTZ is the same. AFAIK we're talking about an optional FTZ flag that defaults to off, so the vast majority of developers will be fine with the default and not turn it on. Many of those developers will be turning it on because their native application already had FTZ enabled, so they were paying that cost to begin with.

The issue here is that we are introducing the need for calls between wasm modules (and JS<->wasm calls) to have an overhead where previously there was no need for any such overhead. And we are doing it to support an alleged optimization that relies on non-compliance with IEEE. And we have unreliable data supporting the allegation that it's an optimization at all.
FWIW my SpiderMonkey example was not to imply that JSC is exactly like SpiderMonkey, but to imply that there will probably be some sort of overhead for JS<->WASM or Module<->Module transitions in most engines (eventually, if not right when the MVP is implemented). The design is already making various performance sacrifices for good reasons.

I don't know what SpiderMonkey does, but the JSC approach being taken in our prototype will not have inter module call overhead. We'd rather go down the path of reducing module overheads rather than increasing them.

We could always make FTZ an advisory flag so it's spec-compliant for JSC to ignore it, and then we'll find out whether users care or not :-)

An FTZ advisory flag would be sort of OK and definitely better than a mandatory one.
―
Reply to this email directly or view it on GitHub.

Answer 64 · 2015-07-12T14:12:10.000Z

I think the larger issue here is that dealing with FTZ modes feels somewhat
premature, and it's a consideration that is in very strong need of data
here that show a real-world impact on a set of representative benchmarks.
We don't yet have a fully functional implementation, yet alone a maximally
performant one, and no benchmarks on which to base real measurements.
There's been a lot of heresay and anecdotes about potential slowdowns, but
for me the bar for deviating from IEEE should be really really high. It's
easy to add FTZ or nondeterministic denormals later and much harder to take
them away, so I think we should be careful to add a controversial
performance feature that could fail to pan out and also cause us grief
later.

On Sat, Jul 11, 2015 at 2:25 PM, pizlonator notifications@github.com
wrote:

On Jul 11, 2015, at 11:44 AM, Katelyn Gadd notifications@github.com
wrote:

How is that different from the myriad of existing performance
techniques, though?

Tools like PGO can reduce your performance or break your application if
guided by bad data/configured incorrectly. You might opt to use a lookup
table in a scenario where it's actually more expensive than the computation
due to memory characteristics. You might hand-inline some logic into your
JavaScript, pushing its size over a threshold and causing some JS engines
not to optimize it (FWIW, this can happen in .NET too). In the bad old days
on x86, MMX and x87 shared registers so if you mixed those two you paid an
enormous mode switch cost to bounce between them. Threading a
performance-sensitive algorithm can reduce performance if it ends up highly
contended on a lock or atomic.

There are very few optimizations you can make thoughtlessly that have no
chance of hurting performance. Optimization is something that has to be an
informed decision. FTZ is the same. AFAIK we're talking about an optional
FTZ flag that defaults to off, so the vast majority of developers will be
fine with the default and not turn it on. Many of those developers will be
turning it on because their native application already had FTZ enabled, so
they were paying that cost to begin with.

The issue here is that we are introducing the need for calls between wasm
modules (and JS<->wasm calls) to have an overhead where previously there
was no need for any such overhead. And we are doing it to support an
alleged optimization that relies on non-compliance with IEEE. And we have
unreliable data supporting the allegation that it's an optimization at all.
FWIW my SpiderMonkey example was not to imply that JSC is exactly like
SpiderMonkey, but to imply that there will probably be some sort of
overhead for JS<->WASM or Module<->Module transitions in most engines
(eventually, if not right when the MVP is implemented). The design is
already making various performance sacrifices for good reasons.

I don't know what SpiderMonkey does, but the JSC approach being taken in
our prototype will not have inter module call overhead. We'd rather go down
the path of reducing module overheads rather than increasing them.

We could always make FTZ an advisory flag so it's spec-compliant for JSC
to ignore it, and then we'll find out whether users care or not :-)

An FTZ advisory flag would be sort of OK and definitely better than a
mandatory one.
―
Reply to this email directly or view it on GitHub.

—
Reply to this email directly or view it on GitHub
#148 (comment).

Answer 65 · 2015-07-12T20:15:28.000Z

Consistent complaints from people who work on realtime audio and multimedia software are not 'hearsay' and FTZ is only controversial if you're talking about wanting to leave it out of an environment to simplify things. Mind you, simplification is a noble goal. But please don't miscategorize an important feature for real-world workloads, heavily used in existing production applications, as a 'controversial performance feature that could fail to pan out.'

If it's an advisory flag that people only use if they need it, the only way it would cause us grief is if applications ship with it for measurable real performance gains and then somehow we end up with architectural reasons to regret it later (like because we implemented it wrong). I'm not sure how badly we could mess up a module-wide FTZ flag. If we're concerned, we can punt with an explicit statement that we will 'do it right' post-MVP.

Answer 66 · 2015-07-12T20:45:07.000Z

On Jul 12, 2015, at 1:15 PM, Katelyn Gadd notifications@github.com wrote:

Consistent complaints from people who work on realtime audio and multimedia software are not 'hearsay' and FTZ is only controversial if you're talking about wanting to leave it out of an environment to simplify things. Mind you, simplification is a noble goal. But please don't miscategorize an important feature for real-world workloads, heavily used in existing production applications, as a 'controversial performance feature that could fail to pan out.'

If it's an advisory flag that people only use if they need it, the only way it would cause us grief is if applications ship with it for measurable real performance gains and then somehow we end up with architectural reasons to regret it later (like because we implemented it wrong). I'm not sure how badly we could mess up a module-wide FTZ flag. If we're concerned, we can punt with an explicit statement that we will 'do it right' post-MVP.

Exactly. The course that would make me happiest is to do it right post-MVP, and not mention FTZ in the MVP. The downside of adding FTZ to the MVP in the currently proposed forms is:

Downside of a Nondeterministic FTZ flag: it’s nondeterministic, which can lead to divergence between implementations. My own experience with FTZ is that some codes unexpectedly require either the presence of FTZ or the lack of it because it influences how some numeric fixpoint converges.

Downside of a Deterministic FTZ flag: it cannot be polyfilled and we can’t ever kill it. It also raises the bar for how much work is needed to achieve a compliant implementation.

I think I understand your argument in favor of FTZ: it is something that is beneficial to enable per-process in native apps that do audio, and those who do it feel strongly about it. I take it as a given that they feel strongly about it because they know things about this that I don’t. But I also know that it’s not the only way to get good performance in such code - you can chop away the denormals yourself if you really care, and people sometimes do this. This makes me suspect that FTZ may be more of a convenience nice-to-have than a performance showstopper.

Also, arguments about the performance of FTZ in native code aren’t directly transferable to wasm given wasm’s early state, for the following reasons:

In wasm we don’t have a notion of enabling FTZ per-process, so we have to resort to something else. That puts us in somewhat uncharted territory. We do not have hard data on the cost of FTZ mode switching across all of our target architectures, and we don’t have hard data on the cost of denormals across all of our target architectures. Hence, whatever we do here now, we would do it in the blind: while we know that whole-process FTZ is profitable, we don’t know if that profitability will hold when you consider the costs of inter-module mode flipping.
In wasm there will be other kinds of overheads already. If audio code uses FTZ to get a 1% speed-up, the speed-up in wasm will probably be less than 1%. That’s because wasm will have some as-yet unknown overhead from other things (memory accesses, lack of undef). The higher those base overheads, the less things like FTZ matter. This makes it difficult to make claims about the overhead of FTZ in wasm based on reports of FTZ overhead in native code.
We don’t know exactly when critical mass adoption of wasm will happen, and what the dominant CPUs will be once that happens. Undoubtedly some CPU(s) that seem important today will seem less important then, and also, there may be some new CPU(s) with entirely new constraints. This makes it profitable to defer semantic-changing perf features that are motivated by the CPUs of today. We should defer these things to when, as @titzer said, we have a well-optimized wasm implementation and we can run real apps on that implementation. To me that means that we should do it post-MVP.
Adding FTZ later is so easy! On the other hand, removing it is impossible. So, I believe that the bar for adding FTZ right now should be: does the lack of FTZ prevent widespread adoption of the MVP? I doubt that this will be the dominant issue influencing whether people try out wasm.

-Filip

Answer 67 · 2015-07-14T02:28:39.000Z

I expect FTZ isn't something we're going to see broadly across brenchmarks; it'll have 0 impact on 99.9% of apps and a 2x slowdown on the .1% of apps that happen to run into slow denormal ops on hot paths. But this is exactly the description of a post-MVP feature, so maybe that is the right path. Starting with less nondeterminism, one less thing to implement, and more polyfill fidelity in v.1 is a good consolation prize.

For now or post-MVP, I had one idea for a refinement on how to define FTZ: give modules a list of global options which are ignored if not known to the browser. Make "FTZ" an optional feature (one that could be permanently not-implemented while still being conforming). Engines which want "fastest" could unconditionally set-and-forget "FTZ". Codes that really want mandatory FTZ could feature test and, if FTZ wasn't present, take a different code path that did explicit denormal flushing. I wonder if llvm-wasm could even include a flag that did all this automatically (scoped or globally).

Answer 68 · 2015-07-14T16:32:56.000Z

For reference, I dug up a few instances where FTZ & denormals crossed the Web Audio mailing list:
https://lists.w3.org/Archives/Public/public-audio/2014AprJun/0091.html
https://bugzilla.mozilla.org/show_bug.cgi?id=1027624
https://lists.w3.org/Archives/Public/public-audio/2014AprJun/0094.html

Answer 69 · 2015-07-14T17:47:06.000Z

As discussed today: we'll wait for data before coming to a conclusion. Leave bug #148 open, don't change FAQ with #260 just yet. Re-discuss when @titzer and @pizlonator can discuss over higher-throughput medium than github issues.

Answer 70 · 2015-07-15T01:15:47.000Z

So until this morning I thought this conversation was about some mode for allowing FTZ, with subnormals by default. And I think we can get data and implement something good, so I wasn't too worried about this.

But having FTZ by default would incur a nontrivial penalty to every call across the FFI.

If you are very adamant about FTZ, then maybe we should move up the mode switching to an MVP issue, which could allow us to skirt the whole debate about defaults.

Answer 71 · 2015-09-01T00:59:32.000Z

I am currently proposing we fix this with #271.

Answer 72 · 2015-10-28T15:02:12.000Z

#271 is now merged.