WebAssembly/binaryen

Future of Binaryen in a stack machine world?

kripken opened this issue · 78 comments

The WebAssembly project recently decided to switch from being an AST to being a stack machine. This issue is to discuss the implications of that for the Binaryen project, which is AST-based.

History: Binaryen's initial design and goals were simple: WebAssembly was an AST. By building a compiler that uses that AST as its internal IR, we could stay very close to our output format, which allows

  • Fast compilation to the target, since almost no lowering is needed.
  • WebAssembly-to-WebAssembly optimization, as we can load it, optimize it, and write it.
  • Using WebAssembly as our IR means optimizations very specific to WebAssembly are easy to express (e.g., block return optimizations), which generic compilers might find harder to do.
  • ASTs are convenient as an input IR, allowing compilers to WebAssembly to use Binaryen as their backend. The asm2wasm and mir2wasm projects are examples of this.
  • ASTs are a reasonable optimization IR, allowing flexible optimizations to be written in nice ways.

But the foundations for all that are now in question with WebAssembly's pivot to a stack machine.

The first practical consequence is that things in the WebAssembly MVP will not be expressible as an AST, e.g.

call gimme-i32
call gimme-i32
call no-value-returned
i32.add

This stack machine code does two calls to get i32s, then a void call. The void call is not placed on the stack, and the add adds the i32s. In an AST that order of operations is not directly expressible, although a "first" or "reverse-comma" operator can work, <x, y> where x is done, then y, and then x is returned (which is the opposite of the comma operator in C, JS, etc.). So we can write

(i32.add
  (call gimme-i32)
  (first
    (call gimme-i32)
    (call no-value-returned)
  )
)

This technically works, but is awkward. In particular, the "first" operator vanishes when actually emitting WebAssembly. This puts us in the uncomfortable position of optimizations needing to take into account that more AST nodes might mean less code size. And the obvious simple IR for such optimizations is just the linear list of stack machine instructions.

That's just the beginning: A major justification for the stack machine approach is multi-values, which are not in the MVP but will be added later. As with "first", technically one could invent some AST construct, but it's awkward. Further possible future features already showing up in discussions include pick (copy a value from up the stack) and other stack machine tricks. It's clear that WebAssembly is moving in a direction that Binaryen's initial design is just not a good fit for.

The bottom line is that Binaryen was designed based on what WebAssembly was at the time. It seemed like an elegant approach to use the WebAssembly AST as a compiler IR. That tied Binaryen's internals to the WebAssembly design, which was a risk, but it paid off - until now, as WebAssembly has pivoted. So we need to decide what to do.

Options include:

  • Deprecate Binaryen: Consider a fresh start in another toolchain project, or shift efforts to an existing one.
  • Keep on trucking, for now: Add "first", and add other things later, despite Binaryen getting less effective at its purpose over time. The maintenance burden will rise as the IRs get more incompatible. This delays what I suspect is the inevitable.
  • Redesign Binaryen (1 IR): Switch to a stack machine IR. That would entail an almost complete rewrite, so it's not much different than the deprecate-binaryen option.
  • Redesign Binaryen (2 IRs): Add a stack machine IR alongside the AST. This seems very complex and risky, especially if the IRs are meant to be bidirectionally translatable, which is necessary if we can still both consume and emit WebAssembly.
  • Stop consuming WebAssembly: Keep the current AST, but it would be the "Binaryen IR" and no longer WebAssembly. This is sort of like forking the IR: WebAssembly is going in a stack machine direction, but the Binaryen IR is not following it there. Binaryen IR would be isomorphic to a subset of WebAssembly, but no more. In particular, this means we would no longer consume WebAssembly, we would only be able to emit it.

More details on the stop-consuming-WebAssembly option:

  • Binaryen would be a compiler to WebAssembly and nothing more: not a WebAssembly interpreter, not a WebAssembly linker, etc. etc.
  • Binaryen's IR would not be able to express all of WebAssembly directly. We would have an output layer that can translate that into WebAssembly. Since we would be a subset of WebAssembly, that should be a simple process, with little changes to our current code. In the short term we just wouldn't emit code not isomorphic to an AST (i.e. we would emit a subset of WebAssembly), but perhaps in time we could have our WebAssembly output layer emit stack machine-specific patterns as a final-layer optimization. (So we might eventually have 2 IRs, but not bidirectionally translatable ones.)
  • Binaryen's AST would stay representable in text format in the current s-expression format. As that format will likely end up being changed for the stack machine, this means forking the s-expression format, into something I guess we can call "Binaryen's IR text format". We would fork the spec test suite and add it to the Binaryen test suite. wasm-shell would run those tests, and should be renamed to binaryen-shell (or byn-shell?).
  • Binaryen's wasm-opt tool would be renamed binaryen-opt (or byn-opt?) as it would operate on Binaryen IR, not WebAssembly. Binaryen would stop being a wasm-to-wasm optimizer, which was a goal we thought could be useful for other projects - we would be dropping that goal.
  • Binaryen's wasm-as, wasm-dis tools would likewise be renamed as they would operate on a Binaryen binary format. In other words, as with the s-expression format, this would be a fork, of the binary format. Note also that this means that Binaryen would no longer be a tool people can use to disassemble and assemble wasm files for toolchain purposes.
  • Binaryen would still have an interpreter - in fact some optimization passes, like Precompute, need one - but it would be a Binaryen IR interpreter and not a WebAssembly interpreter. (Perhaps useful as an emterpreter replacement.)
  • No changes to asm2wasm and mir2wasm, as Binaryen's IR and C API are not changing. There should be no downside whatsoever for those compilers, and in particular not for the entire emscripten-asm2wasm path for compiling C++ to WebAssembly, which already works now.
  • s2wasm is tricky since its input is basically WebAssembly. We would need to modify or replace it, in tandem with the wasm backend. Several options here, with varying amounts of work.
  • No more wasm.js or binaryen.js, which could execute WebAssembly in JS for polyfilling purposes. You might say that wasm.js fulfilled its purpose of helping the toolchain side before JS engines got proper wasm support, and at this stage, we don't need it as much. So that is probably not a big deal. For binaryen.js, it could no longer work as a (slow) client-side polyfill for WebAssembly, but other interpreters could be compiled instead.
  • The bottom line is that Binaryen would be refocused from a general-purpose WebAssembly compiler and toolchain library - trying to do everything in that space - to a far more narrow niche of a specific compiler library for WebAssembly (that uses an AST designed to be isomorphic to a subset of WebAssembly).

As you can see from the amount of text, the stop-consuming-WebAssembly option is the one I've thought most about. Not because I like it necessarily, but because all the other options have downsides that worry me a lot more. But I have no definite opinion on any of this yet, hoping to hear what other people think.

Thoughts?

Can you give an example of stack-machine code that can not be expressed (with only encoding loss) in a structured AST? There might not be any such examples, given that dropping stack elements is currently constrained to block boundaries.

An explicit block1 was considered some time ago see WebAssembly/design#650 As you note if it is explicit then it does not solve the problem as using it changes the code. While the text format might use it just for presentation, what if people use it in excess and this still need to be encoded. Also this is still limited to structured uses in the scope of the block.

But this can not express multiple uses of the function results, and does not work for functions returning multiple values in general. A general solution is needed, and I appeal to people to solve the general case first then optimize it for high frequency cases.

The example could also be presented as the following:

(letc (($c0 (gimme-i32))
       ($c1 (gimme-i32)))
  (no-value-returned)
  (i32.add $c0 $c1))

And when people code references to these local constants that do not fit the stack usage pattern they would simply be encoded with get_value aka pick/pick_and_void. We could also move the function arguments onto the stack. Why not consider this proposal, it was communicated some time ago yet seems to have been ignored and there has been no rationale given?

The other option is the 'Stop consuming WebAssembly' option but I would look at it from a different perspective and view it as conceding that the wasm binary code is just a compressed encoding of the language that developers will actually work with and that the language that binaryen consumed would be the wasm text format and debugging would step through that code not the stack code which would be a throwaway intermediate encoding, just as debuggers do not step through compressed encodings. This would focus on the SSA decoded format, which is really what this is all about, and it might be better for analysis and transforms, but distant from the stack machine code. This would be a last-resort strategy, working around the product of the wasm group, but why should the group be in such a position, it is our group.

In the past we've kicked around the idea of defining a flavor of wasm (an IR, if you like) which is not exactly wasm but more natural for many compilers to produce (aka a producer-oriented/subwasm/flat wasm). What you are describing as Binaryen's new non-isomorphic IR sounds a lot like that in some respects. As you have observed, Binaryen would then be a one-way tool (at least the compiler part would be; obviously from the user perspective Binaryen has always been a loosely-related collection of tools; e.g. we could still have wasm-as and wasm-dis if we wanted to, they would just not use the same IR).

I think you are basically correct about the consequenses of moving Binaryen to this style of IR. Presumably it would evolve a bit (since it doesn't have to precisely track the current flavor of wasm we could rethink pain points that exist or as we find them), and libbinaryen (C API, optimizers, etc) would track this evolution.
For s2wasm it would depend on what we wanted to do with LLVM; we could make LLVM target the "flatwasm" IR if we wanted to. Moving the compiler part of LLVM fully to a non-wasm IR might have interesting consequences for linking formats and the other parts of LLVM, but that's a decision we can make independently of what we want Binaryen to be.

Practically speaking, we may choose over some near-term period to keep on trucking and paper over the differences while we work out what we want the end product to be.

@JSStats: what you're suggesting sounds more like for the WebAssembly design than for Binaryen? This issue is just for Binaryen responding to a WebAssembly change.

@dschuff: I agree that trucking on for a while sounds reasonable, but it's not quite that easy. I've been working towards that in the (increasingly misnamed ;) since it's all about stack now) land-drop branch, and it's been quite a lot of work so far. In addition, it looks like we'd need to rewrite our s-expression parser from scratch. If it's indeed the case that this is a significant amount of work, it seems unwise to do it only to likely get rid of it later - so it makes sense to focus on the change in direction now.

That makes sense; If we do want to just go toward a new IR, we do have sexpr-wasm-prototype which can take the role of binary encoding, decoding, and standalone interpretation. Then the IR can be designed for ease of production from whatever class of compilers we care about, and for wasm-specific optimizations.

Yeah, sexpr-wasm-prototype is perfect for that, it has a proper stack-based interpreter in fact. And I think I saw @binji already has a web port so it could be used as a clientside polyfill (if not I'd be happy to help with that).

I think in the forked-IR path, if we go that route, binaryen might still end up with some wasm import facilities, but they might not be lossless. For example a "wasm2byn" could handle patterns that can't work in an AST by introducing new locals (sort of like the project whose name escapes me that translates wasm into LLVM IR in the sense that it's also not meant to be lossless). And of course we can emit wasm, so optimizing wasm to wasm in binaryen might still be useful, although that's an open question.

I am leaning more in the forked-IR direction. I considered in more detail the option of rewriting binaryen to use a single stack-based IR, but I just don't think it makes sense. The original design principles still relevant are

  • A single main IR has major simplicity and efficiency benefits.
  • That single main IR should be transformable so we can run optimization passes on it. (Without this requirement, the single main IR could just be the wasm binary format.)

But doing those with a stack machine IR seems very hard. On the one hand it's true that some optimizations are easy to write (e.g. removing dead code) while others become possible (stack-machine specific things). But on the other, consider a simple pass that reduces eqz(gt(x, y)) into lte(x, y) (i.e. !(x > y) into x <= y, which is true on integers). Right now in our AST an eqz node has a pointer to the child node. In a stack machine, options are

  • No child pointer. The child is then defined by the location in the array, and might be right before the parent, but also might not be (and even trickier with more than one child).
    • Optimization passes would therefore need to maintain an expression stack, and every optimization pass would need to consult that data structure for child operations. This might not be so bad when traversing in the natural direction, but anything else (other orders, random access, etc.) become tricky.
    • Code receiving a pointer to a node must also receive a "context" pointer to the array, as the node by itself is no longer enough (in our AST right now, we've intentionally made it so a node pointer is always enough; in fact expressions stand on their own even without knowing the module, which makes function parallelism trivial).
    • On the other hand, not having a child pointer would make each node smaller, which is a significant upside. But nodes would still need a "next" pointer (forming a linked list) if we want it to be efficient to move nodes around, insert them, etc., which seems like a strong requirement. So this would only help when the number of child nodes is >1. Still significant, but less.
  • Have a child pointer. This means
    • Node sizes strictly increase compared to our current AST, given the "next" pointer.
    • Each modification of the code needs to be updated in two places, the linked list and the child pointers, which adds overhead compared to now and also introduces bug risk.

@kripken Do you know what the decision-making process is for WebAssembly? I'm increasingly getting the feeling that major decisions are made in private discussions behind closed doors, or other unknown places, with little important discussion happening in the design repo. Did the change to a stack machine catch you by surprise as well?

Based on what I know so far (which is no doubt less than what you know), "Keep on trucking, for now" looks like the best option (for now), because

  • I don't know if the stack machine is a final, irreversible decision.
  • It looks like there are AST equivalents for the new stack machine semantics (have you discovered anything in the stack machine that can't be expressed in AST form with a new operator like 'first'?)
  • By maintaining an AST form for Wasm code, you maintain knowledge about the two-way isomorphism between AST and stack machine. I think that's just useful knowledge from an academic perspective. It could be useful to other engineers in the future and useful to the computer science field.
  • Converting stack-based wasm to an AST is a potentially useful feature to many people. By keeping this ability, you reduce the need for people to migrate to other tools, making your tool more popular and by implication, you yourself ;)
  • It is probably easier to adapt gradually to the changes in Wasm as they happen. We're predicting certain things in the future like a pick operator and multi-values, but we don't know that these will actually happen in the form currently being considered. I think before ripping out functionality or engaging in a major rewrite, you should be certain it is necessary.
  • (edit) As you mentioned, ASTs seem easier to work with for some common optimizations.

This puts us in the uncomfortable position of optimizations needing to take into account that more AST nodes might mean less code size

Are "virtual" nodes really uncomfortable in terms of code complexity, or it is mainly uncomfortable psychologically? If it's the latter, I think you'll get used to it after awhile.

I am leaning more in the forked-IR direction.

If "forked-IR" means "an AST-based IR that is very similar to, and easily convertible to, the Wasm stack machine, such that arbitrary Wasm stack-machine code can be converted easily to the IR," it sounds like a good plan to me. It's even better if Wasm-to-IR-to-Wasm and/or IR-to-Wasm-to-IR are (guaranteed to be) lossless roundtrips.

@qwertie: Yes, I was taken by surprise by the stack machine change. And from what I've seen online and off I was the only browser-vendor person that opposed it. And you can see work going on for it (e.g. in the spec repo). So for better or for worse, I think it's safe to assume it's happening, even if nothing is truly final until it lands, is specced, and ships.

there are AST equivalents for the new stack machine semantics (have you discovered anything in the stack machine that can't be expressed in AST form with a new operator like 'first'?)

Yes, a "first" can solve the current issues, with some awkwardness, but as discussed above multi-values are almost certain to happen, and "pick" and others seem very possible. It's clear the awkwardness for an AST will increase.

By maintaining an AST form for Wasm code, you maintain knowledge about the two-way isomorphism between AST and stack machine. I think that's just useful knowledge from an academic perspective. It could be useful to other engineers in the future and useful to the computer science field.

I agree and those are some of the reasons I opposed WebAssembly moving to a stack machine. (My main reason is the negative implications for the text format.)

Converting stack-based wasm to an AST is a potentially useful feature to many people.

I agree, and I think we might still support a form of that, but it wouldn't be lossless. For example, a wasm2binaryen tool might use additional locals for stack machine code we can't directly represent.

Are "virtual" nodes really uncomfortable in terms of code complexity, or it is mainly uncomfortable psychologically? If it's the latter, I think you'll get used to it after awhile.

I think there is a fundamental issue here. The optimizer currently has some useful principles like

  • AST nodes ~ work. That makes it easier to write optimization passes. In particular, it is obviously the "right thing" in most cases to replace some AST nodes with a smaller number of AST nodes.
  • Removing AST nodes is good. And not just individually as implied by the previous point: if optimization passes decrease the number of nodes, we have a good guarantee of repeated optimization reaching a fixed point (sort of like a loss function in gradient descent or a Lyapunov function in differential equations; i.e., we have a monotonically decreasing function in the number of nodes, and it can't decrease forever, so we must halt).
  • The same points also apply when replacing "work" with "code size", i.e., when optimizing for size over speed.

Now, there are exceptions to those principles. For example, a block does no work by itself, e.g. (block (block ..) x) is the same as (block .. x). So we lose the "AST nodes ~ work" property. But we do still have the principle that decreasing AST nodes is good (for AST size and code size). Another example is that some math operation might be faster with more nodes, like say turning an integer divide by a constant into a sequence of shifts and adds. That violates the first two principles, so the optimizer has to be careful about it. In this case, we can just not do the reverse of that operation, but it does mean we need to keep that in mind when writing additional integer math optimizations. At least, though, the third principle remains in that more AST nodes means more code size, so we clearly know not to do it when optimizing for size.

The problem with "first" is even trickier in that all three of those principles are no longer true: adding such nodes can reduce output code size and work, but the opposite is true in other cases. So we can't just not do the inverse, and it isn't trivial what to do when optimizing for size, and so all optimization passes that remove or create "first"s must be coordinated. Now, of course this isn't an insurmountable problem. But it (and multi-values, and pick, etc.) add complexity and awkwardness to our optimization IR, and not just in a few places. An AST is just not a good 1-to-1 IR for such stack machine code.

If "forked-IR" means "an AST-based IR that is very similar to, and easily convertible to, the Wasm stack machine, such that arbitrary Wasm stack-machine code can be converted easily to the IR,"

No, sorry, the name might be confusing. Forking here means that we diverge from WebAssembly by not adopting the stack machine elements that are awkward in our AST. A better name might be "only output wasm", as it implies wasm is only our output format, while the forked IR is our input format, internal IR, binary and text format, etc. But the direction of Binaryen IR => WebAssembly would remain a very easy and fast operation, of course (since we would be pretty much a subset of WebAssembly).

OIC - "~" means "varies proportionally with". Aka x α y...

It seems to me you can model most of this with a per-node weight - say, most operations have weight 1, multiplications have weight 2-ish, divisions have weight 10-ish... and you can define three weights, one for "optimize for speed", one for "optimize for size" and one for "balanced". If an algorithm needs positive weights, that's okay: it might be better to give first a weight of 0.001 rather than 0 since w; x; y + z has the (small) virtue of being easier to understand than w; first (y, x) + z. On the other hand, another optimizer might have chosen w; first (y, x) + z intentionally, let's say because w might modify memory read by x... anyway, I've never written optimizers so I'm probably not helping :)

I see that multi-values are likely to happen as currently envisioned because they seem easy for the back-end to support. But local variables do everything we would use pick for, so I think it'll be awhile before they consider that - unless they eliminate local variables entirely (a slightly horrifying thought).

FWIW I support your opposition to the stack machine, or at least I think there should be a closely-related AST variant of Wasm for the text format and for binaryen. I recognize the (small) value of the stack machine in the final back-end, but still wonder if there's some way to achieve a similar benefit in the AST paradigm (I'm cursing myself for not having found the time to learn Wasm better, otherwise I'd love to investigate this.)

It sounds like you weren't really consulted about this decision, but do you know what the decision-making process is?

@kripken A sketch of a single pass stack-machine to text format algorithm is at WebAssembly/design#753 This does not use a first operator, rather lexical constants, which would seem to work with the current proposal in the stack-machine to text direction, but be fragile in the other direction without the pick operator. Perhaps you can work it into something concrete, or if you find any show stoppers or loss then please report it so it can be understood and pondered. You could explore the utility of the pick operator, which would probably eliminate a lot of local variables.

OIC - "~" means "varies proportionally with". Aka x α y...

Yeah, sorry, I should have been more clear.

It seems to me you can model most of this with a per-node weight [..]

Yes, but the issue is that "first" would need to have a negative weight in the context of code size. That's already troubling, but in addition, even that isn't true as the weight depends on the surrounding code, it shouldn't be negative in other cases, so it doesn't even have an intrinsically definable weight.

But local variables do everything we would use pick for, so I think it'll be awhile before they consider that

It might be a while, but I hope it'll happen at least for the text format: with pick you can avoid oddities like a let variable that can only be used once and in certain places.

do you know what the decision-making process is?

I don't know that there is anything formal written up. For the informal stuff I'm not sure since I'm not very good at politics ;)

@kripken I'm not seeing it. What's an example where first would have a negative weight?

Imagine that we want to do a call, set it to a local, a store, and then use the call result (and the store and the call can't be reordered):

(i32.eqz
  (first
    (tee_local $x (call ..))
    (store ..)
  )
)

vs.

(set_local $x (call ..))
(store ..)
(i32.eqz
  (get_local $x)
)

Both of those have 5 AST nodes, but the former has one less node in the stack machine output since it doesn't need the get_local.

A negative weight isn't needed here: if first's weight is zero or near-zero, the first version has lower total weight so the optimizer should prefer it.

@kripken If you introduce first now then the text format will probably be stuck with it and given that the lexical constants can also represent this pattern then there are now two text formats for the same stack machine code introducing a loss. I don't even think first aka block1 is a familiar operator to JS programmers so why bother introducing it, it's a dead end. A plan is needed.

@qwertie: sure, but it can't be zero in other cases, and a specific non-zero value might work in some but not others.

The underlying issue is that in a stack machine, it's easy to see the cost of this operator (or rather, what it does, since it doesn't exist there). In an AST an analysis is needed to calculate the cost.

drom commented

Agree with @JSStats that first not readable nor native to "stack machine" concept.

At this stage I'd like to propose that we go in the forked-IR, aka "only output wasm" direction:

  • The internal IR, our AST, does not change, but it is now called Binaryen IR. In time it will diverge from WebAssembly (e.g. on the stack machine issue) but it will remain a core goal to stay as close as possible (same types, semantics, etc.) and basically to have an IR that trivially compiles to WebAssembly with almost no work because it's practically WebAssembly already.
  • We no longer focus on WebAssembly as an input. It is an output for us, and it is our primary output aside from our internal IR. (But see wasm2byn below.)
  • Our main focus is on supporting compilers to WebAssembly, e.g., asm2wasm, mir2wasm, s2wasm, etc. The more specific goals when supporting such compilers remain unchanged: to make it easy to write a compiler to WebAssembly, to be fast, and to emit good code.
  • The current binary format support we have becomes the Binaryen IR Binary Format, perhaps with suffix .byn (pronounced "bine"; see previous discussion).
  • The current s-expression support we have becomes the Binaryen IR Text Format, perhaps with suffix .tyn (pronounced "tine"; spelling is parallel to .byn, replacing 'b' for binary with 't' for text). Our own .wast files in the test suite are renamed to the new suffix.
  • asm2wasm => asm2byn, wasm-as => byn-as, wasm-dis => byn-dis, wasm-opt => byn-opt, wasm-shell => byn-shell. No actual change but input and output where relevant are switched from WebAssembly to Binaryen IR.
  • Add a byn2wasm tool that emits WebAssembly. As our current binary format is identical to wasm, we can share almost all that code for now (just have a flag for the divergent points). emcc will call this tool to emit the wasm binary when compiling, etc.
  • wasm.js => byn-emcc.js, this integrates with emcc and allow running Binaryen IR in a compiled interpreter in JS. Useful for toolchain testing, testing the Binaryen IR interpreter (which is important since we use it in the compiler, e.g. --precompute so far), and ensuring our semantics don't diverge too much from WebAssembly. (I previously considered removing this, but changed my mind.)
  • binaryen.js was a standalone wasm interpreter. We no longer input wasm, so we leave that to other tools, and can remove it. (But see wasm2byn below.)
  • s2wasm => s2byn, it would now emit Binaryen IR. Note that s2wasm must track changes in the wasm backend, which for now doesn't emit stack machine code that is a problem for our IR, but if that happens, s2wasm might need to do lossy conversion (see wasm2byn below).
  • We can still run a subset of the wasm spec tests, basically everything but where stack machine notation pops up. This would let us keep coverage on all the rest of the suite, which is very useful for testing our interpreter and keeping our semantics as close as possible to WebAssembly. And it makes sense to keep our text format compatible with .wast files as much as possible anyhow (and as with the binary format almost all the parsing code could be shared).
  • wasm.h, wasm-validation.h and so forth become byn-*.h, etc.
  • Internally, we use the variable name "wasm" for a module in various places, that should be renamed to "module".

(Assuming we agree on this proposal, I am of course signing up to do all the above work; but if anyone wants to help that's great.)

Possible future steps, worth mentioning now since they might influence the discussion:

  • With our own text and binary formats, I think it makes sense to introduce optional basic block + branches support. We have that support in our C and C++ APIs already, in the form of the relooper API and its IR, so this would just reflect that in our IR's text and binary formats. This would be little work. It would help in internal testing as well as external compilers, e.g., right now mir2wasm uses the Binaryen C API with the relooper, and if we add basic blocks to the text and binary formats then it and similar projects could in principle generate one of those formats directly if they prefer. Basically, this would make our IR easy to target for a wide range of compilers: AST-based or CFG based, both would be supported.
  • Introduce a more sophisticated wasm output layer, that can e.g. reorder stack machine code into forms we can't directly represent in our IR. This might be a secondary IR.
  • Introduce a wasm2byn tool that imports WebAssembly. This would not be lossless as we can't represent all WebAssembly directly, so it might e.g. use more locals for some stack machine code, etc. But overall it would be a simple importer to write, and could be useful for various things even without being lossless, like a wasm-to-wasm optimizer, a standalone wasm interpreter, etc. (which would be trivial by just chaining wasm2byn to our byn* tools).

Thoughts?

@kripken A fundamental problem with this plan is that it still does not address multiple values. Development needs to more forward, develop a counter proposal that addresses the issues, rather than just resisting development and shuffling the chairs. People care about wasm because it is a deployment platform.

I think the general point you are making about the disconnect between the language that producers want to easily emit and the twisted stack machine constraints is a key point. The same was seen with the x87, a numerical stack machine, where compilers just wanted to connect inputs and output but had to work around frustrating stack machine constraints, and to optimize for these constraints, and and to be able to annotate where values were for debugging. The pick operator helps a little as it can access values in general up the stack, but if the requirement that a block stack be empty except for the out values stands then this will require really frustrating swap and drop sequences to clean up the stack.

If wasm is just a target, and not optimized for code size, then the expressionless register based code is much simpler for producers.

@JSStats: I agree we need a plan for multi-values, and if we don't have a good one then we can't move forward here.

One option is to just not add them to our AST. This assumes we can still generate good code without it. I don't know if that's true.

Another is to add it to our AST with something like

(set_local_multi $x $y
  (multi
    (..expression to assign to $x..)
    (..expression to assign to $y..)
  )
)

i.e. multi creates a multiple value, and set_local_multi can destructure it into components and assign them into separate locals. And multiple values can flow through blocks, ifs, and function returns. This seems plausible, but I'm not sure.

@kripken The lexical constants, the pick aka get_value proposal, seems to be able to address the multiple values use case with a minimum complexity for an expression based language, so function return values are immediately assigned to these lexical constants and are then handled by single value expressions. It is close to the stack-machine proposal, but with some significant differences from what has been proposed.

The other option is to depend on local variables, and this takes wasm down the expressionless encoding path. The encoding efficiency was not as good, and the live range of definitions is not as well defined, but it would be a simple compiler target.

A language with more general multiple values support was explored in WebAssembly/wabt#66 and the comments detail the design and operations.

On 8 August 2016 at 23:14, Alon Zakai notifications@github.com wrote:

@qwertie https://github.com/qwertie: Yes, I was taken by surprise by
the stack machine change. And from what I've seen online and off I was the
only browser-vendor person that opposed it. And you can see work going on
for it (e.g. in the spec repo). So for better or for worse, I think it's
safe to assume it's happening, even if nothing is truly final until it
lands, is specced, and ships.

The superficial history is that most of it "happened" in ongoing
conversations between a couple of Mozillians ad Googlers. Luke & Dan have
been arguing in the more stack-ish direction for a while. Ben & I were
rather skeptical at first, but feared "slow drift" (such as more
incremental changes of the drop kind). The pivot point was the suggestion
to "allow void-expressions anywhere", which quite plainly breaks the AST
semantics. So, to see to the end game, we prototyped a stack machine in
formal spec, ml-proto and V8. It turned out to be simple enough to convert
us over.

That said, I can definitely see the downsides. It is less structured, after
all. And structure is usually good.

@kripken
In general I'm on board with Binaryen's IR not needing to be wasm per se. Binaryen (as you are conceiving it here) is essentially a tool and optimization framework for compilers to more easily target wasm, and the IR you use for those optimizations doesn't need to correspond or translate 1:1 with wasm, even if you use it to implement a wasm->wasm optimizer.

I'm a bit confused as to why we need to formalize and have a binary format (and e.g. and full-blown interpreter) for this IR. I can buy having a text format (which is useful for writing tests). But the following things make me nervous:

It sounds like you are conflating 2 different things with this proposal: A compiler IR meant for optimizations, and machine format/VM that you want compilers to target. Do we want to try to make this IR stable? (my opinion: no) Do we want to write a linker for this format? (my opinion: no). Do we want to define ABIs in this format? (probably not). A wise soul once cautioned the LLVM community about doing this and even though LLVM IR might be an extreme example compared to wasm, the underlying lesson is that that the goals of an optimization IR are different from a platform (and here especially I mean "platform" in a broader sense than just a virtual machine, e.g. ecosystem, interchange format, etc).

For example, we know that it's likely that an optimizing compiler might want to use a CFG and not do its own control-flow restructuring, so Binaryen's library includes a relooper API. This is fine if we convert it to a structured IR before optimizing. But we wouldn't want to require that, so what do we do if we force everything through a binary format that matches the IR? Making a binary format that supports both seems like a bad idea because even aside from the complexity, it's incompatible with the stated goal of keeping the IR close to wasm.

Another way to say it is, the goal of keeping the IR as close to wasm as possible may be opposed to the goal of making it useful for optimizations, and making more like a "platform" (binary format,e tc) is opposed to the goal of making it flexible for different use cases.

Hopefully that wasn't too ramble-y.

I definitely don't want to define a new VM or machine format or platform here. The only goal I have in mind is a library that makes compiling to wasm easy, fast, and emits good code. Period.

And I believe an interpreter serves that goal very well:

  • In the --precompute pass, in just a few lines we find stuff the interpreter can run and replace it with the result. This is much, much easier than writing such a pass by hand, and it also has a guarantee of working on all of our IR.
  • This could also be used to remove global ctors by running them at compile time. Emscripten's asm.js optimizer does this now (by running the asm.js in a JS engine!). Compare this to LLVM's GlobalOpt which tries to do the same, but just handles the IR that it was written to handle, and aborts otherwise (e.g. it fails on removing the ctors added in a C++ iostream hello world).
  • A first-class interpreter is also very useful for testing. I've wished many times LLVM's lli worked better than it does.

And we already have a fully-functional interpreter now, so it makes sense to keep it.

Regarding a binary format: it's useful for the same reasons LLVM has a binary format, it's convenient and fast (in particular it will improve our compile times, unless we write all-in-one tools instead of chaining independent ones). And as with an interpreter, we already have one.

Linking: I don't feel strongly about this one. On the one hand it could be useful (like llvm-link is useful). But it seems very low priority at best.

Stability: I agree, not a goal. (While in theory it could help compilers targeting us, the burden on us would be far too great.)

CFG support: You're right that adding this to our IR formats diverges us more from wasm. But:

  • Not by a lot. Literally it's just adding something like (cfg (..block1..) (..) ..). It's not much of a divergence.
  • It's useful. As you said, many compilers want to emit CFGs. Why should they need to use our C API if the IR format would be easier for them?
  • It would also be easier for us for testing, look at the current relooper tests, it's a bunch of ugly C programs using the C API. It would be much nicer to use IR in text form. Just like in LLVM, you want to write tests in IR, you don't want to write C programs using the LLVM C API.

I'm not happy about diverging from wasm, and would still love to find a way to avoid it. But it does have upsides, like being able to make small but useful additions to our IR formats such as CFG support.

Another reason for Binaryen to have its own file format (and not keep using wasm) is that wasm's stack machine pivot causes some issues with unreachable code, namely that adding or removing a branch that alters code reachability can turn wasm code from valid to invalid and vice versa. Details are still being discussed last I heard, but that seems like an annoying property in an object file format (especially when bringing up a new compiler).

Overall, I think it's fair to say that wasm is focused on what browsers want to consume, while an object file format would have somewhat different objectives in my opinion:

  • Fast to link object files together (O(1) with low constant factor)
  • Fast to "link" an object file to a shippable wasm file that browsers can consume
  • As optimizable/LTOable as possible given those constraints (since on the web code size matters more than any other platform by a huge margin)
  • As easy as possible for compilers to emit

.byn files as described above should be able to achieve those goals (fast linking hasn't changed; "link" to wasm is an almost trivial operation since types&semantics are essentially identical; optimizability is pretty good by being AST-based, as evidence we have the current Binaryen IR optimization passes; relatively easy to emit by avoiding stack oddities as mentioned above, and also as discussed earlier we can add CFG support, etc.).

I had originally thought that the toolchain using raw wasm files for object files was an excellent opportunity for wasm - avoiding an extra format simplifies the ecosystem substantially. And that led to the original Binaryen design. But I think that's just not viable any more, wasm's design is evolving in ways that make sense for browsers but have downsides for toolchains. A single format can't fit all I suppose.

Those are all good goals, however:
(TL;DR: we shouldn't tie our "primary" object code format to something other than real wasm if we can help it. Our most primitive mechanisms should be as primitive as possible).

First, let's distinguish the container/file format from the content of the code section, and consider just the code (and any necessary relocations generally, that go with it).
Any code format that we call "the" object file format or even that we just want to be "the primary" object file format has to be as close to real, actual wasm as possible, for a few reasons: There will always be those who want to

  1. produce real wasm themselves (even in "relocatable" intermediate format),
  2. produce real, linked, browser-ready wasm (i.e. do their own linking),
  3. who want to use "standard" libraries from the emscripten and/or LLVM ecosystem,
  4. who want to use libraries produced by producers who want to produce real wasm themselves

...and of course we also want to support those who want not-1 and not-2.

We shouldn't impose a second format that isn't real wasm on those who want to do 1 but not 2, or on those who want 1) and 3), or who want 1) and 4)

If we have a code format that isn't wasm (in particular if our linker and tools use such a format), 1 and 3 become mutually exclusive (unless we ship all of emscripten/LLVM libraries and ports twice). Also a user who wants 4 will be required to do 1, and 4 and 3 become mutually exclusive (or at the very least, mixing 3rd-party libraries will be challenging) because even if we ship our libraries and ports twice, we can't require that everyone do it.

So all this is to say, I really think our primary linking tools should support real wasm, and linking should be as dumb as possible (i.e. should not require the linker to understand wasm semantics).

(wanted to get this down before I forgot it, idea/design sketch to follow, which hopefully supports what we want).

Let's suppose we have a container format, and a relocation format that fits in it, along with section-merging semantics and relocation types that cover what we need (e.g. signature table indices, function table indices, types, etc). It could be ELF or the wasm binary format, doesn't matter for this discussion. Le'ts call these .o or wasm object files. We'd want to make this as stable as possible.

Tools that linked this format would support any third-party tools that produced wasm directly, as long as they also produced our relocation info. So, what about our "higher-level" toolchain, users who want to have simpler compilers, and LTO?

Here are a couple of thoughts:

  1. For simpler compilers, we can still define a second code format (e.g. .byn, or optimizer object file). It could be flexible (i.e. could support both CFG and structured control flow) and designed for more powerful optimizations. Notably, while wouldn't want to break it arbitrarily, the stability requirements would be much less onerous. Binaryen tooling would of course include a tool such as byn2o.
  2. For mixing byn-object and wasm object files, the simple option would be to run this byn2o tool on any byn-object files, and link everything as wasm. This allows any part of the link to be in any format, with only one linker required (and it can be a simple linker), and all the interesting semantics (symbol resolution, etc) can be defined only once: at the wasm-object layer. It allows any local optimizations on the byn format with high fidelity to the original source and allows any kind of additional optimization-related metadata we want to have in byn, without imposing that on users who don't need it.
  3. That leaves LTO. First let's leave aside LLVM LTO for the moment: we can have that in addition if we want it in any case, and we might in fact want it because LLVM IR has more C semantic information than byn or any other low-level format will. Also byn optimization will be focused on wasm-specific stuff: local wasm-specific stuff (AST formation/optimizations, control flow reconstruction etc) are handled adequately in separate compilation. So that leaves inter-procedural stuff: inlining/outlining, DCE, and probably stuff that feeds better DCE like global constant propagation). Some of that could be handled simply in a way like ELF function-sections. But we probably want to do better because size is so important. For that use case I think we should still link at the wasm layer, and then use our proposed wasm2byn tool to convert the whole linked program to byn for optimization. Compared to the opposite (making everything link and optimize at the byn layer) it has the following advantages:
  • It works even with wasm files produced by non-byn-emitting tools (this case can also be served by using wasm2byn on wasm input before linking but see the next point).
  • Linking semantics only have to be defined at one layer: the wasm layer. This seems like a big advantage because for example LLVM IR has some quite tricky bits and subtlety around its implementation of module merging for LTO. Even when it's simple, it still has to be duplicated from the lower-level definition. And we'd have to define how those semantics map in both directions (byn2wasm and wasm2byn) which is not the case with LLVM IR today. Practically speaking, I actually think the merging and symbol-resolution semantics won't be too terrible, but where there are corner cases, it will be easier to e.g. make the optimizer conservative and then improve where possible, rather than have to make it exactly precise all the time.

Doing this means that to the extent that byn2wasm->wasm2byn is not exactly a precise round-trip, the code that gets LTO'd won't be exactly the same as what a byn-producing tool generated, but I can't really see why that would be a problem. Also, much of the byn optimization will still have happened before link-time, so any cases where it's critical can be handled there. In fact if we wanted to we could even include some metadata in the wasm object if we wanted to aid that, or do a full-on LLVM-style LTO if we wanted. But we wouldn't have to, and it could be added later, rather than being required.

Very good points.

First note, I think we should prototype and measure this, and shouldn't decide without that (e.g. converting byn to wasm for linking is a nice idea, but if it adds overhead, it might make sense to implement direct byn linking or something else).

A second and larger issue: I didn't get into this earlier, but byn files aren't just useful for optimizing. Their being in an AST form which is higher-level than wasm's stack machine makes them more introspectable. In particular, as I've argued for a long time I believe wasm has been going down a bad path in terms of its text format, which I worry might be less usable than e.g. LLVM IR and asm.js. Perhaps a novel wasm text format will help there in the future, but I'm skeptical. And introspectability of object files matters a lot, this is a crucial part of toolchain usability. LLVM got this right pretty early, for example, to its great benefit.

That's one reason why I mentioned before that binaryen's byn files could have a parallel tyn text format. That would be AST-based and as it diverges from s-expression format, we could turn it into something nice. We could also evolve byn and tyn files together for their mutual benefit, without being tied down by wasm binary format decisions that are regrettable from a text format viewpoint.

Again, the big picture here is that wasm is making something browsers want to consume first and foremost. It's not meant to be an object file format. I believed it would also make a good one in the early days, but optimizability and introspectability make me think it no longer is.

I do see your points about some people wanting to emit and link raw wasm files. But I'd like to convince you and them not to do that ;) Think about wasm files as a weird browser deployment format, like gzip: every compiled program shipping on the web will be gzipped, but compilers don't gzip, they leave that to a specific tool. Likewise, there is no need for people to emit wasm directly - and it will get harder over time, in particular with layer 1 etc. - we just need to make the tool that generates wasm files super-easy to use like gzip.

A huge motivation we had in the past for emitting wasm directly was for super-fast linking. But we can get that with byn files too. Given that, and the points above, I don't see much benefit to compilers emitting wasm directly. But, I'm not opposed either; we could just implement all the above and see how the ecosystem evolves.

Good counterpoints, @kripken. It seems like not only do the browser people have all the power over Wasm design, but they aren't concerned with pleasing any other audience, so we can't expect Wasm to end up suitable for any purpose other than code execution. Forking into two formats is unfortunate, but avoids relying on the browser people to offer concessions to us.

I guess the fork itself is already set in stone, and the question is just, how much can still be done on the "browser Wasm" side (just linking?) I wonder if using Wasm as the linking format might indirectly encourage the browser people not to completely forget about stakeholders other than themselves (e.g. because people will take more feature requests to the Wasm CG instead of whoever ends up in charge of ".byn" - maybe that's you.)

Where did this name "byn" come from, and post-MVP, will it cater to ALL languages, not just systems languages? The naming scheme of all the tools here is rather baffling to me.

I've argued for a long time I believe wasm has been going down a bad path in terms of its text format

What path is that? Do you have inside information or am I just not monitoring all the right channels? I'm frustrated that no one will talk to me about my text format proposal - no bikeshedding, not even a simple "for" or "against".

The problems need to be solved, wasm is a deployment format, a web format, a format that developers will work with, so it needs to be more than a cryptic convoluted encoding.

There is a proposal that might address the 'dead code' problem. It looks like blocks will unwind. It's moving in the right direction to keep a useful structured presentation. If there are show stoppers then lets talk about them, solve the problems?

If wasm will not fly by the judgement of the community group then that would be a show stopper, no one would want that outcome, it will be resolved. I think the group has come a long way in the past six months, I see light at the end of the tunnel, just a little more patience, focus on the problems.

I get the impression that binaryen might have run into a small barrier of having to change strategies, perhaps some help to get over these would clear the way, but the problems need to be understood and we need some clarity on where wasm is going which might take just a little longer.

@qwertie Your text format does not appear to be constrained to a 'familiar' structured code style, could adapt to even a linear stack machine code, so that is the reason it is a distant consideration for me. The show stoppers in the presentation format for me are other matters, such as structure and variable references etc, things that could be presented in a range of formats. I would be happy at this stage to even defer choices over a particular text format, defer the bike shedding, until post-MVP so long as there are some demonstrations of workable familiar text formats.

@qwertie

Where did this name "byn" come from

Literally from shortening binaryen. This is meant to be a binary file format for Binaryen IR. But I think once we experiment with it, it might be a useful object format more generally, we'll see how things evolve.

About the text format, there was discussion in issues and PRs. For example this and this. Also PRs modifying TextFormat.md are relevant, could be more there.

But sadly much of the debate was not public. To summarize that, I've been saying that we should consider the text format as part of the wasm design, and not be left to the end; and I've opposed changes to wasm that I believe make it less structured and less of an AST as I believe ASTs are significantly better for text representation. But on both the group went another way.

At this point, it was decided as mentioned elsewhere to use a linear list of stack machine instructions for the wasm text format, since there just isn't time to do anything more given the current design of wasm (a stack machine) and the timetable (short). I think that's why your issues have not received responses.

I'm confused, because I thought it was clear that a structured text format is still usable even in the stack machine - and that LES works fine even for a linear list of instructions (with just a little bit of structure given by the block-expressions {...}, loop (label) {...}, and if (...) {...}.)

I'm not sure what you mean by structured here, but in any case, all I meant was that given wasm's current design as a stack machine, there is really just one natural text format, a simple linear list (plus blocks etc). It's essentially just writing the binary format as text, in its natural order. And it was agreed to have that as the wasm text format, so I don't think anyone is interested to talk about other options. I think that makes sense given where we are at this point (but as mentioned before I did all I could throughout the process to convince people to avoid getting there).

For example, code that conceptually does this (and that I wish we could represent as this!),

if (x == y) {
  x = 1;
}

will look something like this in the wasm text format:

get_local y
get_local x
i32.eq
if
  i32.const 1
  set_local x
end

If LES is compatible with that, then perhaps it's worth stressing that.

First, linking and platforms:
(Let's use "platform" instead of just "linking". Platform is where you define not just execution semantics, but also linking, ABIs, and other conventions. Fundamentally it's where you define interoperability). The first observation there is that we can't have only Binaryen as a platform. Unless we can mandate that everyone use Binaryen, then we either have to define completely and stably how it maps to the underlying wasm platform (thereby duplicating all the platform design work, creating a huge maintenance burden, and running into a lot of pitfalls) or say that we don't care about interoperating outside of our sub-platform. As an outside example, even systems that use a common framework like LLVM don't use that layer as their "platform" in this sense; they use the underlying target's native platform primitives (ELF object files, etc). LLVM could be capable of expressing everything needed to have this interoperability; in fact they tried it in the past, but it runs counter to the other goals they have for LLVM. They got it right for optimizability and readability right (as you mentioned upthread) but it came at the cost of LLVM's ability to be a good platform unto itself. LLVM doesn't need to be a platform in that sense, because it sits on other stable platforms where that work is already done. In essence they made the same choice I'm advocating here, for some of the same reasons.

Introspectability:
This is another way of saying "understandability" or "readability" and it sounds to me like this probably the most important thing to you (or at least the thing that is more important to you than maybe most of the other involved parties). First, understandability is subjective (for that matter, even ease of targeting is subjective). We've already seen that when we discussed a flavor of wasm more friendly to producers, but your idea of what that would ideally be was quite different from @sunfishcode's (AST vs flat 3-address style, etc). I think you can do much better on that front that if you can focus on it without being required to also maintain ABIs and worry about the implications of whether your binary format has attached metadata or sections, or what you have to do to your frontend in order to interoperate with Joe Schmo's compiler (or say, Herb Sutter's) who doesn't want to use Binaryen. In short, you can be opinionated.

Philisophical rant:
If we want to have an opinion about what code on the web should look like, we have a forum where we dictate that to our users: it's JavaScript. The whole point of wasm is to let our users dictate to us how they want to code on the web. Interoperability is absolutely key to that. In my opinion it's even more important than introspectability. Our primitives can't be opinionated because they have to be general. In other words I think a byn format will be much better at the things you care about if you don't have to care about all the things, and you can have more control over the things you care about.

@qwertie I agree that "binaryen" being a repository with "all the wasm tools" and also a compiler library and also a linker and a file format and a compiler flag is confusing. I think we should probably try to factor tools, names etc and maybe even repos in a less confusing way before we release/publicize a toolchain. But that's another thread :)

Fair points in that post and in discussion today. Maybe I'm signing up for too much work too early by laying out a large plan here. Instead, a more incremental plan might be

  • Keep the current AST as it is, i.e. not follow wasm into full stack machine land.
  • All tools remain as is, i.e. we still have wasm-as, wasm-dis, etc., but, importing wasm in wasm-as is no longer 1-to-1 in all cases (if it encounters overly-stacky code, it might use extra locals or such to fit it into our AST), and importing wasm in s-expression format might have some limitations too.
  • Rephase the project's goals to something in between the current ("compiler and toolchain library for wasm, i.e. helps do stuff on wasm") and new proposal ("compiler library for wasm, helps builds compilers to wasm"). Not sure exactly where in the middle. On the one hand the loss of 1-1 wasm import means we aren't a fully general wasm library any more. Yet, for practical purposes, maybe we are good enough, or maybe this is a temporary limitation we'll end up fixing (with a second IR as discussed above).
  • I'll do some prototyping of the optimization and object format stuff discussed above, as well as introspectability, and see how that goes before proposing a more elaborate big plan.

Reading code is subjective, subjective for the consumer, be it people or the machine, but that does not mean that readability if not a concern. If I were to put it to people that readability were subjective to the machine too and thus not a concern of the group then surely people would shake their heads. We have a large community group or experienced programmers to make a subjective judgements on the readability of the wasm format. I can point to many areas of readability for people that align with readability for machines too. These are not matters that can just be cast aside as subjective and ignored. The choice of chairs is also a subjective matter for the community group, but does that mean they can be arbitrary, the group members will make a judgement!

The lossless transform of the stack machine code to a structured presentation format seem possible, has been discussed elsewhere. If there are show stopper then please raise them? In the absence of show stoppers then it must be accepted as a technically plausible fact, so claims that the only presentation of the wasm code is a linear stack machine code are not sound, as not a sound basis for binaryen diverging.

The binaryen IR will need to address the multiple values problem to, so be as semantically expressive and efficient as the wasm code. It does not appear to have a path to doing so, so is unfortunately a dead end stuck were the wasm design was over six months ago.

If binaryen development were to engage and embrace the path being explored then it might add to the developement effort. I had personal encouraged this exploration many months ago, explaining how pick could be used to address the problems, giving examples of where I was at. So binaryen is six months behind now, is not adding much to development of the wasm design, but that seems understandable.

A possible path forward, to get over this barrier, might be for that lossless transform to be coded up. Perhaps doing so in the binaryen development environment is a chunk of work, but perhaps someone could prototype it in a rapid prototyping language, ocaml etc, and use s-exp etc. I really do not see the technical show stoppers, the decoder might need to be more type aware but is that a show stopper. Attempting to write this transform would shed light on any show stoppers and add to the wasm design development.

Another path might be for binaryen to add an SSA decoder and to work on the SSA form. Wasm is designed to be easily decoded to SSA form, so this is not a big task. This might also help keep the decoder efficiency in mind, might make some decoder efficiency tests possible in binaryen etc, and perhaps help discover ways to optimize code to make this decoder more efficient (less intermediate phi's etc).

@JSStats: Regarding multiple values, my perspective is that we can do them in a second IR if that is necessary. But it seems premature to think it will be - wasm isn't adding multiple values based on perf data that I am aware of. It seems possible a compiler emitting wasm without multiple values (perhaps with just simple support for function calls) will still be high-performance.

Regarding the lossless transform, the implications of it are discussed above in this issue - it means some new AST nodes with awkward semantics that have downsides for the Binaryen optimizer. But if we add a secondary IR at the stack machine level, then it could do multiple value opts if that makes sense, and a lossless import of wasm to that IR would of course be possible.

@kripken Supporting multiple values returned from function calls is a measurable runtime performance feature. The hardware likely supports returning some of these in registers, so it can be done efficiently. Without this feature these return values would need to be in the linear memory. The CIL style solution is to bundling them into a return structure. So can we agree that to that point it is a runtime performance feature?

From then it is a matter of connecting these return values to their uses in the wasm encoding. At one extreme all the results could be stored in local variables, and at the other extreme all pushed to the values stack which is the proposed design solution. These choices might not affect the runtime performance of an optimizing computer, I can agree on that, but might for a baseline compiler and it might have a measurable impact for decoder efficiency and for some of the same reasons that it affects readability for people (people reading code want to know the scope or live range of values too, and the expression pattern helps a lot here, as would block scoped constants). If you wrote an SSA decoder for binaryen these issues would be clear, you would see the difference it makes to the work need at control flow merge points, the difference it makes to the known live set for a baseline compiler, and probably learn more and be back contributing to the wasm design development. Binaryen does some great optimizations of the expressions, and perhaps you could do something similar with a different set of constraints.

I would like to understand the 'awkward semantics' and the 'downsides for the Binaryen optimizer', see if people can propose some solutions, and understanding these can be very valuable input into the wasm design, getting this right is the next step. I don't see the same conclusion that you seem to be seeing from the discussions, perhaps an example would help me understand?

The structured stack machine design is moving towards the SSA encoding which should be good for optimizations and transforms. If the function arguments are pushed onto the stack it would be closer again, and if loops could accept (pop) arguments then these could be emitted in SSA style, the pick operator gives the scoped SSA definitions. Writing an SSA decoder might help see some opportunities here.

@JSStats: there are examples in the discussion earlier in this issue (e.g. here).

For example, code that conceptually does this (and that I wish we could represent as this!),

if (x == y) {
  x = 1;
}

@kripken But we can represent stack machine code that way! The proposal I keep begging people to read also includes a flat representation (which is the same language used differently). But rather than simply transliterating the s-expressions syntax, I proposed a more compact representation, something like this:

$y; $x; i32_eq; if {
    1; set_local $x
}

Introspectability:
This is another way of saying "understandability" or "readability"

@dschuff Maybe. I had assumed he was talking more about computational introspectability: optimization, decompilation, making it easy to generate code at runtime. Various computational tasks are easier with an AST.

understandability is subjective

I'm pretty sure there's an objective case to be made about an AST being more understandable to humans than a flat instruction list... to me it's so intuitively obvious that I'm not sure how to argue the case.

@kripken I agree with your incremental plan.... except:

if it encounters overly-stacky code, it might use extra locals or such to fit it into our AST

Doesn't the first operator solve this problem though? (Apologies if we're going in circles, but unlike JSStat's proposal, first is a very simple solution that seems to allow perfect round-trips, and it's too early to have two IRs or commit to a particular way of handling multi-values)

there are examples in the discussion earlier

Well, one example. As I pointed out though, the example didn't illustrate a "negative weight" problem, and the general idea of assigning weights to operators seems useful in itself. Can you give a more problematic example?

@kripken Thanks, an example. So what is the problem here? The s-exp has a first node not in the stack machine encoding. Solution is to decode it away to a more workable representation, or just don't use first rather scoped constants (just like SSA definitions are referenced).

(i32.eqz
  (first
    (tee_local $x (call ..))
    (store $x (i32.const 1))
  )
)

<call args>
call
tee-local $x
i32.const 1
store
i32.eqz

Alternative representation:
$def1: ..
$def2: ..
$def3: call $def1 $def2
_: set_local $x $def3
$def4: i32.const 1
_: store $def3 $def4
$def5: i32.eqz $def4

Should be easy to decode the wasm stack code into this format. It requires some type information, such as the number of arguments and results for functions calls, but is that a show stopper?

Converting back to the stack code might not be so hard either, but might require live range information, but I thought binaryen already has that information to optimize the emitted expressions.

Converting to the text format is also relatively easy. The definitions could just be presented as scoped constants, but with some live range information it could be usefully emitted as expressions where they fit.

Are there problems with this approach? If the wasm encoding frustrates this then we need to bring them closer. Experience trying would likely help move the wasm design along.

@qwertie first does not solve the problem in general, and we do need a general solution and need a plan for wasm otherwise the MVP might be a dead end. Your following example is not a familiar structured language, and the s-exp proposal already seems to allow something similar so this is just another way to present it. I think that the choice between s-exp or LES does not block the path of wasm, not in the same way that sorting out the handling of multiple values does.

$y; $x; i32_eq; if {
    1; set_local $x
}

@qwertie: interesting, I didn't realize LES could do that. It's too bad your proposal isn't getting more attention, I agree.

About first: It can't have zero weight, e.g. assuming the order of operations allows it we want to do this transform:

(block
  (call a)
  (call b
    (first
      (call c)
      (call d)
    )
  )
  (call e)
)

==>

(block
  (call a)
  (call b
    (call c)
  )
  (call d)
  (call e)
)

As for it having positive weight, the example from before showed it must have the smallest positive weight (as the other nodes it is compared to could be arbitrary), which is a global constraint on the AST and its weight system - are we sure we don't need some other node to have the smallest weight, now and in the future?

That leaves a negative weight or something more complicated than a fixed weight. Of course it's doable, but I just don't think the complexity in the main IR is worth it. A secondary IR dedicated to stack machine-style opts would be the natural place for it and related things (multi-values, etc.).

drom commented

But rather than simply transliterating the s-expressions syntax, I proposed a more compact representation, something like this:

$y; $x; i32_eq; if {
    1; set_local $x
}

@qwertie in old good Forth it would be:

$y $x = if
  1 to $x
then

@qwertie

I had assumed he was talking more about computational introspectability: optimization, decompilation, making it easy to generate code at runtime. Various computational tasks are easier with an AST.

That's true, but I would consider that as just part of "optimizability" (or specifically it would be suitability to analysis which is of course part of optimization) which I had already considered as the other major goal of binaryen IR. My take on @kripken's comments (and ones that he has made in the past, and of course his ongoing concerns about the text format) was that he also considers understandability a primary goal. e.g. there is no doubt that LLVM IR is pretty good in that respect.

I'm pretty sure there's an objective case to be made about an AST being more understandable to humans than a flat instruction list... to me it's so intuitively obvious that I'm not sure how to argue the case.

2 responses.
Firstly: sure, when you put it like that it sounds obvious, but we are not talking about a structured AST vs, say, a traditional assembly language. A stack machine does not have the same structure as a pure AST form, but it also does not have the "no-structure" that an assembly language has. Here's another interesting observation: we've had a lot of lisp fans hanging around here since the beginning, who would no doubt share your intuitions about ASTs, but the switch to a stack machine has suddenly brought a lot of Forth fans out of the woodwork, who might have different opinions about how intuitive stack machines are :)

Secondly: I am actually extremely skeptical that even if, say, we stay as an AST and get the best and most intuitive possible syntax for it, that the output of an optimizing compiler will be any more (or less) usefully readable than assembly code is for native C developers, or JVM or CLR bytecode is for users of those systems. Compilers do a lot of transformations (and doing more transformations often makes them more useful), and the output will always necessarily be much lower-level and well-removed from the source. Wasm developers will be using wasm for the same purposes that assembly or VM bytecodes are used today (as opposed to what you used to be able to do on web pages; i.e. read the JS source that the page author wrote). I say "used to" because optimized and minified JS is already no more (or less) useful than other VM bytecodes, and wasm will not make this situation any better, or worse. That isn't to say that understandability isn't valuable, or that we shouldn't try to make it as understandable as possible. But IMO we need to keep our expectations realistic as to the result and its utility.

I think there are big differences in text usability between those categories:

  1. x86 or ARM assembly is very low-level and has a lot of target specific noise.
  2. JVM and CLR bytecode on the other hand is portable, and also contains higher-level information like objects.
  3. Minified JS (with prettified whitespace) has the benefits of 2, plus (a) structured control flow, (b) nested expressions, and (c) is still a human-readable/writable language, so you can easily hack in new code while debugging it.

And asm.js is somewhere between 2 and 3 (has structured control flow, nested expressions, and hackability; lacks obvious objects).

So while I agree we need to keep people's expectations realistic, I think we can be pretty close to what the web currently has (minified JS, asm.js). The wasm MVP isn't going to do that, as it's shipping a linear list of stack machine instructions (somewhere between 1 and 2 - portable like 2, but no objects). And while wasm might get a better text format later, without an AST it will always have to deal with non-AST patterns, which means it's going to be no better than decompiled linear code as in 1 and 2 (decompilers can sometimes do a good job of course, but sometimes they don't) and it's hard to hope it will have hackability as in 3c. So I'd argue that would still be worse than asm.js.

Anyhow, if in fact wasm gets a better text format later, Binaryen could adopt it as its text format, and just like Binaryen's IR is the AST-friendly subset of wasm stack machine code, Binaryen's text format would be the AST-friendly subset of the new text format. Which would avoid the issues just mentioned and so be comparable to asm.js, but also strictly better than it (no annoying coercions, etc.). We'd also have the option to alter Binaryen IR in ways that help the text format, if we find that makes sense.

@drom LES has three advantages over Forth syntax: (1) Similarity to JS + (2) familiarity to typical developers (two longstanding design criteria for the text format); and (3) general purpose use related or unrelated to WebAssembly (edit: for instance, using the same parser for binaryen, even though Wasm is a stack machine and binaryen is AST-based).

The wasm MVP isn't going to do that, as it's shipping a linear list of stack machine instructions (somewhere between 1 and 2 - portable like 2, but no objects). And while wasm might get a better text format later, without an AST it will always have to deal with non-AST patterns

@kripken I don't care how compressed the timeframe is, I'm certain it's possible to have the better text format now. If they run with my proposal (taking line break as a terminator), they can start with a flat format now and add structure later, when they have time, without necessarily changing the official spec.

IMO the documented spec should just have a section on syntactic sugar for infix operators, which the browsers will initially ignore and which is straightforward to code in an assembler. But if this time crunch is serious they can even skip that part.

Meanwhile I can write LES parsers for them to use, or they can take up my original proposal of defining Wasm as a subset of LES (and therefore write Wasm-specific parsers). The biggest thing I'm asking for is to change some opcode names, e.g. i32_add or i32'add instead of i32.add, to keep the grammar of LES at LL(2) instead of LL(*). What do you think, can you talk to the others for me?

assuming the order of operations allows it we want to do this transform:

(block
  (call a)
  (call b
    (first (call c) (call d))
  )
  (call e)
)
==>
(block
  (call a)
  (call b (call c))
  (call d)
  (call e)
)

By itself, I don't see any reason to perform this transformation (and since it changes the order of ops you wouldn't do it without a good reason). So, why do it?

are we sure we don't need some other node to have the smallest weight, now and in the future?

Being effective now is enough. In the future, the weights can be changed, but with multivalues, we might switch to a completely different solution (whether it's separate IR or an AST-based solution) for which first is syntactic sugar or even deprecated.

Moving call d earlier, if possible, would have advantages for performance, it would avoid the need to preserve the result of call c across the call d. But just use an IR closer to the wasm stack-machine code, for example in the following this transform is justified due to the real issue of reducing the live set at calls rather than some more remote cost associated with nodes.

(block
  (call a)
  ($def1 (call c))
  (call d)
  (call b $def1)
  (call e))
=>
(block
  (call a)
  (call d)
  ($def1 (call c))
  (call b $def1)
  (call e))

@qwertie LES alone does not make a 'better text format'. You seem to be arguing that LES can present both a structured language with expressions, and a list of stack machine instructions, so it seems already accepted that the choice of LES versus s-exp versus something else does not block wasm development. LES is free to change the naming if that helps, we can all accept that and opcode names do not block wasm design. Is there anything related to LES that requires changes to the wasm design?

@qwertie: I actually think the bikeshedding of small syntax stuff (like opcode names as you mentioned) is exactly something they want to avoid. The sentiment seemed to be "we'll just ship the flat s-expr format of what we have now, there is nothing left to fiddle with, we have no time for that".

If you still want to push LES as an option, I think one possible productive route could be to make a browser addon that shows LES for wasm modules in the developer tools. Seeing it in action could convince people. Especially if your addon receives interest from the developer community.

About that last transform: it's beneficial to do things like it because by removing unnecessary nodes the code is easier for other passes to optimize (the more canonical the representation, the better, passes depend on such conventions). A further reason is that removing nodes makes later optimization passes faster. In fact the Binaryen optimizer does that transform now with blocks, for those reasons.

drom commented

we've had a lot of lisp fans hanging around here since the beginning, who would no doubt share your intuitions about ASTs, but the switch to a stack machine has suddenly brought a lot of Forth fans out of the woodwork, who might have different opinions about how intuitive stack machines are :)

@dschuff yes, stack machine based bytecode brought me out of the woodwork. And brings some hope that it would be done right this time. Forth does have some valuable 45 year experience dealing with stack machines, and developed robust "woodworking" tool set, that worth learning before reinventing it from scratch. IMHO

I am building a compiler for a new language, and came to the conclusion that the most important backend I could build would be a wasm backend, given that it has the potential to become the standard way to ship processor-agnostic binaries for server, desktop and mobile, both inside and outside the browser.

I just now came back to get an update on the progress of wasm, for the first time in 6 months, and discovered the switch from AST to stack-based architecture. I consider this to be a huge mistake (and I think the binayen folks on this thread are being too diplomatic about how to cope with the upstream change!).

There are several reasons why switching to a stack-based architecture is a bad idea:

(1) As pointed out in other comments, this is a major step backwards for the transparency of code and ease of learning on the web: stack-based code is nearly impossible to read.

(2) Stack-based architectures are massively inefficient at runtime, which wastes battery on mobile, causes UI latency and jank (which incurs a cognitive burden in users), kills baby polar bears etc. etc.
The following paper shows how much overhead a stack-based VM incurs compared to a register-based VM, after significant effort was expended to optimize both:

https://www.usenix.org/legacy/events/vee05/full_papers/p153-yunhe.pdf

As far as I can tell, the reasons for switching to a stack-based architecture was initially the trigger of allowing void expressions anywhere, but ended up coming down to the increased ease of building the runtime (since you don't have to build as much compiler backend infrastructure if you're already at a lower level than an AST), as evidenced in the following comment by @rossberg-chromium from earlier in this issue:

The superficial history is that most of it "happened" in ongoing
conversations between a couple of Mozillians ad Googlers. Luke & Dan have
been arguing in the more stack-ish direction for a while. Ben & I were
rather skeptical at first, but feared "slow drift" (such as more
incremental changes of the drop kind). The pivot point was the suggestion
to "allow void-expressions anywhere", which quite plainly breaks the AST
semantics. So, to see to the end game, we prototyped a stack machine in
formal spec, ml-proto and V8. It turned out to be simple enough to convert
us over.

It is likely that even more overhead would be incurred between compilers forced to compile from a stack-based IR and an AST-based IR than even the difference between stack-based and register-based IR, since the higher level of an AST-based IR makes so many optimizations simpler. For an example of the types of hoops you have to jump through to safely perform even simple dead-code elimination in a stack-based IR, see this paper:

http://set.ee/publications/bytecode07.pdf

So many simple optimizations require careful thought, soundness proofs, and new abstractions (such as swapping virtual stack snapshots due to differences in control flow) when reasoning about a stack-based VM, and most peephole optimizations are simply undoing some ineffeciencies imposed on the code by lowering to a stack-based form (e.g. replacing DUP SWAP with DUP), because the representation is so low-level:

http://www.rigwit.co.uk/thesis/chap-5.pdf
https://users.ece.cmu.edu/~koopman/stack_compiler/stack_co.html

Some other reasons why stack-based representations can be more inefficient include the following (there are many -- the quoted points are taken from this Wikipedia page):

  • Stack machines must always properly unwind the stack before exiting an execution frame (they cannot leave unneeded values on the stack, the way that a register machine can simply leave unneeded values in registers, depending on the calling convention).
  • "Some in the industry believe that stack machines execute more data cache cycles for temporary values and local variables than do register machines."
  • "In register machines, a subexpression which is used multiple times with the same result value can be evaluated just once and its result saved in a fast register. . . With stack machines, in contrast, results can be stored in one of two ways: A temporary variable in memory. . . The second way leaves a computed value on the data stack, duplicating it as needed." (both are inefficient)
  • "Rigid code order: Scheduling memory accesses requires explicit, spare registers. (Out-of-order execution) is not possible on stack machines without exposing some aspect of the micro-architecture to the programmer."
  • "it takes about 1.88 stack-machine instructions to do the work of a register machine's RISC instruction."
  • Stack machines "hide a faster register machine inside . . . if the code stream instead had explicit register-select fields which directly manipulated the underlying register file, the compiler could make better use of all registers and the program would run faster."
  • "Interpreters for virtual stack machines are often slower than interpreters for other styles of virtual machine. This slowdown is worst when running on host machines with deep execution pipelines, such as current x86 chips. A program has to execute more instructions when compiled to a stack machine than when compiled to a register machine or memory-to-memory machine. Every variable load or constant requires its own separate Load instruction, instead of being bundled within the instruction which uses that value. "
  • "In some interpreters, the interpreter must execute a N-way switch jump to decode the next opcode and branch to its steps for that particular opcode. . . The host machine's prefetch mechanisms are unable to predict and fetch the target of that indexed or indirect jump. So the host machine's execution pipeline must restart each time the hosted interpreter decodes another virtual instruction. This happens more often for virtual stack machines than for other styles of virtual machine."
  • "Pure stack machines are quite inefficient for procedures which access multiple fields from the same object. The stack machine code must reload the object pointer for each pointer+offset calculation."

(3) AOT compilation from a stack-based IR is much more complicated than AOT compilation from an AST. This means that some implementers of wasm will simply choose to build an interpreter, rather than an AOT compiler, leading to great inefficiencies. (On the other hand, if an interpreter is what is wanted, building an interpreter for an AST-based IR would not be much more complicated than building an interpreter for a stack-based IR -- so compilation suffers with a stack-based design, but interpretation does not suffer either way.)

Basically, stack based IRs / bytecode systems are a terrible idea if you care about either interpreter efficiency, difficulty of writing optimizing AOT compilers, or about code transparency, readability or learnability. It is sad that four guys made this decision, perhaps without considering all aspects of these issues, which will affect both billions of devices and billions of users. This is a major step back for the web.

Is there any chance the decision could be revisited?

(I see this is a binaryen bug, but at least people are talking about the issues here -- is there a better venue for this discussion?)

@lukehutch Personally I agree with you - I think it was a mistake for wasm to make this change.

However, the specific arguments about perf are mostly incorrect - it's true that executing a stack machine can be very inefficient compared to a register machine etc., but the goal in wasm isn't to actually execute the code that way - VMs would translate it to something efficient first (optimized machine code in most cases).

With that out of the way, as I said, I definitely agree that the stack machine change was a bad idea for the rest of your reasons - code transparency, readability, learnability. And more generally, openness in the wide sense.

Sadly I don't think there is much of a chance to revisit this. The small group of people driving the wasm design are all in agreement on this matter (as you quoted), and everyone is working hard towards shipping. But, if you want to try, then opening an issue on the wasm design repo would be the right place.

@kripken Correct, wasm will typically be AOT-compiled down to native code. But most of the points I raised are also true of AOT compiled code, not just interpreted code.

From the Wikipedia page on stack machines: "The object code translators for the HP 3000 and Tandem T/16 are another example.(15)(16) They translated stack code sequences into equivalent sequences of RISC code. Minor 'local' optimizations removed much of the overhead of a stack architecture. Spare registers were used to factor out repeated address calculations. The translated code still retained plenty of emulation overhead from the mismatch between original and target machines. Despite that burden, the cycle efficiency of the translated code matched the cycle efficiency of the original stack code. And when the source code was recompiled directly to the register machine via optimizing compilers, the efficiency doubled. This shows that the stack architecture and its non-optimizing compilers were wasting over half of the power of the underlying hardware."

Refs:

(15) HP3000 Emulation on HP Precision Architecture Computers, Arndt Bergh, Keith Keilman, Daniel Magenheimer, and James Miller, Hewlett-Packard Journal, Dec 1987, p87-89. (PDF)

(16) Migrating a CISC Computer Family onto RISC via Object Code Translation, K. Andrews and D. Sand, Proceedings of ASPLOS-V, October, 1992

Both of these papers are old, but the point is still true today that optimizing stack-based IR code is exceptionally difficult. Stack-based IRs make the code less efficient, and most of the difficult work of building an optimizing compiler is just using local optimizations to undo the inefficiency of the stack machine. Once you undo that efficiency, you're left with code that is borderline obfuscated, difficult to optimize further, and much less efficient than code that has not passed through a stack IR stage.

Thanks on the pointer to the wasm design repo, I'll post there.

Yeah, it's true that

This shows that the stack architecture and its non-optimizing compilers were wasting over half of the power of the underlying hardware.

But in wasm we are talking about optimizing compilers, and specifically ones using SSA IRs. The wasm stack machine code is translated into the VM's SSA IR, at which point it doesn't matter if it came from a stack machine representation or register machine representation or AST or anything else - it's a bunch of operations on virtual registers in basic blocks in a CFG. Then it can be optimized exactly like the best native compilers do today.

We also have perf numbers on wasm running in current VMs - it's very close to native speed. asm.js was already close, and wasm improves on that. So throughput is just not an issue here.

@lukehutch It's not so bad, sleep on it and take another look. There is not really a 'stack machine' rather wasm is designed to decode to validated SSA form quickly, and with a few extensions code could be encoded in SSA form, so the 'stack machine' could be viewed as just an efficient encoding. The stack encoding is frustrating to work with, so just don't, it's not intended to be an IR, it's an encoding, just decode it to SSA form and work with that, and wasm is designed to make this easy to work with untrusted code by being able to validate and decode it to SSA in a single pass - that is good isn't it? Even the local variables are just place holders for holding definitions while decoding and are expected to be gone at the end of the decoding stage.

The design is not even complete, we don't have high performance decode/compile implementations yet and I suspect that some more changes will be necessary to support this. Wasm is optimized to compile down to machine code AOT, before it runs anyway, not incrementally compiled as needed and not optimize for an interpreter which is still possible.

@Wllang then why not encode directly in SSA form? Adding the stack-based form only obfuscates the problem. That doesn't make any sense, there is no value added by the stack-based representation other than the intellectual curiosity of how to optimally solve the new problems that a stack machine introduces when you're trying to convert to a more useful form.

The problem is that SSA form is not compact - it's much larger than an AST or stack-based representation.

@lukehutch I am trying to do just that, made some recent progress, but I don't have a conclusion on the encoding efficiency just yet. It is possible that @kripken is right, that the local variables will prove to be an efficient encoding and SSA 'much larger', but I am a long way from conceding just yet and if I do I'll be able to explain it. There are certain many functions that encode well in SSA form, but also see data flow patterns that are a challenge and I need to work through them.

@lukehutch Here's a quick example of where I'm at with it, the main function from a zlib benchmark. Compare it to binareyen IR and I hope people see merit in this, perhaps imagine it in your preferred language format. Obviously the code below needs work, lots of patterns that are not optimal, just trying to get to the proof of concept stage. E.g. many of the br_case paths need jump threading and can just fall through, the loop could fall through here, and I have a long long list of other patterns that need work. You'll just have to wait on the encoding efficiency, or perhaps ask @kripken if he really knows.

(func $main ((n0 i32) (n1 i32)) (i32)
   (declare (export "main"))
   (set n2 (i32.sub (i32.load :offset 4 0) 16))
   (i32.store :offset 4 0 n2)
   (set n14
        (block $block-1
          (set n4
               (block $block-2
                 (br_if $block-2 (values 500) (i32.lt_s n0 2))
                 (set n3 (i32.add (i32.load8_s (i32.load :offset 4 n1)) -48))
                 (when (i32.gt_u n3 5)
                   (i32.store n2 n3)
                   (drop (call $printf 160 n2))
                   (br $block-1 (values -1)))
                 (br_case nil (0 1 2 3 4 5 0) n3
                      (case (br $block-1 (values 0)))
                      (case (br $block-2 (values 60)))
                      (case (br $block-2 (values 250)))
                      (case (br $block-2 (values 500)))
                      (case (br $block-2 (values 2500)))
                      (case (br $block-2 (values 5000))))))
          (set n5 (call $malloc 100000))
          (block $block-12
            (loop $repeat-8 (n6 n7 n8) (values 0 0 17)
               (set (n9 n10)
                    (block $block-9
                      (when (i32.lt_s n7 1)
                        (unless (i32.and n6 7)
                          (br $block-9 (values (i32.and n6 31) 0)))
                        (br $block-9 (values n7 (i32.rem_u (i32.mul n6 n6) 6714))))
                      (br $block-9 (values (i32.add n7 -1) n8))))
               (i32.store8 (i32.add n5 n6) n10)
               (set n11 (i32.add n6 1))
               (br_if $repeat-8 (values n11 n9 n10) (i32.ne n11 100000))
               (br $block-12 (values))))
          (block $block-14
            (loop $repeat-13 (n12) (values 0)
               (call $doit n5 100000 n12)
               (set n13 (i32.add n12 1))
               (br_if $repeat-13 (values n13) (i32.lt_s n13 n4))
               (br $block-14 (values))))
          (drop (call $puts 176))
          (values 0)))
   (i32.store :offset 4 0 (i32.add n2 16))
   (values n14))

@Wllang thanks for the example. What is this exactly? Is this some SSA code converted from the lowlevel stack representation, rendered into s-expression form, or is this a proposed alternative to the current stack machine model?

@lukehutch It's binaryen output decoded to SSA retaining the control structure. There are already provisions in wasm to do encode much of this and it's a minor variation of the encoding:

  • provisions for blocks returning multiple values, presented as (set (v1 v2) (block ...))

  • provisions for loops having a signature which is used for the phi nodes here. The phi nodes and their values popped on loop entry are presented as (loop (n1 n2) (values v1 v2) ...) in the loop header here.

  • To encode the above requires adding the pick operator.

  • The above is presented in packed expression format, but it can be presented one operator per statement, it's just a different presentation. It's not clear above which definitions need pick and which are consumed in stack order, but that seems an encoding detail. Some redundant type information is also omitted in the presentation above.

  • The encoding benefits from having block ends unwind the stack.

  • The function arguments are pushed on the stack.

  • Changed br_table to a table-case operator as the constraint on br_table of passing a fixed set of values to all targets was trouble to work with. Need to revisit this.

  • The lack of a label and separate signature for the loop exit path was trouble, so wrapped them all in a block and need to revisit this.

  • The control structure need not be preserved, it can be convert to basic blocks, perhaps numbering them to retain some ordering.

  • It is not necessary to start with binaryen wasm output. E.g. it is possible to start with machine code and structure the CFG and convert to SSA then convert it to the above format, and I do that too, write out C and compile it and decode it back to wasm just to double check. Found much simpler algorithms than the relooper to structure the code, one or two pages of code.

Here's the above translate roughly to C, the C compiler cleans it up. Obviously a linear memory scheme needs to be chosen too, but for trusted code aligned to the target memory it would be as simple as follows. Have wasm code compiled in this way demonstrated running on IoT devices. Add pointer masking and good type derivation to minimise it and the code has a new layer of security and relatively good performance even without memory management. Add C type annotations and it can interoperate with C libraries cleanly. It's not valid C code due to all the casting, needs some compiler flags, perhaps there is a better approach, need to revisit.

uint32_t main(uint32_t n0, uint32_t n1) {
    const uint32_t n2 = (*(uint32_t *)(0 + 4)) - 0x10;
    *(uint32_t *)(0 + 4) = n2;
    uint32_t n14;
    {
        uint32_t n4;
        {
            if ((int32_t)n0 < (int32_t)2) {
                n4 = 0x1F4;
                goto $block_2;
            }
            const uint32_t n3 = ((int32_t)*(int8_t *)(*(uint32_t *)(n1 + 4))) + -48;
            if (n3 > 5) {
                *(uint32_t *)n2 = n3;
                _printf(0xA0, n2);
                n14 = -1;
                goto $block_1;
            }
            switch (n3) {
            case 1:
                    n4 = 0x3C;
                    goto $block_2;
            case 2:
                    n4 = 0xFA;
                    goto $block_2;
            case 3:
                    n4 = 0x1F4;
                    goto $block_2;
            case 4:
                    n4 = 0x9C4;
                    goto $block_2;
            case 5:
                    n4 = 0x1388;
                    goto $block_2;
            case 0:
            default:
                    n14 = 0;
                    goto $block_1;
            }
        }
    $block_2:;
        const uint32_t n5 = _malloc(0x186A0);
        {
            uint32_t n6 = 0;
            uint32_t n7 = 0;
            uint32_t n8 = 0x11;
        $repeat_8:;
            {
                uint32_t n9;
                uint32_t n10;
                {
                    if ((int32_t)n7 < (int32_t)1) {
                        if ((n6 & 7) == 0) {
                            n9 = n6 & 0x1F;
                            n10 = 0;
                            goto $block_9;
                        }
                        n9 = n7;
                        n10 = ((n6 * n6) % 0x1A3A);
                        goto $block_9;
                    }
                    n9 = n7 + -1;
                    n10 = n8;
                    goto $block_9;
                }
            $block_9:;
                *(uint8_t *)(n5 + n6) = n10;
                const uint32_t n11 = n6 + 1;
                if (n11 != 0x186A0) {
                    n6 = n11;
                    n7 = n9;
                    n8 = n10;
                    goto $repeat_8;
                }
                goto $block_12;
            }
        }
    $block_12:;
        {
            uint32_t n12 = 0;
        $repeat_13:;
            {
                doit(n5, 0x186A0, n12);
                const uint32_t n13 = n12 + 1;
                if ((int32_t)n13 < (int32_t)n4) {
                    n12 = n13;
                    goto $repeat_13;
                }
                goto $block_14;
            }
        }
    $block_14:;
        _puts(0xB0);
        n14 = 0;
    }
 $block_1:;
    *(uint32_t *)(0 + 4) = n2 + 0x10;
    return n14;
}

@Wllang Thanks for explaining. The thing I don't understand is: why encode in stack form at all, if all the consumers of wasm are going to have to undo the stack representation (as you have) in order to do anything useful?

(@kripken I don't think code size is the full answer here -- there must be a non-SSA form that is also non-stack-based, and much easier to work with than the current stack-based system.)

@lukehutch The efficiency comes from the common case of definitions that have a single use and are consumed in a stack like order, or in other words code that can be represented as an expression and encoded pre-order or post-order. Serial encodings are just not a good representation for some 'useful things' to do with the code, pre-order or post-order.

@lukehutch, the most compact representation for expressions is a pre- or post-order serialisation. The latter turns out to be slightly more efficient to process on the consumer side. A stack machine essentially is just the generalisation of such a post-order encoding. As such, it is pretty much optimal. In particular, it keeps all "register indices" entirely implicit.

Of course, what's optimal for expressions may not be optimal in general. However, single-use values are very common, so the stack machine encoding is quite effective.

@Wllang @rossberg-chromium are you saying that for this particular stack machine, there is a guaranteed bijective mapping between the stack representation and the AST, and the stack representation is only used as an efficient serialization mechanism for the AST? If the stack representation is only used to serialize the AST, and it is neither intended to be interpreted nor to be displayed in stack-based form in browser debugging tools, then it makes sense that this is a non-issue.

(It is not true for stack machines in general that the stack representation can be unambiguously translated into an AST, depending on what operations are supported, and what the semantics of those operations are. For example, all deserialization bets are off if the stack machine supports a goto instruction that is able to jump to any instruction, or if it supports operations that can push a varying number of parameters onto the stack, depending on an input value.)

@lukehutch I gave up long ago on seeking an encoding with a one-to-one mapping to a useful presentation language, and it took me too long to realize that it's not necessary as you can define a canonical decoding that should be sufficient, sorry to people on that one. So consider the wasm binary as having many possible encodings for the same canonical language, and variations on that canonical language can be encoded in an unknown section. Perhaps you'll come to the same realisation at some point, but it seems healthy to explore and challenge.

Practically the stack code is also interpreted and is even the default view-source format it seems. I would like to see some web browser hooks to allow custom presentation formats for view-source so the community can explore a range of text formats, and perhaps it will be possible some day.

The wasm code has structured control flow, which relates back to the validation. There are no operations that push a dynamic number of values and the stack depth is always uniquely defined (ignoring some differences in unreachable code).

@rossberg-chromium, that's the current state now, but if as expected wasm ends up adding stacky operations like pick, multiple return values, etc., then the situation for the inverse will worsen, won't it? Or has something changed?

@kripken, true, then you need auxiliary let's in general -- one reason why I'm not particularly fond of pick or other stack hacking ops, and would rather prefer a destructuring let block as primitive, which is more structured.

Practically the stack code is also interpreted and is even the default view-source format it seems.

If the stack code is interpreted directly as a stack machine only when single-stepping in the debugger, that's fine (and, presumably, the debugger would also be free to single-step through an AST-ified representation of the stack format if it wanted to give the user the option of stepping at a higher level). However, if the fundamental stack-based execution model of wasm is preserved even in the AOT-compiled code, meaning that there is an actual stack machine (beyond the call stack) used to store all intermediate values even in compiled code, then there are enormous efficiency implications. See the links I provided in my initial comment for quantitative evidence and qualitative explanations for how badly the stack machine model can impact performance (and therefore battery life, etc.).

So I guess this is at the core of my concern: what will AOT compilers do with wasm stack machine code, and how will it be run outside of the debugger?

However, if the fundamental stack-based execution model of wasm is preserved even in the AOT-compiled code, meaning that there is an actual stack machine (beyond the call stack) used to store all intermediate values even in compiled code, then there are enormous efficiency implications.

As mentioned above, nothing to worry about: the stack machine representation is converted into standard compiler SSA form and optimized, just like clang or gcc would, into machine code. That is how SpiderMonkey and V8 work right now, so you can benchmark this if you want, no need to speculate. For example, here's emscripten's corrections benchmark:

test_corrections (test_benchmark.benchmark) ... 
        clang: mean: 14.877 (+-0.001) secs  median: 14.877  range: 14.876-14.878  (noise: 0.004%)  (3 runs)
     sm-asmjs: mean: 16.035 (+-0.105) secs  median: 15.961  range: 15.948-16.183  (noise: 0.658%)  (3 runs)   Relative: 1.08 X slower
      sm-wasm: mean: 15.561 (+-0.210) secs  median: 15.412  range: 15.408-15.857  (noise: 1.348%)  (3 runs)   Relative: 1.05 X slower
ok

That shows asm.js being 8% slower than native clang, and wasm is slightly faster, at 5% slower.

@kripken thanks for confirming this. What is the reason for the remaining 5% overhead? Does it come down to reduced opportunities for optimization in wasm SSA form code? Or overhead of sandboxing?

Could be those, but it could also be noise - gcc and clang have differences between them too, often much larger. Sometimes one register allocator happens to work better on a particular function, etc.