WebAssembly/module-linking

For imports, should field name be optional, rather than module name?

Closed this issue · 11 comments

From the explainer:

The module field of import becomes optional (allowing single-level imports).

But this runs counter to the desugaring of two-level imports into a one-level instance import:

(import "module" "field" (func))
;; ==>
(import "module" (instance (export "field" (func))))

Although it shouldn't matter semantically, afaict, I think it would be clearer if the field name were optional rather than the module name.

Although I suppose that would make

(import "f" (func))

seem weird. I guess it is a question of how often we expect one-level imports to refer to instances/modules vs other things.

Although we often informally call the second string of import the "field name", technically, in the spec, there is only the name fieldcomponent of the import tuple. Note that export also has a name field, so dropping the module component of import makes import more symmetric with export, which is nice.

Should the binary format be updated to reflect that? (where 0x00 comes first)

Hmm, good question. Aesthetically, it should probably be nm:name (matching the export binary production), but, thinking about the order of 0x00 0xff, we need to ensure the 0xff is unambiguously invalid, as opposed to the first byte of the nm:name. When 0x00 0xff was followed by d:importdesc, that follows from importdesc starting with a definition-kind byte (which we can easily ensure never hits 0xff). But I think 0xff is a valid first byte of a LEB128? But now I'm thinking we should probably have a more robust encoding of an invalid string; like a 1-byte-length string where the 1 byte is invalid UTF-8... I think maybe 0x01 0xff fits this bill? (Edit: oops, not it doesn't, better suggestion below.) If so, then we could have 0x01 0xff go first, as the module name.

Oh for imports the binary format currently says

    mod:name 0x00 0xff d:importdesc             ->    {module mod, desc d}

That's an MVP-invalid encoding because, as you mentioned, 0xff can't start an importdesc. I was wondering if we want to keep that or instead switch to:

    0x00 nm:name 0xff d:importdesc             ->    {name nm, desc d}

(not that this really matters, it's quite minor)

Oh, I see. Yes, I suppose that could technically work. But what I was thinking is that we could avoid this implicit dependency between the binary format of import and the first byte of importdesc by having:

    0x01 0xff nm:name d:importdesc             ->    {name nm, desc d}

where 0x01 is the string length and 0xff is an invalid 1-byte UTF-8 string. What's vaguely nice about this is that it makes the invalid-ness of the name self-contained in the decoding of the module:name itself.

But that would be a much worse dependency! I would strongly prefer not playing nasty tricks that depend on the definition of UTF-8 well-formedness. Besides general hygiene, it's more future-proof to not shut the door on the possibility of allowing arbitrary "binary blob" names, for which we have already seen possible use cases emerging.

Ah, if you don't take UTF-8 as fixed over time, that's a good point.

Actually, thinking about this more, I have a really hard time imagining a backwards-compatible change to wasm that simply removes the UTF-8 restriction. If you tried to do that, then what happens to all the places (like JS API and ESM-integration, probably every other language binding, and also devtools) that today decode the bytes as UTF-8 into strings of characters? To be backwards-compatible, these embeddings would need to continue attempting to decode as UTF-8 and, if that failed... hide the export? Decode in some other way? Decode into a non-string? But now if I have some binary blob that just happens to resemble UTF-8 it will be interpreted differently, which seems bad.

Thus, if we ever wanted to allow binary blobs in import/export strings, I think we would be forced to prefix them with a byte pattern that is invalid UTF-8 (such as 0x01 0xff), regardless. And since there are an infinite number of invalid UTF-8 patterns, there would be no problem adding one more if we claimed 0x01 0xff for "not present". Another way to think of it is adding a generic binary decoding rule:

name? ::= 0x01 0xff -> ϵ
        | nm:name   -> nm

which is (exclusively) used by import. Thus, the only binary format coupling is between name and name?.

(On a side note: even with explicitly distinguishing UTF-8 strings from binary blobs, binary blobs seem like they would be really bad for the ecosystem in general. I doubt it will actually happen.)

Perhaps a more likely future extension would be adding a string section that defines strings that all other sections can use anywhere there is a name production. For that, the most natural encoding seems to be adding a case to the encoding of name that embeds the string index:

name ::= 0x01 0xfe x:stringidx -> strings[x]
       | nm:name               -> nm

As of #35 and this proposal being a new layer, rather than a modification of the core binary format, this is no longer an issue, so closing.

(That being said, I chatted some more with @rossberg about the idea of the string section described in the preceding comment and it seems to make sense, with the addition of making sure to carve out room in the binary encoding for a proper discriminant so there could be multiple not-a-utf-8-string options added in the future.)