tc39/proposal-arraybuffer-base64

Is encouraging binary encoding/decoding a good idea? Should it be so prominent?

domenic opened this issue · 10 comments

I found the arguments at whatwg/html#6811 (comment) by @Kaiido somewhat persuasive. Basically, if you're encoding your bytes to and from a string, you're probably doing something wrong, and you should instead modify your APIs or endpoints to accept bytes anyway.

There are definitely cases where it's useful, mostly around parsing and serializing older file formats. But I'm not sure they need to be promoted to the language (or web platform).

Relatedly, even if we think this is a capability worth including, I worry that putting it on ArrayBuffer makes it seem too prominent. It makes base64 decoding/encoding feel "promoted" on the same level as fundamental binary-data operations such as slicing or indexed access. From this perspective something like https://github.com/lucacasonato/proposal-binary-encoding (with static methods) seems nicer in that it silos off this functionality to a separate utility class.

you should instead modify your APIs or endpoints to accept bytes anyway

I don't know about you, but a substantial portion of the code I write talks to APIs which I am not in a position to modify. I think that's probably the case for many developers. Just to pick a few examples I've encountered: Google's speech to text API, the Google Drive API, and Github's API all expect you to provide binary data encoded with base64 in some circumstances.

In the other direction a great many APIs return data in base64, usually as part of a larger response - for example, JSON-based APIs generally base64-encode binary data which they wish to return as part of the response (what else could you do?).

It makes base64 decoding/encoding feel "promoted" on the same level as fundamental binary-data operations such as slicing or indexed access.

Ehh... I don't think the fact that two APIs are exposed in the same way implies they are equally promoted. (Though actually indexing is done with syntax, not a method call, so indexing is strictly more promoted than this would be.) I have wanted Array.prototype.map approximately a thousand times more frequently than I've wanted Array.prototype.copyWithin, for example.

And ASCII serialization/deserialization is a pretty fundamental operation on binary data, so ArrayBuffer seems like the right place to put those methods. Certainly I as a developer would not think to look for a class outside of ArrayBuffer to find the method for base64-encoding an ArrayBuffer.

sffc commented

JSON is a fundamental part of the language, and JSON requires that array buffers be stored as text, so I think Base64 is fundamental enough to be this prominent.

Not disagreeing with the conclusion, but there are other ways to represent binary data in JSON and the suitability varies. Strings are often the most practical option, but for small binary values, arrays of numbers are usually better. A real world example of where strings are the worst option is the “challenge” and “user handle” binary values that get exchanged in WebAuthn.

Every demo of WebAuthn I’ve seen encodes these (tiny — 64 bytes or less) binaries as urlsafe base64 strings in JSON during interchange. (I’m not sure why they add the extra steps for urlsafe given it’s sent in a JSON body — any flavor of base64 would be fine — but they all seem to do it.)

Encoding those values as ordinary JSON arrays of numbers is more direct, less error-prone, and the size doesn’t make a material difference:

// JSON-serializable representation:
[ ...new Uint8Array(buffer) ];

// Simpler and safer restoration from JSON:
Uint8Array.from(array);

I would like to point out that there are a few encoding/decoding types that are practically everywhere both client-side and server-side:

  • Base64 string ↔ raw binary, due to JSON, XML, and URLs not supporting arbitrary data without lots of escaping
    • The "URL-safe" encoding simply replaces + and / with _ and - - this would be a good candidate for an encoder option, but that's about it as a single decoder could easily decode both by simply changing a lookup table slightly.
  • Hex string ↔ raw binary, used both for raw data (Base64 would be better, but some people are just lazy) and for cryptographic constants
  • Native string ↔ raw UTF-8 due to all the basically everything that requires it (it's been the default text-to-binary conversion for Node's buffers since the moment .toString was added, and WHATWG's TextDecoder/TextEncoder has never supported anything other than UTF-8 to/from strings)

I've had a WIP transcoding proposal sitting privately, and there is a very significant performance boost to be had by having this done within the engine: they can iterate strings via their native representation (whether it be a cons string or a flat string) and build it with zero unnecessary copies. Additionally, while this isn't in of itself an argument for using JS engines, they're all three embarassingly parallel tasks very well-suited to SSE vectorization, and JS engines are much more likely to look into those and similar where possible than embedders as they already have to care a lot about architectural specifics between their JIT and WebAssembly.

I do agree with original commenter that base64 url should be discouraging. it's wasteful to send 33% more bandwith and wasting unnecessary processing time encoding/decoding to and from strings.

and the WebAuthn/json reason sending things back and forth between api's can be dealt with other communication strategies such as FormData + Blob

I have abuse fetch power to retrieve multiple files back from a server to the browser by doing something like

const fd = await response.formData()
const files = fd.getAll('files')
const ab = await files[0].arrayBuffer()

much quicker and easier than having to use any zip/tar stuff. there isn't any reason you can do the same thing on the server now either when NodeJS and Deno can do now using the same fetch api on the backend.

just send the webauthn binary using something like:

fd.append('challange', new Blob([uint8array]))
fetch(url, { body: fd })
// and do this on the backend:
const fd = await req.formData()

this way of sending formdata works both ways.

As such i am -1 on implementing new binary encoder

The platform have involved to better handle binary data nowdays that don't require things to be sent via base64 or json
we have things such as bson and protobuf and other binary representation. JSON isn't the only solution and it isn't the best solution for doing everything with it.

you can also use fetch to convert a base64 data url into something else:

const b64toRes = (base64, type = 'application/octet-stream') => {
  return fetch(`data:${type};base64,${base64}`)
  res.arrayBuffer()
  res.blob()
  res.json()
  res.body // stream
}

but again, base64 should have been avoided in the first place


speaking of formdata (off topic)... would it be a great idea to have something like: formdata.append('stuff', typedArray)?

I am in retrospect growing of the opinion that @domenic is right that this doesn't belong on the array buffer itself, but in maybe a built-in module or something. It's also worth mentioning a built-in module or separate global would be a lot easier for those maintaining embedded runtimes like XS to implement.

@jimmywarting Nice, I’d never considered FormData here. It seems pretty well-suited and (relatively) direct. I’m curious if you know whether there are any gotchas attached to it? Sometimes those characteristics are deceptive with APIs like this*.

* I might be unduly wary of novel usage of FormData cause I proposed %Object.fromEntries% a few years back and today one of the most common ways I see it used in the wild is Object.fromEntries(new FormData(form)) ... which is unsafe for duplicate keys & doesn’t do what people imagine for e.g. checkbox inputs. I now feel partly responsible for some untold but prob large number of sporadic bugs on the open web ._.

@bathos

I’m curious if you know whether there are any gotchas attached to it?

well, IE don't support fetch at all so, there is that...
but other than that i don't see any problem with responding with a FormData back to the client/browser from the server and vice versa. also, there isn't any reason why you can't just send pure raw binary directly also fetch(url, {body: arrayBuffer})

Deno have a pretty well spec'ed fetch+FormData built in.
latest node-fetch@3 have support for taking a stream and convert it into back into a formdata using something like:

import { Response, FormData } from 'node-fetch'
reg.post('path', (req, res) => {
  const fd = await new Response(req).formData()
  new Response(fd).arrayBuffer().then(Buffer.from).then(res.send)
})

doe undici fetch impl don't yet support decoding formdata yet: (see nodejs/undici#974)

formdata-polyfill comes with a neat way of converting FormData into blobs for furderer usage in other ways.

But this is all just off topic so i'm going to hide this.

(Hid my question in turn, but appreciate the answer, thanks.)

It is the opinion of the committee that this is worth doing. It's true that it's better to avoid the overhead when possible, but often it simply isn't, and we should make accommodations for that reality.