nodejs/node

vm.Script could be used to hide the source by shipping only bytecode.

hashseed opened this issue · 13 comments

I was playing with this idea this morning, and @bmeck asked me to put this into words.

vm.Script can be used to produce a cache into a buffer, and also used to load from existing cache produced earlier. With Ignition (bytecode interpreter) launched in V8, we could "abuse" it to only ship bytecode and hide the source.

The thing with Ignition is: once a function has been compiled, we don't need the source code anymore. The optimizing compiler can construct the graph from bytecode alone. So a script can be fully shipped as bytecode. There are a couple things missing though. For a proof-of-concept these issues can be hacked before actually thinking about changing the V8 API to accommodate.

  • Eager compilation: every function must be compiled to bytecode already. V8 doesn't do that out of the box, but there is a command line flag called --serialize_eager that you could turn on to force eager compilation if a code cache is being created.
  • The source. vm.Script expects the script source to be provided always. With Ignition we don't actually need it, but we have a checksum when deserializing to check that the source matches expectation. The checksum is simply the script length at this point. So an empty string with the same length would do.
  • Platform dependency. V8's serializer simply walks and serializes the object graph. In case of code cache, we walk the object graph of the function (SharedFunctionInfo). Depending on whether the platform is 32 or 64 bit, the object layout is different, and the code cache would look different. I'm not 100% sure whether x64 and arm64 would produce the same code cache, either.
  • Version dependency: V8's bytecode is purely internal, and not versioned. So for a different version of V8, the bytecode needs to be recompiled.
  • Function.prototype.toString() would just show a window from whatever the dummy source was provided. Duh.

Once these issues are solved, you could ship bytecode and hide the source, without worrying about crashing the optimizing compiler.

Oh and this would only work on versions where V8 uses Ignition. For example at this shameless plug.

bmeck commented

@hashseed it sounds like we can provide the source so that things like debuggers can show the source though? I think showing the source can be useful, but avoiding extra parsing and compilation costs would be good.

I was just pointing out the possibility of hiding the source, if required by use case. If the source is available, then there is not much difference to what vm.Script already does now, except for maybe forced eager compilation.

The checksum is really important here I think, for transparency. Say in an open source situation, you publish the byte code with the original code, a collision free checksum provides a gaurantee that the byte code is true to the source. Is there a way to do this without a dummy checksum, and using a strong hash for a legit checksum?

The header of the code cache contains a bunch of different fields that has to match: V8 version, source length, command line flags, etc. There is also a checksum over the payload. But that's intended for error correction, not security. It uses a Fletcher's checksum, so fairly easy to find a collision for.

What about switching to a Blake2b hash? That’s very fast (faster than MD5) and as hard to find collisions in as SHA2 (i.e., impossible).

I'll just put this here and walk away...

.pyc

What about switching to a Blake2b hash? That’s very fast (faster than MD5) and as hard to find collisions in as SHA2 (i.e., impossible).

Might be worth experimenting with. But you'd still need a safe way to store/transmit the checksum.

As mentioned, the current checksum is to detect accidental data corruption only.

.pyc

What I'm pointing out here is precisely how someone could implement something similar to .pyc for Node.

bmeck commented

JS code is usually the smallest representation. Bytecode take less space than native code, but still larger than JS source on average.

Code caching for individual files has been implemented about two years ago in V8. Prior to bytecode however the source still needs to be available for parsing when code is being recompiled for optimization. Turbofan can create its graph from bytecode though, so no source necessary anymore.

I may be wrong, but I think @indutny's experiments were way before the code cache, and was about putting code into V8's startup snapshot. However, the startup serializer/deserializer had many limitations back then, which were fine for V8's default startup snapshot, but did not work for arbitrary code.

Trott commented

Should this remain open?

bmeck commented

@Trott No bandwidth currently to move it, but still relevant and comes up on social media somewhat often

Another application of this would be the ability to allow pre-compiled code to be sent between processes via IPC. Of course the "cached code object" would need to be serialized, but even so, it would likely be faster than passing the original source code to the target process and recompiling it there (unfortunately, there's no practical way to test--that I can think of--with the way vm.Script currently works).

The discussion seems to have quieted down a bit. Closing.

We can reopen this some time later.