w3f/polkadot-spec

RFC: AbortHandler

bkchr opened this issue · 6 comments

bkchr commented

Proposal

We would like to introduce a new host function:

fn AbortHandler_abort(u64);

This host function should abort the execution of the runtime when being called. The given parameter is to be interpreted as utf8-string and follows the default encoding for parameters in Polkadot. The caller can expect that the function will never return.

Why

This improves the Observability for runtimes. Currently runtime implementations abuse the Logging_log host function to give information about why it crashed. Runtimes are allowed to panic when they are in a state that isn't recoverable. A good example is when calling any runtime function and the given parameters fail to decode. This often happens when validating transactions and spams the node output with errors. Node operators get confused from these log outputs.

Another important use case is testing. Currently there is no functionality to find out why the runtime panicked in a test and you can only check that the execution failed. While it would be much better to at least get some "error message" to find out why it failed.

bkchr commented

You can argue that this is maybe not needed or only for testing. Polkadot/Kusama already disable logging for production builds that go onto real networks. So, there you can not see these messages ever. Which is fine by itself and would make this host function irrelevant/not needed. The problem is more when the network is running into issues and we need to debug things. Then not knowing what the problem is, requires some more work to actually fix it. There are ways to rebuild the wasm binary with logging enabled and then use this. This would enable someone to find get some better observability.

I can also see advantages in doing it this way.

fn AbortHandler_abort(u64);

I think the status quo is to use two u32 arguments instead of clumping the pointer and the length together.

Now, regarding the behavior. It's worth clarifying what happens if

  1. [ptr..][..len] is out of bounds.
  2. The string has non-UTF8 characters.

It might seem obvious, but I still would insist such things be clarified in an RFC.

Now regarding the function itself.

It definitely provides better error reporting. When you generally call a runtime API, and it panics right now, you would not get the panic message because, even if it is logged, there is no reliable way to distinguish other logs from the panic message. This function enables returning panic messages.

Now the question is how badly you would want to have those messages. For debugging problems, this is certainly a good tool. I do not really like the approach that requires rebuilding the runtime with logging enabled to just get the panic message. This would suck enormously when the time works against us. It would suck even more if the version of the compiler (or another build environment variable) plays a role (That actually did happen in polkadot).

On the other hand, it also has costs.

First, obviously, using this function (or any logging function for that matter) in the panic routine should increase the size of the wasm binary.

Second, it comes with a rake you could step on. Runtimes should not panic, as we know, but they do. It's possible that a type with a careless Debug implementation (e.g. a type that prints a byte array verbatim as decimal numbers, that happened within the polkadot codebase) gets printed on a panic path. This aids a potential DoS attacker with extra leverage. Yet, if you think about it, it serves no purpose.

Why do I say it serves no purpose if I just agreed that it aids with debuggability?

Well, this is because the Substrate Runtime's or PVF's main objective is to encode the state transition function. Those are in one way or another the bottleneck of the system. I am a proponent of the idea that if a something is not absolutely necessary to be present in the onchain wasm image it should not be there (one of the early examples).

Logging and panic reporting are not required since they can be implemented out-of-band. The first direction is like defmt where the formatting is deferred to the host. Only format values are sent to the host and the format strings are supplied separately from the binary. The second direction is the omniscient trace-based debugger framework I mentioned elsewhere (write-up is still pending 😮‍💨), the principle is similar to the first direction. Those solutions are quite advanced though and will require a lot of engineering effort.

On the other hand, the cost of this function is negligible. Somebody might say that introducing this function means that we are stuck with it forever1, and all the future polkadot host implementers will have to deal with it. While it is somewhat true, I would argue that it is not that bad. The function terminates the execution immediately: this does not introduce any new mechanics and also does not change the logical state of the system and therefore removes all those pesky questions. A legit implementation is to just trap.

Footnotes

  1. Although I think we still haven't figured out the stability story.

Shouldn't this function be under the Misc module? Respectively ext_misc_abort.

What's the state of this?

See #593

bkchr commented

What's the state of this?

Sorry. I wanted to actually convert this into a proper RFC, I will still do this.