connectrpc/vanguard-go

More robust transcoding of `google.protobuf.Any` error details

jhump opened this issue · 0 comments

jhump commented

RPC errors can include arbitrary additional details in the form of google.protobuf.Any messages. These messages are generally transmitted in the Protobuf binary format, and the in-memory format of a de-serialized google.protobuf.Any message also contains bytes in the Protobuf binary format.

When translating errors into responses for REST clients, the error is transcoded into JSON. This means that any error details will also be transcoded to JSON. Furthermore, when receiving an error body from a REST server, we first try to de-serialize the body as a JSON-formatted RPC error. This requires transcoding the other direction, from JSON to Protobuf binary format.

The issue is that transcoding a google.protobuf.Any message from binary to JSON or vice versa requires having the schema for that message type. So there is a class of failures where an RPC error includes an unrecognized message type as an error detail. When that happens the transcoding will fail because there is no schema available to inform the transcoding step for that message type.

Today when this happens:

  1. If we cannot transcode to JSON, for a REST client, the client will instead receive this less-than-helpful error: {"code":13,"message":"failed to marshal end error"}.
  2. If we cannot transcode from JSON, from a REST server, we synthesize an RPC error with a code based on the HTTP status code and a google.api.HttpBody error detail message that contains the error body that could not be transcoded and its content-type.

Similar software has similar issues:

  1. In the gRPC-JSON transcoder filter for Envoy, if this happens, no error body will be sent and instead the error details will remain only in grpc-status, grpc-message, and/or grpc-status-details-bin response headers or trailers. (Likely even less useful to a REST client than Vanguard's behavior described above.)
  2. In gRPC-Gateway, when this happens, the error body is hardcoded to {"error": "failed to marshal error message"} (not too far from Vanguard's default behavior). However, gRPC-Gateway allows the application to configure a custom error handler which could potentially handle such marshalling errors in a much more robust fashion.

Ideally, when the only reason we'd fail to transcode an RPC error to/from JSON is because of missing schemata for error details therein, we would produce a slightly different error format -- one that preserves as much as possible of the original error data. Here are some concrete tactics we could take:

  1. Instead of attempting to marshal or unmarshal the RPC status to/from JSON, we should process it in a custom fashion where error details can be transcoded one-at-a-time. This allows the processing to detect if/when a single error detail cannot be processed and to use a fallback strategy for that one message. This preserves the status code and message and possibly even a subset of error details, even if there is an unrecognized error detail message.
  2. When there is an unknown error detail and we are transcoding to JSON, serialize the message to a JSON object with @type and @value keys, where the @value key is a base64-encoded string that corresponds to the binary bytes that cannot otherwise be transcoded.
    • Ideally support for this special form would be symmetric: when transcoding an error detail from JSON, if we see that is has just @type and @value properties and we can't otherwise unmarshal the Any message, then try to base64-code the @value property and construct the Any error detail from that.
  3. When there is an unknown error details and we are transcoding from JSON, de-serialize the message as a google.protobuf.Struct message and then append that to the RPC error details.

Also, it's worth noting that gRPC-Gateway synthesizes an additional error string field in the JSON form of the RPC error, because that is a very common part of JSON error representations expected by REST clients. It could be useful for Vanguard to do the same -- for instance, we could put the string form of code there, which satisfies the expected shape (a string field named error) and also makes the error more human-readable without the human having to memorize the 16 RPC codes.