googleapis/gapic-showcase

rest streaming JSON response error

Opened this issue · 9 comments

The gapic showcase tests added in this python PR uncovered an issue with the gapic showcase server. It seems that when using the StreamingSequence server in rest mode, if you request partial data before raising an error, it results in a broken JSON response

Reproduction

# start a showcase server
gapic-showcase run

In a separate terminal:

# create a new sequence on the server
gapic-showcase sequence create-streaming-sequence --streaming_sequence.content "once upon a time there was a" --streaming_sequence.responses '{"status": {"code": 12, "message":"error"}, "response_index": 5}'

# create a rest attempt
 curl -X POST http://localhost:7469/v1beta1/streamingSequences/0:stream -H 'x-goog-request-params: name=streamingSequences/0' -H 'Content-Type: application/json' -H 'x-goog-api-client: gl-python/3.9.2 rest/2.28.1 gax/2.11.1 gapic/0.0.0'  -d '{}'

we get:

[{
  "content": "once"
},{
  "content": "upon"
},{
  "content": "a"
},{
  "content": "time{"error":{"code":501,"message":"error","details":[],"Body":"","Header":null,"Errors":null}}"
},{
  "content": "there"
}]

we'd expect something like:

[{
  "content": "once"
},{
  "content": "upon"
},{
  "content": "a"
},{
  "content": "time"
},{
  "content": "there"
},{
  "error":{"code":501,"message":"error","details":[],"Body":"","Header":null,"Errors":null}
}]

Speculation

Looking though the showcase server code, it looks like stream.send and ReportGRPCError are relevant. They seem to be using different methods to write to the same buffer. Maybe we need to change the flushing strategy?

Hey @noahdietz! I did some digging with this and was able to replicate Dan's results in node (notes on the mirrored buganizer bug) but I have a feeling that what's causing this is some code in genrest. Any tips on how to resolve this? My golang is meh and so is my REST fallback abilities so I think I've made it as far as I can independently tracing this and confirming that it is a consistent issue. Happy to gvc

cc @vchudnov-g if you feel like reassigning this to yourself - also an option!

I'm going to bump this up to p1 because it's a blocker for both me and Dan at this point. I am happy to GVC to pair on it though and am ultimately excited to learn more about the root cause so I can be more helpful digging into these issues in the future!!

I've started looking at this.

To be honest, I'm kind of skeptical that our current implementation of REST server streaming will actually work the way we want it to (exactly the same as gRPC) for this specific testing scenario. Let me explain.

In gRPC server streaming, the client Recv()s responses as the server sends them, but can receive an error status at any point in the stream. If the server closes its end and returns a non-OK Status, the next Recv() call by the client will produce an error. The key aspect being that the "status code" isn't sent until the very end.

Contrast that with how Go's HTTP 1.1 lib (and HTTP 1.1 in general) works, if Write is called on the response, and WriteHeader hasn't been called yet, it will send 200 OK, then start writing the body. This basically precludes us from sending an error status code later on, like gRPC would do. See the comment here in our impl and the docs for ResponseWriter (which implements the Flusher we use).

The ReportGRPCError implementation is writing the error data to the response, but also trying to WriteHeader after it's already been done by the Flusher (having been flushed at least once at this point). So that WriteHeader call is definitely failing (silently). So basically we can report an error status at the beginning of the stream, but not after it has started.

We might be able to look at using HTTP 1.1 Trailer Headers to report an error status after the fact and mimic the behavior in client libraries, but this is an update to our REGAPIC server streaming impl (meaning we'd need to verify that Google HTTP apis do this and update clients to standardize regapic impl on that). Here is an example of a server doing it in Go: https://pkg.go.dev/net/http#example-ResponseWriter-Trailers

I'm not sure if @vchudnov-g already accounted for/looked into this or not.

TBF grpc-gateway the OSS gRPC-HTTP transcoding proxy (which I think we decided to not use for some reason that is escaping me) has the same limitation. Just reading the code for response stream handling shows as much - they can only write the header once, and once data has been written, the response status cannot be changed.

So I think if I understand properly there are two problems here:

  1. the JSON payload for streams containing errors is broken
  2. the status code is fixed at the start of the stream, so streams that fail later still result in a 200 status code

Ideally they should both be fixed, but at least from my perspective, issue 2 seems like something we can work around if needed

The way I understood it, issue 2 happens because of issue 1 and when it's sending that error. I have some thoughts about issue 2 that I shared on the internal bug - PTAL!

the status code is fixed at the start of the stream, so streams that fail later still result in a 200 status code...issue 2 seems like something we can work around if needed

I'm not so sure, at least not without compromising the regapic server-stream logic in our clients. HTTP 1.1 requires that a status code be sent back in the initial response, we cannot write a chunk of data to the response until the response code has been sent.

There is a mechanism for when the client is actively/about to send(ing) a massive request payload and it wants confirmation from the backend to continue via the Expect header. This basically has the server send back 100 Continue, and then ultimately send a final response code once the client as finished and the service is ready to write the response. This does not fit our use cases here.

I think we need to see what Google infra is doing for this case e.g. talk to ESF folks, and I was hoping @vchudnov-g had either already done so during initial regapic designs or had a contact.

Hi folks. Apologies for being late to the party; I had my hands full with other stuff the past week.

I was reading over this conversation, and I think I mostly understand what's been discussed. I just looked at our REGAPIC streaming design docs and they did not go into much detail on errors.

So my understanding at the moment is this:

  1. The JSON response payload should indeed be fixed so that it terminates before the error is to be inserted. I don't know why that behavior currently differs between grpc and rest transports.
  2. Currently the trailing header that @noahdietz pointed out seems like the desirable solution, since they worked with the chunked-encoding we're already using and would allow us to terminate the JSON stream at the right point and then set (modify?) the status header to indicate there is an error. As @noahdietz points out, we would have to make sure this is consistent with what Google infrastructure is already doing, and we would also want to double-check that rest clients in all languages are indeed checking the error code after the stream ends, and not assuming the error code is provided at the outset.

I'm also going to add an internal note about an implementation detail we may be able to use.

I'll work on issue 1 in #1378

I'll reassign this issue to myself to do the workaround discussed on chat and internally