grpc/grpc-node

Inconsistent Error Handling for Graceful Server Shutdown during Keepalive Ping

w1am opened this issue · 12 comments

w1am commented

Problem Description

I am experiencing a discrepancy in the error messages returned when reading from a gRPC stream and gracefully shutting down the connection midstream. Specifically, the grpc-js client incorrectly interprets a cancelled keepalive ping as an "Unavailable" error instead of the expected "Cancelled" error during such a shutdown. This behavior is related to my PR #2622. While the client logs Error [ERR_HTTP2_PING_CANCEL]: HTTP2 ping cancelled, it eventually throws an "Unavailable" error, which seems inconsistent with the actual server state.

Reproduction Steps

  1. Clone my repository.
  2. Run yarn to install dependencies.
  3. Execute ./server.sh to start the server.
  4. In a separate terminal, run node client.js to start the client.
  5. Wait for the client to establish a connection and start reading from the stream.
  6. Gracefully stop the server.
  7. Observe the logs to see the error messages.

Relevant code snippet handling ping failures:

transport.ts:429

this.session!.ping((err: Error | null, duration: number, payload: Buffer) => {
  if (err) {
    console.log(err); // Outputs: Error [ERR_HTTP2_PING_CANCEL]: HTTP2 ping cancelled
    this.keepaliveTrace('Ping failed with error ' + err.message);
    this.handleDisconnect();
  }
  this.keepaliveTrace('Received ping response');
  this.clearKeepaliveTimeout();
  this.maybeStartKeepalivePingTimer();
});

The log displays Error [ERR_HTTP2_PING_CANCEL]: HTTP2 ping cancelled, indicating a correct cancellation due to server shutdown. However, the final error reported to the client is "Unavailable".

Full log:

Event #0: test-event
Event #1: test-event
Event #2: test-event
Event #3: test-event
Event #4: test-event
Event #5: test-event
Event #6: test-event
Event #7: test-event
Event #8: test-event
Event #9: test-event
Event #10: test-event
Event #11: test-event
Manually stop the Docker container now...
!goaway called!
Error [ERR_HTTP2_PING_CANCEL]: HTTP2 ping cancelled
    at new NodeError (internal/errors.js:322:7)
    at pingCallback (internal/http2/core.js:897:10)
    at processTicksAndRejections (internal/process/task_queues.js:83:21) {
  code: 'ERR_HTTP2_PING_CANCEL'
}
!close called!
Event #12: test-event
Event #13: test-event
Event #14: test-event
Event #15: test-event
Event #16: test-event
Event #17: test-event
Event #18: test-event
Event #19: test-event
Event #20: test-event
Event #21: test-event
Event #22: test-event
Event #23: test-event
Event #24: test-event
Event #25: test-event
Event #26: test-event
Event #27: test-event
Event #28: test-event
Event #29: test-event
Event #30: test-event
Event #31: test-event
Event #32: test-event
Event #33: test-event
Event #34: test-event
Event #35: test-event
Event #36: test-event
Event #37: test-event
Event #38: test-event
Event #39: test-event
Event #40: test-event
Event #41: test-event
Event #42: test-event
Event #43: test-event
Event #44: test-event
Error: UnavailableError: 14 UNAVAILABLE: Connection dropped

Environment

  • OS: Linux Fedora 39
  • Node version: 14.21.3
  • Node installation method: t/j n
  • Package: @grpc/grpc-js@1.9.11

Additional Context

N/A

This ping cancellation you are observing is a session-level event, and a description of how the ping specifically ended. It is not supposed to have any direct correspondence to how an individual request on that session ends. We interpret any ping error we see to mean that the server is no longer available, so the UNAVAILABLE error code is appropriate.

When you say that you "Gracefully stop the server", it's not clear what that means. In the context of gRPC, that would generally involve allowing existing requests to end on their own while not accepting new requests, but that is clearly not what is happening here.

If you want to have more control over the status code the client sees, you should control that explicitly on the server. Most gRPC server implementations have a function that cancels all open requests (generally while shutting down). In grpc-js, that is Server#forceShutdown.

w1am commented

@murgatroid99 Thanks for the explanation! I'm running into a strange situation where I get "Unavailable" when connecting to my server in CI and "Cancelled" in the 'Latest' server environment.

When you say that you "Gracefully stop the server", it's not clear what that means

By graceful shutdown, I mean stopping the docker-compose (my server) using Ctrl + C.

However, previous versions (<=1.9.1) of gRPC-Js before this modification in transport.ts was added, I was getting the same error message (Cancelled) across both environments (CI and Latest). I would tend to agree that 'Unavailable' should be the right message. However, I still cannot figure out why I get different error messages for different server environments after upgrading gRPC-js.

Well, what is different between your CI and Latest server environments?

w1am commented

@murgatroid99 They are the same server but running in different environment.

The exact error you are seeing is directly related to how you are running the server (in a docker container) and how you are stopping it (sending SIGINT to docker-compose). If you are not doing those same things in both environments, it is unsurprising that you would see different errors, and even if you are, identical handling is not guaranteed on the client.

The expectation in gRPC is that a server gracefully shuts down by closing the listening port, sending each open session a GOAWAY with code 0, and then waiting for existing requests to end on their own before ending the process. These actions are generally encapsulated in a single operation in the gRPC API, like Server#tryShutdown in grpc-js. A server can also shutdown more aggressively by cancelling all pending requests after sending the GOAWAY.

A client will make its best effort to handle any other event that causes a connection to become unusable in the middle of processing a request, but specific outcomes are not guaranteed.

w1am commented

In my case, receiving a GOAWAY with code 0 is unexpectedly triggering handleDisconnect. This seems contrary to the expected behavior, where handleDisconnect should only respond to abrupt disconnects without a GOAWAY or with non-zero GOAWAY codes.

In your log, handleDisconnect is triggered by the ping error, not by the GOAWAY. If the server sends a GOAWAY with code 0, indicating that it is shutting down gracefully, and then immediately stops responding while there are still pending requests, it is not actually shutting down gracefully, so the client is back to making its best effort to handle the error.

w1am commented

I ran the test in another scenario to better illustrate the problem:

  1. I read data from a stream every second.
  2. I closed the connection as soon as the client connected to the server (at t=1s)

In this test, I set the keepAliveInterval to 10s and the keepAliveTimeout to 10s.

In server version 23.10.0:
After shutting down the server at t=1s, the session was closed at t=6s.

At t=10s, since the session had already closed, it returned CANCELLED to the client because the keepalive ping could not be sent.

In server version 24.0.0:
After shutting down the server at t=1s, the session was closed at t=10s.

The client managed to send the keepalive ping to the server before the server closed, and it returned UNAVAILABLE (by the ping error handler, i.e handleDisconnect) to the client.

I realized that the difference between the two servers is that in version 24.0.0, they upgraded to .NET 7, but nothing else changed. The changelog for .NET 7 can be found here.

Conclusion

For both server, the server was shut down at the same time (t=1s) and different error message is being returned because In version 23.10.0, the session closed prematurely at t=6s, making it impossible for the client to send the keepalive ping, resulting in a CANCELLED error. In contrast, in version 24.0.0, with the .NET 7 upgrade, the session remained open until t=10s. This allowed the client enough time to send the keepalive ping before the server shutdown, leading to an UNAVAILABLE error instead.

So, it appears that the discrepancy might have been introduced by the .NET 7 upgrade after all, but nothing obvious is indicated in the changelog.

The Node client does not currently end a call with the CANCELLED status because the keepalive ping could not be sent. The PR you sent would have introduced that behavior, but I didn't merge that PR. Either the CANCELLED status is coming from a different code path, or you are using your branch with your proposed change.

w1am commented

It seems that in .NET 7, the default value for the graceful ShutdownTimeout has changed, which affects how much time the server allows for ongoing requests to complete gracefully before forcefully terminating them during a shutdown process but it wasn't listed in the changelog. I am closing this issue.