grpc/grpc-node

Calls trigger ERR_HTTP2_SESSION_ERROR with node 20.10.0

AlbertLlebaria opened this issue · 29 comments

Problem description

When running multiple gRPC requests asynchronously with the same gRPC client, gRPC client at some point starts cancelling thoe calls with the following error message:
Received RST_STREAM with code 2 triggered by internal client error: Session closed with error code 2

I've enabled the following ENV variables GRPC_TRACE=all and GRPC_VERBOSITY=DEBUG and these are the logs for one of the calls that received the RST_STREAM error.

D 2023-11-28T11:52:30.785Z | subchannel_call | [5567] write() called with message of length 433
D 2023-11-28T11:52:30.785Z | subchannel_call | [5567] sending data chunk of length 433
D 2023-11-28T11:52:31.297Z | subchannel_call | [5567] end() called
D 2023-11-28T11:52:31.297Z | subchannel_call | [5567] calling end() on HTTP/2 stream
D 2023-11-28T11:52:35.382Z | subchannel_call | [5567] Node error event: message=Session closed with error code 2 code=ERR_HTTP2_SESSION_ERROR errno=Unknown system error undefined syscall=undefined
D 2023-11-28T11:52:35.393Z | subchannel_call | [5567] HTTP/2 stream closed with code 2
D 2023-11-28T11:52:35.393Z | subchannel_call | [5567] ended with status: code=13 details="Received RST_STREAM with code 2 triggered by internal client error: Session closed with error code 2"

Right before the calls start triggering the ERR_HTTP2_SESSION_ERROR , i found the following a log explaining the client got closed by GOAWAY with code 2. 0.0.0.0:3000 connection closed by GOAWAY with code 2.

D 2023-11-28T11:52:35.354Z | resolving_call | [6070] Received status
D 2023-11-28T11:52:35.354Z | resolving_call | [6070] ended with status: code=0 details="OK"
D 2023-11-28T11:52:35.354Z | subchannel_call | [6202] HTTP/2 stream closed with code 0
D 2023-11-28T11:52:35.354Z | subchannel_call | [6424] Received server trailers:
		grpc-status: 0
		grpc-message: OK

D 2023-11-28T11:52:35.354Z | subchannel_call | [6424] received status code 0 from server
D 2023-11-28T11:52:35.354Z | subchannel_call | [6424] received status details string "OK" from server
D 2023-11-28T11:52:35.354Z | subchannel_call | [6424] close http2 stream with code 0
D 2023-11-28T11:52:35.354Z | subchannel_call | [6424] ended with status: code=0 details="OK"
D 2023-11-28T11:52:35.354Z | load_balancing_call | [6338] Received status
D 2023-11-28T11:52:35.354Z | load_balancing_call | [6338] ended with status: code=0 details="OK"
D 2023-11-28T11:52:35.354Z | retrying_call | [6337] Received status from child [6338]
D 2023-11-28T11:52:35.354Z | retrying_call | [6337] state=COMMITTED handling status with progress PROCESSED from child [6338] in state ACTIVE
D 2023-11-28T11:52:35.354Z | retrying_call | [6337] ended with status: code=0 details="OK"
D 2023-11-28T11:52:35.354Z | resolving_call | [6247] Received status
D 2023-11-28T11:52:35.354Z | resolving_call | [6247] ended with status: code=0 details="OK"
D 2023-11-28T11:52:35.354Z | subchannel_call | [6424] HTTP/2 stream closed with code 0
D 2023-11-28T11:52:35.355Z | transport | (12) 0.0.0.0:3000 connection closed by GOAWAY with code 2
D 2023-11-28T11:52:35.355Z | subchannel | (11) 0.0.0.0:3000 READY -> IDLE
D 2023-11-28T11:52:35.355Z | subchannel_refcount | (11) 0.0.0.0:3000 refcount 2 -> 1
D 2023-11-28T11:52:35.355Z | pick_first | READY -> IDLE
D 2023-11-28T11:52:35.355Z | resolving_load_balancer | dns:0.0.0.0:3000 READY -> IDLE
D 2023-11-28T11:52:35.355Z | connectivity_state | (1) dns:0.0.0.0:3000 READY -> IDLE

You can find the full logs in the attached file:
grpcLogs.txt

This error happened after upgrading our NodeJS project to version 20.10.0. This problem is only replicable when there are a ''large'' (2361 calls from the logs) amount of gRPC calls triggered. When there is a reduced amout of calls, I couldn't replicate the error.

Moreover, the ERR_HTTP2_SESSION_ERROR is not replicable with lower versions Node versions ( issue didn't appear when runnin on node 20.5.1 ).

Reproduction steps

Use Nodejs 20.10.0 and install grpc/grpc-js 1.9.12.
Run multiple gRPC calls with the same client asynchornously. Issue is also replicable with other older versions of
grpc/grpc-js such as 1.8.14 .

Environment

  • OS name, version and architecture: Arch Linux amd64
  • Node version 20.10.0/20.9.0
  • Node installation method: nvm
  • Package name and version: 1.9.12

If your code works in one version of Node, and doesn't work in a later version of Node, my first assumption is that this is a regression in Node itself. Since log says that the session ended with a GOAWAY with error code 2 (internal error), that implies that it received that error from the server, though it is possible that the runtime synthesized that GOAWAY on the client. If it did come from the server, then maybe the client behaves differently in Node 20 vs earlier versions of Node that causes the server error.

I am facing exactly the same error in my application that involves a high degree of concurrency.
In my case, I see exactly the same error/log output with Node version 18.19.0 and Node 20.10.0.
It works if I fix the Node version at 18.16.0 - I did not try the versions between 18.16.0 and 18.19.0, but the root cause of the problem must be in a code change between any of those versions I suppose.

I have investigated the traffic with Wireshark and the server sends a HTTP2 GOAWAY with Code 2. The @grpc/grpc-js client then throws the mentioned exception and in my case does not retry the request.
I have tried @grpc/grpc-js versions 1.7.3. and 1.9.13, the behavior is the same. Behavior is the same on Windows 11, Ubuntu 20.04, and Ubuntu 22.04. It does not matter if the client manages the channel on its own or if I override the channel for all concurrent requests to the same one.

My logs are identical to the one of @AlbertLlebaria. Please let me know when there is something else that I can provide to further investigate the issue. Thanks a lot! :)

I am also facing the same issue in my application and mostly during peak usage. I am getting the same exception which causing context cancellation for other upstream services. I am on node version 20.10.0 and using grpc-js version 1.9.13.

Please let me know if I can provide any more context or help in investigation of the issue. Thank you.

Practically speaking, I think the only real impact of this bug is a slightly confusing error description. The logs show that the client received a GOAWAY with code 2. When Node receives that, it also ends any open streams on that connection with code 2. gRPC translates the HTTP2 code 2 into the gRPC status code 13, which is what we are seeing here. The confusing part is that this should be reported as an internal server error, but the gRPC client is interpreting it as an internal client error.