Received RST_STREAM with code 2 triggered by internal client error: Session closed with error code 2 ERR_HTTP2_SESSION_ERROR
mcaruso85 opened this issue · 34 comments
Problem description
What I'm doing?
I have a basic grpc js client and a grpc python server.
The grpc client it does 60 iterations, in each iteration it fires 300 grpc requests in parallel to the grpc server. The server processing is pretty simple, it sleeps 5 seconds and it respond success.
What I expected to happen?
I expect to finish all 60 iterations.
What is actually happening?
The script does not reach the 60 iterations. It starts to complete iterations with all responses success, but approx at the iteration 9 It fails with the following error:
grpc-client {"group":"mcaruso-poc:grpc-client","error":{"kind":"Error","message":"Error: 13 INTERNAL: Received RST_STREAM with code 2 triggered by internal client error: Session closed with error code 2",
"stack":"Error: 13 INTERNAL: Received RST_STREAM with code 2 triggered by internal client error: Session closed with error code 2\n
at callErrorFromStatus (/app/node _modules/@grpc/grpc-js/build/src/call.js:31:19)\n
at Object.onReceiveStatus (/app/node_modules/@grpc/grpc-js/build/src/client.js:192:76)\n
at Object.onReceiveStatus (/app/node_modules/@grpc/grpc-js/build/src/client-interceptors.js:360:141)\n
at Object.onReceiveStatus (/app/node_modules/@grpc/grpc-js/build/src/client-interceptors.js:323:181)\n
at /app/node_modules/@grpc/grpc-js/build/src/resolving-call.js:94:78\n
at process.processTicksAndRejections (node:internal/process/task_queues:77:11)\nfor call at\n
at ServiceClientImpl.makeUnaryRequest (/app/node_modules/@grpc/grpc-js/build/src/client.js:160:32)\n
at ServiceClientImpl.GetBookList (/app/node_modules/@grpc/grpc-js/build/src/make-client.js:105:19)\n
at /app/inventoryServiceClient.js:39:28\n
at new Promise (<anonymous>)\n
at Client.getBookList (/app/inventoryServiceClient.js:37:12)\n
at /app/index.js:28:9\n
at new Promise (<anonymous>)\n
at GetBookList (/app/index.js:26:12)\n
at /app/node_modules/async/dist/async.js:151:38\n
at /app/node_modules/async/dist/async.js:2588:44\n
at replenish (/app/node_modules/async/dist/async.js:446:21)\n
at /app/node_modules/async/dist/async.js:451:13\n
at Object.eachLimit$1 (/app/node_modules/async/dist/async.js:2718:34)\n
at Object.awaitable (/app/node_modules/async/dist/async.js:211:32)\n
at /app/index.js:48:15\n
at new Promise (<anonymous>)","code":13},
"message":"eachSeries ERROR","status":"Error","timestamp":"2024-01-22T02:48:17.546Z"}
grpc-client HTTP2 12: Http2Session client: finishSessionClose socket end undefined Error [ERR_HTTP2_SESSION_ERROR]: Session closed with error code 2
grpc-client at new NodeError (node:internal/errors:399:5)
grpc-client at Http2Session.onGoawayData (node:internal/http2/core:687:21)
grpc-client at Http2Session.callbackTrampoline (node:internal/async_hooks:130:17) {
grpc-client code: 'ERR_HTTP2_SESSION_ERROR'
grpc-client }
grpc-client D 2024-01-22T02:48:17.548Z | transport | (3) 172.20.93.63:8081 connection closed with error Session closed with error code 2
grpc-client D 2024-01-22T02:48:17.548Z | transport | (3) 172.20.93.63:8081 session closed
Environment
-
Docker image: node:18.15.0-slim
-
Package name and version: "@grpc/grpc-js": "1.8.21"
-
Eks 1.25 - Istio 1.19.4
-
This error is also happening with node 18.16.0, 20.11.0, 21.6.0, and also with grpc-js 1.9.14
Additional context
If I do 100 requests in parallel, it finishes all the 60 iterations, the problem starts when I increase the number of requests in parallel. If I do 600 requests in parallel instead of 300, it throws the error even earlier in the iteration 4 approx. If I do 1000 it fails with the same error in the iteration 1.
This message means that the server abruptly closed the connection with error code 2, which means "internal error". This in turn caused the requests to fail. The client is just reporting the error from the server.
My guess is that the server gets overloaded by the number of concurrent requests you are sending.
@murgatroid99 Hi, the problem I can see it comes from the client, because I did the same tests having the server deployed in kubernetes with several pods with the capacity to take the high load. I see the server replying to all the requests, but suddenly the client shutdown with this error. When I have a concurrency problem, I receive the error in the client "RPC limit exceeded", but in this case I always get a response, never that error. It looks like the problem is when the client increase the amount of requests it sends.
First, before we investigate this further, you should verify that this error still occurs on the latest @grpc/grpc-js 1.9.x version.
The error message says that the HTTP/2 session ended with error code 2. The gRPC client code is not deliberately ending sessions with that code, so the default assumption is that that error comes from the server/proxy, but it is possible that the Node runtime on the client is generating that error. This can be verified with a tcpdump of the traffic the client is sending and receiving. That will show definitively whether or not the server sent that error.
@murgatroid99 Yes, this error is also happening with node 18.16.0, 20.11.0, 21.6.0, and also with grpc-js 1.9.14
@murgatroid99 Here I have more logs:
The envoy side car in the client shows:
2024-01-27T07:18:14.817567Z trace envoy connection external/envoy/source/common/network/connection_impl.cc:615 [Tags: "ConnectionId":"7"] read ready. dispatch_buffered_data=0 thread=25
2024-01-27T07:18:14.817567Z trace envoy connection external/envoy/source/common/network/raw_buffer_socket.cc:24 [Tags: "ConnectionId":"7"] read returns: 26 thread=25
2024-01-27T07:18:14.817567Z trace envoy connection external/envoy/source/common/network/raw_buffer_socket.cc:38 [Tags: "ConnectionId":"7"] read error: Resource temporarily unavailable thread=25
2024-01-27T07:18:14.817867Z debug envoy http2 external/envoy/source/common/http/http2/codec_impl.cc:1208 [Tags: "ConnectionId":"7"] sent goaway code=2 thread=25
2024-01-27T07:18:14.817867Z debug envoy http external/envoy/source/common/http/conn_manager_impl.cc:427 [Tags: "ConnectionId":"7"] dispatch error: The user callback function failed thread=25
2024-01-27T07:18:14.817867Z debug envoy http external/envoy/source/common/http/conn_manager_impl.cc:1889 [Tags: "ConnectionId":"7","StreamId":"7112271742612094099"] stream reset: reset reason: connection termination, response details: - thread=25
Which looks like is saying the client is sending the goway due to a callback function failed and proceed to reset the stream
Then in the application I see these relevant lines:
2024-01-27T07:18:14.951034047Z stderr F HTTP2 18: Http2Session client: goaway 2 received [last stream id: 3999]
2024-01-27T07:18:14.951037006Z stderr F D 2024-01-27T07:18:14.866Z | v1.9.14 18 | transport | (3) 10.96.13.98:8081 connection closed by GOAWAY with code 2 and data undefined
2024-01-27T07:18:14.951041047Z stderr F D 2024-01-27T07:18:14.866Z | v1.9.14 18 | subchannel | (2) 10.96.13.98:8081 READY -> IDLE
2024-01-27T07:18:14.951043381Z stderr F HTTP2 18: Http2Session client: marking session closed
2024-01-27T07:18:14.951045672Z stderr F HTTP2 18: Http2Session client: submitting goaway
2024-01-27T07:18:14.951047214Z stderr F D 2024-01-27T07:18:14.866Z | v1.9.14 18 | subchannel_refcount | (2) 10.96.13.98:8081 refcount 2 -> 1
2024-01-27T07:18:14.951049006Z stderr F D 2024-01-27T07:18:14.866Z | v1.9.14 18 | pick_first | READY -> IDLE
2024-01-27T07:18:14.951050881Z stderr F D 2024-01-27T07:18:14.866Z | v1.9.14 18 | resolving_load_balancer | dns:grpc-server.mcaruso-poc.svc.cluster.local:8081 READY -> IDLE
2024-01-27T07:18:14.951052756Z stderr F D 2024-01-27T07:18:14.866Z | v1.9.14 18 | connectivity_state | (1) dns:grpc-server.mcaruso-poc.svc.cluster.local:8081 READY -> IDLE
2024-01-27T07:18:14.951054797Z stderr F D 2024-01-27T07:18:14.866Z | v1.9.14 18 | dns_resolver | resolution update delayed by "min time between resolutions" rate limit
2024-01-27T07:18:14.951056631Z stderr F D 2024-01-27T07:18:14.866Z | v1.9.14 18 | resolving_load_balancer | dns:grpc-server.mcaruso-poc.svc.cluster.local:8081 IDLE -> CONNECTING
2024-01-27T07:18:14.951058131Z stderr F D 2024-01-27T07:18:14.866Z | v1.9.14 18 | connectivity_state | (1) dns:grpc-server.mcaruso-poc.svc.cluster.local:8081 IDLE -> CONNECTING
2024-01-27T07:18:14.951059756Z stderr F HTTP2 18: Http2Session client: destroying
2024-01-27T07:18:14.951061297Z stderr F HTTP2 18: Http2Session client: start closing/destroying Error [ERR_HTTP2_SESSION_ERROR]: Session closed with error code 2
2024-01-27T07:18:14.951063089Z stderr F at Http2Session.onGoawayData (node:internal/http2/core:690:21)
2024-01-27T07:18:14.951064881Z stderr F at Http2Session.callbackTrampoline (node:internal/async_hooks:130:17) {
2024-01-27T07:18:14.951066714Z stderr F code: 'ERR_HTTP2_SESSION_ERROR'
2024-01-27T07:18:14.951068631Z stderr F }
2024-01-27T07:18:14.951070172Z stderr F HTTP2 18: Http2Stream 2025 [Http2Session client]: destroying stream
2024-01-27T07:18:14.952236297Z stderr F HTTP2 18: Http2Session client: finishSessionClose
2024-01-27T07:18:14.952238089Z stderr F D 2024-01-27T07:18:14.875Z | v1.9.14 18 | subchannel_call | [7012] Node error event: message=Session closed with error code 2 code=ERR_HTTP2_SESSION_ERROR errno=Unknown system error undefined syscall=undefined
2024-01-27T07:18:14.964811172Z stderr F D 2024-01-27T07:18:14.901Z | v1.9.14 18 | resolving_call | [4999] ended with status: code=13 details="Received RST_STREAM with code 2 triggered by internal client error: Session closed with error code 2"
2024-01-27T07:18:14.964813089Z stderr F HTTP2 18: Http2Session client: finishSessionClose socket end null Error [ERR_HTTP2_SESSION_ERROR]: Session closed with error code 2
2024-01-27T07:18:14.964815214Z stderr F at Http2Session.onGoawayData (node:internal/http2/core:690:21)
2024-01-27T07:18:14.964816714Z stderr F at Http2Session.callbackTrampoline (node:internal/async_hooks:130:17) {
2024-01-27T07:18:14.964818381Z stderr F code: 'ERR_HTTP2_SESSION_ERROR'
2024-01-27T07:18:14.964820172Z stderr F }
2024-01-27T07:18:14.964821631Z stderr F D 2024-01-27T07:18:14.934Z | v1.9.14 18 | transport | (3) 10.96.13.98:8081 connection closed with error Session closed with error code 2
2024-01-27T07:18:14.964823172Z stderr F D 2024-01-27T07:18:14.934Z | v1.9.14 18 | transport | (3) 10.96.13.98:8081 session closed
2024-01-27T07:18:14.980607506Z stdout F 2024-01-27T07:18:14Z scuttle: Received signal 'child exited', passing to child
2024-01-27T07:18:14.980790964Z stdout F 2024-01-27T07:18:14Z scuttle: Received signal 'urgent I/O condition', ignoring
Where I see also here the line: "Http2Session client: submitting goaway" which looks like show the client is sending the goaway confirming the envoy side car log. This log line comes from:
https://github.com/nodejs/node/blob/f38a9a59182027493852478b8a3abd5d7d5d1726/lib/internal/http2/core.js#L853
Which ends up calling this function: https://github.com/nodejs/node/blob/f38a9a59182027493852478b8a3abd5d7d5d1726/lib/internal/http2/core.js#L1508
which clarifies "Submits a GOAWAY frame to be sent to the remote peer."
And the relevant envoy logs from the client and server I see the following (there are a lot of requests like these):
ENVOY CLIENT LOG - Request: 7448123a-6637-4550-80d0-ea519c22cfa0
2024-01-27T07:18:14.861767964Z stdout F {"path":"/rpc.bookshop.v1.InventoryService/GetBookList","authority":"grpc-server.mcaruso-poc.svc.cluster.local:8081","downstream_local_address":"10.96.13.98:8081","upstream_host":"10.244.1.35:8081","start_time":"2024-01-27T07:18:09.524Z","downstream_peer_uri_san":null,"request_id":"7448123a-6637-4550-80d0-ea519c22cfa0","dd.trace_id":null,"upstream_transport_failure_reason":null,"upstream_service_time":null,"requested_server_name":null,"response_code_details":"codec_error:The_user_callback_function_failed","bytes_received":5,"route_name":"default","response_flags":"DPE","upstream_cluster":"outbound|8081||grpc-server.mcaruso-poc.svc.cluster.local","duration":5333,"user_agent":"grpc-node-js/1.9.14","x_forwarded_for":null,"response_code":0,"x_envoy_external_address":null,"x_forwarded_proto":"http","connection_termination_details":null,"protocol":"HTTP/2","upstream_local_address":"10.244.1.36:56188","x_forwarded_client_cert":null,"downstream_remote_address":"10.244.1.36:34518","method":"POST","bytes_sent":0}
The client error seems to say there is a codec error. The envoy flag DPE means "The downstream request had an HTTP protocol error", and because this is the client envoy, the downstream is the nodejs application client container using the grpc-js having the protocol error.
ENVOY SERVER LOG - Request: 7448123a-6637-4550-80d0-ea519c22cfa0
2024-01-27T07:18:14.842076464Z stdout F {"upstream_service_time":"5175","x_envoy_external_address":null,"downstream_peer_uri_san":"spiffe://cluster.local/ns/mcaruso-poc/sa/grpc-client","dd.trace_id":null,"protocol":"HTTP/2","route_name":"default","response_code_details":"via_upstream","x_forwarded_for":null,"response_flags":"-","duration":5227,"requested_server_name":"outbound_.8081_._.grpc-server.mcaruso-poc.svc.cluster.local","start_time":"2024-01-27T07:18:09.599Z","path":"/rpc.bookshop.v1.InventoryService/GetBookList","connection_termination_details":null,"response_code":200,"user_agent":"grpc-node-js/1.9.14","downstream_remote_address":"10.244.1.36:56188","bytes_received":5,"upstream_cluster":"inbound|8081||","downstream_local_address":"10.244.1.35:8081","method":"POST","upstream_transport_failure_reason":null,"upstream_local_address":"127.0.0.6:42447","x_forwarded_client_cert":"By=spiffe://cluster.local/ns/mcaruso-poc/sa/grpc-server;Hash=31094800fb20c490cceba2bd1ea0b3ed257f16934f4d2fc8858b32d79ad90e9e;Subject="";URI=spiffe://cluster.local/ns/mcaruso-poc/sa/grpc-client","bytes_sent":106,"upstream_host":"10.244.1.35:8081","request_id":"7448123a-6637-4550-80d0-ea519c22cfa0","authority":"grpc-server.mcaruso-poc.svc.cluster.local:8081","x_forwarded_proto":"http"}
ENVOY CLIENT LOG - Request 9664814f-e11d-43e4-ba61-df8293f63552
2024-01-27T07:18:14.825414881Z stdout F {"authority":"grpc-server.mcaruso-poc.svc.cluster.local:8081","protocol":"HTTP/2","path":"/rpc.bookshop.v1.InventoryService/GetBookList","dd.trace_id":null,"route_name":"default","response_code_details":"codec_error:The_user_callback_function_failed","bytes_received":5,"requested_server_name":null,"upstream_cluster":"outbound|8081||grpc-server.mcaruso-poc.svc.cluster.local","start_time":"2024-01-27T07:18:09.576Z","downstream_peer_uri_san":null,"user_agent":"grpc-node-js/1.9.14","x_forwarded_proto":"http","bytes_sent":0,"x_envoy_external_address":null,"request_id":"9664814f-e11d-43e4-ba61-df8293f63552","x_forwarded_for":null,"upstream_transport_failure_reason":null,"method":"POST","response_flags":"DPE","duration":5243,"x_forwarded_client_cert":null,"response_code":0,"upstream_service_time":null,"upstream_host":"10.244.1.35:8081","downstream_local_address":"10.96.13.98:8081","downstream_remote_address":"10.244.1.36:34518","upstream_local_address":"10.244.1.36:56188","connection_termination_details":null}
ENVOY SERVER LOG- Request 9664814f-e11d-43e4-ba61-df8293f63552
2024-01-27T07:18:14.894895339Z stdout F {"connection_termination_details":null,"method":"POST","request_id":"9664814f-e11d-43e4-ba61-df8293f63552","upstream_service_time":null,"dd.trace_id":null,"upstream_host":"10.244.1.35:8081","bytes_received":5,"route_name":"default","requested_server_name":"outbound_.8081_._.grpc-server.mcaruso-poc.svc.cluster.local","x_forwarded_proto":"http","bytes_sent":0,"downstream_remote_address":"10.244.1.36:56188","upstream_transport_failure_reason":null,"start_time":"2024-01-27T07:18:09.641Z","response_code":0,"upstream_local_address":"127.0.0.6:42447","x_forwarded_for":null,"response_flags":"-","response_code_details":"http2.remote_reset","x_forwarded_client_cert":"By=spiffe://cluster.local/ns/mcaruso-poc/sa/grpc-server;Hash=31094800fb20c490cceba2bd1ea0b3ed257f16934f4d2fc8858b32d79ad90e9e;Subject="";URI=spiffe://cluster.local/ns/mcaruso-poc/sa/grpc-client","x_envoy_external_address":null,"downstream_peer_uri_san":"spiffe://cluster.local/ns/mcaruso-poc/sa/grpc-client","downstream_local_address":"10.244.1.35:8081","upstream_cluster":"inbound|8081||","path":"/rpc.bookshop.v1.InventoryService/GetBookList","protocol":"HTTP/2","user_agent":"grpc-node-js/1.9.14","duration":5251,"authority":"grpc-server.mcaruso-poc.svc.cluster.local:8081"}
Here we see in the server response that the remote client reseted the stream.
As I commented on the corresponding Istio issue, the first line of the client log shows that the client received a goaway before it sent one. So the primary cause of the connection closing is the client receiving a goaway, not sending one.
@murgatroid99 That is not the issue, we can see there that the envoy in the client side is generating the goaway:
2024-01-27T07:18:14.817867Z debug envoy http2 external/envoy/source/common/http/http2/codec_impl.cc:1208 [Tags: "ConnectionId":"7"] sent goaway code=2 thread=25
Which makes me think the envoy side car in the client side is telling the client application to goaway.
But the goaway here is not the problem, the goaway happens, because there are a bunch of requests with errors. I had thousands of requests with the same error in the envoy client side: "response_code_details":"codec_error:The_user_callback_function_failed",,"response_flags":"DPE","
`
For me is pretty clear that there is a codec error. The envoy flag DPE means "The downstream request had an HTTP protocol error", and because this is the client envoy, the downstream is the nodejs application client container using the grpc-js having the protocol error.
The goaway is the end result of this, becasue there are several errors, then the client it proceeds to reset all the streams. And we now is the client resetting the streams, because the envoy server side car shows :"response_code_details":"http2.remote_reset",
We need to focus on the codec error, not in the goaway.
The client log shows that the client process received a goaway and the cited error message indicates that the requests failed because the client process received that goaway. Do you have any logs that show that any requests failed on the client before the client received that goaway?
@murgatroid99
Yes, in fact if you see the timeline order is the following,
First I see a lot of requests like this in envoy client side where they are ok:
2024-01-27T07:18:14.817417547Z stdout F {"bytes_received":5,"connection_termination_details":null,"downstream_peer_uri_san":null,"upstream_local_address":"10.244.1.36:56188","method":"POST","x_envoy_external_address":null,"dd.trace_id":null,"downstream_local_address":"10.96.13.98:8081","upstream_host":"10.244.1.35:8081","requested_server_name":null,"start_time":"2024-01-27T07:18:09.537Z","path":"/rpc.bookshop.v1.InventoryService/GetBookList","upstream_cluster":"outbound|8081||grpc-server.mcaruso-poc.svc.cluster.local","request_id":"4d2ba9ed-e3dd-4cd2-8ef4-87369be87a99","protocol":"HTTP/2","downstream_remote_address":"10.244.1.36:34518","x_forwarded_client_cert":null,"route_name":"default","user_agent":"grpc-node-js/1.9.14","x_forwarded_proto":"http","response_code":200,"upstream_service_time":"5215","bytes_sent":106,"duration":5269,"authority":"grpc-server.mcaruso-poc.svc.cluster.local:8081","response_flags":"-","x_forwarded_for":null,"upstream_transport_failure_reason":null,"response_code_details":"via_upstream"}
Then, at some point after a lot of successfull requests like the one above, I see the goaway is sent by the envoy client side:
2024-01-27T07:18:14.817867Z debug envoy http2 external/envoy/source/common/http/http2/codec_impl.cc:1208 [Tags: "ConnectionId":"7"] sent goaway code=2 thread=25
And after that I see a lot of requests in the envoy client side with the codec error like these:
2024-01-27T07:18:14.825341714Z stdout F {"bytes_sent":0,"response_flags":"DPE","user_agent":"grpc-node-js/1.9.14","upstream_service_time":null,"protocol":"HTTP/2","authority":"grpc-server.mcaruso-poc.svc.cluster.local:8081","duration":5240,"upstream_host":"10.244.1.35:8081","x_envoy_external_address":null,"method":"POST","downstream_peer_uri_san":null,"bytes_received":5,"x_forwarded_for":null,"dd.trace_id":null,"downstream_local_address":"10.96.13.98:8081","start_time":"2024-01-27T07:18:09.577Z","upstream_local_address":"10.244.1.36:56188","requested_server_name":null,"connection_termination_details":null,"response_code_details":"codec_error:The_user_callback_function_failed","upstream_cluster":"outbound|8081||grpc-server.mcaruso-poc.svc.cluster.local","downstream_remote_address":"10.244.1.36:34518","upstream_transport_failure_reason":null,"path":"/rpc.bookshop.v1.InventoryService/GetBookList","response_code":0,"request_id":"23cf4d24-9caa-4d24-b3ca-a136b81ae46d","route_name":"default","x_forwarded_proto":"http","x_forwarded_client_cert":null}
And after that, the client app receives the goaway, and closes the connection (here the app container logs):
2024-01-27T07:18:14.951034047Z stderr F HTTP2 18: Http2Session client: goaway 2 received [last stream id: 3999]
2024-01-27T07:18:14.951037006Z stderr F D 2024-01-27T07:18:14.866Z | v1.9.14 18 | transport | (3) 10.96.13.98:8081 connection closed by GOAWAY with code 2 and data undefined
2024-01-27T07:18:14.951043381Z stderr F HTTP2 18: Http2Session client: marking session closed
2024-01-27T07:18:14.951045672Z stderr F HTTP2 18: Http2Session client: submitting goaway
The server logs doesnt show anything about a goaway, this all happens inside the pod of the client, between the envoy and the app container.
This is the repository for the @grpc/grpc-js
library. What I care about here is the behavior of that library. So far, I don't see any evidence of that library misbehaving. The client application logs you have showed so far show the client application receiving a goaway with code 2 and then reporting the corresponding error correctly.
Ok, makes sense. If I try this with Istio 1.18.5, this does not happen. It only happens with Istio 1.19.X and up.
The only thing that calls my attention in the grpc-js library is that when I search this codec_error: The_user_callback_function_failed, I get this:
If you are encountering the codec_error: The_user_callback_function_failed message in the context of a Node.js application using gRPC and Envoy, it suggests that there is an issue with the callback function related to gRPC processing. Here are some steps you can take to troubleshoot and address the issue:
Check gRPC Callback Functions:
Examine your gRPC callback functions in your Node.js application. This includes functions related to handling requests and responses. Look for any exceptions, errors, or unexpected behavior within these functions.
And I see in the client node app erroring in the callback function:
2024-01-27T07:18:14.951058131Z stderr F D 2024-01-27T07:18:14.866Z | v1.9.14 18 | connectivity_state | (1) dns:grpc-server.mcaruso-poc.svc.cluster.local:8081 IDLE -> CONNECTING
2024-01-27T07:18:14.951059756Z stderr F HTTP2 18: Http2Session client: destroying
2024-01-27T07:18:14.951061297Z stderr F HTTP2 18: Http2Session client: start closing/destroying Error [ERR_HTTP2_SESSION_ERROR]: Session closed with error code 2
2024-01-27T07:18:14.951063089Z stderr F at Http2Session.onGoawayData (node:internal/http2/core:690:21)
2024-01-27T07:18:14.951064881Z stderr F at Http2Session.callbackTrampoline (node:internal/async_hooks:130:17) {
2024-01-27T07:18:14.951066714Z stderr F code: 'ERR_HTTP2_SESSION_ERROR'
2024-01-27T07:18:14.951068631Z stderr F }
I don't think there is any relation between those things. The Node log there shows the code providing an error to a callback, not an error occurring in a callback.
The reason is this silly rate limiter in node.js, 50121
#define NGHTTP2_DEFAULT_STREAM_RESET_BURST 1000
#define NGHTTP2_DEFAULT_STREAM_RESET_RATE 33
@bbit8 I don't see any relation between that rate limit and this bug. This bug is about requests closing because of GOAWAYs, not any large volume of RST_STREAMs.
@murgatroid99 In fact is the opposite of what you say, I see a tons of RST_STREAMs. I think is related to what @bbit8 is mentioning.
v1.9.14 18 | subchannel_call | [7722] HTTP/2 stream closed with code 2
v1.9.14 18 | subchannel_call | [7722] ended with status: code=13 details="Received RST_STREAM with code 2 triggered by internal client error: Session closed with error code 2"
Those lines are everywhere. The goaway happens because there was an error previously and it is related to the RST_STREAMs.
The goaway log happens after I see a huge volume of RST STREAM errors.
The GOAWAY is a prevention of envoy side car in istio, when it receives a lot of RST_STEAM from the grpc client, it consider that it is an attack, so it sends a goway to the client. But the initial problem here is why the client is sending a lot of RST_STREAMS, and it is because there is an error in the client side, it might be the grps-js or the node js, or what @bbit8 is mentioning.
This is the property of envoy that do this "envoy.restart_features.send_goaway_for_premature_rst_streams" when it sees a lot of premature reset streams, it sends a go away. And this is what is happening here.
To be clear, those logs do not correspond to actual RST_STREAM frames on the wire. They are synthetic RST_FRAME events generated by Node that correspond to the GOAWAY.
@mcaruso85 @murgatroid99 I didn't dive deep. I just decreased node.js version from 18.18.2 to 18.18.1. Problem Solved! Something wrong with the rate limiter in nghttp2.
For the record, the purpose of that linked patch in Node.js was to fix an HTTP/2 protocol vulnerability. I would not advise downgrading below that patch.
@bbit8 the original issue reports that this problem occurs on Node 18.16.0, so if your problem was fixed by downgrading that way, it is likely different than the error originally reported here.
This is a very general error: a connection closed because of an internal server error. There are probably a variety of causes that result in the same error.
This problem may be related to #2625. We are trying out a fix for that in grpc-js 1.10.4, so I recommend checking that out.
@murgatroid99 agree, initial testing shows that v1.10.4
resolves this for us (presumably #2695). I'll report back once we've verified more.
We can also resolve it by switching envoy to use the oghttp2
backend (theory being that nghttp2 has a new RST_STREAM rate limiter, and envoy doesn't expose configuration points). But oghttp2
is not ready for prime-time in envoy 1.28, so we're very happy to find a fix for the client side.
@murgatroid99 my initial tests shows that 1.10.4 also resolves this for us. I'll report back too once I have it in production. Thanks for your help !
This is a very general error. It says "Internal server error" in the text because one of the causes is an internal server error. I don't know why you're still seeing it, but the cause of your errors may be entirely unrelated to the one the patch in 1.10.4 was trying to address.
I am also still receiving 13 INTERNAL: Received RST_STREAM with code 2 triggered by internal client error: Session closed with error code 2"
on 1.10.4 under high-load scenarios. (i.e. 40k+ request for streams)