grpc/grpc-node

not sure why i'm seeing grpc blocking the event loop so often

earonesty opened this issue · 4 comments

Problem description

i tried to turn on event loop block detection, and all i see are grpc calls (we use google pubsub and other grpc-using tools), which doesn't make sense to me, since i thought during the call the system should yield

Reproduction steps

Use grpc on a busy server (we use google secrets manager, google pubsub, etc.)

Environment

  • Linux Ubuntu 18.04 amd64
  • Node 18
  • nvm/yarn

Additional context

Examples:

7: "    at /app/node_modules/@grpc/grpc-js/build/src/resolving-call.js:192:22"
8: "    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)"
message: "Event loop blocked for 600.4265230000019ms, operation started here:"
        "4": "    at Subchannel.startConnecting (/app/node_modules/@grpc/grpc-js/build/src/subchannel.js:278:17)",
        "5": "    at ChannelSubchannelWrapper.startConnecting (/app/node_modules/@grpc/grpc-js/build/src/subchannel-interface.js:34:20)",
        "6": "    at PickFirstLoadBalancer.exitIdle (/app/node_modules/@grpc/grpc-js/build/src/load-balancer-pick-first.js:341:24)",
        "7": "    at ChildLoadBalancerHandler.exitIdle (/app/node_modules/@grpc/grpc-js/build/src/load-balancer-child-handler.js:112:31)",
        "8": "    at ResolvingLoadBalancer.exitIdle (/app/node_modules/@grpc/grpc-js/build/src/resolving-load-balancer.js:235:32)",
        "9": "    at InternalChannel.getConfig (/app/node_modules/@grpc/grpc-js/build/src/internal-channel.js:313:36)",
        "10": "    at ResolvingCall.getConfig (/app/node_modules/@grpc/grpc-js/build/src/resolving-call.js:121:43)",
        "11": "    at /app/node_modules/@grpc/grpc-js/build/src/internal-channel.js:200:26",
        "12": "    at process.processTicksAndRejections (node:internal/process/task_queues:77:11)",
        "message": "Event loop blocked for 99.32906599998473ms, operation started here:"

basically most of my event loop is blocked on grpc stuff. is there some configuration or setting to make it properly yield?

this is a common one... blocking for 600ms

        "1": "    at TLSWrap.emitInitNative (node:internal/async_hooks:200:43)",
        "2": "    at TLSSocket._wrapHandle (node:_tls_wrap:696:24)",
        "3": "    at new TLSSocket (node:_tls_wrap:577:18)",
        "4": "    at Object.connect (node:_tls_wrap:1752:19)",
        "5": "    at Object.connect (node:internal/http2/core:3298:22)",
        "6": "    at /app/node_modules/@grpc/grpc-js/build/src/transport.js:504:35",
        "7": "    at new Promise (<anonymous>)",
        "8": "    at Http2SubchannelConnector.createSession (/app/node_modules/@grpc/grpc-js/build/src/transport.js:417:16)",
        "9": "    at /app/node_modules/@grpc/grpc-js/build/src/transport.js:557:112",
        "10": "    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)",
        "message": "Event loop blocked for 699.5128900003433ms, operation started here:"

That last example shows the creation of a TLS socket. It's probably doing some encryption work, which takes time. This should only happen infrequently: one of the basic design principles of gRPC is that you establish a connection once and then use it to make several requests, so the time taken to establish the connection is amortized over all of the requests.

The other examples seem to have incomplete stack traces, so I don't think I can evaluate what happened with them.

i can show the full traces. the TLS establishment one happens a lot. i have a 10 minute log with a single instance running on google cloud. it happens 5000 times.

here's a full trace 400 non-io milliseconds to "sendMessageOnChild".

      "data": [
        {
          "message": "Event loop blocked for 399.0506949999332ms, operation started here:"
        },
        [
          "    at AsyncHook.init (/app/node_modules/blocked-at/index.js:31:11)",
          "    at emitInitNative (node:internal/async_hooks:200:43)",
          "    at emitInitScript (node:internal/async_hooks:503:3)",
          "    at promiseInitHook (node:internal/async_hooks:322:3)",
          "    at promiseInitHookWithDestroyTracking (node:internal/async_hooks:326:3)",
          "    at Promise.then (<anonymous>)",
          "    at ResolvingCall.sendMessageOnChild (/app/node_modules/@grpc/grpc-js/build/src/resolving-call.js:104:99)",
          "    at /app/node_modules/@grpc/grpc-js/build/src/resolving-call.js:192:22"
        ]

heck here's the full log:

https://text.is/W88JY

you can see almost all the block-logs are "grpc doing stuff". not always TLS. not sure why TLS needs so much CPU with grpc, but not with our https:// rest requests - which never block.

I don't entirely understand how to interpret these logs. It's not clear what exactly qualifies as "blocking" the event loop. If it's any time spent in JS code, why are these stack traces called out specifically? And what exactly in the stack trace is the blocking time attributed to?

Also, if you have 5000 instances of blocking 400-600ms in 10 minutes, that's over half an hour of blocking total. I assume that means that there are multiple separate client instances. How many times does each client experience this in a 10 minute period?