googleapis/nodejs-pubsub

Getting ENODATA and EADDRNOTAVAIL from PubSub after sending events

LukasKnuth opened this issue · 4 comments

Environment details

  • OS: x86_64 GNU/Linux 5.10.176+ (GKE node)
  • Node.js version: 18.16.0
  • npm version: 9.8.0
  • @google-cloud/pubsub version: 4.0.6

Steps to reproduce

  1. Send an Event through Topic.publishMessage({json: event})
  2. The message is sent as expected
  3. We get the errors above in DataDog traces

Additional information

  • This application only publishes events, it does NOT have any event listeners

The Trace

  1. The trace shows that the code path to publish the event sends the event.
  2. We see pubsub.request for the correct/expected topic
  3. Next is grpc.client to the /google.pubsub.v1.Publisher/Publish path
  4. Within this call comes the first error in dns.resolve
Error: queryTxt ENODATA pubsub.googleapis.com
    at QueryReqWrap.onresolve [as oncomplete] (node:internal/dns/callback_resolver:47:19)
    at QueryReqWrap.callbackTrampoline (node:internal/async_hooks:130:17)
  1. On that same code path we get the next error in tcp.connect
Error: connect EADDRNOTAVAIL 2a00:1450:4001:80f::200a:443 - Local (:::0)
    at internalConnect (node:net:1100:16)
    at defaultTriggerAsyncIdScope (node:internal/async_hooks:463:18)
    at node:net:1306:9
    at process.processTicksAndRejections (node:internal/process/task_queues:77:11)
  1. Right after the connection attempt to the IPv6 address, I see a successful attempt to 142.250.74.202:443
  2. I presume that path is asynchronous because on the same path we get repeated tcp.connect attempts around every four minutes. They all try to connect to an IPv6 address I don't recognize (like in the above error).

Here are some screenshots of the flame graph:

image

Zoomed out to max to see the repeated connection attempts
image


This is spamming our logs and causing false positives in the error reporting. The sending of events works as expected though. Would love to sort this out.

We are having the same issue as well (but with subscriptions) we had http2 debugging and grpc tracing on at the time of the incident.

In our case no more messages are being pulled.
Acks and nacks aren't delivered (we are not sure about that) :/


ERROR 2023-10-20T20:05:48.449431624Z D 2023-10-20T20:05:48.449Z | subchannel | (85) 2a00:1450:4001:80e::200a:443 TRANSIENT_FAILURE -> CONNECTING
ERROR 2023-10-20T20:05:48.450167540Z D 2023-10-20T20:05:48.450Z | subchannel | (85) 2a00:1450:4001:80e::200a:443 creating HTTP/2 session
ERROR 2023-10-20T20:05:48.451373185Z HTTP2 1: Http2Session client: created
ERROR 2023-10-20T20:05:48.452870711Z HTTP2 1: Http2Session <invalid>: socket error [connect EADDRNOTAVAIL 2a00:1450:4001:80e::200a:443 - Local (:::0)]
ERROR 2023-10-20T20:05:48.453065069Z HTTP2 1: Http2Session client: destroying
ERROR 2023-10-20T20:05:48.455429453Z HTTP2 1: Http2Session client: start closing/destroying Error: connect EADDRNOTAVAIL 2a00:1450:4001:80e::200a:443 - Local (:::0) at __node_internal_captureLargerStackTrace (node:internal/errors:496:5) at __node_internal_exceptionWithHostPort (node:internal/errors:671:12) at internalConnect (node:net:1100:16) at defaultTriggerAsyncIdScope (node:internal/async_hooks:463:18) at node:net:1306:9 at processTicksAndRejections (node:internal/process/task_queues:77:11) at runNextTicks (node:internal/process/task_queues:64:3) at listOnTimeout (node:internal/timers:538:9) at process.processTimers (node:internal/timers:512:7) {
ERROR 2023-10-20T20:05:48.455649536Z errno: -99,
ERROR 2023-10-20T20:05:48.455655139Z code: 'EADDRNOTAVAIL',
ERROR 2023-10-20T20:05:48.455660692Z syscall: 'connect',
ERROR 2023-10-20T20:05:48.455666785Z address: '2a00:1450:4001:80e::200a',
ERROR 2023-10-20T20:05:48.455671971Z port: 443
ERROR 2023-10-20T20:05:48.455685973Z }
ERROR 2023-10-20T20:05:48.455985840Z HTTP2 1: Http2Session client: finishSessionClose
ERROR 2023-10-20T20:05:48.456237292Z D 2023-10-20T20:05:48.456Z | subchannel | (85) 2a00:1450:4001:80e::200a:443 connection closed with error connect EADDRNOTAVAIL 2a00:1450:4001:80e::200a:443 - Local (:::0)
ERROR 2023-10-20T20:05:48.456285701Z D 2023-10-20T20:05:48.456Z | subchannel | (85) 2a00:1450:4001:80e::200a:443 connection closed
ERROR 2023-10-20T20:05:48.456303572Z D 2023-10-20T20:05:48.456Z | subchannel | (85) 2a00:1450:4001:80e::200a:443 CONNECTING -> TRANSIENT_FAILURE
ERROR 2023-10-20T20:05:48.456534614Z D 2023-10-20T20:05:48.456Z | subchannel | (85) 2a00:1450:4001:80e::200a:443 TRANSIENT_FAILURE -> IDLE
ERROR 2023-10-20T20:05:48.456552437Z D 2023-10-20T20:05:48.456Z | subchannel | (86) 2a00:1450:4001:80f::200a:443 TRANSIENT_FAILURE -> CONNECTING
ERROR 2023-10-20T20:05:48.456703562Z D 2023-10-20T20:05:48.456Z | subchannel | (86) 2a00:1450:4001:80f::200a:443 creating HTTP/2 session
ERROR 2023-10-20T20:05:48.457291793Z HTTP2 1: Http2Session client: created
ERROR 2023-10-20T20:05:48.457883349Z HTTP2 1: Http2Session <invalid>: socket error [connect EADDRNOTAVAIL 2a00:1450:4001:80f::200a:443 - Local (:::0)]
ERROR 2023-10-20T20:05:48.458050683Z HTTP2 1: Http2Session client: destroying
ERROR 2023-10-20T20:05:48.458296839Z HTTP2 1: Http2Session client: start closing/destroying Error: connect EADDRNOTAVAIL 2a00:1450:4001:80f::200a:443 - Local (:::0) at __node_internal_captureLargerStackTrace (node:internal/errors:496:5) at __node_internal_exceptionWithHostPort (node:internal/errors:671:12) at internalConnect (node:net:1100:16) at defaultTriggerAsyncIdScope (node:internal/async_hooks:463:18) at node:net:1306:9 at processTicksAndRejections (node:internal/process/task_queues:77:11) at runNextTicks (node:internal/process/task_queues:64:3) at listOnTimeout (node:internal/timers:538:9) at process.processTimers (node:internal/timers:512:7) {
ERROR 2023-10-20T20:05:48.458370179Z errno: -99,
ERROR 2023-10-20T20:05:48.458375751Z code: 'EADDRNOTAVAIL',
ERROR 2023-10-20T20:05:48.458381411Z syscall: 'connect',
ERROR 2023-10-20T20:05:48.458386972Z address: '2a00:1450:4001:80f::200a',
ERROR 2023-10-20T20:05:48.458392521Z port: 443
ERROR 2023-10-20T20:05:48.458397787Z }
ERROR 2023-10-20T20:05:48.458540773Z HTTP2 1: Http2Session client: finishSessionClose
ERROR 2023-10-20T20:05:48.458732527Z D 2023-10-20T20:05:48.458Z | subchannel | (86) 2a00:1450:4001:80f::200a:443 connection closed with error connect EADDRNOTAVAIL 2a00:1450:4001:80f::200a:443 - Local (:::0)
ERROR 2023-10-20T20:05:48.458858364Z D 2023-10-20T20:05:48.458Z | subchannel | (86) 2a00:1450:4001:80f::200a:443 connection closed
ERROR 2023-10-20T20:05:48.459002652Z D 2023-10-20T20:05:48.458Z | subchannel | (86) 2a00:1450:4001:80f::200a:443 CONNECTING -> TRANSIENT_FAILURE
ERROR 2023-10-20T20:05:48.459227436Z D 2023-10-20T20:05:48.459Z | subchannel | (86) 2a00:1450:4001:80f::200a:443 TRANSIENT_FAILURE -> IDLE
ERROR 2023-10-20T20:05:48.459422298Z D 2023-10-20T20:05:48.459Z | subchannel | (87) 2a00:1450:4001:810::200a:443 TRANSIENT_FAILURE -> CONNECTING
ERROR 2023-10-20T20:05:48.459660232Z D 2023-10-20T20:05:48.459Z | subchannel | (87) 2a00:1450:4001:810::200a:443 creating HTTP/2 session
ERROR 2023-10-20T20:05:48.460359637Z HTTP2 1: Http2Session client: created
ERROR 2023-10-20T20:05:48.461176918Z HTTP2 1: Http2Session <invalid>: socket error [connect EADDRNOTAVAIL 2a00:1450:4001:810::200a:443 - Local (:::0)]
ERROR 2023-10-20T20:05:48.461356881Z HTTP2 1: Http2Session client: destroying
ERROR 2023-10-20T20:05:48.461647123Z HTTP2 1: Http2Session client: start closing/destroying Error: connect EADDRNOTAVAIL 2a00:1450:4001:810::200a:443 - Local (:::0) at __node_internal_captureLargerStackTrace (node:internal/errors:496:5) at __node_internal_exceptionWithHostPort (node:internal/errors:671:12) at internalConnect (node:net:1100:16) at defaultTriggerAsyncIdScope (node:internal/async_hooks:463:18) at node:net:1306:9 at processTicksAndRejections (node:internal/process/task_queues:77:11) at runNextTicks (node:internal/process/task_queues:64:3) at listOnTimeout (node:internal/timers:538:9) at process.processTimers (node:internal/timers:512:7) {
ERROR 2023-10-20T20:05:48.461748549Z errno: -99,
ERROR 2023-10-20T20:05:48.461754307Z code: 'EADDRNOTAVAIL',
ERROR 2023-10-20T20:05:48.461791167Z syscall: 'connect',
ERROR 2023-10-20T20:05:48.461797891Z address: '2a00:1450:4001:810::200a',
ERROR 2023-10-20T20:05:48.461803923Z port: 443
ERROR 2023-10-20T20:05:48.461809972Z }
ERROR 2023-10-20T20:05:48.461952928Z HTTP2 1: Http2Session client: finishSessionClose
ERROR 2023-10-20T20:05:48.462147565Z D 2023-10-20T20:05:48.462Z | subchannel | (87) 2a00:1450:4001:810::200a:443 connection closed with error connect EADDRNOTAVAIL 2a00:1450:4001:810::200a:443 - Local (:::0)
ERROR 2023-10-20T20:05:48.462308185Z D 2023-10-20T20:05:48.462Z | subchannel | (87) 2a00:1450:4001:810::200a:443 connection closed
ERROR 2023-10-20T20:05:48.462505629Z D 2023-10-20T20:05:48.462Z | subchannel | (87) 2a00:1450:4001:810::200a:443 CONNECTING -> TRANSIENT_FAILURE
ERROR 2023-10-20T20:05:48.462759040Z D 2023-10-20T20:05:48.462Z | subchannel | (87) 2a00:1450:4001:810::200a:443 TRANSIENT_FAILURE -> IDLE
ERROR 2023-10-20T20:05:48.462986925Z D 2023-10-20T20:05:48.462Z | subchannel | (88) 2a00:1450:4001:811::200a:443 TRANSIENT_FAILURE -> CONNECTING
ERROR 2023-10-20T20:05:48.463202698Z D 2023-10-20T20:05:48.463Z | subchannel | (88) 2a00:1450:4001:811::200a:443 creating HTTP/2 session
ERROR 2023-10-20T20:05:48.463943090Z HTTP2 1: Http2Session client: created
ERROR 2023-10-20T20:05:48.464716175Z HTTP2 1: Http2Session <invalid>: socket error [connect EADDRNOTAVAIL 2a00:1450:4001:811::200a:443 - Local (:::0)]
ERROR 2023-10-20T20:05:48.464895203Z HTTP2 1: Http2Session client: destroying
ERROR 2023-10-20T20:05:48.465142845Z HTTP2 1: Http2Session client: start closing/destroying Error: connect EADDRNOTAVAIL 2a00:1450:4001:811::200a:443 - Local (:::0) at __node_internal_captureLargerStackTrace (node:internal/errors:496:5) at __node_internal_exceptionWithHostPort (node:internal/errors:671:12) at internalConnect (node:net:1100:16) at defaultTriggerAsyncIdScope (node:internal/async_hooks:463:18) at node:net:1306:9 at processTicksAndRejections (node:internal/process/task_queues:77:11) at runNextTicks (node:internal/process/task_queues:64:3) at listOnTimeout (node:internal/timers:538:9) at process.processTimers (node:internal/timers:512:7) {
ERROR 2023-10-20T20:05:48.465231833Z errno: -99,
ERROR 2023-10-20T20:05:48.465238476Z code: 'EADDRNOTAVAIL',
ERROR 2023-10-20T20:05:48.465244753Z syscall: 'connect',
ERROR 2023-10-20T20:05:48.465250884Z address: '2a00:1450:4001:811::200a',
ERROR 2023-10-20T20:05:48.465256932Z port: 443
ERROR 2023-10-20T20:05:48.465262473Z }
ERROR 2023-10-20T20:05:48.465483218Z HTTP2 1: Http2Session client: finishSessionClose
ERROR 2023-10-20T20:05:48.465690270Z D 2023-10-20T20:05:48.465Z | subchannel | (88) 2a00:1450:4001:811::200a:443 connection closed with error connect EADDRNOTAVAIL 2a00:1450:4001:811::200a:443 - Local (:::0)
ERROR 2023-10-20T20:05:48.465884563Z D 2023-10-20T20:05:48.465Z | subchannel | (88) 2a00:1450:4001:811::200a:443 connection closed
ERROR 2023-10-20T20:05:48.465998695Z D 2023-10-20T20:05:48.465Z | subchannel | (88) 2a00:1450:4001:811::200a:443 CONNECTING -> TRANSIENT_FAILURE
ERROR 2023-10-20T20:05:48.466187245Z D 2023-10-20T20:05:48.466Z | subchannel | (88) 2a00:1450:4001:811::200a:443 TRANSIENT_FAILURE -> IDLE

I'd advise both of you to open up a support ticket, because this looks less like a library bug and more like an issue with the network path between you and the CPS services.

Can you point us in the right direction where to look for the problem?
We started to get the disruptions at 20:30 last Tuesday without changing our code.
We downgraded to node 16 and the problem was gone. So we belive something changed in the underlying nodejs http2 library (nghttp2) which now causes the reconnect not to work.

@John98Zakaria There was apparently a service-side change that activated a bug in Node 18's HTTP/2 stack, and that has since been rolled back. Node 18.18.2 is reported to be in the clear if you need 18.