pinojs/pino

Pino stopped writing logs

Closed this issue · 23 comments

We are experiencing exactly the same problems as the author of the other problem.
Look here for more details, screens and so on: #1829

We contacted author of related thread to see if he was able to fix the problem. This advice from @mcollina helped fix the memory leak, but the logs still disappear and do not appear until the application is restarted.

So he closed issue, but problem still exists

A little information, we have several micro services on different versions of the node: 20.4, 20.7 - they are all experiencing problems.

We use a pino logger and also a nest pino
"nestjs-pino": "3.5.0",
"pino-logger": "8.14.1"
"pino-http": "8.3.3"

I would like to add that we have been using this logger for several years and have not experienced any problems. The only thing that has changed is that some time ago we upgraded to the node 20 and moved to the typescript 5

pino-logger is a scam library that should not be used. Use pino.

I would like to add that we have been using this logger for several years and have not experienced any problems.

This looks like a bug in Node.js more than Pino. Try setting UV_USE_IO_URING=0 in your environment variables, as it might be necessary.

I would really need a way to reproduce this problem without:

  1. Nest.js
  2. Kubernetes

Sorry, we are using pino, not pino-logger. It was a typo.

We also think that problem is in nodejs, cause we do not see any troubles with NodeJS 18. Thanks for suggestion, will try use "UV_USE_IO_URING=0" and reply

After 2 weeks of everything working as intended, we ran into the issue again.
This time we had a console.error on sonic boom for drop and some other events:

  const stdoutStream = new SonicBoom({
    fd: stdout.fd,
    maxLength: 16 * mebibyteInBytes,
  });

  const onLogFailEventHandler =
    (event: string) => (error: (Error & { code: string }) | unknown) => {
      const errCodeStr =
        error instanceof Error && 'code' in error
          ? `, error code: ${error.code}`
          : '';

      console.error(
        `Encountered an error while trying to log (event: ${event}${errCodeStr})`,
        error,
      );
    };

  stdoutStream.on('drop', onLogFailEventHandler('drop'));
  // ... some other events

Other events didn't print, but the drop event is being printed a lot, basically at the same frequency as our logs (as expected).
The interesting thing is that when our drop event prints - a log from pino is being printed after it:

Screenshot 2023-11-06 at 11 46 30

Will try using the UV_USE_IO_URING flag, and see how it goes.

facing the same issue here!! also after upgrading to Node 20.

@habiiev Have you run into this problem again after setting the UV_USE_IO_URING flag?

@chrismeyers we haven't ran into the issue yet, everything is working fine at the moment

I ran into the same issue for an app running on Node 20.10.0 in Kubernetes.
UV_USE_IO_URING=0 so far seems to fix the issue.
Additionally setting this flag has reduced the apps' CPU usage by a huge amount (60% or so, it's basically only doing network i/o). Haven't profiled the exact cause of this (yet).

I´ve seen this same problem in one of our services running pino and node 20. One of the running kubernetes pods stop logging and memory keeps building up.

we've also been getting this issue on our production system, we're debating reverting back to node 18 or enabling UV_USE_IO_URING=0

latest Node v18 also has io_uring enabled.

This is relevant: nodejs/node#50717.

We are facing the same problem on node 18.18 and node 20.10
The pod logs stop, memory keeps increasing until the pod eventually dies.

@asadtariq96 could you confirm if setting UV_USE_IO_URING=0 fixes it?

Reporting back on this. We haven't seen any crashes after 4 weeks with UV_USE_IO_URING=0 set. Before, the app was crashing after 2-3 weeks.

@asadtariq96 could you confirm if setting UV_USE_IO_URING=0 fixes it?

@mcollina I can get back to you after running some load tests with UV_USE_IO_URING=0
Meanwhile here's some heap snapshots. The asJson function seems to be the culprit. Is the following heap snapshot useful in any way?

image

That function generates the log lines that are getting accumulated in the buffer. The buffer is not flushed due to nodejs/node#50717 in certain conditions on some (possibly old) docker/OS combinations.

Here's our config (if it helps)

node:20.11.0-bookworm with jemalloc running on GCP's GKE with COS_CONTAINERED OS

We ran some load tests with UV_USE_IO_URING=0 and that seems to have fixed the problem 🚀

Hello everyone. Libuv dev here, I'd like to take a look at this. Can somebody provide instructions (the simpler the better) on how to reproduce the issue? Thanks

hi @santigimeno , I encountered the same issue. To reproduce you might need a service with 10 replicas, running in Kubernetes, handling some 300-400 requests per second. (So about 30-40 requests per second on each replica). Our service was running stable on Node 18, and we noticed the degradation on attempt to upgrade to Node 20.

What we observed, significantly higher CPU usage right out the gate. Then, we noticed that randomly a pod would stop processing requests (as we initially thought, turns out it just stops logging), would start consuming memory (while their CPU use was way lower than other pods), and then in about 5 or 6 hours would run out of memory and crash with OOM (that "Allocation failed - JavaScript heap out of memory" node error). The issue would happen maybe to 2 or 3 pods per day.

Applying the UV_USE_IO_URING=0 seems to have fixed the problem. I will let you know in 3 or 4 days time if the issue is gone.

After applying UV_USE_IO_URING=0 to the pod environment, we immediately see the "system CPU" metric drop to low level, same as before, on Node 18. So this immediately shows there is some regression that causes significantly higher CPU usage when UV_USE_IO_URING=0 is not set.

P.S. I really am not sure the io_uring feature is worth it for us. With the significantly increased CPU consumption, and the risky memory behavior on default Node 20, it all suggests to me that the "feature" should not be on by default - because from my reading of it, it only seems to work on specific combination of kernel ver + docker ver + io_uring. I.e., if you are lucky. And using io_uring crashes or runs away on other combinations. If it was my call, the io_uring feature should be off by default and left as opt-in for folks who see actual benefit from it.

I think I have a similar issue Here,

using node:20.11.1

I'm not on Kubernetes, but on AWS ECS Fargate, with only 2 replicas, about 1500 HTTP queries per minute

I also recently moved from node 18 to node 20 but this issue started to appear after some usage of AsyncLocalStorage but the behaviour is the same:

  • Logs stop being sent
  • Memory start increasing rapidly from some random time in the time
  • Seems anyway correlated to the number of requests

Our current workaround is to reboot instances regularly 👈 sorry for that

Below, the blue spikes to 0 are the actual reboots of our instances

image

And my log count with all repliquas taken into account so it does not totally drops to 0 but is divided by 2, as if only one instance is not sending logs anymore

image

[UPDATE 02/16 10:17]

Using node:20.11.1 👈 Wrong , I'm actually using 2.7.0 . I mixed up versions
Now testing with 20.11.1

@cometurrata your graphs are showing a long period of time, while Node v20.11.1 was released yesterday. I'm confident that Node v20.11.1 should not have the problem mentioned in this issue, because it disables io_uring by default.

Please open a fresh issue with fresh data and more details on your pino setup.

Closing this as latest Node.js has io_uring disabled by default.

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.