Improve debug logging performance

Question

Improve debug logging performance

Opened this issue 2 months ago · 7 comments

This flame graph taken by @kantai shows debug logging has a large impact on node performance.

33% of the sample is in make_tenure_bitvector, but almost entirely invoking the log commands

I've also observed that RPC endpoints return much quicker when the node does not have debug logging enabled; this can have a significant impact if a stacks node with debug logs enabled is broadcasting to a signer or other downstream observer.

One idea from @hstove that can alleviate this is to switch to async logging via slog_async, however this would need to be researched and weighed against any other suggestions, as it may result in not getting critical logs in the event of a panic under the right conditions.

Answer 1 · 2024-11-08T22:34:37.000Z

One idea from @hstove that can alleviate this is to switch to async logging via slog_async, however this would need to be researched and weighed against any other suggestions, as it may result in not getting critical logs in the event of a panic under the right conditions.

Yep -- I think if we want to avoid getting aggressive about reducing the debug log volume, the logger will need to support an async mode.

I think if we're opening the "async logging" can of worms, though, it would probably be worthwhile to move to the tracing framework (https://github.com/tokio-rs/tracing). I know that tokio is controversial for some folks (@jcnelson for one), but tracing is a very well supported logging system, and it seems like one that many rust projects are converging around (in fact, slog's readme now asks users to strongly consider tracing instead... https://github.com/slog-rs/slog?tab=readme-ov-file#you-might-consider-using-tracing-instead)

Answer 2 · 2024-11-09T04:33:48.000Z

As an aside, I am very interested in getting tracing support added to the stacks node (and potentially signer node) as well. If tokio's tracing lib were used to generate spans (as part of a separate issue), it may make sense to also look into it for logging as well since it may be cleaner and easier than mixing libs.

Answer 3 · 2024-11-10T03:30:46.000Z

It seems that 33% of the flame graph is just log output from InvGenerator::make_tenure_bitvector().

Perhaps for now, we can move the inner loop of that to trace!-level.

@kantai Yep -- I think if we want to avoid getting aggressive about reducing the debug log volume,

I'm 100% down to substantially reduce the debug log volume. Some of the things we log in the networking stack really need to be trace!, if nothing else.

Down the road, I'm in favor of a combination of @CharlieC3's request to have per-module or per-aspect logging, and my request to have an RPC interface for toggling logging, so we can selectively activate subsystem logging when diagnosing a problem.

@CharlieC3 as it may result in not getting critical logs in the event of a panic under the right conditions.

This right here is the reason I'm not in favor of asynchronous logging, regardless of the concurrency paradigm used. We want to see crashes or other debug-worthy behaviors reported as they happen, so we shouldn't introduce an arbitrary delay between when the offending logic executes and when the log data is emitted. That is, after all, why we log things.

Answer 4 · 2024-11-11T16:55:16.000Z

It came up on the sprint call today to consider the possibility of using tracing instead of slog.

It is my understanding that the value proposition of tracing over slog is that it has multiple backends. My feedback to this is that this is irrelevant, since ingesting log output into one or more backend services is not only well outside the node's responsibility, but also been a solved problem since well before any of us were born and requires negligible work on our end. All the user needs to do is pipe the log output from the Stacks node into whatever service needs to ingest it. All we need to do is inform users of this, and perhaps provide an example or two if anyone asks.

Given the heterogeneity of node users, there isn't a one-size-fits-all solution to log ingestion. Home users have different needs from developers, who in turn have different needs from exchanges (and each exchange likely has their own internal tooling). It's not up to us to provide solutions to log ingestion, and trying to do so only makes our lives harder in the form of more object code to compile, more threads to trace, and more service tickets to deal with incompatibility issues which may arise. Instead, it's the node's job to give them the log data in an ingestible format (e.g. lines of text, JSON objects over stderr, etc), so it can be sent to the log-processing back-end.

Answer 5 · 2024-11-11T16:56:56.000Z

If folks feel strongly that we have to help users with log ingestion, then that's all well and good. Let's create one or more log ingestion binaries into which the node's output can be piped. And, let's do so in a separate repository, and let's treat this as a "green field" out-of-scope project that can be taken on by 3rd parties.

Answer 6 · 2024-11-11T17:43:58.000Z

@jcnelson For posterity, do you recall which backends from the tracing lib were being discussed? If it were regarding backends which ship logs, like tracing-loki, then I have to agree. It's often safer, easier, and more performant to configure an external log collector which ships these logs rather than doing it in code.

Answer 7 · 2024-11-11T17:55:46.000Z

I don't think any backends in particular were being discussed -- I don't think it got that far. The conversation stopped at the conceptual level of why there are distinct frontends and backends in the tracing crate.