nodejs/diagnostics

[async_hooks] stable API - tracking issue

AndreasMadsen opened this issue · 32 comments

I posted requirements list for getting async_hooks into stable a while ago (nodejs/node#14717 (comment)). This is a very similar list.

Internal Technical

Internal Non-Technical

  • Specify whether a triggerAsyncId change will be a semver-major, semver-minor or semver-patch.
  • Identify "tolerance level" of perf impact of async-hooks, especailly regarding promises.
  • Is Async Hooks exposing the right logical model? (issue: #143)

External Non-Technical

After Stable

/cc @nodejs/async_hooks @nodejs/diagnostics

/cc @jasnell who sometimes asks me about this

@AndreasMadsen how about adding some external criteria, like TC39 uses or N-API (nodejs/node#14532)

  1. major APM vendor that uses async_hooks (N/Solid)
  2. npm modules with at least 5000 DL/month based on async_hooks (https://www.npmjs.com/package/trace & https://www.npmjs.com/package/cls-hooked)

@refack added your 1. and 2. to the list. I'm not sure what you mean by TC39 uses. Regarding N-API I don't see why that should be a requirement. async_hooks has its own native Embedder API outside N-API, we can mark that as stable without N-API. In fact N-API shouldn't be marked stable before the async_hooks native Embedder API is marked stable since they depend on that. But async_hooks doesn't depend on N-API.

Thanks. I just referenced TC39 and N-API as processes that use exit criteria that are independent of the development process, but take into account ecosystem adoption. So I agree that stability of N-API and async_hooks are independent (except for the embedder API)

FWIW, We plan to start using Async Hooks in Sqreen Agent in the upcomming weeks.

kjin commented

PR to add async_hooks in Stackdriver Trace: googleapis/cloud-trace-nodejs#538

Updated "Deprecate setTriggerId"

Added:

So based on discussion in the benchmarking WG meeting yesterday I did setup test cases to run the Promise heavy Bluebird and Wikipedia benchmarks (used by the V8 project) with and without async_hooks to answer the first question in this thread, and slow-down is pretty significant even with just an empty init hook.

@mhdawson suggested to kick-off some discussion on the performance via nodejs/benchmarking#188 and bubble up the issue to make sure we consider the performance aspect before async_hooks goes out of EXPERIMENTAL.

FYI I added an item to list above to identify the perf impact we're willing to tolerate from async-hooks. Current benchmark data cited in nodejs/benchmarking#181 show a ~2x-3x slowdown.

bnb commented

Any thoughts on if this will be coming out of Experimental before v10 launches? Looking at the remaining items I can't quite tell if there are still significant barriers because I'm not familiar enough with the subject 🤔

We haven't run into any blockers in the New Relic agent.

The primary concern we've come up against is an extension of the lifetime of a promise leads to drastically increased memory usage. This only seems to be an issue with immediately resolved promises (e.g. Promise.resolve()), since those don't emit an after event and have to be cleared on destroy. This situation only causes an issue when there is an existing promise leak and just requires us to be more diligent about finding and eliminating leaks in libraries we interact with. I don't think this issue affects the shape of the API, so it's definitely not a blocker.

We currently only instrument core promises using async hooks, though the data exposed through the hooks should be enough to move over the rest of the core instrumentation when the feature is fully released.

@lykkin do you have any data on real-world performance impact async-hooks adds to applications? Performance seems to be the main blocker at this point, and it would be good to learn what kind of impact you observe.

Unfortunately it's really difficult for us to get exact numbers for the overhead caused by each of our components from users. As a proxy we have a couple of benchmarks we run. (Note: all tests were run against 9.11.1)

One holistic benchmark we run using acmeair where we saturate the app for ~30 seconds to measure the throughput delta our agent incurs. Running this test with a set of no-op hooks (i.e. all hooks are defined, but don't do anything) shows a small, but non-negligible amount of overhead created: 1854 requests/sec -> 1750 requests/sec

One thing to note is the above application doesn't use many promises, so for evaluating the overhead of our monkey patched promise instrumentation vs the async hook instrumentation we built a few promise based microbenchmarks. When running these tests with the no-op hooks we observe the following baseline overhead:

Without hooks:

forkedTest x 7,197 ops/sec ±0.82% (74 runs sampled)
longTest x 14,377 ops/sec ±0.50% (74 runs sampled)
longTestWithCatches x 14,481 ops/sec ±0.54% (75 runs sampled)
longThrowToEnd x 16,735 ops/sec ±0.68% (77 runs sampled)
promiseConstructor x 28,361 ops/sec ±1.39% (74 runs sampled)
promiseReturningPromise x 6,755 ops/sec ±0.89% (74 runs sampled)
thenReturningPromise x 2,034 ops/sec ±0.63% (76 runs sampled)
promiseConstructorThrow x 344 ops/sec ±3.37% (62 runs sampled)

With no-op hooks:

forkedTest x 1,189 ops/sec ±0.91% (74 runs sampled)
longTest x 2,378 ops/sec ±1.00% (71 runs sampled)
longTestWithCatches x 2,387 ops/sec ±1.11% (71 runs sampled)
longThrowToEnd x 2,424 ops/sec ±0.88% (70 runs sampled)
promiseConstructor x 4,159 ops/sec ±1.11% (61 runs sampled)
promiseReturningPromise x 1,173 ops/sec ±1.98% (74 runs sampled)
thenReturningPromise x 779 ops/sec ±2.08% (70 runs sampled)
promiseConstructorThrow x 315 ops/sec ±3.16% (61 runs sampled)

These microbenchmarks show drastically degraded performance around promise usage. With that said, for our use case it ends up being more performant than our monkey patched version of promise instrumentation. In general, instrumenting native promises is going to be relatively heavy, since they are computationally light. I don't think this should be a blocker, but it's a definite pain point.

Hi!

First of all I wish to say many thanks for all this happens!
Really nice API, and works just good enough to make some really cool stuff!

And, I'm very sorry, but I need to ask a question about the issue I'm under and unfortunately unable to solve by myself. Despite everything with Async contexts seems to be working good, there is a problem with Sync instead.

More description here: Issue #249

@ofrobots are you still able to champion this.

Unfortunately no. It would be great if someone else has the drive to drive async_hooks forward.

should this remain open? [ I am trying to chase dormant issues to closure ]

no one working on this?

Since async_hooks is still unstable, I would keep this open until it's considered stable.

Hey there! 👋 I work for AppSignal and we're currently working on a Node.js integration. While I haven't used it in our integration just yet, async_hooks is something I personally would like to see move forward. How can we best help with the effort to progress this?

@xadamy - thanks. one thing I would suggest is join our bi-weekly diagnostic working group meeting, and do a briefing on goal, engagement model, your availability etc. We will have one next to next Wednesday. makes sense?

@gireeshpunathil Sounds good, I'll make sure I'm there next Wednesday! Thank you!

@gireeshpunathil Apologies, did you mean today or next Wednesday for this?

Hey @xadamy, sorry for jumping in here, the next Diagnostics WG Meeting is currently planned on Wednesday, 20. Nov 19:30 - 20:30 CEST :)

This issue is stale because it has been open many days with no activity. It will be closed soon unless the stale label is removed or a comment is made.

I figure this shouldn't go stale since it's a long-lived tracking issue?

Hey team, sorry it has taken me so long to follow up. Last time we spoke I believe you asked me to experiment with the async_hooks API more. We now have an async_hooks implementation in production in our Node.js integration - although it is not much different (if at all) from Cloud Trace's implementation.

I have some thoughts and feedback on the API that it would be great to sync with you all on at some point (if you have time), but it would be good to know if there's anything further you'd like me to do to move this forward?

@xadamy maybe you can plan to come ot the next diagnostics meeting and give the team a runthrough of your thoughts.

closing in favor of #437