nodejs/node

Integrate C++ AsyncHooks Embedder API with native abstraction

Closed this issue · 46 comments

  • Version: v8.x
  • Platform: all
  • Subsystem: n-api, nan, async_hooks

The AsyncHooks Embedder API has now been merged, we need to integrate this into N-API and NAN such that userland add-ons can inform async_hooks about the context.

I'm not very familiar with either APIs, but NAN is the API I know the best, so I will explain it from that perspective.

AsyncHooks allows userland to get notified about all asynchronous event and understand what caused the asynchronous job to be tasked. This requires 4 events to be emitted:

  • init: emitted with the asynchronous job is created (called a resource). EmitAsyncInit emits this.
  • before, after: emitted with the asynchronous job calls back, this can happen multiple times. MakeCallback now emits these when two additional parameters are passed (async_id and trigger_id).
  • destroy: emitted when the resource can't call back anymore. EmitAsyncDestroy emits this.

there is also a high-level API, a C++ class called AsyncResource but I suspect this isn't useful for NAN or N-API.

In terms of NAN I think there is almost a 1 to 1 mapping between Nan::Callback and the AsyncHooks API. I believe the following changes should be made:

This is very similar to the AsyncResource class. It is not clear what the resource should be as Callback::Callback does not take such a parameter.

I believe @mhdawson said during a diagnostics meeting that if NAN required changes then likely N-API would need changes too.

/cc @mhdawson @addaleax @trevnorris @nodejs/diagnostics @nodejs/n-api @nodejs/addon-api @nodejs/nan (@kkoopa)

I can’t really speak for NAN; my best guess is that it will want to build its own thing similar to AsyncResource.

As for N-API: I’m not sure. We could just proxy the low-level API to C and add a napi_make_callback variant that takes id and trigger_id parameters, but that might require committing to the native async_hooks API as it currently is, i.e. make that non-experimental?

One thought I’ve had is that we could add a CallbackScope class that basically does what MakeCallback does but without necessarily jumping into JS; N-API could use that to make async worker’s complete callbacks automagically run in the correct async context. That wouldn’t require any API changes, too.

I need to familiarize myself more with this, but initially I think @addaleax is right in that it would be better to make a new class for this (new) concept. Nan::Callback has always been a convenience class which is mostly around for legacy reasons (legacy within NAN). It was useful in the Node 0.10 -> 0.12 transition with the Persistent changes to reduce leaky and broken code. However, making it do something new seems like it could backfire in unforseen ways.

Another question is that of compatibility. How far back can this be implemented without patching node.js?
All the way down to 0.10? NAN only offers functionality that works on all supported versions, but supports pick-and-mix with raw V8, libuv and anything else. That way one can write the bulk of code using NAN's abstractions and use, say, the V8 API for Futures if it will only run on versions of node that supports V8 Futures.

However, making it do something new seems like it could backfire in unforseen ways.

I'm not exactly sure what "new" covers, I see this as adding annotation to existing functionality. My thoughts behind annotating Nan::Callback is that it would make the userland adoption transparent, but if very few are using Nan::Callback then that will of cause not be the case.

Adoption by userland is critical in this case, because if just one native addon model is missing the annotation then the usefulness of AsyncHooks is dramatically decreased.

Another question is that of compatibility. How far back can this be implemented without patching node.js?

Currently, AsyncHooks will only be in node v8.x, we may backport it to v6.x and perhaps even v4.x. As said the AsyncHooks Embedder API should be seen as annotation, thus for node versions where the AsyncHooks Embedder API is not supported that annotation simply won't be added.

Adoption by userland is critical in this case, because if just one native addon model is missing the annotation then the usefulness of AsyncHooks is dramatically decreased.

I agree – once async_hooks is no longer experimental, we should probably deprecate the old functionality.

we may backport it to v6.x and perhaps even v4.x

I’m pretty sure we won’t backport to v4.x?

I do not know what the explicit usage rate of Nan::Callback is. It is however implicitly used in Nan::AsyncWorker, which most addons use for asynchronous functions. Holdouts are mostly projects that have been around for a long time and simply updated their existing, working baton code at some point.

I am not sure what "new" covers either, which is why I need to read up on what async hooks are supposed to do before I can make an informed statement.

However, the point of NAN is to offer a set of functionality that gives the same observable behavior on all supported versions of Node.js. There should not be anything that only works on specific versions, since that would defeat the point of having a version-independence abstraction layer. An end-user should be able to assume that all code does what it should on all supported versions. This also includes not having functionality that becomes a no-op on certain versions, since that hides that the code does not work as intended.

That being said, if all this does is adding some extra information which can be useful in some cases, there might be a way to add some of it to NAN so that code running on newer versions of node.js does something in a better or more efficient manner. For instance, if async hooks subsumes an older API which may then be deprecated, the NAN code for all newer versions that support this functionality could be made to use it instead of the older functionality, since that would fall within implementation, not version-independent functionality.

To make this a bit clearer: We can think of two aspects of NAN: One is the external, documented API. Only functionality which works on all supported versions goes here. The other is the internal implementation (actually there are several of them depending on node version), which has greater freedom. Here, the only restriction is: Given a program in state S which calls a part of NAN, at return from NAN, the program is in state S', regardless of which version-dependent implementation was used.
This is a truth with modification, but we try to keep it as close as possible. In practice, we may end up in a state S'' which is a superset of S'. While not ideal, this is sometimes the only practical way. What is not allowed is to sometimes end up in a state S''' which is a strict subset of S'.

It is however implicitly used in Nan::AsyncWorker, which most addons use for asynchronous functions.

I think it should be easy enough to add support for everything we need to AsyncWorker transparently, because N-API is basically going to need the same things anyway.

There should not be anything that only works on specific versions, since that would defeat the point of having a version-independence abstraction layer.

You mean, that works from the addon writer’s perspective, right? It would be okay if Nan had code that only worked well with the versions of Node to which async_hooks has been ported, correct?

Also: Would it help if Node had a #define that indicated the presence of the C++ embedder API?

Yes, from the addon writer's perspective, if a (documented) function appears in NAN, the contract states that it works the same on all versions of Node.js. Suppose a function in NAN returns an object with a set of keys and values. Those keys and values should be the same on all versions of Node. In some specific instances it might be allowable to allow more keys and values to appear, but then they fall out of NAN's blessing, not documented by NAN and are used at your own risk (usually because we would not take active steps to remove kv-pairs if that returned object was obtained from, say, a V8 function).

Sniffing specific versions of Node from NODE_MODULE_VERSION and NODE_VERSION is not too hard, there are only ever problems when nonstandard implementations which are not Node but claim to be so enter the picture, and I suspect a special define could end up meaningless there too...

From what I now have gleaned about what the async hooks are supposed to do, it seems that all NAN needs to do is change implementations of relevant existing APIs for versions of Node that support async hooks so that async hooks transparently work, since the end user may want to use async hooks while running on newer versions of Node, while still using NAN. Since no new external APIs related to async hooks need to be added, that should be perfectly fine. It is a matter of transparency, or not getting in the way.

@matthewloring and I had a discussion yesterday about exposing APIs for async hooks in N-API, and I've started looking into this further. Before proposing a design, I have a few questions:

  • What is the reason for the async_context structure? It looks like it would be simpler for EmitAsyncInit() to just return the execution async ID, since the trigger ID is provided by the caller. Or could there be other kinds of async IDs that might be added to the async_context structure in the future? If the structure is likely to change, we should consider how such a change can be handled without breaking N-API's ABI stability.

  • The async_id type is a typedef of double. Does that numeric value need to be exposed to users (maybe for logging/diagnostics) or can the type be completely opaque?

  • Does N-API need to continue supporting domains? A possible N-API design for async hooks might involve only exposing the "primitive" methods for creating/getting async IDs and emitting async hooks events, without actually using the node::AsyncWrap class that currently includes domain integration. (The Napi::AsyncWorker class in node-addon-api would then help with calling the N-API async hooks C functions at the appropriate times.)

Does that numeric value need to be exposed to users (maybe for logging/diagnostics) or can the type be completely opaque?

It's only a double b/c that's the only numeric type that JS supports, and the async id is assigned as an object property for quick lookup. Can you explain what you mean by "opaque" more?

A possible N-API design for async hooks might involve only exposing the "primitive" methods for creating/getting async IDs and emitting async hooks events,

That API exists in in JS and would be possible in C++, but @AndreasMadsen is more knowledgeable as to whether there would be any issues doing this.

It looks like it would be simpler for EmitAsyncInit() to just return the execution async ID, since the trigger ID is provided by the caller.

It is provided by default but the usual scenario is going to be that the trigger ID is inferred from context.

The async_id type is a typedef of double. Does that numeric value need to be exposed to users (maybe for logging/diagnostics) or can the type be completely opaque?

The typedef is a somewhat intentional step towards making it opaque, yes. It might still be good to be able to log the value, or to pass it to JS and have it comparable to the values JS sees.

Does N-API need to continue supporting domains?

Maybe/probably? It should not be hard, and while it would be really nice to have domains implementable in terms of async_hooks, we’re not there yet.

A possible N-API design for async hooks might involve only exposing the "primitive" methods for creating/getting async IDs and emitting async hooks events,

That API exists in in JS and would be possible in C++, but @AndreasMadsen is more knowledgeable as to whether there would be any issues doing this.

@trevnorris Not sure what you mean, the C++ API for this already exists.

Imo, the main thing that N-APi should focus on for now is enabling async_hooks and domains for the napi_async_work functions; that shouldn’t be hard and doesn’t even require exposing new native APIs.

It looks like it would be simpler for EmitAsyncInit() to just return the execution async ID, since the trigger ID is provided by the caller.

@addaleax is right. I want to add that there is no API for getting the default value, other than EmitAsyncInit(). It also prevents some easy to make bugs. I think we would like to do the same in the JS Embedder API but unfortunately tubles/structs are too expensive in JS.

The async_id type is a typedef of double. Does that numeric value need to be exposed to users (maybe for logging/diagnostics) or can the type be completely opaque?

I have sometimes abused the fact that the double is increasing with time, which makes it an easy sortkey and thus I can use some clever graph hacks. But it should be opaque. Perhaps, at some point in the future we will use int64 instead.

A possible N-API design for async hooks might involve only exposing the "primitive" methods for creating/getting async IDs and emitting async hooks events,

That API exists in in JS and would be possible in C++, but @AndreasMadsen is more knowledgeable as to whether there would be any issues doing this.

I'm flattered, but I will mostly defer to @addaleax as she wrote that code. In the future I think we would like to rewrite domains to use async_hooks. But it is unclear right now if that would be too breaking for users. I don't think it will be too breaking but it is hard to tell for sure.

In the future I think we would like to rewrite domains to use async_hooks. But it is unclear right now if that would be too breaking for users. I don't think it will be too breaking but it is hard to tell for sure.

By the way, I tried that a while back, and… the tricky bit isn’t implementing domains on top of async_hooks, the tricky bit is going to be that Node’s unhandled exception handling is a horrible mess of spaghetti code. We should completely re-write that, but that’s going to be a breaking change in itself.

@addaleax

Not sure what you mean, the C++ API for this already exists.

Nevermind me. Was thinking of something else. It's what I get for trying to respond to issues while in a meeting.

@AndreasMadsen

In the future I think we would like to rewrite domains to use async_hooks. But it is unclear right now if that would be too breaking for users. I don't think it will be too breaking but it is h

When AsyncListener was first introduced I spend a month trying to do this. Got it ~90% the way there, but do to some fundamental differences the last 10% was impossible.

My initial work on this is at nodejs/node-v0.x-archive@bc39bdd but after the month of effort getting the async error handling down correctly I decided to give up. Unfortunately the commits for those changes were lost when I deleted my joyent/node folder.

It's only a double b/c that's the only numeric type that JS supports, and the async id is assigned as an object property for quick lookup.

I don't understand that reasoning. It could easily be treated as an integer in the native API. While working on performance optimizations for N-API, I found that creating a V8 number value is somewhat faster using v8::Integer::New(int32) compared to v8::Number::New(double), even though the latter will eventually still create an optimized integer representation when the double value has no fractional component.

Can you explain what you mean by "opaque" more?

A pointer to an undefined structure. N-API uses opaque pointers for some things so that the actual type or structure of a value is hidden from user code and can change without breaking ABI stability.

Imo, the main thing that N-APi should focus on for now is enabling async_hooks and domains for the napi_async_work functions; that shouldn't be hard and doesn't even require exposing new native APIs.

OK, that should be relatively easy to do. I still think eventually we will need to expose more of the async hooks APIs directly, but that can be lower priority.

Imo, the main thing that N-APi should focus on for now is enabling async_hooks and domains for the napi_async_work functions; that shouldn't be hard and doesn't even require exposing new native APIs.

After studying this more, I don't think that is the right approach. The current N-API async functions (napi_create_async_work and friends) are adapters for libuv APIs (uv_queue_work etc.), not Node APIs. I think hiding async hooks behind the same abstraction would limit flexibility. The current async support in N-API is known to be incomplete: advanced async scenarios (such as background threads) may require direct calls to libuv. Future N-API async APIs are likely to be lower level, and combining those with async hooks probably won't make sense. So I think it will be better to keep the async hooks separate, at least in the C API. Then the C++ wrapper (and user code) will have more flexibility to compose N-API async hooks APIs, N-API async work APIs, libuv work queue APIs, and/or any other async code in different ways.

Oh and as for domains... after studying the code more I realized that won't be a problem, since the N-API implementation (using either approach) would certainly use AsyncWrap::MakeCallback(), which already handles domains.

@jasongin

I don't understand that reasoning. It could easily be treated as an integer in the native API.

Reasons:

  1. A 32 bit id is too small, so must use something larger.
  2. The id is assigned as an object property, and it helps to prevent any unexpected side-effects by treating them as two different types (e.g. int64_t in C++ and double in JS)
  3. The id can fit in a Float64Arrray, which makes operations on it between C++ and JS much faster.

I found that creating a V8 number value is somewhat faster using v8::Integer::New(int32) compared to v8::Number::New(double)

This is true, except as stated above int32 isn't large enough. We'll quickly run out of ids at that size. Also, in core we never need to do Number::New() in C++. Even if we did, it's still a net gain for performance to use a double because it's then possible to write the values to a Float64Array.

A pointer to an undefined structure. N-API uses opaque pointers for some things so that the actual type or structure of a value is hidden from user code and can change without breaking ABI stability.

As far as core is concerned, and I believe for some implementors, it is a disadvantage to not know what the type is. Being able to wrap the ids in a Typed Array is very useful for both performance and making cross communication between C++ and JS easier.

it is a disadvantage to not know what the [async id] type is

That is what I had assumed, thanks for confirming. It still seems strange to me as a double in the C++ API though... and @AndreasMadsen suggested above it might be int64_t in the future?

The challenge now is we need to ensure that we have an async hooks interface that is ABI-stable, at least by the time N-API exits experimental status which is expected within a couple months. (Even node major versions must not break N-API ABI-stability. We may add new APIs and deprecate old ones if necessary, but never make breaking changes.)

Even if the async hooks feature is still technically experimental for longer than N-API, can we settle on a final public interface for it earlier than that? (@refack and I discussed this some at the VM summit on Friday.)

A non-opaque N-API async ID and context would look like this:

typedef double napi_async_id;

typedef struct {
  napi_async_id async_id;
  napi_async_id trigger_async_id;
} napi_async_context;

If we define it that way, we can never add fields to or otherwise change the definition of the napi_async_context structure. But it is the most straightforward and performant approach.

Alternatively, we can make the context structure opaque; then have accessors for retrieving fields from the structure, something like:

napi_status napi_get_execution_async_id(napi_env env,
                                        napi_async_context async_context,
                                        napi_async_id* result);
napi_status napi_get_trigger_async_id(napi_env env,
                                      napi_async_context async_context,
                                      napi_async_id* result);

Then it would be OK to add fields to the structure without breaking ABI stability. But the accessors would have some performance cost, and the context structure itself would have to be allocated on the heap instead of on the stack.

Also, the reason it's better to lock down the async hooks interface in N-API before N-API exits experimental status is so that an async context parameter can be included in the napi_make_callback() API:

napi_status napi_make_callback(napi_env env,
                               napi_async_context async_context,
                               napi_value recv,
                               napi_value func,
                               size_t argc,
                               const napi_value* argv,
                               napi_value* result);

If the async context isn't included in that API, then native addons will be developed that call napi_make_callback() without any async context, breaking the async context chain for everyone else. Sure we could add a napi_make_async_callback() API later, and deprecate the old one... but that would be unfortunate.

@trevnorris @AndreasMadsen @addaleax Can you comment on the two options I defined above for exposing async context via N-API? I would like to implement this soon.

I don't really have an opinion. ABI stability is not something I have a lot of experience with. As for exposing the specific type, I think we should just leave it unspecified. I very much doubt that there is a good use case for knowing it and we can always document it later.

@jasongin Do you think a opaque value without getters would work for you? I am not sure we really need the latter (and if so, we can always revisit), and we might be able to get around the heap allocation restriction with clever (read: dark magic) tricks.

An opaque napi_async_context without getters could work. But that seems to go against what you said above:

it is a disadvantage to not know what the [async id] type is

We can always add the getters later if and when they are needed. The only concern then is the small performance cost of heap allocation and getters, compared to the non-opaque context structure. The getters could never be inlined, since they are called through the ABI. Do we expect native modules might need to repeatedly get async IDs in performance-sensitive scenarios?

we might be able to get around the heap allocation restriction with clever (read: dark magic) tricks

I'm curious what you had in mind, because I don't see a way around it.

we might be able to get around the heap allocation restriction with clever (read: dark magic) tricks

I'm curious what you had in mind, because I don't see a way around it.

Using a tagged pointer with 2×31 bit fields for the values on 64-bit platforms. That’s going to work pretty well for a lot of applications.

Do we expect native modules might need to repeatedly get async IDs in performance-sensitive scenarios?

I don’t really see why they would need access at all, tbh. Passing along the async_context should suffice for 99.9 % of native addons, I think?

Using a tagged pointer with 2x31 bit fields for the values on 64-bit platforms.

How? The async context structure consists of two 64-bit (double) async IDs.

@jasongin It’s still reasonable to assume that for a long time, the IDs will fall into the [0..2**31] range. I know it won’t hold true forever, and depending on the application that may be faster or slower, but my estimate would be that it would be worth it.

OK, I had missed that the optimization could be conditional on the IDs being < 2**31.

I've mostly implemented the napi_async_context at jasongin@c5761a9. I still need to add more tests, and maybe wait until #14697 lands.

I'm reopening as the NAN interface is still being considered.

@addaleax mind adding a comment on why this should be closed?

@AndreasMadsen Sorry – didn’t notice this happened. This was because I synced ayojs/ayo with Node’s master, which contained 53de2e1 that had a Fixes: tag for this issue even though it shouldn’t have one…

Ah, I see. That can happen :)

@ofrobots volunteered to look into this & see if we can move the NaN support for async hooks forward.

Still WIP, but here's my work on Nan support so far, incase people want to provide early feedback: ofrobots/nan@66b34ce. With this any Nan based modules using AsyncWorker will get automatic context propagation (yay!).

I am still trying to figure out how to deal with AsyncProgressWorker. It seems that HandleProgressCallback is pure virtual so users must be calling Callback::Call directly. IMO context should be bound on the worker rather than on Nan::Callback as it is possible for users to create a Callback as re-use the handle across multiple async contexts.

I don't see an obvious, non semver-major, way to make AsyncProgressWorker code context-correct™ by construction. Ideas are welcome. Let me know if any of my assumptions / inferences are incorrect.

/cc @bnoordhuis

I agree that context should be on the worker and not on the callback. While a non-breaking solution would be preferrable, a well-managed breaking change can still be dealt with. I believe the majority of addons do not currently use AsyncProgressWorker, so the impact of a breaking change should be fairly limited. It might be possible to have two implementations for a while, with the old one marked as deprecated and then removing it in a major release.

I have done some thinking on this, and have a plan for what I would like to proceed with it. I am presenting it here for feedback before proceeding.

The driving motivation here is to make it easy for native module authors to write modules correctly propagate async context. Correctly keeping track of async context is important for program understandability and debugging. A single ill-behaving module can break debugging tools, so it is important that we give the right tools and APIs to Nan and N-API module authors to be able to do things correctly. Writing correct code should be straightforward, and writing context-breaking code should require intention.

N-API has already solved this problem (thanks @jasongin, 0c258bd), so we are now left with only Nan.

A secondary goal might be that we would like native module authors to migrate to N-API rather than continuing to use Nan. I believe it would be good for the Node project if more native modules were written in N-API Nan. The question becomes how much of the new async resource API should be provided in Nan. Do we simply deprecate/remove “dangerous” APIs in Node.js and Nan and ask users to migrate to N-API? Or do we port over the AsyncResource API (knowing that this might decrease the incentive for people to switch to N-API).

I believe we should do the latter; switching to N-API is a lot of work for most module authors; let us not make that a barrier to making Node.js more debuggable.

So here is a proposal/plan:

  1. Modify Nan’s AsyncWorker class, as in commit (ofrobots/nan@66b34ce) to automatically propagate context.
  2. Add async_context and async_resource as concepts to Nan. On old versions of Node, these would be opaque no-op values.
  3. Provide context accepting variants of Nan::MakeCallback and Nan::Callback::Call
  4. Deprecate the legacy Node.js’ node::MakeCallback APIs that are context oblivious. The context accepting alternative node::MakeCallback APIs have been available since Node 8. I would like this to do a docs-only deprecation in time for Node 10, runtime deprecation in 11 and hopefully we can remove them by Node 12.
  5. Remove the context-oblivious versions of Nan::Callback::Call, Nan::MakeCallback in a semver major release of Nan. Some semver major changes will also be needed in the AsyncWorker class hierarchy, but I expect this be an easy migration to users.

The main thing time sensitive is the docs-only deprecation of legacy node::MakeCallback in time for Node 10.

I’d love to hear people’s feedback on this before I start executing on this.

@ofrobots I fully agree; and to add to your thoughts on Nan vs N-API, N-API is never going to be a full replacement for native addons that use the raw V8 & Node C++ APIs, so not fixing this in Nan is not really an option.

Qard commented

My thinking is that we should update Nan, but message it as a maintenance update and deprecate Nan in favor of N-API. We could do docs deprecation, then warning message, then removal, to give people plenty of time to switch while still encouraging them to do so without putting pressure on it to happen immediately.

Ha. So, I can't find where node::MakeCallback is documented. Not sure how we can doc-deprecate an undocumented thing, lol. I guess a compile-time warning is like a doc-deprecation?

So, I can't find where node::MakeCallback is documented.

It isn’t documented outside what’s in the node.h header and the Nan docs. It’s a terrible name anyway, and I’d generally recommend always using CallbackScope instead for clarity.

I guess a compile-time warning is like a doc-deprecation?

Yeah, that is a better system for deprecating anyway, because unlike doc-only deprecations in Node this is something that the developer of a module gets to see, not the one who runs the code and might have to go complain to the developer. 👍

First Nan PR: nodejs/nan#729

Deprecation of legacy node::MakeCallback node.h APIs: #18632

Nan::AsyncWorker and Nan::Callback support: nodejs/nan#731

With the nan changes landed, I believe this is ready to be closed now. Please reopen if anyone disagrees.

Thanks @ofrobots for driving this. :)