open-telemetry/opentelemetry-js

Add or change resource detection to be synchronous

Closed this issue · 13 comments

svrnm commented

Is your feature request related to a problem? Please describe.

As of now resource detection is happening asynchronous, this means that spans that are generated early at application start might be lost (e.g. lambdas) or the whole application needs to be wait for the detection promise to fulfill, something like

sdk.start().then(() => {
  /* now we can run our application */
})

This is especially problematic when the SDK is loaded in a "don't touch my code" manner, e.g. via node --require tracing.js, as described in the Getting Started, used by the operator for autoinstrumentation or suggested at #2891: there is no way to "wait" for resource detection to finish before the application runs

Describe the solution you'd like

via #2891 (comment) from @dyladan:

This would require us to introduce a new SyncResourceDetector and probably change the existing resource detectors to implement it instead of the async version that currently exists.

I had the same issue and ended up implementing a sync api for the detectors and implemented a few sync detectors with the api.
Code can be found here

I can upstream it to the core repo if there is interest.

This will mean that users will have to choose between the following alternative:

  • Using only sync detectors and auto-instrument (--require)
  • Use sync + async detectors and modify the service code to start only after OTEL async detectors are resolved.
  • Initialize OTEL when resource detectors are resolved and live with no telemetry on service startup. This means that operations invoked on startup will probably not be collected (like database "connect", api request to get config etc).
  • Start OTEL with resource attributes from sync detectors and add additional attributes later when they are resolved async. This has been suggested in the past, could be implemented easily in JS but conflicts with the spec (resource is immutable). A suggested alternative was to keep the resource immutable but replace it with another immutable resource that contains the async attributes when they are known. This can somehow solve the conflict with the spec but might mess up the backends if they use the resource to derive some behavior (like identifying unique instances based on the resource).

Options 1, 3, 4 can probably be used with (--require). Maybe we can have some environment variable to instruct the register script on which option to choose from.

What about packing the resource into a Promise/Future? This one can be created sync and it's immutable.
Whoever needs resource has to await it. As far as I know that's only exporters for now.

svrnm commented

I can upstream it to the core repo if there is interest.

This would be a good starting point for sure, nobody needs to reinvent the wheel!

Using only sync detectors and auto-instrument (--require)

What would be the issue with having only sync detectors?

  • Slow down of application startup? How significant is this impact, are we talking about tens of milliseconds or seconds? I assume this can also depend on the resource detection (like a API is called and that API for whatever reason is slow and taking 5 seconds to respond?)
  • Are there any other disadvantages?

Initialize OTEL when resource detectors are resolved and live with no telemetry on service startup. This means that operations invoked on startup will probably not be collected (like database "connect", api request to get config etc).

As far as I understand, that's what we have right now, looks like it's accepted as "OK" behaviour but having some issues (you loose some insights on application startup, might not work well with shortlived processes, lambdas ,...)

Start OTEL with resource attributes from sync detectors and add additional attributes later when they are resolved async. This has been suggested in the past, could be implemented easily in JS but conflicts with the spec

Looks like a good solution, except the conflict with the spec. How are other language SIGs implementing that? I think they should have similar issues with resource detection on startup? I can not imagine that this is unique to JavaScript.

This can somehow solve the conflict with the spec but might mess up the backends if they use the resource to derive some behavior (like identifying unique instances based on the resource).

Also, this sounds like something that's not unique to javascript?

What would be the issue with having only sync detectors?

  • Slow down of application startup? How significant is this impact, are we talking about tens of milliseconds or seconds? I assume this can also depend on the resource detection (like a API is called and that API for whatever reason is slow and taking 5 seconds to respond?)
  • Are there any other disadvantages?

It is not always possible. If the detect function involves some API call then it cannot block in JS and has to be awaited. As far as I understand it, This would mean that cloud providers' detectors will not be able to be used when sync resource detection is necessary.

א

As far as I understand, that's what we have right now

Where do we have it?

Looks like a good solution, except the conflict with the spec. How are other language SIGs implementing that? I think they should have similar issues with resource detection on startup? I can not imagine that this is unique to JavaScript.

I only investigated it in ruby, which blocks the thread until resources are resolved. This is not possible in JS.

Also, this sounds like something that's not unique to javascript?

Not sure if it's implemented in any other SDK. The RUM SIG need it to store things like session id in the resource which can change over the lifetime of the application. We talked about adding it in the SIG last week, but there was no progress yet. I have a feeling it will mess up backend logic.

What about packing the resource into a Promise/Future? This one can be created sync and it's immutable. Whoever needs resource has to await it. As far as I know that's only exporters for now.

The resource is currently part of the TracerProvider and ReadableSpan interfaces, so it will need to be changed there as well which is a big breaking change.
At aspecto we wrote a SpanProcessor that implements some logic based on the resource which is invoked at onEnd, thus can not trivially await the resource.

If we can make it work, I think this is the best option.

What would be the issue with having only sync detectors?

  • Slow down of application startup? How significant is this impact, are we talking about tens of milliseconds or seconds? I assume this can also depend on the resource detection (like a API is called and that API for whatever reason is slow and taking 5 seconds to respond?)
  • Are there any other disadvantages?

It is not always possible. If the detect function involves some API call then it cannot block in JS and has to be awaited. As far as I understand it, This would mean that cloud providers' detectors will not be able to be used when sync resource detection is necessary.

One example is GCP where it is required to call the gcp metadata server to get some resource attributes.

As far as I understand, that's what we have right now

Where do we have it?

Looks like a good solution, except the conflict with the spec. How are other language SIGs implementing that? I think they should have similar issues with resource detection on startup? I can not imagine that this is unique to JavaScript.

I only investigated it in ruby, which blocks the thread until resources are resolved. This is not possible in JS.

At the SIG meeting we talked briefly about this. At least ruby, java, python all simply block startup.

Also, this sounds like something that's not unique to javascript?

Not sure if it's implemented in any other SDK. The RUM SIG need it to store things like session id in the resource which can change over the lifetime of the application. We talked about adding it in the SIG last week, but there was no progress yet. I have a feeling it will mess up backend logic.

Most other SIGs are able to block execution. Those that aren't (I think erlang?) just decided to use sync only. We are the only one that has an async startup process.

What about packing the resource into a Promise/Future? This one can be created sync and it's immutable.
Whoever needs resource has to await it. As far as I know that's only exporters for now.

I have an idea here. Rather than packing the whole Resource into a promise, we can add a promise to the Resource for just the async attributes. Synchronous detectors can be set in the attributes when the Resource is created. This way there is no breaking change, SpanProcessors can continue to access the synchronously detected attributes on the resource. Exporters can await the resource detectors if they want to.

Working on a prototype for this in #2933

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days.

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days.

I was thinking that for node 19+, instead of node -r, one could use node --import (import an ESM module), and use a SDK initialization with a top-level await.
The rest of the application code can remain a commonJS module, thereby allowing auto-instrumentation.

In other words:

  • Blocking SDK initialization
  • Resources are fetched before the application code starts
  • No change in application code needed

Sadly, Node19 is not available for AWS lambdas, so I guess I have to go down another route.

From what I understand, if the SDK initialization is async, then at least the auto-instrumentation needs to be loaded synchronously before the first async call ? Is that correct ? If so, the documentation seems to be lacking

That can probably work for 19+ but we won't be able to rely on it as the "default" solution for a long time.

From what I understand, if the SDK initialization is async, then at least the auto-instrumentation needs to be loaded synchronously before the first async call ? Is that correct ? If so, the documentation seems to be lacking

Yes that is correct. Any updates to the documentation would be welcome.

Actually there is no need to wait with instrumentation creation until SDK has finished resource detection (see #3146). So maybe we should fix this instead of documenting the bug.

#2933 was auto closed. I think there were a few outstanding issues but where do we stand overall on this approach? Do we still need this?