GoogleCloudPlatform/opentelemetry-operations-js

Timestamps with decimals cause metric forwarding to get stuck

Closed this issue · 13 comments

What version of OpenTelemetry are you using?

"@opentelemetry/api": "1.4.1",
"@opentelemetry/core": "1.15.2",
"@opentelemetry/resources": "1.15.2",
"@opentelemetry/sdk-metrics": "1.15.2",

What version of Node are you using?

v16.14.2

What did you do?

Very occasionally, metrics created with OpenTelemetry end up with a decimal part to their timestamps. This is presumably due to numeric precision issues causing integer math to create a floating point number. Attempting to process metrics with floating point timestamps causes a RangeError exception to be thrown. The exception isn't caught until somewhere well up the stack. A side effect is that the list of pending metrics is not cleared out, resulting in this error being hit every time the system tries to forward the list of pending metrics. The following is an example of a metric seen when this problem was encountered:

    "attributes":
    {
    },
    "startTime":
    [
        1695096840,
        534000118.00000006
    ],
    "endTime":
    [
        1695096846,
        89824450
    ],
    "value": 22

The function that transforms the metrics should:

  1. Round timestamps to prevent inadvertent decimal values from creating problems
  2. Catch any exceptions to prevent them from bubbling up the stack and bypassing clean-up code. There is already a precedent in the function for discarding invalid metrics. This would extend it to discard any metrics where an exception was thrown while it was being processed.

I'll submit a PR shortly with the above fixes.

@sethrwebster do you have a repro for making the OpenTelemetry SDK generate non-integer numbers? This sounds like a bug upstream, I think consumers should safely be expecting int since process.hrtime.bigint() exists

I was wondering if it's possibly related to open-telemetry/opentelemetry-js#4014 but I don't think so given that is converting from HrTime.

Attempting to process metrics with floating point timestamps causes a RangeError exception to be thrown. The exception isn't caught until somewhere well up the stack.

Could you explain a bit more, what is processing these metrics? A stack trace would be helpful

Hi, thanks for taking a look at this. I think you are correct that this is a bug upstream. But because the underlying HrTime type allows invalid data, it seemed useful to prevent the issue here rather than rely solely on finding all places HrTime objects are created and making sure they are done so properly (though clients supplying invalid data should also be fixed). Unfortunately, HrTime is defined as a type and not a class, so I don't know any way to enforce constraints (like the values being integers) when it is created.

In our case, we are using metric objects such as Histograms to store the metric values. We don't supply any timestamps, so those are being generated somewhere in the opentelemetry-js library.

I believe they're only created with the code I linked above, which is using Math.round() and Math.trunc() to ensure ints

I'll admit I'm at a bit of a loss on how to proceed. Our code generating the metrics doesn't create the timestamps, and all the places I've looked in the opentelemtry-js library where metrics are recorded appear to be properly using the millisToHrTime function. I'm not familiar with this codebase, so I'm sure I've missed code paths though. Regardless, we are definitely seeing examples of metrics that contain a non-integer nanosecond value and it is causing the metrics processing to get stuck. The stack trace isn't very informative, as it seems the place the error is thrown is just exporting a set of previously generated metrics, so it doesn't indicate where or by whom the metrics were created:

Invalid time value
{"stack":"Error: PeriodicExportingMetricReader: metrics export failed (error RangeError: Invalid time value)
    at PeriodicExportingMetricReader._runOnce (/app/node_modules/@opentelemetry/sdk-metrics/src/export/PeriodicExportingMetricReader.ts:91:13)
    at runNextTicks (node:internal/process/task_queues:60:5)
    at listOnTimeout (node:internal/timers:538:9)
    at processTimers (node:internal/timers:512:7)
    at Timeout._onTimeout (/app/node_modules/@opentelemetry/sdk-metrics/src/export/PeriodicExportingMetricReader.ts:101:9)","message":"PeriodicExportingMetricReader: metrics export failed (error RangeError: Invalid time value)","name":"Error"}

Any thoughts on how to resolve this issue would be greatly appreciated.

How often are you seeing the SDK produce these non-integers? If you could repro it and open an issue in upstream open-telemetry/opentelemetry-js repo that would be awesome.

That stack trace is useful, I agree it's probably coming from new PreciseDate(...).toIsoString() since I can repro that:

$ node
Welcome to Node.js v18.13.0.
Type ".help" for more information.
> const {PreciseDate} = require('@google-cloud/precise-date');
undefined
> let d = new PreciseDate([1695096840, 534000118.00000006]);
undefined
> d
PreciseDate Invalid Date { _micros: 0, _nanos: 6e-8 }
> d.toISOString()
Uncaught RangeError: Invalid time value
    at PreciseDate.toISOString (<anonymous>)
    at PreciseDate.toISOString (/usr/local/google/home/aaronabbott/repo/opentelemetry-operations-js/node_modules/@google-cloud/precise-date/build/src/index.js:297:22)
>

A side effect is that the list of pending metrics is not cleared out, resulting in this error being hit every time the system tries to forward the list of pending metrics.

That definitely seems like a bug outside of the specific issue but I'm confused how that's happening because we don't buffer/retry failed exports. What do you mean by pending metrics? Afaik the exception would bubble out of this function:

...createTimeSeries(metric, resource, this._metricPrefix)

Causing the promise to reject and do this

resultCallback({code: ExportResultCode.FAILED, error: err});

I think what's actually happening is that the bad timestamp is the start timestamp, from your original bug report:

    "startTime":
    [
        1695096840,
        534000118.00000006
    ],

OTel will keep reporting this same start timestamp every export cycle (as the metrics are CUMULATIVE) which seems like the actual cause of you seeing the issue repeatedly. So I don't think the try/catch you added would help

places I've looked in the opentelemtry-js library where metrics are recorded appear to be properly using the millisToHrTime function. I'm not familiar with this codebase, so I'm sure I've missed code paths though.

One other thing I noticed. The hrTime() function calls addHrTimes() which is doing some arithmetic. I wonder if that's where the floating point error is introduced.

That makes sense that the problem is the same start time being used on each call. Looking at addHrTimes, it is only doing addition and subtraction. Is it possible to end up with non-integer values when doing addition or subtraction starting with integers in Javascript? I didn't think so, but could be wrong.

One other weird thing, I noticed the timestamps in your example are beyond millisecond precision [1695096846, 89824450] while OpenTelemetry metric SDK seems to always give millisecond precision timestamps like [ 1695671283, 798000000 ] when I tested it out. I tracked this change down to open-telemetry/opentelemetry-js#3514 which was released in SDK version 1.9.1. (the changelog is wrong if you look at the code at the release tags)

Are you sure about the dependency versions you mentioned above?

"@opentelemetry/api": "1.4.1",
"@opentelemetry/core": "1.15.2",
"@opentelemetry/resources": "1.15.2",
"@opentelemetry/sdk-metrics": "1.15.2",

I was able to repro the problem with @opentelemetry/core@1.8.0, here's a branch https://github.com/GoogleCloudPlatform/opentelemetry-operations-js/compare/main...aabmass:opentelemetry-operations-js:hrtime-repro?expand=1. Cd that directory, npm install, and run the repro.sh script and possibly wait ~30s for it to encounter the issue.

[ 1695674335, 751246217 ] 249.5499540567398
[ 1695674335, 751262637 ] 249.56618404388428
[ 1695674335, 751285377 ] 249.5889140367508
[ 1695674335, 751309937 ] 249.6135540008545
[ 1695674335, 751332797 ] 249.63635408878326
[ 1695674335, 751354687 ] 249.65818405151367
[ 1695674335, 751377217 ] 249.68073403835297
[ 1695674335, 751397727 ] 249.70124399662018
[ 1695674335, 751419097 ] 249.72259402275085
[ 1695674335, 751438417 ] 249.74190402030945
[ 1695674335, 751460227 ] 249.76378405094147
[ 1695674335, 751483757 ] 249.78736400604248
[ 1695674335, 751506607 ] 249.8101440668106
[ 1695674335, 751529427 ] 249.83304405212402
[ 1695674335, 751551937 ] 249.8555040359497
[ 1695674335, 751573487 ] 249.87701404094696
[ 1695674335, 751595077 ] 249.89872407913208
[ 1695674335, 751617277 ] 249.92093408107758
[ 1695674335, 751638907 ] 249.9424340724945
[ 1695674335, 751661077 ] 249.9645940065384
[ 1695674335, 751700917.0000001 ] 250.00472402572632
/usr/local/google/home/aaronabbott/repo/opentelemetry-operations-js/hrtime-repro/index.js:5
    throw new RangeError(
    ^

RangeError: Got non-int hrTime: 1695674335,751700917.0000001, timeorigin=1695674335501.697, now=250.0495640039444
    at check (/usr/local/google/home/aaronabbott/repo/opentelemetry-operations-js/hrtime-repro/index.js:5:11)
    at Object.<anonymous> (/usr/local/google/home/aaronabbott/repo/opentelemetry-operations-js/hrtime-repro/index.js:16:3)
    at Module._compile (node:internal/modules/cjs/loader:1218:14)
    at Module._extensions..js (node:internal/modules/cjs/loader:1272:10)
    at Module.load (node:internal/modules/cjs/loader:1081:32)
    at Module._load (node:internal/modules/cjs/loader:922:12)
    at Function.executeUserEntryPoint [as runMain] (node:internal/modules/run_main:81:12)
    at node:internal/main/run_main_module:23:47

Node.js v18.13.0

The setup restarts the node script after 10,000 attempts to get a new performance.timeOrigin which seems necessary to repro the bug. The other really bizarre thing is that it always happens right after performance.now() passes 250, but not every time. If you add some logging to the actual hrTime() function you may be able to figure out the problem. But it would be better to upgrade your dependencies if possible because it seems you're on an older version that obsoleted the issue

Hi, I think you are exactly correct. We are on a recent version of opentelemetry-operations-js, but an older version of opentelemetry-js. I think I accidentally grabbed the versions out of the the opentelemetry-operations-js package.json instead of ours when filling out the bug report because I had too many files open and that confused the matter further. I apologize for wasting your time on this, but really appreciate the help in getting to the bottom of the issue.

If it is useful to you, I can remove the extra Math.round from my PR but leave the try/catch as a guard against any future errors that show up, or we can just close the PR if you'd rather let exceptions go up the stack and be dealt with by the caller. Regardless, this issue can be closed as it appears upgrading to a recent version of opentelemetry-js looks to be the answer.

No worries, glad we figured it out.

Wraps a try/catch around the metric processing to make sure any other errors don't cause the invalid metric to be left in place, blocking all future metric processing.

I think we determined that invalid metrics won't block future metrics as there is no retry happening. In that case, I'd rather the exceptions go up the stack. If that's OK with you, can you close this issue/PR out?