Opentelemery reporting to GCM does not work in Cloud Functions - errors due to no NODE_ID attribute in TimeSeries
Closed this issue · 11 comments
When using OpenTelemetry in Cloud Functions, using the GCM Exporter, a Node ID is not included in the resource attributes.
This leads to errors when 2 or more instances of the same cloud function export metrics to GCM, because: GCM cannot tell the incoming metrics CreateTimeSeries requests from the different instances of the cloud functions apart.
(This same problem will also affect Cloud Run for the same reason)
Errors can be when these 2 (or more) function instances may send CreateTimeSeries requests within 5 seconds of each other, leading to error:
Send TimeSeries failed: One or more TimeSeries could not be written: One or more points were written more frequently than the maximum sampling period configured for the metric.
These 2 (or more) functions may send CreateTimeSeries requests with different or overlapping time series for the same metric, leading to error:
One or more TimeSeries could not be written: Field timeSeries[1].points[0].interval.start_time had an invalid value of "xxxx": The start time must be before the end time (xxxx) for the non-gauge metric
Send TimeSeries failed: One or more TimeSeries could not be written: Points must be written in order. One or more of the points specified had an older start time than the most recent point
Some debugging later...
Cloud functions do have an instance ID that can be used to distinguish between multiple instances of the same function. This is detected and exported by the GcpDetector as the faas.id
resource attribute (shouldn't it be the FAAS_INSTANCE?)
However when creating the TimeSeries, which maps the resource attrs for the timeseries this resource attribute is not used - only Region and function name (not instance) is used.
This means that all CreateTimeSeries requests from multiple instances of the same Cloud Function will be seen as coming from the 'same' instance, and will confict as seen with the above errors.
Also affects Cloud Run and K8s workloads when exporting directly.
As mentioned, this problem will affect Cloud Run for the same reason: no identifiers for the individual instance of a Cloud Run workload are exported
And the same problem also occured in K8s workloads, which was more complicated to resolve. While the Pod name is exported, the pod name is not detected , so I had to manually pass the pod name in to my as an environmental variable, and set the resource manually.
(In a K8s workload using the OpenTelemetry collector, the collector can determine and add the pod name itself, so this only affects workloads who export directly to GCM(.
Suffice to say, understanding these issues, and creating workarounds took a considerable amount of investigation and debugging, as it is not at all well documented.
Workaround for Cloud Functions
I have solved this with a workaround, overriding the resource attributes so that when the TimeSeries is created, different CF instances will be seen as different to GCM.
- setting the platform to GENERIC_TASK
- setting SERVICE_INSTANCE_ID to the FAAS_ID
With this workaround, metrics are reported and aggregated correctly, and no errors occur when sending TimeSeries.
if (process.env.FUNCTION_TARGET) {
RESOURCE_ATTRIBUTES[Semconv.CLOUD_PLATFORM] = 'generic_task';
if (gcpResources.attributes[Semconv.FAAS_ID]?.toString()) {
RESOURCE_ATTRIBUTES[Semconv.SERVICE_INSTANCE_ID] =
gcpResources.attributes[Semconv.FAAS_ID].toString();
} else {
logger.warn('WARNING: running under Cloud Functions, but FAAS_ID ' +
'resource attribute is not set. ' +
'This may lead to Send TimeSeries errors');
}
}
What version of OpenTelemetry are you using?
Latest:
"@google-cloud/opentelemetry-cloud-monitoring-exporter": "^0.17.0",
"@opentelemetry/api": "^1.8.0",
"@opentelemetry/sdk-metrics": "^1.22.0",
"@opentelemetry/sdk-node": "^0.49.1",
What version of Node are you using?
v18.19.1 lts/hydrogen
v20.11.1 lts/iron
What did you do?
Simple Cloud Function which exports metrics according to requests, using PeriodicMetricExporter to Google Cloud Messaging, running multiple instances of the same function in parallel.
What did you expect to see?
Using a simple setup of OpenTelemetry with GCM, following the examples, Metrics should be exported reliably without any errors.
const {MeterProvider, PeriodicExportingMetricReader} =
require('@opentelemetry/sdk-metrics');
const {Resource} = require('@opentelemetry/resources');
const otelGcpExporter =
require('@google-cloud/opentelemetry-cloud-monitoring-exporter');
const {GcpDetectorSync} = require('@google-cloud/opentelemetry-resource-util');
const {SemanticResourceAttributes: Semconv} =
require('@opentelemetry/semantic-conventions');
// Initialization code
const RESOURCE_ATTRIBUTES = {
[Semconv.SERVICE_NAMESPACE]: 'my_org',
[Semconv.SERVICE_NAME]: 'my_service',
[Semconv.SERVICE_VERSION]: '1.0',
};
const COUNTERS_PREFIX = 'my_org/my_servce/;
const gcpResources = new GcpDetectorSync().detect();
await gcpResources.waitForAsyncAttributes();
logger.debug('got GCP resources %o', gcpResources);
meterProvider = new MeterProvider({
resource: new Resource(RESOURCE_ATTRIBUTES)
.merge(gcpResources),
});
meterProvider.addMetricReader(new PeriodicExportingMetricReader({
exportIntervalMillis: 30_000,
exportTimeoutMillis: 30_000,
exporter: new otelGcpExporter.MetricExporter(),
}));
const meter = meterProvider.getMeter(COUNTERS_PREFIX);
const counter = meter.createCounter(COUNTERS_PREFIX + 'counter';,
{description: 'my test counter}));
// Runtime code:
counter.add(1, {counter_attribute: 'value'});
What did you see instead?
Multiple errors, metrics failing to be exported, and therefore values missed.
Send TimeSeries failed: One or more TimeSeries could not be written: One or more points were written more frequently than the maximum sampling period configured for the metric.
One or more TimeSeries could not be written: Field timeSeries[1].points[0].interval.start_time had an invalid value of "xxxx": The start time must be before the end time (xxxx) for the non-gauge metric
Send TimeSeries failed: One or more TimeSeries could not be written: Points must be written in order. One or more of the points specified had an older start time than the most recent point
I would also note that the last public release is 0.17 from Jun last year, and there have been many changes since then...
The bug report above is very thorough so I think the action items are clear:
- Update the detector to use FAAS_INSTANCE instance of
faas.id
- Update the metric exporter to use the FAAS_INSTANCE
- Update docs explaining how to set k8s resource attributes (see note below)
And the same problem also occured in K8s workloads, which was more complicated to resolve. While the Pod name is exported, the pod name is not detected , so I had to manually pass the pod name in to my as an environmental variable, and set the resource manually.
Unfortunately this is expected. Like you mentioned it's not well documented here, but k8s workloads should use the Downward API to pass in pod and namespaces and manually set the container name. This is documented for Go here
Cloud functions do have an instance ID that can be used to distinguish between multiple instances of the same function. This is detected and exported by the GcpDetector as the
faas.id
resource attribute (shouldn't it be the FAAS_INSTANCE?)However when creating the TimeSeries, which maps the resource attrs for the timeseries this resource attribute is not used - only Region and function name (not instance) is used.
Huh weird, this should have been working based on the code linked at the time this issue was created. We fixed most of this in #600 and #643 but still need to do a release like you mentioned. I'll verify it's working and make a release.
Cloud functions do have an instance ID that can be used to distinguish between multiple instances of the same function. This is detected and exported by the GcpDetector as the
faas.id
resource attribute (shouldn't it be the FAAS_INSTANCE?)
However when creating the TimeSeries, which maps the resource attrs for the timeseries this resource attribute is not used - only Region and function name (not instance) is used.Huh weird, this should have been working based on the code linked at the time this issue was created. We fixed most of this in #600 and #643 but still need to do a release like you mentioned. I'll verify it's working and make a release.
at main/HEAD, it is still not correct. The mapping for cloud functions set up at opentelemetry-resource-util/src/index.ts#L130) only copies REGION and FUNCTION_NAME, into the TimeSeries, not FAAS_INSTANCE
[CLOUD_FUNCTION]: {
[REGION]: {otelKeys: [SemanticResourceAttributes.CLOUD_REGION]},
[FUNCTION_NAME]: {otelKeys: [SemanticResourceAttributes.FAAS_NAME]},
},
I have to manually set the SERVICE_INSTANCE_ID resource attribute for it to be handled correctly by cloud monitoring.
if (gcpResources.attributes[Semconv.SEMRESATTRS_FAAS_ID]?.toString()) {
RESOURCE_ATTRIBUTES[Semconv.SEMRESATTRS_SERVICE_INSTANCE_ID] =
gcpResources.attributes[Semconv.SEMRESATTRS_FAAS_ID].toString();
}...
Thanks for the reply sorry again for not looking at this sooner
at main/HEAD, it is still not correct. The mapping for cloud functions set up at opentelemetry-resource-util/src/index.ts#L130) only copies REGION and FUNCTION_NAME, into the TimeSeries, not FAAS_INSTANCE
The cloud functions and cloud run resources are not writeable for custom metrics so that mapping shouldn't be relevant. Instead it should be writing to generic_task
using this mapping for task_id
which uses either service.instance.id
or faas.instance
:
opentelemetry-operations-js/packages/opentelemetry-resource-util/src/index.ts
Lines 160 to 166 in be4ae61
Something must have changed between 0.17 and main then, because the data sent by a simple GCF metrics test code in 0.17:
- uses GENERIC_NODE
- only includes "location" and "namespace" from resource attributes
(sample code: https://github.com/nielm/counters-test. Note that there is workaround code in index.js, which is enabled by an environment variable )
// RESOURCE_ATTRIBUTES
{
'service.namespace': 'nielm',
'service.name': 'counters-test',
'service.version': '1.0.0',
'cloud.provider': 'gcp',
'cloud.account.id': 'xxxxxxxxxxx',
'cloud.platform': 'gcp_cloud_functions',
'faas.name': 'counters-test',
'faas.version': '2',
'faas.id': '00f46b92856fc7f381fe1f3b1fc26dd38d4da32a74f71a070477a4cefce3bb7ebbfc6db25d45ee01d4020dfd6c32bcd1b26691e8903a33dc5aa1387f12bc2cec6ca5f5',
'cloud.region': 'us-central1'
}
// EXPORTED METRIC, AS SENT TO GCM
{
metric: {
type: 'custom.googleapis.com/nielm-test/background-counter',
labels: {}
},
resource: {
type: 'generic_node',
labels: { location: 'us-central1', namespace: 'nielm', node_id: '' }
},
metricKind: 'CUMULATIVE',
valueType: 'DOUBLE',
points: [
{
value: { doubleValue: 273 },
interval: {
startTime: '2024-05-03T13:07:29.115000000Z',
endTime: '2024-05-03T13:17:45.819000000Z'
}
}
]
}
Yes there are tons of changes since the last release, apologies for that. Let me just quickly test your sample code at main/HEAD.
I bundled in a packed tarballs of the packages in this repo, commented out the [Semconv.SEMRESATTRS_SERVICE_NAME]: 'counters-test'
label and deployed your sample on both gen1 and gen2. It is setting instance ID now correctly and I can see separate time series from each instance.