GoogleCloudPlatform/opentelemetry-operations-js

Opentelemery reporting to GCM does not work in Cloud Functions - errors due to no NODE_ID attribute in TimeSeries

Closed this issue · 11 comments

When using OpenTelemetry in Cloud Functions, using the GCM Exporter, a Node ID is not included in the resource attributes.

This leads to errors when 2 or more instances of the same cloud function export metrics to GCM, because: GCM cannot tell the incoming metrics CreateTimeSeries requests from the different instances of the cloud functions apart.

(This same problem will also affect Cloud Run for the same reason)

Errors can be when these 2 (or more) function instances may send CreateTimeSeries requests within 5 seconds of each other, leading to error:

Send TimeSeries failed: One or more TimeSeries could not be written: One or more points were written more frequently than the maximum sampling period configured for the metric.

These 2 (or more) functions may send CreateTimeSeries requests with different or overlapping time series for the same metric, leading to error:

One or more TimeSeries could not be written: Field timeSeries[1].points[0].interval.start_time had an invalid value of "xxxx": The start time must be before the end time (xxxx) for the non-gauge metric

Send TimeSeries failed: One or more TimeSeries could not be written: Points must be written in order. One or more of the points specified had an older start time than the most recent point

Some debugging later...

Cloud functions do have an instance ID that can be used to distinguish between multiple instances of the same function. This is detected and exported by the GcpDetector as the faas.id resource attribute (shouldn't it be the FAAS_INSTANCE?)

However when creating the TimeSeries, which maps the resource attrs for the timeseries this resource attribute is not used - only Region and function name (not instance) is used.

This means that all CreateTimeSeries requests from multiple instances of the same Cloud Function will be seen as coming from the 'same' instance, and will confict as seen with the above errors.

Also affects Cloud Run and K8s workloads when exporting directly.

As mentioned, this problem will affect Cloud Run for the same reason: no identifiers for the individual instance of a Cloud Run workload are exported

And the same problem also occured in K8s workloads, which was more complicated to resolve. While the Pod name is exported, the pod name is not detected , so I had to manually pass the pod name in to my as an environmental variable, and set the resource manually.

(In a K8s workload using the OpenTelemetry collector, the collector can determine and add the pod name itself, so this only affects workloads who export directly to GCM(.

Suffice to say, understanding these issues, and creating workarounds took a considerable amount of investigation and debugging, as it is not at all well documented.

Workaround for Cloud Functions

I have solved this with a workaround, overriding the resource attributes so that when the TimeSeries is created, different CF instances will be seen as different to GCM.

  • setting the platform to GENERIC_TASK
  • setting SERVICE_INSTANCE_ID to the FAAS_ID

With this workaround, metrics are reported and aggregated correctly, and no errors occur when sending TimeSeries.

    if (process.env.FUNCTION_TARGET) {
      RESOURCE_ATTRIBUTES[Semconv.CLOUD_PLATFORM] = 'generic_task';
      if (gcpResources.attributes[Semconv.FAAS_ID]?.toString()) {
        RESOURCE_ATTRIBUTES[Semconv.SERVICE_INSTANCE_ID] =
            gcpResources.attributes[Semconv.FAAS_ID].toString();
      } else {
        logger.warn('WARNING: running under Cloud Functions, but FAAS_ID ' +
        'resource attribute is not set. ' +
        'This may lead to Send TimeSeries errors');
      }
    }

What version of OpenTelemetry are you using?

Latest:

        "@google-cloud/opentelemetry-cloud-monitoring-exporter": "^0.17.0",
        "@opentelemetry/api": "^1.8.0",
        "@opentelemetry/sdk-metrics": "^1.22.0",
        "@opentelemetry/sdk-node": "^0.49.1",

What version of Node are you using?

   v18.19.1 lts/hydrogen 
   v20.11.1 lts/iron 

What did you do?

Simple Cloud Function which exports metrics according to requests, using PeriodicMetricExporter to Google Cloud Messaging, running multiple instances of the same function in parallel.

What did you expect to see?

Using a simple setup of OpenTelemetry with GCM, following the examples, Metrics should be exported reliably without any errors.

const {MeterProvider, PeriodicExportingMetricReader} =
  require('@opentelemetry/sdk-metrics');
const {Resource} = require('@opentelemetry/resources');
const otelGcpExporter =
  require('@google-cloud/opentelemetry-cloud-monitoring-exporter');
const {GcpDetectorSync} = require('@google-cloud/opentelemetry-resource-util');
const {SemanticResourceAttributes: Semconv} =
  require('@opentelemetry/semantic-conventions');
  
// Initialization code
    const RESOURCE_ATTRIBUTES = {
      [Semconv.SERVICE_NAMESPACE]: 'my_org',
      [Semconv.SERVICE_NAME]: 'my_service',
      [Semconv.SERVICE_VERSION]: '1.0',
    };

    const COUNTERS_PREFIX = 'my_org/my_servce/;
      
    const gcpResources = new GcpDetectorSync().detect();
    await gcpResources.waitForAsyncAttributes();
    logger.debug('got GCP resources %o', gcpResources);

    meterProvider = new MeterProvider({
      resource: new Resource(RESOURCE_ATTRIBUTES)
          .merge(gcpResources),
    });

    meterProvider.addMetricReader(new PeriodicExportingMetricReader({
      exportIntervalMillis: 30_000,
      exportTimeoutMillis: 30_000,
      exporter: new otelGcpExporter.MetricExporter(),
    }));

    const meter = meterProvider.getMeter(COUNTERS_PREFIX);
    const counter = meter.createCounter(COUNTERS_PREFIX + 'counter';,
              {description: 'my test counter}));    
 
 // Runtime code: 
     counter.add(1, {counter_attribute: 'value'});
 

What did you see instead?

Multiple errors, metrics failing to be exported, and therefore values missed.

Send TimeSeries failed: One or more TimeSeries could not be written: One or more points were written more frequently than the maximum sampling period configured for the metric.

One or more TimeSeries could not be written: Field timeSeries[1].points[0].interval.start_time had an invalid value of "xxxx": The start time must be before the end time (xxxx) for the non-gauge metric

Send TimeSeries failed: One or more TimeSeries could not be written: Points must be written in order. One or more of the points specified had an older start time than the most recent point

One note - I inspected the resource mapping code in Java, Go and Python and they all set FAAS_INSTANCE to the CF instance ID, and add it to the TimeSeries by using a Generic_Task and setting the TASK_ID to the FAAS_INSTANCE

I would also note that the last public release is 0.17 from Jun last year, and there have been many changes since then...

Thanks for the detailed report. I'm going to look into in coming days. It's probably related to #655.

I would also note that the last public release is 0.17 from Jun last year, and there have been many changes since then...

Will definitely make a release once this is fixed.

The bug report above is very thorough so I think the action items are clear:

  • Update the detector to use FAAS_INSTANCE instance of faas.id
  • Update the metric exporter to use the FAAS_INSTANCE
  • Update docs explaining how to set k8s resource attributes (see note below)

And the same problem also occured in K8s workloads, which was more complicated to resolve. While the Pod name is exported, the pod name is not detected , so I had to manually pass the pod name in to my as an environmental variable, and set the resource manually.

Unfortunately this is expected. Like you mentioned it's not well documented here, but k8s workloads should use the Downward API to pass in pod and namespaces and manually set the container name. This is documented for Go here

Cloud functions do have an instance ID that can be used to distinguish between multiple instances of the same function. This is detected and exported by the GcpDetector as the faas.id resource attribute (shouldn't it be the FAAS_INSTANCE?)

However when creating the TimeSeries, which maps the resource attrs for the timeseries this resource attribute is not used - only Region and function name (not instance) is used.

Huh weird, this should have been working based on the code linked at the time this issue was created. We fixed most of this in #600 and #643 but still need to do a release like you mentioned. I'll verify it's working and make a release.

Cloud functions do have an instance ID that can be used to distinguish between multiple instances of the same function. This is detected and exported by the GcpDetector as the faas.id resource attribute (shouldn't it be the FAAS_INSTANCE?)
However when creating the TimeSeries, which maps the resource attrs for the timeseries this resource attribute is not used - only Region and function name (not instance) is used.

Huh weird, this should have been working based on the code linked at the time this issue was created. We fixed most of this in #600 and #643 but still need to do a release like you mentioned. I'll verify it's working and make a release.

at main/HEAD, it is still not correct. The mapping for cloud functions set up at opentelemetry-resource-util/src/index.ts#L130) only copies REGION and FUNCTION_NAME, into the TimeSeries, not FAAS_INSTANCE

  [CLOUD_FUNCTION]: {
    [REGION]: {otelKeys: [SemanticResourceAttributes.CLOUD_REGION]},
    [FUNCTION_NAME]: {otelKeys: [SemanticResourceAttributes.FAAS_NAME]},
  },

I have to manually set the SERVICE_INSTANCE_ID resource attribute for it to be handled correctly by cloud monitoring.

See: https://github.com/cloudspannerecosystem/autoscaler/blob/main/src/autoscaler-common/counters_base.js#L267 :

      if (gcpResources.attributes[Semconv.SEMRESATTRS_FAAS_ID]?.toString()) {
        RESOURCE_ATTRIBUTES[Semconv.SEMRESATTRS_SERVICE_INSTANCE_ID] =
          gcpResources.attributes[Semconv.SEMRESATTRS_FAAS_ID].toString();
      }...

Thanks for the reply sorry again for not looking at this sooner

at main/HEAD, it is still not correct. The mapping for cloud functions set up at opentelemetry-resource-util/src/index.ts#L130) only copies REGION and FUNCTION_NAME, into the TimeSeries, not FAAS_INSTANCE

The cloud functions and cloud run resources are not writeable for custom metrics so that mapping shouldn't be relevant. Instead it should be writing to generic_task using this mapping for task_id which uses either service.instance.id or faas.instance:

[TASK_ID]: {
otelKeys: [
SemanticResourceAttributes.SERVICE_INSTANCE_ID,
SemanticResourceAttributes.FAAS_INSTANCE,
],
},
},

Just tested it out at main/HEAD and things look correct:

image

You can add a non-blank namespace by setting service.namespace.name

Something must have changed between 0.17 and main then, because the data sent by a simple GCF metrics test code in 0.17:

  • uses GENERIC_NODE
  • only includes "location" and "namespace" from resource attributes

(sample code: https://github.com/nielm/counters-test. Note that there is workaround code in index.js, which is enabled by an environment variable )

image

// RESOURCE_ATTRIBUTES
{
  'service.namespace': 'nielm',
  'service.name': 'counters-test',
  'service.version': '1.0.0',
  'cloud.provider': 'gcp',
  'cloud.account.id': 'xxxxxxxxxxx',
  'cloud.platform': 'gcp_cloud_functions',
  'faas.name': 'counters-test',
  'faas.version': '2',
  'faas.id': '00f46b92856fc7f381fe1f3b1fc26dd38d4da32a74f71a070477a4cefce3bb7ebbfc6db25d45ee01d4020dfd6c32bcd1b26691e8903a33dc5aa1387f12bc2cec6ca5f5',
  'cloud.region': 'us-central1'
}

// EXPORTED METRIC, AS SENT TO GCM
{
  metric: {
    type: 'custom.googleapis.com/nielm-test/background-counter',
    labels: {}
  },
  resource: {
    type: 'generic_node',
    labels: { location: 'us-central1', namespace: 'nielm', node_id: '' }
  },
  metricKind: 'CUMULATIVE',
  valueType: 'DOUBLE',
  points: [
    {
      value: { doubleValue: 273 },
      interval: {
        startTime: '2024-05-03T13:07:29.115000000Z',
        endTime: '2024-05-03T13:17:45.819000000Z'
      }
    }
  ]
}

Yes there are tons of changes since the last release, apologies for that. Let me just quickly test your sample code at main/HEAD.

I bundled in a packed tarballs of the packages in this repo, commented out the [Semconv.SEMRESATTRS_SERVICE_NAME]: 'counters-test' label and deployed your sample on both gen1 and gen2. It is setting instance ID now correctly and I can see separate time series from each instance.

image
image

And the CreateTimeSeries() requests are all successful:
image