open-telemetry/opentelemetry-ruby

Metrics feedback

Opened this issue ยท 13 comments

We want your feedback on the opentelemetry-metrics-sdk and opentelemetry-metrics-api gems! Understanding how these libraries are used by the community will help us prioritize future work on Metrics.

๐Ÿ‘‹ This issue has been marked as stale because it has been open with no activity. You can: comment on the issue or remove the stale label to hold stale off for a while, add the keep label to hold stale off permanently, or do nothing. If you do nothing this issue will be closed eventually by the stale bot.

Hi there ๐Ÿ‘‹

I've started some prototype implementation of metrics in a Rails app, and I could get some metrics exported from the SDK example.

However, From the README it is not clear where and when I should pull the metrics and shutdown the provider?
These 2 lines:

OpenTelemetry.meter_provider.metric_readers.each(&:pull)
OpenTelemetry.meter_provider.shutdown

What we'd want to do is adding a counter metric each time a specific controller action is called. I can do this with a standard before_action filter, but from there I don't understand well when to pull the metrics from the reader? Any advice?

Hi @chloe-meister! Thanks for reaching out!

The README provides an example for a script that would run once, pull the metric, and exit the program.

If you're open to it, I think a periodic metric reader is a better fit for a Rails application. It'll collect the metrics and pass them to the exporter on a set interval. The process will live as long as the Rails application does, or until OpenTelemetry.meter_provider.shutdown is manually called.

Assuming you're using the OTLP exporter, rather than calling:

otlp_metric_exporter = OpenTelemetry::Exporter::OTLP::Metrics::MetricsExporter.new
OpenTelemetry.meter_provider.add_metric_reader(otlp_metric_exporter)

You would call something like:

# config/initializers/opentelemetry.rb
reader = OpenTelemetry::SDK::Metrics::Export::PeriodicMetricReader.new(
          exporter: OpenTelemetry::Exporter::OTLP::Metrics::MetricsExporter.new,
          export_interval_millis: 3000,
          export_timeout_millis: 10000)

OpenTelemetry.meter_provider.add_metric_reader(reader)

You can create the meter and the counter the same way as in the example.
If you prefer the console exporter, just swap OpenTelemetry::Exporter::OTLP::Metrics::MetricsExporter with OpenTelemetry::SDK::Metrics::Export::ConsoleMetricPullExporter.

Using arguments above, your metrics would be exported every three seconds and would have a 10-second timeout.

Keep in mind that if you're using the OTLP exporter, you also need to install the opentelemetry-exporter-otlp-metrics gem.

Hi @kaylareopelle! Thank you very much for the detailed answer. This helps a lot ๐Ÿ™

Hi @kaylareopelle! Thanks to your guidance I managed to get something exported. We use Grafana and I can see something matching my meter provider's attributes. However, there doesn't seem to be any metric attached: no matter how the @data_points in the @metric_store object are updated, nothing from this information reaches Grafana.

I wonder if you could potentially have an idea of what could be wrong here?

Below you can find my setup:

# config/initializers/open_telemetry.rb

OpenTelemetry::SDK.configure do |config|
  config.service_name = ENV.fetch('SERVICE_NAME', 'unknown')

  group = { 'service.group' => ENV.fetch('SERVICE_GROUP', 'unknown') }
  config.resource = OpenTelemetry::SDK::Resources::Resource.create(group)
end

otlp_metric_exporter = OpenTelemetry::Exporter::OTLP::Metrics::MetricsExporter.new
reader = OpenTelemetry::SDK::Metrics::Export::PeriodicMetricReader.new(
  exporter: otlp_metric_exporter,
  export_interval_millis: 15_000,
  export_timeout_millis: 30_000
)
OpenTelemetry.meter_provider.add_metric_reader(reader)

queue_latency_meter = OpenTelemetry.meter_provider.meter('QUEUE_LATENCY')
QueueLatency = queue_latency_meter.create_histogram('queue_latency', unit: 'job_latency', description: 'cumulative latency for all queues')

I use this in a background job, collecting this information:

QueueLatency.record(queue.latency, attributes: { queue_name: queue.name })

Hi @chloe-meister! Thanks for sharing these details! Sorry things aren't working as expected.

Everything you've shared looks good to me. I copied your setup and created a simple Rails app using Sidekiq for the background jobs and was able to record data points to the QueueLatency histogram and export them via OTLP.

I was surprised that two resources were created: one for the Rails process and one for the Sidekiq process. The QueueLatency metric on the Rails process never had any data points. Instead, they were all recorded in the Sidekiq process.

I haven't worked with Grafana before, but I can do some testing with that observability backend next week. Perhaps there's something we're missing that environment needs.

Earlier this week, we released a fix for a bug in the metrics SDK that prevented users who set the OTEL_EXPORTER_OTLP_METRICS_TEMPORALITY_PREFERENCE environment variable from getting a valid aggregation_temporality set, preventing some observability backends from ingesting the metrics. Are you setting that environment variable? Perhaps updating to the latest version, if you're not using it already, may help.

It might also be helpful to see the raw data for the metric you're sending, either by creating a periodic metric reader with a console exporter, or by setting up a collector with the debug exporter set to detailed verbosity.

Here's how to set up the console exporter:

# config/initializers/opentelemetry.rb

# ... 

console_metric_exporter = OpenTelemetry::SDK::Metrics::Export::ConsoleMetricPullExporter.new
console_reader = OpenTelemetry::SDK::Metrics::Export::PeriodicMetricReader.new(
  exporter: console_metric_exporter,
  export_interval_millis: 15_000,
  export_timeout_millis: 30_000
)

OpenTelemetry.meter_provider.add_metric_reader(console_reader)

Either of these options should provide a plain text output of the data being sent over the wire. Would you be able to update your application to generate this output and paste the output (with sensitive information removed) on this issue? Seeing the structure of your data might help us pinpoint any malformed or missing elements that could be preventing the data points from being ingested.

Hello @kaylareopelle! I am a co-worker of @chloe-meister.
First of all thanks so much for the help and the guidance provided so far ๐Ÿ˜„

We did not set the OTEL_EXPORTER_OTLP_METRICS_TEMPORALITY_PREFERENCE env var, so I think that should not be a problem.

Followed your advise and used the ConsoleMetricPullExporter which helped us see the raw data points for the metric.
Here are some screenshots I took from it:
otl_data_debug
otl_data_debug2

Right after that I also get a line saying Flushed 2 metrics since I am also sending a second set of histogram metrics.

Just a bit of extra information, there is some data that does seem to reach Grafana and that we can see there. The attributes shown here:

Screenshot from 2024-10-28 12-16-41

Thank you for all the help provided!

Hi @eduAntequera, thank you for these screenshots! Everything looks good in this context, so as of yet, it does not seem like the problem is in the metrics SDK. This is puzzling!

For the next phase of debugging, I think we need to focus on the OTLP exporter specifically. In order to do that, I recommend we move to the collector setup with a debug exporter set to a detailed verbosity.

Could you give that a try and send me the output for a payload that isn't showing up in Grafana as expected?

Here's some example output from my reproduction app:

Resource SchemaURL: https://opentelemetry.io/schemas/1.6.1
Resource attributes:
     -> service.name: Str(unknown)
     -> process.pid: Int(2732)
     -> process.command: Str(<redacted>/bin/sidekiq)
     -> process.runtime.name: Str(ruby)
     -> process.runtime.version: Str(3.3.1)
     -> process.runtime.description: Str(ruby 3.3.1 (2024-04-23 revision c56cd86388) [x86_64-darwin23])
     -> telemetry.sdk.name: Str(opentelemetry)
     -> telemetry.sdk.language: Str(ruby)
     -> telemetry.sdk.version: Str(1.5.0)
     -> service.group: Str(unknown)
     -> host.name: Str(ce5b8509a987)
     -> os.type: Str(linux)
ScopeMetrics #0
ScopeMetrics SchemaURL: 
InstrumentationScope QUEUE_LATENCY 
Metric #0
Descriptor:
     -> Name: queue_latency
     -> Description: cumulative latency for all queues
     -> Unit: job_latency
     -> DataType: Histogram
     -> AggregationTemporality: Delta
HistogramDataPoints #0
Data point attributes:
     -> queue_name: Str(queue.name)
StartTimestamp: 2024-10-29 22:23:54.933843 +0000 UTC
Timestamp: 2024-10-29 22:24:54.942435 +0000 UTC
Count: 9
Sum: 37.000000
Min: 0.000000
Max: 9.000000
ExplicitBounds #0: 0.000000
ExplicitBounds #1: 5.000000
ExplicitBounds #2: 10.000000
ExplicitBounds #3: 25.000000
ExplicitBounds #4: 50.000000
ExplicitBounds #5: 75.000000
ExplicitBounds #6: 100.000000
ExplicitBounds #7: 250.000000
ExplicitBounds #8: 500.000000
ExplicitBounds #9: 1000.000000
Buckets #0, Count: 1
Buckets #1, Count: 5
Buckets #2, Count: 3
Buckets #3, Count: 0
Buckets #4, Count: 0
Buckets #5, Count: 0
Buckets #6, Count: 0
Buckets #7, Count: 0
Buckets #8, Count: 0
Buckets #9, Count: 0
Buckets #10, Count: 0

Hi @kaylareopelle, sorry for the delay and thanks for the guidance.
Here are some screenshots as requested
Screenshot from 2024-10-31 17-15-53
Screenshot from 2024-10-31 17-16-37
Screenshot from 2024-10-31 17-17-08

I hope they help

Hi @eduAntequera! Thanks for the screenshots! I'm about to take some time off through next Monday, so perhaps someone else can jump in here to help in the meantime.

These screenshots helped me realize that both exporters are recording the metric only once, and it seems to have a value of 0.0. Is that accurate? For these samples, how many times has QueueLatency.record(queue.latency, attributes: { queue_name: queue.name }) been called? What are some values you expect queue.latency to have recorded?

Hi @kaylareopelle thanks for the reply!
I checked and indeed 0.0 could indeed be a value, I have changed the code to generate random values instead and will update here with screenshots with values other than 0.0. I also decided to change the name of the queues, since I was seeing a metric in grafana for queue_size but none for queue_latency. This made me think that maybe the names were colliding. So I changed the names to sidekiq_queue_size and sidekiq_queue_latency.
When using these and the generation of random value they did not appear in grafana (the queue_size metric still appeared).

Here are some of the results when sending randomly generated values:
Screenshot From 2024-11-06 10-50-07
Screenshot From 2024-11-06 10-50-21
Screenshot From 2024-11-06 10-50-45

Thanks for all the help and hope you enjoy your time off ๐Ÿ˜„

Hi @eduAntequera! Thank you for this test and for your patience. I took another look at the output and unfortunately I'm not seeing any issues with the output.

This might be an issue related to Grafana's ingest process rather than the OpenTelemetry Ruby metrics gems. I reached out on the CNCF Slack to someone who works for Grafana and they're going to take a look at the issue. If you're able to open a support ticket with a representative from the company, that may also be a good next step.

If they have anything they'd like us to test or change with the metrics code to help with troubleshooting, let us know!

@kaylareopelle Thank you so much for looking into it and contacting them ๐Ÿ™‡
We will let you know if we get any news :)
Sorry for the late reply, the notification about your reply got lost in my inbox ๐Ÿ˜“