grafana/docker-otel-lgtm

Document settings required to populate Dashboards.

yeroc opened this issue · 19 comments

I've been trying out this docker image as someone new to both the Grafana products and OpenTelemetry in general. I believe I'm your target audience. That said, I'm struggling to get any of the three sample Dashboards to populate with metrics using the OpenTelemetry Java Agent with my own application. I can confirm instrumentation is working because I'm able to see some metrics using Explore, I'm also seeing Traces and Logs populated as well but all the metrics dashboards remain obstinately blank.

Here are my settings for Java Agent 2.2.0:

# Settings for the opentelemetry java agent

# in ms...
otel.bsp.schedule.delay=5000
otel.metric.export.interval=5000

otel.exporter.otlp.metrics.default.histogram.aggregation=base2_exponential_bucket_histogram

# capture enduser info...
otel.instrumentation.common.enduser.enabled=true
otel.instrumentation.common.enduser.id.enabled=true

otel.instrumentation.common.peer-service-mapping=foo-host:8082=foo-service

otel.semconv-stability.opt-in=http

otel.resource.attributes=service.version=HEAD-SNAPSHOT
otel.service.name=my-service

The above settings are trying to get the JVM and RED (native histograms) dashboards to populate.

General feedback:

  • Having two RED Metrics dashboards is confusing. Strongly recommend a single RED dashboard using the metrics you view to be the best / future-proof path for someone new to OpenTelemetry. After a bunch of Googling I think that might be the "native histogram" one but I'm still not sure!
  • Please document the settings required to populate the Dashboards. Yes, I did see the description at the top for the two RED Metrics dashboards but it would be great to add the property settings (in addition to environment variables) or simply a link to a more detailed page on this github project. The JVM Metrics dashboard is missing any instructions so I have no idea how to get it to populate (in Explore view I can see some jvm_* metrics are populated so I dunno?!
  • Consider enabling github Discussions on this project (this issue would be more appropriate as a discussion post but I can't figure out where to post feedback!)

Hope this doesn't come across as overly negative. It's pretty awesome to be able to spin up a product suite supporting metrics, traces and logs with a single command!

I think this might have to do with the collection names. Can you please double check that the generated collection names are matching, there's this temporary duality with adding/not-adding the unit, when the OTEL metrics are converted to Prometheus.

@grcevski Thanks for responding. Is "collection name" a Grafana or Prometheus term? I tried Googling but I'm failing to turn up exactly what you're referring to here. Is that equivalent to the metric name? Or something else?

Sorry I mean the Prometheus series name. I apologize for the confusion.

@grcevski If I'm understanding correctly, here's the list of metric series names populated:

"http_client_request_duration_seconds",
"http_server_request_duration_seconds",
"jvm_class_count",
"jvm_class_loaded_total",
"jvm_class_unloaded_total",
"jvm_cpu_count",
"jvm_cpu_recent_utilization_ratio",
"jvm_cpu_time_seconds_total",
"jvm_gc_duration_seconds",
"jvm_memory_committed_bytes",
"jvm_memory_limit_bytes",
"jvm_memory_used_after_last_gc_bytes",
"jvm_memory_used_bytes",
"jvm_thread_count",
"otelcol_exporter_queue_capacity",
"otelcol_exporter_queue_size",
"otelcol_exporter_send_failed_log_records_total",
"otelcol_exporter_send_failed_metric_points_total",
"otelcol_exporter_send_failed_spans_total",
"otelcol_exporter_sent_log_records_total",
"otelcol_exporter_sent_metric_points_total",
"otelcol_exporter_sent_spans_total",
"otelcol_http_server_duration_bucket",
"otelcol_http_server_duration_count",
"otelcol_http_server_duration_sum",
"otelcol_http_server_request_content_length_total",
"otelcol_http_server_response_content_length_total",
"otelcol_process_cpu_seconds_total",
"otelcol_process_memory_rss",
"otelcol_process_runtime_alloc_bytes_total",
"otelcol_process_runtime_heap_alloc_bytes",
"otelcol_process_runtime_total_sys_memory_bytes",
"otelcol_process_uptime_total",
"otelcol_processor_batch_batch_send_size_bucket",
"otelcol_processor_batch_batch_send_size_count",
"otelcol_processor_batch_batch_send_size_sum",
"otelcol_processor_batch_metadata_cardinality",
"otelcol_processor_batch_timeout_trigger_send_total",
"otelcol_receiver_accepted_log_records_total",
"otelcol_receiver_accepted_metric_points_total",
"otelcol_receiver_accepted_spans_total",
"otelcol_receiver_refused_log_records_total",
"otelcol_receiver_refused_metric_points_total",
"otelcol_receiver_refused_spans_total",
"otlp_exporter_exported_total",
"otlp_exporter_seen_total",
"processedLogs_total",
"processedSpans_total",
"queueSize_ratio",
"scrape_duration_seconds",
"scrape_samples_post_metric_relabeling",
"scrape_samples_scraped",
"scrape_series_added",
"target_info",
"traces_service_graph_request_client_seconds_bucket",
"traces_service_graph_request_client_seconds_count",
"traces_service_graph_request_client_seconds_sum",
"traces_service_graph_request_server_seconds_bucket",
"traces_service_graph_request_server_seconds_count",
"traces_service_graph_request_server_seconds_sum",
"traces_service_graph_request_total",
"up"

Not sure what I should be matching these up against? Is this related to this Grafana blog post and this OpenTelemetry Collector document that mentions Prometheus Normalization?

Hm, interesting, you don't see http_server_request_duration_seconds_bucket and http_server_request_duration_seconds_count?

Did you install the docker/grafana-dashboard-red-metrics-classic.json or docker/grafana-dashboard-red-metrics-native.json? Based on the metric series names I think you need to use docker/grafana-dashboard-red-metrics-native.json.

@grcevski I'm using the docker container as published to Docker Hub via docker run -p 3000:3000 -p 4317:4317 -p 4318:4318 --rm -ti grafana/otel-lgtm as per the Grafana Labs blog post. I haven't cloned this repo and customized anything, thus my comment above about the confusion between the two different predefined RED dashboards that are visible in the Grafana UI. All three dashboards show No Data.

Ah I see, we should possibly expand the documentation to include mention of the other dashboards. Are there any other dashboards available, we need to use the 'Native Prometheus Dashboard' for the metrics series names you have.

@grcevski Which other dashboards are you referring to? I see three dashboards by default:
image

It looks like these correspond to the dashboard definitions in the docker/grafana-dashboard-*.json files in this repo. Are there others?

OK, great, so the "RED Metrics (native histogram)" should work if you have data in "http_server_request_duration_seconds". Is it also empty?

@grcevski I seem to have data:
image

but nothing shows up on the dashboard:
image

I'm trying to understand where the docs can be improved.

I've just followed the proposed steps for native histograms - please let me know where it didn't work:

  1. start LGTM
  2. go to native dashboard
  3. see instructions (screenshot)
  4. adjust run.sh, uncommenting export OTEL_EXPORTER_OTLP_METRICS_DEFAULT_HISTOGRAM_AGGREGATION=base2_exponential_bucket_histogram
  5. run run-example.sh and generate-traffic.sh
  6. native dashboard shows data (second screenshot)

image

image

@zeitlinger Sorry, if the intent is for this container to only work with the sample you provided you can go ahead and close this ticket. I'm feeding data in from my own application via the OpenTelemetry Java Agent. I still think it's confusing to have two RED dashboards but maybe that makes sense to OTel experts?

I still think it's confusing to have two RED dashboards

Only one of the RED dashboards can work, depending on how you send the data (controlled by OTEL_EXPORTER_OTLP_METRICS_DEFAULT_HISTOGRAM_AGGREGATION).

I'm happy to improve the docs if you have a suggestion 😄

@zeitlinger For me none of the three dashboards are working with Java Agent 2.2.0 per notes above. Not sure what I'm doing wrong. I'd suggest adding docs for whatever is required for the JVM Dashboard to show information.

@zeitlinger For me none of the three dashboards are working with Java Agent 2.2.0 per notes above.

Are you using the included example app or your own? If the latter, can you point to a repo - or steps how to reproduce?

@zeitlinger My own application. The application isn't public so can't point you to a repo. I included the Java Agent config in ticket summary. Let me know what additional details you'd need. Like I said, even the JVM dashboard doesn't display anything but when I explore Metrics, Logs and Traces I do see information so I know the agent is properly activated and feeding data over.

@zeitlinger My own application.

Can you try to reproduce the issue with the java app?

  • Maybe just changing some settings in run.sh - and create a draft PR
  • If necessary, also copy the relevant parts of your app into the draft PR