Detecting broken subscriptions

Question

Detecting broken subscriptions

jhujhiti opened this issue 3 years ago · 8 comments

Hello again!

As you've no doubt figured out by my third issue, I'm working on building a monitoring system using gnmic as gNMI ingest and Prometheus as a metrics backend. The entire thing has been working great and has been very intuitive, but one thing I'm stuck on is how to detect when devices are down or otherwise not reporting sampled paths as they should by using the gNMI stream itself. My initial reaction to the question was simply to query Prometheus for a metric/path guaranteed to be reported by all devices, but unfortunately that's actually unworkable because I have no way to iterate all expected devices in Prometheus and because I would lack the context at that point to know if a device was even supposed to be reporting in the first place (maybe it's been removed from the gNMIc configuration entirely). I'm curious if there is any advice on how to proceed.

One idea I had would be to generate some sort of keepalive event per subscription per target. For sampled subscriptions, we could generate one such event per sample-interval as long as one message was received from the target. For my particular use case, this metric would then propagate into Prometheus, where I would be able to notice the lack of a timeseries for this keepalive metric as a broken subscription. For subscriptions with heartbeats enabled, this could even be a "time since last heartbeat".

Another idea might be to export the list of targets a given gNMIc instance is handling (not just has configured, since I'm using clustering) with the Prometheus client so that I would be able to identify targets for which there were no active timeseries with a simple boolean AND in the promql query. I noticed that there are several places in the configuration where I'm able to add enable-metrics flags, including the file loader (which I happen to be using and I might expect to have the data I'm looking for), but I haven't been able to figure out how to get these metrics to be emitted, either on my configured Prometheus output or a dedicated one for gNMIc's own exported metrics. I'm not sure if I'm doing something wrong or if those flags don't have an implementation behind them yet?

Thanks

Answer 1 · 2022-04-08T18:35:56.000Z

Hi,

I noticed that there are several places in the configuration where I'm able to add enable-metrics flags, including the file loader (which I happen to be using and I might expect to have the data I'm looking for), but I haven't been able to figure out how to get these metrics to be emitted, either on my configured Prometheus output or a dedicated one for gNMIc's own exported metrics. I'm not sure if I'm doing something wrong or if those flags don't have an implementation behind them yet?

I will look into this and get back to you.

The metrics supported by file_loader right now are:

Gauges:

number_of_loaded_targets
number_of_deleted_targets
file_read_duration_ns

Counters:

number_of_failed_file_reads
number_of_file_read_attempts_total

One idea I had would be to generate some sort of keepalive event per subscription per target. For sampled subscriptions, we could generate one such event per sample-interval as long as one message was received from the target. For my particular use case, this metric would then propagate into Prometheus, where I would be able to notice the lack of a timeseries for this keepalive metric as a broken subscription. For subscriptions with heartbeats enabled, this could even be a "time since last heartbeat".

I think I can add a few gNMIc internal metrics:
Gauges:

number of targets in config
number of gNMI clients

Counters:

number of messages received per target per subscription

These should be enough to detect if a target is not sending messages for a certain subscription.

Let me know what you think and if you would like to see some additional metrics added.

Answer 2 · 2022-04-09T00:10:11.000Z

This actually sounds like a great idea. I can do the alerting I need simply on sum (rate(message_count[1m])) by (source,subscription) == 0.

I don't think I need the gauges at all for my use case here, but they're a good idea regardless. I would also be interested in tracking:

number of locked targets when clustered (gauge)
number of connection attempts (by failure and success) per target per subscription (counter)

I'm sure I will come up with more as I deploy this into production.

Answer 3 · 2022-04-15T01:18:44.000Z

Hi @jhujhiti,

With the v0.25.0-beta, I believe that I fixed the metrics generation for the loaders.
I also added a few new metrics:

Cluster leader indication
Number of locked targets
Number of received subscribeResponses per target per subscription

# HELP gnmic_cluster_is_leader Has value 1 if this gnmic instance is the cluster leader, 0 otherwise
# TYPE gnmic_cluster_is_leader gauge
gnmic_cluster_is_leader 0
# HELP gnmic_cluster_number_of_locked_targets number of locked targets
# TYPE gnmic_cluster_number_of_locked_targets gauge
gnmic_cluster_number_of_locked_targets 3
# HELP gnmic_subscribe_number_of_received_subscribe_response_messages_total Total number of received subscribe response messages
# TYPE gnmic_subscribe_number_of_received_subscribe_response_messages_total counter
gnmic_subscribe_number_of_received_subscribe_response_messages_total{source="clab-metrics-srl2",subscription="sub1"} 32
gnmic_subscribe_number_of_received_subscribe_response_messages_total{source="clab-metrics-srl2",subscription="sub2"} 7
gnmic_subscribe_number_of_received_subscribe_response_messages_total{source="clab-metrics-srl4",subscription="sub1"} 28
gnmic_subscribe_number_of_received_subscribe_response_messages_total{source="clab-metrics-srl4",subscription="sub2"} 7
gnmic_subscribe_number_of_received_subscribe_response_messages_total{source="clab-metrics-srl6",subscription="sub1"} 31
gnmic_subscribe_number_of_received_subscribe_response_messages_total{source="clab-metrics-srl6",subscription="sub2"} 7

Note that the number of locked targets is retrieved from Consul periodically (10s)

gRPC client metrics are now enabled:

# HELP grpc_client_msg_received_total Total number of RPC stream messages received by the client.
# TYPE grpc_client_msg_received_total counter
grpc_client_msg_received_total{grpc_method="Subscribe",grpc_service="gnmi.gNMI",grpc_type="bidi_stream"} 111
# HELP grpc_client_msg_sent_total Total number of gRPC stream messages sent by the client.
# TYPE grpc_client_msg_sent_total counter
grpc_client_msg_sent_total{grpc_method="Subscribe",grpc_service="gnmi.gNMI",grpc_type="bidi_stream"} 6
# HELP grpc_client_started_total Total number of RPCs started on the client.
# TYPE grpc_client_started_total counter
grpc_client_started_total{grpc_method="Subscribe",grpc_service="gnmi.gNMI",grpc_type="bidi_stream"} 6

That shows the total number of messages received/sent as well as the number of clients started.

For the number of retries per target per subscription, I might need to do some refactoring before adding the metric.

Answer 4 · 2022-04-15T20:23:12.000Z

Hi @karimra

I set out to test the beta, but I'm still unable to get the /metrics endpoint to emit anything related to gNMIc itself, just the subscribed target metrics. I stripped my config down to the bare minimum:

skip-verify: true
log: true
debug: true
loader:
  type: file
  path: ./targets.yaml
  interval: 60s
  enable-metrics: true
outputs:
  prometheus:
    type: prometheus
    listen: 127.0.0.1:9804
subscriptions:
  interfaces:
    encoding: JSON_IETF
    mode: STREAM
    paths:
      - /interface[name=mgmt0]/statistics
    sample-interval: 60s
    stream-mode: SAMPLE

And I can see 2022/04/15 16:13:57.220338 /home/runner/work/gnmic/gnmic/app/app.go:272: [gnmic] loader/enable-metrics='true'(bool) in the debug output, which looks right, but http://127.0.0.1:9804/metrics is still not serving the loader metrics in my testing (x86-64 Darwin release). My real config on Linux with the API and clustering enabled (enable-metrics: true in the clustering section as well) shows the same behavior, nothing but the metrics collected from the targets. It seems like I must be doing something wrong but I'm just not seeing it...

Answer 5 · 2022-04-15T20:36:13.000Z

Oh I see, I think this is not clearly stated in the docs but the internal metrics are not served on the prometheus output server, they are available on the api server, you should be scraping http://api_addr:7890/metrics

You also need to set enable-metrics: true under api-server:

Answer 6 · 2022-04-15T20:47:07.000Z

Oh, of course, I should have thought to try that. gnmic_subscribe_number_of_received_subscribe_response_messages_total is exactly what I needed to detect broken subscriptions, and the others will be useful as well. Thanks again!

Answer 7 · 2022-04-15T21:35:18.000Z

The reasoning behind putting internal metrics under the api-server is that a user doesn't necessarily use prometheus as output for the target metrics and still wants to get internal metrics.
I think the internal metrics deserve their own page in the docs now, will work on that.

Answer 8 · 2022-04-15T21:56:30.000Z

Yeah, that reasoning makes sense. It also helps in case there's a problem scraping metrics, since the output metrics endpoint can get quite large. Best to have the "metadata" on its own endpoint anyway.