fastly/fastly-exporter

Add discovery/scrape-time service selection

SuperQ opened this issue · 20 comments

With Prometheus 2.28, there is now a generic http service discovery.

The exporter can now produce an API output that lists all of the available services so that they can be scraped independently. This improves the performance of ingestion by spreading it out over time and allowing Prometheus to ingest the data over multiple target threads.

On the Prometheus side, you would configure the job like this:

scrape_configs:
- job_name: fastly
  metrics_path: /fastly
  http_sd_configs:
  - url: http://fastly-exporter:8080/sd
  relabel_configs:
  - source_labels: [__address__]
    target_label: __param_target
  - source_labels: [__param_target]
    target_label: instance
  - target_label: __address__
    replacement: fastly-exporer:8080 

The /service-discovery endpoint would output json like this:

[
  { 
    "targets": [
      "<Service ID 1>",
      "<Service ID 2>",
      "<Service ID 3>",
      "<Service ID ...>"
    ]
  }
]

The relabel_config would then produce exporter URLs like /fastly?service=<Service ID 1>.

This is meant as an additional improvement on #31. Possibly a replacement, depending on where the bottlenecks are.

CC @neufeldtech

Super cool.

So wait, given /discovery returns

[
  { "targets": ["A", "B", "C"] }
]

there's no absolute or explicit URL mapping to each of A, B, and C — I would just assert in the exporter (or a flag or something) that they map to e.g. /service/{A, B, C} which the scraping Prometheus would have to be configured to match a priori. Is that right?

Yes, it needs to be re-mapped by a relabel config in Prometheus no matter what. The discovery injects the target list items into the __address__ meta-label in discovery. Prometheus doesn't treat those as URLs, it treats them as host:port pairs.

@SuperQ The bottleneck is parsing the JSON from the Fastly API. By default, the exporter fetches all services available to the provided token on startup, and launches a goroutine per service to poll the API and update the relevant metrics. The -service* flags are used to control the set of services we poll, so I think that means we need to keep them. Though maybe I could play some trick — could I lazy-launch the polling goroutine for a given service only once the exporter received a request for that service? It would mean the first scrape either had a 1s+ delay, or returned an empty response; would either of those be acceptable?

This is why readiness checks are a thing. Discovery can just wait for the readiness endpoint to say OK.

Discovery can just wait for the readiness endpoint to say OK.

Not sure how that would work. Is readiness per-SD-target? I'm suggesting that the exporter not poll a service ID unless/until it receives a scrape request for that service ID from Prometheus.

The exporter shouldn't expose a newly added service to the discovery output until it has valid API connection.

Right, but the only meaningful optimization is to try to avoid polling the API and parsing the large JSON responses if we can somehow. For example, if the user starts the exporter with a configuration that makes 10 services "visible" but only has Prometheus scrape 2 of them, it would be ideal if we didn't poll/parse/present metrics for the other 8. The only way I can think of to accomplish that is to "lazy load" the per-service polling goroutines. Is that feasible? Sounds like no. That's fine.

I don't think that's feasible. I don't see a lot of of users doing selective filtering without configuring the -service flag.

Is there a canonical example of an exporter that supports this new capability?

Not that I'm aware of, this would be the first one.

So this is an interesting design conundrum. I think the right way to think about this is that we're adding /metrics endpoints that are backed by a kind of pseudo-registry: something that takes the main registry and filters based on a label. I guess you could do that by wrapping the Registry with something that takes a service ID and yields a custom Gatherer? Or am I looking at this the wrong way?

Ah! Huh.

So the other angle would be something like maintaining a separate gen.Metrics/prometheus.Registry for each service. That makes per-service /metrics endpoints easy. Could you then just use a prometheus.Gatherers to abstract over all of them to serve the existing /metrics endpoint? Would that be efficient/effective?

Yes, setting up a separate registry per-service makes the most sense to me.

Although, typically, an exporter like this would manage the data internally. You implement your own prometheus.Collect(ch) to send data to the Prometheus via the interface. We do something like this in the node_exporter.

Ah, @SuperQ, what would the Prometheus config look like for this? Especially whatever relabel config would be necessary.

The scrape config and relabel is in the issue summary.

I tested this out locally, looks really good. It even works well with the modulo sharding so you get the best of both worlds, horizontal sharding and automatic discovery.

the issue summary

Thank you 🤦