TTL for pushed metrics?
Rotwang opened this issue ยท 12 comments
Hi, it appears that a pushgateway doesn't support any form of TTL for the pushed metrics. Yes, I've seen this link: https://prometheus.io/docs/practices/pushing/. However cache should be invalidated under the right circumstances and I think that introducing a TTL (e.g. with a "meta" label like: 'push_gateway_ttl_seconds') could help in removing cached stale metrics.
Dupe of #19
For the rare times you need to delete a group, you can do so by hand.
I'm wondering why is this an anti-pattern. So I have batch job 'A' that I run periodically, month later this job is (a) no longer required or maybe (b) it was converted to a daemon with it's own exporter. Now in case 'b' I have a duplicate metric available (daemon exporter and push gateway). In case 'a' I have a stale metric (information which is no longer valid).
I don't want to micromanage my metrics either and DELETE them from the push gateway. Would be much easier to just TTL the metric I'm sending.
#19 documents the conclusion back then. If you want to bring forward new evidence that justifies re-opening the discussion, please do so on the prometheus-developers mailing list.
Sorry to keep dredging this up, but I would like the devteam to explain the best practice for this situation:
- We have an upload area where files are delivered.
- Every time a file is noticed, a Kubernetes Job is spun up to ingest the file.
- The Job pushes file-level metrics (number of successful records processed) when it completes to the pushgateway. The endpoint that is pushed to is http://pushgateway/job/ingest/instance/HOSTNAME_OF_POD .
- We want to track incoming number of records (each incoming file has varying amount of records) per hour over all Jobs using Prometheus.
- ~Hundreds of files arrive per hour.
This works well but the queries get slower as the number of unique hostnames increases. This is because every unique instance name is permanently remembered by the pushgateway. The way the pushgateway is designed, it seems like we have these choices:
- Use a less unique instance name than HOSTNAME_OF_POD. But then if n files for the instance name are processed in the same scraping period (the files are often quite small), then all but 1 of the metrics for that scraping period would be lost and we would under-report.
- Keep sending the hostname as the instance value, but delete metrics for completed ingestion processes after a given amount of time. How long would this need to be to prevent our graphs from being incomplete? It sounds like as long as it's a few multiples of the scraping period, it should be fine.
I think this is a common use case, and that the official documentation should describe what the best practice for this sort of 'ephemeral producer' use case is.
I'm sure the Prometheus community is happy to discuss your use case. But an already closed GitHub issue is not the right place. Could you post to the prometheus-users mailing list where the discussion is accessible for everybody so that more people can benefit from it?
would like to see a TTL feature too
I'd also like to see a TTL feature. Having to manually remove stale groups is painful and for me it's not a 'rare time'.
@yumpy maybe you want to look at this fork
https://github.com/pkcakeout/pushgateway
would like to see a TTL feature too stable version
I would also like this feature. Prometheus is frequently deployed in container infrastructures, where all jobs are ephemeral. This is doubly the case with the push gateway, which is designed for ephemeral jobs.
Furthermore, mailing lists are where discussions go to die. Email chains fork, they're often only visible to a small group, and they're frequently lost. Github persists context and conversation across years, as this issue shows.
Garbage collecting push metrics is nearly impossible in the prometheus model, because it's hard to know when a metric is no longer relevant. Fortunately, most of us don't need perfect: we need good enough. And stale metric deletion is good enough for most use cases.
I have the same problem. I have hundreds of jobs every day and I need to monitor the status of the jobs, but after a few hours the job finishes but the metric is still there. Then new jobs arrive continuously then pushgateway keep accumulating the jobs.
So I really need a way to expire entries.
Meanwhile, given those are ephemeral values, I guess I can delete all entries before adding new ones.
Could you please take #117 (comment) into account? Really, folks, you are using the wrong forum to express your concerns.