banzaicloud/spark-metrics

Repetitions of last metric value

g1thubhub opened this issue · 2 comments

Hello,

stoader
I have a general question: After a Spark application ends, the metric in the pushgateway becomes “stale” as no more updates are pushed. This results in the following problem:

Problem

The last metric is repeatedly scraped and only stops getting plotted when the gateway server shuts down which means that the cluster needs to terminate. This is apparently by design and users on the mailing group ask for a way to avoid this behaviour, e.g. https://groups.google.com/g/prometheus-users/c/uGYUQhQAdOE/m/0ICfNNHaAQAJ
There seems to be no way around this problem, the authors explicitly decided against implementing something like a metric “timeout”

Have you also observed this and do you know a way to solve this problem? Unfortunately, the pull-based approach does not seem to work for multiple executors per node on a YARN cluster

That's due to how Pushgateway works, it keeps the last metric value for a metric key forever. The only solution I see is to use a custom built Pushgateway which is compatible with the upstream one but has metrics TTL capabilities. (e.g. https://github.com/dinumathai/pushgateway)

Update: I have implemented a push and pull-based approach based on VictoriaMetrics in this project: https://github.com/xonai-computing/xonai-dashboard

It is 100% PromQL compatible and the Grafana Prometheus plugin also works as does the Prometheus Python client