VictoriaMetrics/metrics

vmagent loses some metrics because it doesn't push them on shutdown

Opened this issue ยท 5 comments

Description

We noticed that some metrics are not pushed randomly. After some debugging, we found out that this only happens when vmagent is running for a short period of time and cannot push all the metrics because some of them were created between the last scrape and shutdown.
The metrics appear in the input file, but they are not sent to the -remoteWrite.url endpoint.

A possible solution might be to change the code here

metrics/push.go

Lines 236 to 242 in fdfd428

case <-stopCh:
if wg != nil {
wg.Done()
}
return
}
}

to push the metrics on shutdown

To reproduce

Use vmagent in an environment with a short life cycle.

Version

vmagent-20230313-021802-tags-v1.89.1-0-g388d6ee16
But it doesn't really matter since the same problem exists even in the last version of vmagent

Maybe there could be some way to signal to the vmagent that we are about to shutdown, and it needs to do the final scrape and push the metrics?

This is the use case for a variable-life-length process that stores metrics in a file and vmagent scrapes that file periodically and pushes them to a remote write URL. Once that process shutdowns we then shutdown the vmagent, but we expect vmagent to make sure it scrapes the metrics file for the last time and does the ultimate push.

This problem reproduces when the process that produces the metrics turns out to be short-lived (e.g. it fails fast, but still produces some useful metrics). In this case it is likely that the scrape interval of the vmagent may not coincide with the time when the process pushed the metrics to it, and thus vmagent would never see and push the metrics.

EDITED FOR CLARITY.

@andrii-dovzhenko and I work in the same company. I thought I addressed this issue several months ago by making sure vmagent is sent a SIGINT signal before we shutdown the telemetry infrastructure, so that any buffered telemetry is flushed to the configured remote endpoint.

See here:

https://github.com/elastio/elastio/blob/edb2ae7795849fca523a674d2296bb498ff2cf44/docker/elastio-red-stack-base/supervised.sh#L60-L64

(This link is to an internal Elastio repo; sorry for this. Basically it's a script that sends SIGINT to vmagent before we shut down the rest of the telemetry infrastructure.)

The stop signal is configured in the respective vmagent and promtail config files as INT, meaning SIGINT. AFAIK, that signal should force vmagent and promtail to flush their buffers and then exit. These two are stopped explicitly so that nifmet and naflog are still running to receive their flushed results.

It sounds like perhaps this isn't working.

Is it expected behavior that vmagent flushes its buffers and writes all metrics to the configured server in response to a SIGINT signal?

Link is 404 (looks like this's a private repo ๐Ÿ‘€ )

Sorry @cristaloleg that is indeed a private repo. That comment was directed to @andrii-dovzhenko. I'll reword the comment to make it more clear.

@anelson, the order of stopping the services is correct, the problem is that vmagent does not flush its buffer on SIGINT as we can see in the code snippet I left in the issue description and it is not fixed in vmagent yet and nifmet doesn't receive the metrics that were created after the last scrape performed by vmagent