vmagent loses some metrics because it doesn't push them on shutdown
Opened this issue ยท 5 comments
Description
We noticed that some metrics are not pushed randomly. After some debugging, we found out that this only happens when vmagent
is running for a short period of time and cannot push all the metrics because some of them were created between the last scrape and shutdown.
The metrics appear in the input file, but they are not sent to the -remoteWrite.url
endpoint.
A possible solution might be to change the code here
Lines 236 to 242 in fdfd428
to push the metrics on shutdown
To reproduce
Use vmagent
in an environment with a short life cycle.
Version
vmagent-20230313-021802-tags-v1.89.1-0-g388d6ee16
But it doesn't really matter since the same problem exists even in the last version of vmagent
Maybe there could be some way to signal to the vmagent that we are about to shutdown, and it needs to do the final scrape and push the metrics?
This is the use case for a variable-life-length process that stores metrics in a file and vmagent scrapes that file periodically and pushes them to a remote write URL. Once that process shutdowns we then shutdown the vmagent, but we expect vmagent to make sure it scrapes the metrics file for the last time and does the ultimate push.
This problem reproduces when the process that produces the metrics turns out to be short-lived (e.g. it fails fast, but still produces some useful metrics). In this case it is likely that the scrape interval of the vmagent may not coincide with the time when the process pushed the metrics to it, and thus vmagent would never see and push the metrics.
EDITED FOR CLARITY.
@andrii-dovzhenko and I work in the same company. I thought I addressed this issue several months ago by making sure vmagent
is sent a SIGINT signal before we shutdown the telemetry infrastructure, so that any buffered telemetry is flushed to the configured remote endpoint.
See here:
(This link is to an internal Elastio repo; sorry for this. Basically it's a script that sends SIGINT
to vmagent
before we shut down the rest of the telemetry infrastructure.)
The stop signal is configured in the respective vmagent
and promtail
config files as INT
, meaning SIGINT
. AFAIK, that signal should force vmagent
and promtail
to flush their buffers and then exit. These two are stopped explicitly so that nifmet
and naflog
are still running to receive their flushed results.
It sounds like perhaps this isn't working.
Is it expected behavior that vmagent
flushes its buffers and writes all metrics to the configured server in response to a SIGINT
signal?
Link is 404 (looks like this's a private repo ๐ )
Sorry @cristaloleg that is indeed a private repo. That comment was directed to @andrii-dovzhenko. I'll reword the comment to make it more clear.
@anelson, the order of stopping the services is correct, the problem is that vmagent
does not flush its buffer on SIGINT
as we can see in the code snippet I left in the issue description and it is not fixed in vmagent
yet and nifmet
doesn't receive the metrics that were created after the last scrape performed by vmagent