trustpilot/beat-exporter

filebeat_up metric issue

Opened this issue · 1 comments

Hi guys, first of all thanks for the great work and support that you've put to this project !

I just want to mention an issue that i witnessed by deploying the beat-exporter as a side container in a pod next to filebeat in a kubernetes environment :

If the filebeat pod which exports the metrics on the port 5066 is in a state different than CrashLoopBackOff - filbeat_up returns 0 - which is the expected behavior everything works fine.If the filebeat pod enters in a condition of a CrashLoopBackOff then beat-exporter doesn't register anything related to the pod hence filebeat_up is absent and all the metrics for this particular pod.
CrashLoopBackOff status of the filebeat pod - beat -exporter logs:

{"level":"error","message":"Could not load beat type, with error: Get http://localhost:5066: dial tcp 127.0.0.1:5066: connect: connection refused, retrying in 1s","time":"2021-01-05T09:58:45Z"}
{"level":"error","message":"Could not load beat type, with error: Get http://localhost:5066: dial tcp 127.0.0.1:5066: connect: connection refused, retrying in 1s","time":"2021-01-05T09:58:46Z"}

and here is the case when the POD is not in a CrashLoopBackOff / Error but in a different failed state and the filebeat_up is evaluated correctly to 0 :

{"level":"error","message":"Failed getting /stats endpoint of target: Get http://localhost:5066/stats: dial tcp 127.0.0.1:5066: connect: connection refused","time":"2021-01-05T09:59:04Z"}
{"level":"error","message":"Could not fetch stats endpoint of target: http://localhost:5066","time":"2021-01-05T09:59:25Z"}
{"level":"error","message":"Failed getting /stats endpoint of target: Get http://localhost:5066/stats: dial tcp 127.0.0.1:5066: connect: connection refused","time":"2021-01-05T09:59:25Z"}
{"level":"error","message":"Could not fetch stats endpoint of target: http://localhost:5066","time":"2021-01-05T09:59:34Z"}

Issue here, is that in first case ☝️ your beat never reached "ready" state, that is, beat-exporter doesn't know what type beat to expect. In second case, this looks like beat crashed after being healthy previously, That is beat-exporter managed to get type of beat, initialize itself against it and then returning 0 when beat is crashed and is not reachable.

I'm referring to this: https://github.com/trustpilot/beat-exporter/blob/master/main.go#L93 initialization loop, in one case beat-exporter is stuck in this loop, in another case it's past that loop and in main "proxy" loop.