buildkite/buildkite-agent-metrics

Bug? prometheus metrics around agent counts are mistaken

petemounce opened this issue · 6 comments

Summary

I'm running buildkite-metrics to export prometheus metrics.

I expected buildkite_queues_total_agent_count to mean "agents currently alive and able to service ", but instead, it appears to be "agents seen at some point that could service ".
I expected buildkite_queues_idle_agent_count to mean "agents currently idle and able to service ", but instead, it appears to be "agents seen at some point that could service ".

I'm currently running 40x agents across 5x nodes, with 3x unique queues (v-a2e4f94d1f6ad97f-s---u-1526206078, v-a70385275979e38b-----u-1525350118, v-d7ec781b032306fb-------1525946537). I think that the metrics with queues other than the ones I have running agents for should be 0?

Observed

# HELP buildkite_queues_idle_agent_count Buildkite Queues: IdleAgentCount
# TYPE buildkite_queues_idle_agent_count gauge
buildkite_queues_idle_agent_count{queue="v-0a61fa7cc9b4f591-----u-1525958964"} 8
buildkite_queues_idle_agent_count{queue="v-0f372a98763ad513-----u-1526174313"} 16
buildkite_queues_idle_agent_count{queue="v-20945a8a78dc170d-----u-1526046098"} 8
buildkite_queues_idle_agent_count{queue="v-5fe71c3ead68152f-----u-1525803178"} 8
buildkite_queues_idle_agent_count{queue="v-6ced0e2c9da7d0de-----u-1525866697"} 8
buildkite_queues_idle_agent_count{queue="v-994e5333d056799b-------1525944473"} 8
buildkite_queues_idle_agent_count{queue="v-a2e4f94d1f6ad97f---m-u-1526204764"} 8
buildkite_queues_idle_agent_count{queue="v-a2e4f94d1f6ad97f---m-u-1526205438"} 8
buildkite_queues_idle_agent_count{queue="v-a2e4f94d1f6ad97f-s---u-1526206078"} 20
buildkite_queues_idle_agent_count{queue="v-a70385275979e38b-----u-1525350118"} 8
buildkite_queues_idle_agent_count{queue="v-b00f1734381cecdd-----u-1525956753"} 16
buildkite_queues_idle_agent_count{queue="v-cee2dd74a40bfdc6-----u-1525963268"} 8
buildkite_queues_idle_agent_count{queue="v-d751547d853ce5b9-----u-1526044727"} 8
buildkite_queues_idle_agent_count{queue="v-d7ec781b032306fb-------1525946537"} 8
buildkite_queues_idle_agent_count{queue="v-e533a9b4a7b2fbbe-----u-1525971036"} 8
buildkite_queues_idle_agent_count{queue="v-f319742435b331ff-----u-1525804355"} 8
# HELP buildkite_queues_total_agent_count Buildkite Queues: TotalAgentCount
# TYPE buildkite_queues_total_agent_count gauge
buildkite_queues_total_agent_count{queue="v-0a61fa7cc9b4f591-----u-1525958964"} 8
buildkite_queues_total_agent_count{queue="v-0f372a98763ad513-----u-1526174313"} 16
buildkite_queues_total_agent_count{queue="v-20945a8a78dc170d-----u-1526046098"} 8
buildkite_queues_total_agent_count{queue="v-5fe71c3ead68152f-----u-1525803178"} 8
buildkite_queues_total_agent_count{queue="v-6ced0e2c9da7d0de-----u-1525866697"} 8
buildkite_queues_total_agent_count{queue="v-994e5333d056799b-------1525944473"} 8
buildkite_queues_total_agent_count{queue="v-a2e4f94d1f6ad97f---m-u-1526204764"} 8
buildkite_queues_total_agent_count{queue="v-a2e4f94d1f6ad97f---m-u-1526205438"} 8
buildkite_queues_total_agent_count{queue="v-a2e4f94d1f6ad97f-s---u-1526206078"} 24
buildkite_queues_total_agent_count{queue="v-a70385275979e38b-----u-1525350118"} 8
buildkite_queues_total_agent_count{queue="v-b00f1734381cecdd-----u-1525956753"} 16
buildkite_queues_total_agent_count{queue="v-cee2dd74a40bfdc6-----u-1525963268"} 8
buildkite_queues_total_agent_count{queue="v-d751547d853ce5b9-----u-1526044727"} 8
buildkite_queues_total_agent_count{queue="v-d7ec781b032306fb-------1525946537"} 8
buildkite_queues_total_agent_count{queue="v-e533a9b4a7b2fbbe-----u-1525971036"} 8
buildkite_queues_total_agent_count{queue="v-f319742435b331ff-----u-1525804355"} 8

Expected

# HELP buildkite_queues_idle_agent_count Buildkite Queues: IdleAgentCount
# TYPE buildkite_queues_idle_agent_count gauge
buildkite_queues_idle_agent_count{queue="v-0a61fa7cc9b4f591-----u-1525958964"} 0
buildkite_queues_idle_agent_count{queue="v-0f372a98763ad513-----u-1526174313"} 0
buildkite_queues_idle_agent_count{queue="v-20945a8a78dc170d-----u-1526046098"} 0
buildkite_queues_idle_agent_count{queue="v-5fe71c3ead68152f-----u-1525803178"} 0
buildkite_queues_idle_agent_count{queue="v-6ced0e2c9da7d0de-----u-1525866697"} 0
buildkite_queues_idle_agent_count{queue="v-994e5333d056799b-------1525944473"} 0
buildkite_queues_idle_agent_count{queue="v-a2e4f94d1f6ad97f---m-u-1526204764"} 0
buildkite_queues_idle_agent_count{queue="v-a2e4f94d1f6ad97f---m-u-1526205438"} 0
buildkite_queues_idle_agent_count{queue="v-a2e4f94d1f6ad97f-s---u-1526206078"} 20
buildkite_queues_idle_agent_count{queue="v-a70385275979e38b-----u-1525350118"} 8
buildkite_queues_idle_agent_count{queue="v-b00f1734381cecdd-----u-1525956753"} 0
buildkite_queues_idle_agent_count{queue="v-cee2dd74a40bfdc6-----u-1525963268"} 0
buildkite_queues_idle_agent_count{queue="v-d751547d853ce5b9-----u-1526044727"} 0
buildkite_queues_idle_agent_count{queue="v-d7ec781b032306fb-------1525946537"} 8
buildkite_queues_idle_agent_count{queue="v-e533a9b4a7b2fbbe-----u-1525971036"} 0
buildkite_queues_idle_agent_count{queue="v-f319742435b331ff-----u-1525804355"} 0
# HELP buildkite_queues_total_agent_count Buildkite Queues: TotalAgentCount
# TYPE buildkite_queues_total_agent_count gauge
buildkite_queues_total_agent_count{queue="v-0a61fa7cc9b4f591-----u-1525958964"} 0
buildkite_queues_total_agent_count{queue="v-0f372a98763ad513-----u-1526174313"} 0
buildkite_queues_total_agent_count{queue="v-20945a8a78dc170d-----u-1526046098"} 0
buildkite_queues_total_agent_count{queue="v-5fe71c3ead68152f-----u-1525803178"} 0
buildkite_queues_total_agent_count{queue="v-6ced0e2c9da7d0de-----u-1525866697"} 0
buildkite_queues_total_agent_count{queue="v-994e5333d056799b-------1525944473"} 0
buildkite_queues_total_agent_count{queue="v-a2e4f94d1f6ad97f---m-u-1526204764"} 0
buildkite_queues_total_agent_count{queue="v-a2e4f94d1f6ad97f---m-u-1526205438"} 0
buildkite_queues_total_agent_count{queue="v-a2e4f94d1f6ad97f-s---u-1526206078"} 24
buildkite_queues_total_agent_count{queue="v-a70385275979e38b-----u-1525350118"} 8
buildkite_queues_total_agent_count{queue="v-b00f1734381cecdd-----u-1525956753"} 0
buildkite_queues_total_agent_count{queue="v-cee2dd74a40bfdc6-----u-1525963268"} 0
buildkite_queues_total_agent_count{queue="v-d751547d853ce5b9-----u-1526044727"} 0
buildkite_queues_total_agent_count{queue="v-d7ec781b032306fb-------1525946537"} 8
buildkite_queues_total_agent_count{queue="v-e533a9b4a7b2fbbe-----u-1525971036"} 0
buildkite_queues_total_agent_count{queue="v-f319742435b331ff-----u-1525804355"} 0

Other

It would be convenient to export a metric that is the sha / tag that buildkite-metrics binary was built from. This would make it easy to check. As it is, I'm running the v3.0.0 release and know that via my dockerfile.

lox commented

Sorry for the slow reply, was reminded of this by a support ticket. We're investigating!

@petemounce do you know if the numbers eventually show up correctly? Because we're wondering if perhaps when you're terminating the agents, it's happening with a SIGKILL or similar that doesn't give the agent a chance to de-register with the Agent API, so the metrics won't accurately represent "available agents". Eventually agents are marked as lost, but unfortunately we can't do so in a very short time frame in case of network failures, downtime, etc.

lox commented

Ok, I can see the bug. It looks like we need to be sending a zero value for previous gauge entries that aren't seen in the current collector results.

lox commented

It would be convenient to export a metric that is the sha / tag that buildkite-metrics binary was built from. This would make it easy to check. As it is, I'm running the v3.0.0 release and know that via my dockerfile.

That's a good idea @petemounce!

@toolmantim it's very possible agents are terminating via SIGKILL or similar - the most likely case is that when the GCE instance group is scaled down, there's no grace period to allow termination of the agents.

The instance group is not participating in an http backend service so there's no opportunity for connection draining as I've understood the docs, so I don't think I can get a grace period + event hook to gracefully terminate agent processes that way.

Since posting this ticket, the number of queues has reduced from that shown to just 2, so I think some reaping is going on at some stage at your end. The brutal disconnections causing this are understandable.

lox commented

Fixed with #45, will release in 3.0.1