cloudfoundry/bosh

Resurrector not resurrecting unresponsive agent.

simonjohansson opened this issue · 7 comments

Describe the bug
If I delete a VM in GCP it will be recreated. However if an agent goes into an unresponsive agent state nothing happens(Ive waited up to 20 minutes) and I have to manually run bosh cck to recreate the VM.

To Reproduce

  1. Deploy bosh with
...
hm:
  resurrector_enabled: true
  1. Upload resurrection config and run update-resurrection for good meassure
$ cat bosh-resurrection.yml
rules:
- enabled: true
$ bosh update-config --type resurrection --name resurrect-all bosh-resurrection.yml
$ bosh update-resurrection on
  1. Put a instance in a bad state
$ bosh vms
...
main/e81f2df2-6c3c-487f-9094-d5d4383092ae                         running        z1  10.80.205.35
....
$ bosh ssh main/e81f2df2-6c3c-487f-9094-d5d4383092ae -d concourse
main/e8... $ sudo rm /var/vcap/bosh/bin/monit-actual
main/e8... $ pgrep monit-actual | sudo xargs kill
main/e8... $ ^D
$ bosh vms
...
main/e81f2df2-6c3c-487f-9094-d5d4383092ae                         unresponsive agent
...

At this point I would expect that hm runs and figures out that the instance is borked and recreate it

  1. Wait a while...
$ sleep 600
$ bosh tasks -a --recent=1000 | egrep "hm|scan"
# nothing.. :(
  1. Check the logs.
$ ssh jumpbox@director_ip
bosh $ cat /var/vcap/sys/log/health_monitor/health_monitor.log
...
I, [2023-09-19T11:53:53.501143 #7]  INFO : Managing 1 deployment, 31 agents
I, [2023-09-19T11:53:53.501253 #7]  INFO : Agent heartbeats received = 8669153
I, [2023-09-19T11:53:53.512012 #7]  INFO : Found deployment 'concourse'
I, [2023-09-19T11:53:53.639371 #7]  INFO : Adding agent 1d07d230-67a4-4dd8-8d67-64ae51114ab9 (main/e81f2df2-6c3c-487f-9094-d5d4383092ae) to concourse...
....
I, [2023-09-19T11:54:01.701086 #7]  INFO : Resurrection config remains the same
I, [2023-09-19T11:54:22.735327 #7]  INFO : Analyzing agents...
I, [2023-09-19T11:54:22.735442 #7]  INFO : Skipping analyzing agents for locked deployment concourse
I, [2023-09-19T11:54:22.735482 #7]  INFO : Analyzed 0 agents, took 3.9149e-05 seconds
I, [2023-09-19T11:54:22.735549 #7]  INFO : Analyzing instances...
I, [2023-09-19T11:54:22.735574 #7]  INFO : Skipping analyzing instances for locked deployment concourse
I, [2023-09-19T11:54:22.735595 #7]  INFO : Analyzed 0 instances, took 1.935e-05 seconds
I, [2023-09-19T11:54:25.426598 #7]  INFO : [ALERT] Alert @ 2023-09-19 11:54:25 UTC, severity 1: process is not running
I, [2023-09-19T11:54:25.426780 #7]  INFO : (Event logger) notifying director about event: Alert @ 2023-09-19 11:54:25 UTC, severity 1: process is not running
...
bosh $ sudo grep "main/e81f2df2-6c3c-487f-9094-d5d4383092ae" /var/vcap/sys/log/director/* | grep "2023-09-19T11:"
/var/vcap/sys/log/director/audit_worker_1.log:I, [2023-09-19T11:47:05.115477 #1683124] [task:2094629]  INFO -- DirectorAudit: {"id":31711,"parent_id":null,"user":"admin","timestamp":"2023-09-19 11:47:04 UTC","action":"cleanup ssh","object_type":"instance","object_name":"main/e81f2df2-6c3c-487f-9094-d5d4383092ae","error":"Timed out sending 'ssh' to instance: 'main/e81f2df2-6c3c-487f-9094-d5d4383092ae', agent-id: '1d07d230-67a4-4dd8-8d67-64ae51114ab9' after 45 seconds","task":"2094629","deployment":"concourse","instance":"main/e81f2df2-6c3c-487f-9094-d5d4383092ae","context_json":"{\"user\":\"^bosh_2a9f984570e945b\"}"}
/var/vcap/sys/log/director/audit_worker_2.log:I, [2023-09-19T11:43:40.733399 #1682701] [task:2094617]  INFO -- DirectorAudit: {"id":31710,"parent_id":null,"user":"admin","timestamp":"2023-09-19 11:43:40 UTC","action":"setup ssh","object_type":"instance","object_name":"main/e81f2df2-6c3c-487f-9094-d5d4383092ae","error":null,"task":"2094617","deployment":"concourse","instance":"main/e81f2df2-6c3c-487f-9094-d5d4383092ae","context_json":"{\"user\":\"bosh_2a9f984570e945b\"}"}
...
  1. Run a cck to fix my problem
$ bosh cck -d concourse
...
Task 2094704 | 12:03:28 | Scanning 30 VMs: Checking VM states (00:00:11)
Task 2094704 | 12:03:39 | Scanning 30 VMs: 29 OK, 1 unresponsive, 0 missing, 0 unbound (00:00:00)
...
Type                Description
27  unresponsive_agent  VM for 'main/e81f2df2-6c3c-487f-9094-d5d4383092ae (0)' with cloud ID 'vm-1e1dbef2-b977-4aad-442b-9a54d350673d' is not responding.

1 problems

1: Skip for now
2: Reboot VM
3: Recreate VM without waiting for processes to start
4: Recreate VM and wait for processes to start
5: Delete VM
6: Delete VM reference (forceful; may need to manually delete VM from the Cloud to avoid IP conflicts)
VM for 'main/e81f2df2-6c3c-487f-9094-d5d4383092ae (0)' with cloud ID 'vm-1e1dbef2-b977-4aad-442b-9a54d350673d' is not responding. (1): 3

Continue? [yN]: y


Task 2094707

Task 2094707 | 12:03:50 | Applying problem resolutions: VM for 'main/e81f2df2-6c3c-487f-9094-d5d4383092ae (0)' with cloud ID 'vm-1e1dbef2-b977-4aad-442b-9a54d350673d' is not responding. (unresponsive_agent 1): Recreate VM without waiting for processes to start
  1. All is well again.

Versions (please complete the following information):

  • Infrastructure: GCP
  • BOSH version: 277.3.0

Killing monit is not killing the bosh agent process. Try these steps instead: https://github.com/cloudfoundry/os-conf-release/blob/master/jobs/disable_agent/templates/post-start#L9-L13

@rkoster Im confused as to why a manual bosh cck fixes the VM whereas the resurrector/hm does not in this instance though. We've had instances where 5 m load have shut up to ~500 and the VM/agent becomes unresponsive, but bosh does nothing.

In those cases I can either delete the VM from GCP or do a bosh cck and have it sorted for me. I would rather not have to manually delete VMs or cck in those instances.

health monitor is not supposed to act on I, [2023-09-19T11:54:25.426598 #7] INFO : [ALERT] Alert @ 2023-09-19 11:54:25 UTC, severity 1: process is not running Monit is responsible for restarting processes.
The problem in this experiment is that you killed monit and removed it's binary, so it can not be restarted. The agent will keep sending heartbeats as normal.

When invoking cck the director sends a get_state message via NATS and waits for a response, the agent calls out to monit which does not exist, and is probably not able to respond back in time causing the agent to be shown as being unresponsive.

In short, the agent has not been optimized to deal with monit not being available for a longer time, because this should never happen under normal operation.

I would recommend running the experiment again but then stop the agent instead of deleting monit.

@rkoster sure. It was just a contrived example. But how should BOSH behave when it reports the instance having a unresponsive agent, the last know load is sky high and I fail to SSH into the instance?

I would expect BOSH to recreate the instance for me, and I seem to recall this was the case before we upgraded past v267.2.0. The docs[1] also suggests that "replace a VM if the Agent on that VM is not responding to commands". If Im unable to bosh ssh, would that not be equivalent to not responding to commands.

Thanks!

Edit:
Im having a hard time to replicate the situation of high load/unresponsive agent. Seems to be related to specific Concourse pipelines running specific dind compose integration tests in specific combinations with some xvfb + chrome sprinkled on top. You know how it is..

[1] https://bosh.io/docs/resurrector/

According to @jpalermo during the last week (or the week before) working group meeting, the director has a shorter timeout when doing a get_state than the health_monitor. This means that bosh vms might have agents showing as unresponsive while the health_monitor still gets heartbeats from that agent.

To diagnose this problem, you should focus on the health monitor logs, which tell you about the heartbeats it receives.

Yeah, the docs are somewhat correct, but not 100% clear. Really bosh recreates a VM when the agent on that VM stops sending heartbeats. Heartbeats are sent every 45 seconds or so and require very little resources to produce.

It’s possible to have a VM able to send heartbeats but still failing in other ways. Bosh has no current functionality to detect and address situations like this. Mostly because it’s an impossible problem to solve accurately. A high resource load of one job may be fine, but another may be a sign of a problem.

Ultimately the health monitor and resurrector are designed to address VMs that have gone into a bad state due to some IaaS failure, not because load problems with their workload.

beyhan commented

Closing because the question have been clarified and please reopen if you think that there is still an open issue.