Pod rescheduling will fail to recover unit.
Closed this issue · 1 comments
zmraul commented
Steps to reproduce
juju add-model zk
juju deploy zookeeper-k8s --channel 3/edge -n 3
# wait for idle
kctl delete pod zookeeper-k8s-0 -n zk
Expected behavior
Deleted pod can rejoin cluster without failures.
Actual behavior
Unit willl error:
Unit Workload Agent Address Ports Message
zookeeper-k8s/0 error idle 10.1.146.158 hook failed: "restart-relation-changed"
zookeeper-k8s/1* active idle 10.1.146.178
zookeeper-k8s/2 active idle 10.1.146.162
Log output
Traceback (most recent call last):
File "/var/lib/juju/agents/unit-zookeeper-k8s-0/charm/./src/charm.py", line 471, in <module>
main(ZooKeeperK8sCharm)
File "/var/lib/juju/agents/unit-zookeeper-k8s-0/charm/venv/ops/main.py", line 441, in main
_emit_charm_event(charm, dispatcher.event_name)
File "/var/lib/juju/agents/unit-zookeeper-k8s-0/charm/venv/ops/main.py", line 149, in _emit_charm_event
event_to_emit.emit(*args, **kwargs)
File "/var/lib/juju/agents/unit-zookeeper-k8s-0/charm/venv/ops/framework.py", line 354, in emit
framework._emit(event)
File "/var/lib/juju/agents/unit-zookeeper-k8s-0/charm/venv/ops/framework.py", line 830, in _emit
self._reemit(event_path)
File "/var/lib/juju/agents/unit-zookeeper-k8s-0/charm/venv/ops/framework.py", line 919, in _reemit
custom_handler(event)
File "/var/lib/juju/agents/unit-zookeeper-k8s-0/charm/lib/charms/rolling_ops/v0/rollingops.py", line 327, in _on_relation_changed
self.charm.on[self.name].run_with_lock.emit()
File "/var/lib/juju/agents/unit-zookeeper-k8s-0/charm/venv/ops/framework.py", line 354, in emit
framework._emit(event)
File "/var/lib/juju/agents/unit-zookeeper-k8s-0/charm/venv/ops/framework.py", line 830, in _emit
self._reemit(event_path)
File "/var/lib/juju/agents/unit-zookeeper-k8s-0/charm/venv/ops/framework.py", line 919, in _reemit
custom_handler(event)
File "/var/lib/juju/agents/unit-zookeeper-k8s-0/charm/lib/charms/rolling_ops/v0/rollingops.py", line 385, in _on_run_with_lock
self._callback(event)
File "/var/lib/juju/agents/unit-zookeeper-k8s-0/charm/./src/charm.py", line 201, in _restart
self.container.restart(CONTAINER)
File "/var/lib/juju/agents/unit-zookeeper-k8s-0/charm/venv/ops/model.py", line 1902, in restart
self._pebble.start_services(service_names)
File "/var/lib/juju/agents/unit-zookeeper-k8s-0/charm/venv/ops/pebble.py", line 1598, in start_services
return self._services_action('start', services, timeout, delay)
File "/var/lib/juju/agents/unit-zookeeper-k8s-0/charm/venv/ops/pebble.py", line 1654, in _services_action
resp = self._request('POST', '/v1/services', body=body)
File "/var/lib/juju/agents/unit-zookeeper-k8s-0/charm/venv/ops/pebble.py", line 1458, in _request
response = self._request_raw(method, path, query, headers, data)
File "/var/lib/juju/agents/unit-zookeeper-k8s-0/charm/venv/ops/pebble.py", line 1502, in _request_raw
raise APIError(body, code, status, message)
ops.pebble.APIError: cannot start services: service "zookeeper" does not exist
Additional information
This happens because unit will have a flag set to started
:
│ unit data │ ╭─ zookeeper-k8s/zookeeper-k8s/0 ─╮
│ │ │ │
│ │ │ quorum default - non-ssl │
│ │ │ state started │
│ │ ╰─────────────────────────────────╯
When events are triggered, the cluster seems ok from the charm point of view, so there is no call to recreate the pebble layer of the unit.
A minimal fix looks something like:
def _on_upgrade(self, event):
self.unit_peer_data.update({"state": "", "quorum": ""})