canonical/mysql-router-operator

Install hook failure due to snap charmed-mysql already installed

wolsen opened this issue · 9 comments

wolsen commented

Please note that I am submitting this bug report upon analyzing the code and viewing log snippets from an issue reported against the slurm-bundles repository. I do not have full the full log of recreate steps, but reading the code is enough to conjecture this sequence of events and I have been able to confirm this in a contrived recreate scenario. The scenario is included below.

Steps to reproduce

  1. juju deploy --base ubuntu@22.04 slurmd --channel edge
  2. juju deploy --base ubuntu@22.04 mysql-router --channel dpe/edge
  3. juju relate slurmd mysql-router
  4. juju debug-hooks mysql-router/0 install
  5. Raise an exception at the end of the snap.py's install function (e.g. do not exit cleanly). I used:
    raise Exception("Intentionally forcing a failure here")
  6. exit the debug-hooks session with a non-zero exit code
  7. juju resolved mysql-router/0

You will note that the failure then results in Exception: Multiple charmed-mysql snap installs not supported on one machine.

Expected behavior

The install hook should be able to be re-run on initial failure after the error has been resolved.

Actual behavior

The install hook fails with Exception: Multiple charmed-mysql snap installs not supported on one machine

Versions

Operating system: Ubuntu 22.04.3 LTS

Juju CLI: 3.1.6-genericlinux-amd64

Juju agent: 3.3-beta1

mysql charm revision: n/a
mysql-router charm revision: dpe/edge revision 119

LXD: 5.19

Log output

Juju debug log:

unit-mysql-router-0: 16:52:33 INFO juju Starting unit workers for "mysql-router/0"
unit-mysql-router-0: 16:52:33 INFO juju.worker.apicaller [9b6cfe] "unit-mysql-router-0" successfully connected to "10.167.81.6:17070"
unit-mysql-router-0: 16:52:33 INFO juju.worker.apicaller [9b6cfe] password changed for "unit-mysql-router-0"
unit-mysql-router-0: 16:52:33 INFO juju.worker.apicaller [9b6cfe] "unit-mysql-router-0" successfully connected to "10.167.81.6:17070"
unit-mysql-router-0: 16:52:33 INFO juju.worker.upgrader no waiter, upgrader is done
unit-mysql-router-0: 16:52:33 INFO juju.worker.migrationminion migration phase is now: NONE
unit-mysql-router-0: 16:52:33 INFO juju.worker.logger logger worker started
unit-mysql-router-0: 16:52:33 ERROR juju.worker.meterstatus error running "meter-status-changed": charm missing from disk
unit-mysql-router-0: 16:52:33 INFO juju.worker.uniter unit "mysql-router/0" started
unit-mysql-router-0: 16:52:33 INFO juju.worker.uniter resuming charm install
unit-mysql-router-0: 16:52:33 INFO juju.worker.uniter.charm downloading ch:amd64/jammy/mysql-router-119 from API server
unit-mysql-router-0: 16:52:36 INFO juju.worker.uniter hooks are retried true
unit-mysql-router-0: 16:52:36 INFO juju.worker.uniter.storage initial storage attachments ready
unit-mysql-router-0: 16:52:36 INFO juju.worker.uniter found queued "install" hook
unit-mysql-router-0: 16:52:36 INFO juju.worker.uniter.runner executing install via debug-hooks; hook dispatching script: dispatch
unit-mysql-router-0: 16:53:39 DEBUG unit.mysql-router/0.juju-log ops 2.6.0 up and running.
unit-mysql-router-0: 16:53:39 INFO unit.mysql-router/0.juju-log Running legacy hooks/install.
unit-mysql-router-0: 16:53:40 DEBUG unit.mysql-router/0.juju-log ops 2.6.0 up and running.
unit-mysql-router-0: 16:53:40 DEBUG unit.mysql-router/0.juju-log Charm called itself via hooks/install.
unit-mysql-router-0: 16:53:40 DEBUG unit.mysql-router/0.juju-log Legacy hooks/install exited with status 0.
unit-mysql-router-0: 16:53:40 DEBUG unit.mysql-router/0.juju-log Using local storage: not a Kubernetes podspec charm
unit-mysql-router-0: 16:53:40 DEBUG unit.mysql-router/0.juju-log Initializing SQLite local storage: /var/lib/juju/agents/unit-mysql-router-0/charm/.unit-state.db.
unit-mysql-router-0: 16:53:40 DEBUG unit.mysql-router/0.juju-log Emitting Juju event install.
unit-mysql-router-0: 16:53:40 DEBUG unit.mysql-router/0.juju-log Installing _SNAP_NAME='charmed-mysql', _REVISION='69'
unit-mysql-router-0: 16:54:29 ERROR unit.mysql-router/0.juju-log Uncaught exception while in charm code:
Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-mysql-router-0/charm/./src/machine_charm.py", line 73, in <module>
    ops.main.main(MachineSubordinateRouterCharm)
  File "/var/lib/juju/agents/unit-mysql-router-0/charm/venv/ops/main.py", line 441, in main
    _emit_charm_event(charm, dispatcher.event_name)
  File "/var/lib/juju/agents/unit-mysql-router-0/charm/venv/ops/main.py", line 149, in _emit_charm_event
    event_to_emit.emit(*args, **kwargs)
  File "/var/lib/juju/agents/unit-mysql-router-0/charm/venv/ops/framework.py", line 344, in emit
    framework._emit(event)
  File "/var/lib/juju/agents/unit-mysql-router-0/charm/venv/ops/framework.py", line 841, in _emit
    self._reemit(event_path)
  File "/var/lib/juju/agents/unit-mysql-router-0/charm/venv/ops/framework.py", line 930, in _reemit
    custom_handler(event)
  File "/var/lib/juju/agents/unit-mysql-router-0/charm/./src/machine_charm.py", line 65, in _on_install
    snap.install(unit=self.unit)
  File "/var/lib/juju/agents/unit-mysql-router-0/charm/src/snap.py", line 49, in install
    raise Exception("Intentionally forcing a failure here.")
Exception: Intentionally forcing a failure here.
unit-mysql-router-0: 16:54:36 ERROR juju.worker.uniter.operation hook "install" (via hook dispatching script: dispatch) failed: exit status 1
unit-mysql-router-0: 16:54:36 INFO juju.worker.uniter awaiting error resolution for "install" hook
unit-mysql-router-0: 16:54:41 INFO juju.worker.uniter awaiting error resolution for "install" hook
unit-mysql-router-0: 16:54:42 DEBUG unit.mysql-router/0.juju-log ops 2.6.0 up and running.
unit-mysql-router-0: 16:54:42 INFO unit.mysql-router/0.juju-log Running legacy hooks/install.
unit-mysql-router-0: 16:54:42 DEBUG unit.mysql-router/0.juju-log ops 2.6.0 up and running.
unit-mysql-router-0: 16:54:42 DEBUG unit.mysql-router/0.juju-log Charm called itself via hooks/install.
unit-mysql-router-0: 16:54:42 DEBUG unit.mysql-router/0.juju-log Legacy hooks/install exited with status 0.
unit-mysql-router-0: 16:54:42 DEBUG unit.mysql-router/0.juju-log Re-emitting deferred event <InstallEvent via MachineSubordinateRouterCharm/on/install[1]>.
unit-mysql-router-0: 16:54:42 ERROR unit.mysql-router/0.juju-log charmed-mysql snap already installed on machine. Installation aborted
unit-mysql-router-0: 16:54:42 ERROR unit.mysql-router/0.juju-log Uncaught exception while in charm code:
Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-mysql-router-0/charm/./src/machine_charm.py", line 73, in <module>
    ops.main.main(MachineSubordinateRouterCharm)
  File "/var/lib/juju/agents/unit-mysql-router-0/charm/venv/ops/main.py", line 439, in main
    framework.reemit()
  File "/var/lib/juju/agents/unit-mysql-router-0/charm/venv/ops/framework.py", line 851, in reemit
    self._reemit()
  File "/var/lib/juju/agents/unit-mysql-router-0/charm/venv/ops/framework.py", line 930, in _reemit
    custom_handler(event)
  File "/var/lib/juju/agents/unit-mysql-router-0/charm/./src/machine_charm.py", line 65, in _on_install
    snap.install(unit=self.unit)
  File "/var/lib/juju/agents/unit-mysql-router-0/charm/src/snap.py", line 30, in install
    raise Exception(f"Multiple {_SNAP_NAME} snap installs not supported on one machine")
Exception: Multiple charmed-mysql snap installs not supported on one machine
unit-mysql-router-0: 16:54:42 ERROR juju.worker.uniter.operation hook "install" (via hook dispatching script: dispatch) failed: exit status 1
unit-mysql-router-0: 16:54:42 INFO juju.worker.uniter awaiting error resolution for "install" hook
unit-mysql-router-0: 16:54:52 INFO juju.worker.uniter awaiting error resolution for "install" hook

Additional context

Provided at top of bug report.

Thank you for the bug report!

Out of curiosity, do you know what error in the install hook caused this issue? Given that the install hook only runs these two commands I'm wondering what could've failed after the snap successfully installed

def _on_install(self, _) -> None:
snap.install(unit=self.unit)
self.unit.set_workload_version(self.get_workload(event=None).version)

As you've shown, we can improve the design of snap check. At the same time, I'm wary of adding handling to the charm for unexpected, uncaught exceptions—I'd rather the charm fail early and loudly instead of silently, especially in the install hook when the workload isn't serving clients

That said, I think we should reconsider the design of the snap check. Deferring to @taurus-forever regarding priority for that

@carlcsaposs-canonical I believe Billy is reporting us this real-world case: charmed-hpc/slurm-bundles#38 (comment)
description: charmed-hpc/slurm-bundles#38 (comment)

Q: why mysql-router has failed? From this point of view, the charm should recover on hook retry (after networking restoring).
277013527-a201d0ce-3967-467a-91ae-b17d98e72204

Billy, feel free to correct me here.

@taurus-forever what I'm trying to understand is why the install hook failed the first time

@taurus-forever If the install hook fails because of networking, the snap will not be installed—so subsequent retries of the install hook should not fail with this exception

wolsen commented

It's not entirely clear why the install hook failed and I don't have the full log set and the deployment is no longer around to get the data. However, the code in and of itself clearly recognizes that it can fail for various reasons.

Specifically, this loop here is acknowledging that there may be a timeout condition that occurs during the installation (e.g. low bandwidth, high iowait periods, etc). The charm will set the maintenance status and then re-raise the exception, which causes the install hook to fail. However, the asynchronous task that was started by the snap is not stopped by the charm (and I don't think it can be). This means there's a period where the install hook fails, but the installation actually succeeds (its just slow). The install hook will be re-run by Juju and must complete this hook before moving onto other hooks in the queue. The re-execution of the hook will always fail if the snap completed installation in the background.

Additionally, it's also true the the invocation of _snap.hold() will fail (snapd crash, socket closed, oom killer, etc) in a manner that is not a persistent failure. That will also raise an unhandled exception that will cause the hook to fail.

The only way out of this is either via manual intervention or destroying the application and re-deploying which is not an ideal experience (but sometimes necessary). In this case, it is actually possible to verify whether or not the snap installed is the correct version of the snap or not and handle the error condition gracefully.

Edit to confirm that this is indeed being reported from the above bug mentioned by @taurus-forever.

However, the asynchronous task that was started by the snap is not stopped by the charm (and I don't think it can be). This means there's a period where the install hook fails, but the installation actually succeeds (its just slow).

Oh, I didn't realize this—thank you! Definitely something we need to fix then, since it sounds like there's a significant chance of this happening

In this case, it is actually possible to verify whether or not the snap installed is the correct version of the snap or not and handle the error condition gracefully.

This isn't quite enough—even if the snap is the same version as we're expecting, we need to verify that it's not in use by another charm on the same machine (e.g. mysql server) so that if we update the snap later we aren't restarting mysql server.

We can probably use a similar approach that we're using on server by writing a file with the name of the charm that installed the snap

Should be fixed by #92

I'm wondering if the original issue might've been caused by something else (and this is just a symptom), or if it was caused by install being fired by Juju twice

Fix merged, will be available in dpe/edge soon

Note that dpe/edge currently contains known issues with secrets & relation broken. More details: #82 (comment)