Install hook failure due to snap charmed-mysql already installed
wolsen opened this issue · 9 comments
Please note that I am submitting this bug report upon analyzing the code and viewing log snippets from an issue reported against the slurm-bundles repository. I do not have full the full log of recreate steps, but reading the code is enough to conjecture this sequence of events and I have been able to confirm this in a contrived recreate scenario. The scenario is included below.
Steps to reproduce
- juju deploy --base ubuntu@22.04 slurmd --channel edge
- juju deploy --base ubuntu@22.04 mysql-router --channel dpe/edge
- juju relate slurmd mysql-router
- juju debug-hooks mysql-router/0 install
- Raise an exception at the end of the snap.py's
install
function (e.g. do not exit cleanly). I used:
raise Exception("Intentionally forcing a failure here")
- exit the debug-hooks session with a non-zero exit code
- juju resolved mysql-router/0
You will note that the failure then results in Exception: Multiple charmed-mysql snap installs not supported on one machine
.
Expected behavior
The install hook should be able to be re-run on initial failure after the error has been resolved.
Actual behavior
The install hook fails with Exception: Multiple charmed-mysql snap installs not supported on one machine
Versions
Operating system: Ubuntu 22.04.3 LTS
Juju CLI: 3.1.6-genericlinux-amd64
Juju agent: 3.3-beta1
mysql charm revision: n/a
mysql-router charm revision: dpe/edge revision 119
LXD: 5.19
Log output
Juju debug log:
unit-mysql-router-0: 16:52:33 INFO juju Starting unit workers for "mysql-router/0"
unit-mysql-router-0: 16:52:33 INFO juju.worker.apicaller [9b6cfe] "unit-mysql-router-0" successfully connected to "10.167.81.6:17070"
unit-mysql-router-0: 16:52:33 INFO juju.worker.apicaller [9b6cfe] password changed for "unit-mysql-router-0"
unit-mysql-router-0: 16:52:33 INFO juju.worker.apicaller [9b6cfe] "unit-mysql-router-0" successfully connected to "10.167.81.6:17070"
unit-mysql-router-0: 16:52:33 INFO juju.worker.upgrader no waiter, upgrader is done
unit-mysql-router-0: 16:52:33 INFO juju.worker.migrationminion migration phase is now: NONE
unit-mysql-router-0: 16:52:33 INFO juju.worker.logger logger worker started
unit-mysql-router-0: 16:52:33 ERROR juju.worker.meterstatus error running "meter-status-changed": charm missing from disk
unit-mysql-router-0: 16:52:33 INFO juju.worker.uniter unit "mysql-router/0" started
unit-mysql-router-0: 16:52:33 INFO juju.worker.uniter resuming charm install
unit-mysql-router-0: 16:52:33 INFO juju.worker.uniter.charm downloading ch:amd64/jammy/mysql-router-119 from API server
unit-mysql-router-0: 16:52:36 INFO juju.worker.uniter hooks are retried true
unit-mysql-router-0: 16:52:36 INFO juju.worker.uniter.storage initial storage attachments ready
unit-mysql-router-0: 16:52:36 INFO juju.worker.uniter found queued "install" hook
unit-mysql-router-0: 16:52:36 INFO juju.worker.uniter.runner executing install via debug-hooks; hook dispatching script: dispatch
unit-mysql-router-0: 16:53:39 DEBUG unit.mysql-router/0.juju-log ops 2.6.0 up and running.
unit-mysql-router-0: 16:53:39 INFO unit.mysql-router/0.juju-log Running legacy hooks/install.
unit-mysql-router-0: 16:53:40 DEBUG unit.mysql-router/0.juju-log ops 2.6.0 up and running.
unit-mysql-router-0: 16:53:40 DEBUG unit.mysql-router/0.juju-log Charm called itself via hooks/install.
unit-mysql-router-0: 16:53:40 DEBUG unit.mysql-router/0.juju-log Legacy hooks/install exited with status 0.
unit-mysql-router-0: 16:53:40 DEBUG unit.mysql-router/0.juju-log Using local storage: not a Kubernetes podspec charm
unit-mysql-router-0: 16:53:40 DEBUG unit.mysql-router/0.juju-log Initializing SQLite local storage: /var/lib/juju/agents/unit-mysql-router-0/charm/.unit-state.db.
unit-mysql-router-0: 16:53:40 DEBUG unit.mysql-router/0.juju-log Emitting Juju event install.
unit-mysql-router-0: 16:53:40 DEBUG unit.mysql-router/0.juju-log Installing _SNAP_NAME='charmed-mysql', _REVISION='69'
unit-mysql-router-0: 16:54:29 ERROR unit.mysql-router/0.juju-log Uncaught exception while in charm code:
Traceback (most recent call last):
File "/var/lib/juju/agents/unit-mysql-router-0/charm/./src/machine_charm.py", line 73, in <module>
ops.main.main(MachineSubordinateRouterCharm)
File "/var/lib/juju/agents/unit-mysql-router-0/charm/venv/ops/main.py", line 441, in main
_emit_charm_event(charm, dispatcher.event_name)
File "/var/lib/juju/agents/unit-mysql-router-0/charm/venv/ops/main.py", line 149, in _emit_charm_event
event_to_emit.emit(*args, **kwargs)
File "/var/lib/juju/agents/unit-mysql-router-0/charm/venv/ops/framework.py", line 344, in emit
framework._emit(event)
File "/var/lib/juju/agents/unit-mysql-router-0/charm/venv/ops/framework.py", line 841, in _emit
self._reemit(event_path)
File "/var/lib/juju/agents/unit-mysql-router-0/charm/venv/ops/framework.py", line 930, in _reemit
custom_handler(event)
File "/var/lib/juju/agents/unit-mysql-router-0/charm/./src/machine_charm.py", line 65, in _on_install
snap.install(unit=self.unit)
File "/var/lib/juju/agents/unit-mysql-router-0/charm/src/snap.py", line 49, in install
raise Exception("Intentionally forcing a failure here.")
Exception: Intentionally forcing a failure here.
unit-mysql-router-0: 16:54:36 ERROR juju.worker.uniter.operation hook "install" (via hook dispatching script: dispatch) failed: exit status 1
unit-mysql-router-0: 16:54:36 INFO juju.worker.uniter awaiting error resolution for "install" hook
unit-mysql-router-0: 16:54:41 INFO juju.worker.uniter awaiting error resolution for "install" hook
unit-mysql-router-0: 16:54:42 DEBUG unit.mysql-router/0.juju-log ops 2.6.0 up and running.
unit-mysql-router-0: 16:54:42 INFO unit.mysql-router/0.juju-log Running legacy hooks/install.
unit-mysql-router-0: 16:54:42 DEBUG unit.mysql-router/0.juju-log ops 2.6.0 up and running.
unit-mysql-router-0: 16:54:42 DEBUG unit.mysql-router/0.juju-log Charm called itself via hooks/install.
unit-mysql-router-0: 16:54:42 DEBUG unit.mysql-router/0.juju-log Legacy hooks/install exited with status 0.
unit-mysql-router-0: 16:54:42 DEBUG unit.mysql-router/0.juju-log Re-emitting deferred event <InstallEvent via MachineSubordinateRouterCharm/on/install[1]>.
unit-mysql-router-0: 16:54:42 ERROR unit.mysql-router/0.juju-log charmed-mysql snap already installed on machine. Installation aborted
unit-mysql-router-0: 16:54:42 ERROR unit.mysql-router/0.juju-log Uncaught exception while in charm code:
Traceback (most recent call last):
File "/var/lib/juju/agents/unit-mysql-router-0/charm/./src/machine_charm.py", line 73, in <module>
ops.main.main(MachineSubordinateRouterCharm)
File "/var/lib/juju/agents/unit-mysql-router-0/charm/venv/ops/main.py", line 439, in main
framework.reemit()
File "/var/lib/juju/agents/unit-mysql-router-0/charm/venv/ops/framework.py", line 851, in reemit
self._reemit()
File "/var/lib/juju/agents/unit-mysql-router-0/charm/venv/ops/framework.py", line 930, in _reemit
custom_handler(event)
File "/var/lib/juju/agents/unit-mysql-router-0/charm/./src/machine_charm.py", line 65, in _on_install
snap.install(unit=self.unit)
File "/var/lib/juju/agents/unit-mysql-router-0/charm/src/snap.py", line 30, in install
raise Exception(f"Multiple {_SNAP_NAME} snap installs not supported on one machine")
Exception: Multiple charmed-mysql snap installs not supported on one machine
unit-mysql-router-0: 16:54:42 ERROR juju.worker.uniter.operation hook "install" (via hook dispatching script: dispatch) failed: exit status 1
unit-mysql-router-0: 16:54:42 INFO juju.worker.uniter awaiting error resolution for "install" hook
unit-mysql-router-0: 16:54:52 INFO juju.worker.uniter awaiting error resolution for "install" hook
Additional context
Provided at top of bug report.
Thank you for the bug report!
Out of curiosity, do you know what error in the install hook caused this issue? Given that the install hook only runs these two commands I'm wondering what could've failed after the snap successfully installed
mysql-router-operator/src/machine_charm.py
Lines 64 to 66 in e832715
As you've shown, we can improve the design of snap check. At the same time, I'm wary of adding handling to the charm for unexpected, uncaught exceptions—I'd rather the charm fail early and loudly instead of silently, especially in the install
hook when the workload isn't serving clients
That said, I think we should reconsider the design of the snap check. Deferring to @taurus-forever regarding priority for that
@carlcsaposs-canonical I believe Billy is reporting us this real-world case: charmed-hpc/slurm-bundles#38 (comment)
description: charmed-hpc/slurm-bundles#38 (comment)
Q: why mysql-router has failed? From this point of view, the charm should recover on hook retry (after networking restoring).
Billy, feel free to correct me here.
@taurus-forever what I'm trying to understand is why the install hook failed the first time
@taurus-forever If the install
hook fails because of networking, the snap will not be installed—so subsequent retries of the install hook should not fail with this exception
It's not entirely clear why the install hook failed and I don't have the full log set and the deployment is no longer around to get the data. However, the code in and of itself clearly recognizes that it can fail for various reasons.
Specifically, this loop here is acknowledging that there may be a timeout condition that occurs during the installation (e.g. low bandwidth, high iowait periods, etc). The charm will set the maintenance status and then re-raise the exception, which causes the install hook to fail. However, the asynchronous task that was started by the snap is not stopped by the charm (and I don't think it can be). This means there's a period where the install hook fails, but the installation actually succeeds (its just slow). The install hook will be re-run by Juju and must complete this hook before moving onto other hooks in the queue. The re-execution of the hook will always fail if the snap completed installation in the background.
Additionally, it's also true the the invocation of _snap.hold()
will fail (snapd crash, socket closed, oom killer, etc) in a manner that is not a persistent failure. That will also raise an unhandled exception that will cause the hook to fail.
The only way out of this is either via manual intervention or destroying the application and re-deploying which is not an ideal experience (but sometimes necessary). In this case, it is actually possible to verify whether or not the snap installed is the correct version of the snap or not and handle the error condition gracefully.
Edit to confirm that this is indeed being reported from the above bug mentioned by @taurus-forever.
However, the asynchronous task that was started by the snap is not stopped by the charm (and I don't think it can be). This means there's a period where the install hook fails, but the installation actually succeeds (its just slow).
Oh, I didn't realize this—thank you! Definitely something we need to fix then, since it sounds like there's a significant chance of this happening
In this case, it is actually possible to verify whether or not the snap installed is the correct version of the snap or not and handle the error condition gracefully.
This isn't quite enough—even if the snap is the same version as we're expecting, we need to verify that it's not in use by another charm on the same machine (e.g. mysql server) so that if we update the snap later we aren't restarting mysql server.
We can probably use a similar approach that we're using on server by writing a file with the name of the charm that installed the snap
Should be fixed by #92
I'm wondering if the original issue might've been caused by something else (and this is just a symptom), or if it was caused by install
being fired by Juju twice
Fix merged, will be available in dpe/edge soon
Note that dpe/edge currently contains known issues with secrets & relation broken. More details: #82 (comment)