Helm operator stucks if install/upgrade were aborted due to unexpected exit of the operators
Opened this issue · 4 comments
If I kill the operator while an helm install is executed it is not able to recover as it always receives an error due to the existing lock.
Hmm. I wasn't aware that helm created a lock during installs (and I assume other operations that affect release data?).
Could you share the error you were getting and/or any other breadcrumbs? Assuming the lock is some sort of kube object helm creates, I wonder if we could inject a label or other identifier that could help us identify it later as a lock we created (vs the helm CLI or another client) and try to recover.
Hey @joelanford ! Thanks for your answer! :)
Helm does update its release secret status to pending-upgrade
:
❯ helm get all <release-name> | head
NAME: stackrox-secured-cluster-services
LAST DEPLOYED: Tue Jun 15 18:47:11 2021
NAMESPACE: stackrox
STATUS: pending-upgrade
REVISION: 8
TEST SUITE: None
I even found several issue in helm like this but until now it looks like the only solutions are:
- deleting the secret
- updating the status of the release manually
- do a rollback
I'll try to fix this today and add you to the PR - would be really useful to have feedback from you on this.
Ah! I'm pretty sure (though not positive) that we inject an owner reference into the release secret. If so, that would help us identify release secrets we create. Whatever solution we choose, I think it should include a check that we will only try to automatically resolve it if we see that we are the only interested party to the release.
That would avoid a situation where the operator suddenly inherits and potentially stomps on a release when a CR is created for an existing release.
Action items / open questions:
- Check for the owner reference on Helm secrets and config maps to only resolve pending states on operator owned resources
- Should not operator owned resources recover from pending states?
- Adding tests