k3s-io/helm-controller

Controller delete the release during the upgrade process without warning

mtparet opened this issue · 15 comments

Hello,

We experienced a release deleting during upgrade, is it expected ?

elastic/cloud-on-k8s#4734 (comment)

kubectl logs -f helm-install-elastic-operator-jzwnf -n elastic-system 
CHART="${CHART//%\{KUBERNETES_API\}%/${KUBERNETES_SERVICE_HOST}:${KUBERNETES_SERVICE_PORT}}"
set +v -x
+ cp /var/run/secrets/kubernetes.io/serviceaccount/ca.crt /usr/local/share/ca-certificates/
+ update-ca-certificates
WARNING: ca-certificates.crt does not contain exactly one certificate or CRL: skipping
+ [[ '' != \t\r\u\e ]]
+ export HELM_HOST=127.0.0.1:44134
+ HELM_HOST=127.0.0.1:44134
+ helm_v2 init --skip-refresh --client-only --stable-repo-url https://charts.helm.sh/stable/
+ tiller --listen=127.0.0.1:44134 --storage=secret
Creating /root/.helm 
Creating /root/.helm/repository 
Creating /root/.helm/repository/cache 
Creating /root/.helm/repository/local 
Creating /root/.helm/plugins 
Creating /root/.helm/starters 
Creating /root/.helm/cache/archive 
Creating /root/.helm/repository/repositories.yaml 
Adding stable repo with URL: https://charts.helm.sh/stable/ 
Adding local repo with URL: http://127.0.0.1:8879/charts 
$HELM_HOME has been configured at /root/.helm.
Not installing Tiller due to 'client-only' flag having been set
++ jq -r '.Releases | length'
++ helm_v2 ls --all '^elastic-operator$' --output json
[main] 2021/08/04 12:19:18 Starting Tiller v2.17.0 (tls=false)
[main] 2021/08/04 12:19:18 GRPC listening on 127.0.0.1:44134
[main] 2021/08/04 12:19:18 Probes listening on :44135
[main] 2021/08/04 12:19:18 Storage driver is Secret
[main] 2021/08/04 12:19:18 Max history per release is 0
[storage] 2021/08/04 12:19:18 listing all releases with filter
+ V2_CHART_EXISTS=
+ [[ '' == \1 ]]
+ [[ '' == \v\2 ]]
+ [[ -n '' ]]
+ shopt -s nullglob
+ helm_content_decode
+ set -e
+ ENC_CHART_PATH=/chart/elastic-operator.tgz.base64
+ CHART_PATH=/elastic-operator.tgz
+ [[ ! -f /chart/elastic-operator.tgz.base64 ]]
+ return
+ [[ install != \d\e\l\e\t\e ]]
+ helm_repo_init
+ grep -q -e 'https\?://'
+ [[ helm_v3 == \h\e\l\m\_\v\3 ]]
+ [[ eck-operator == stable/* ]]
+ [[ -n https://helm.elastic.co ]]
+ helm_v3 repo add elastic-operator https://helm.elastic.co
"elastic-operator" has been added to your repositories
+ helm_v3 repo update
Hang tight while we grab the latest from your chart repositories...
...Successfully got an update from the "elastic-operator" chart repository
Update Complete. ⎈Happy Helming!⎈
+ helm_update install --namespace elastic-system --repo https://helm.elastic.co --version 1.7.0
+ [[ helm_v3 == \h\e\l\m\_\v\3 ]]
++ tr '[:upper:]' '[:lower:]'
++ jq -r '"\(.[0].app_version),\(.[0].status)"'
++ helm_v3 ls --all -f '^elastic-operator$' --namespace elastic-system --output json
+ LINE=1.7.0,failed
+ IFS=,
+ read -r INSTALLED_VERSION STATUS _
+ VALUES=
+ [[ install = \d\e\l\e\t\e ]]
+ [[ 1.7.0 =~ ^(|null)$ ]]
+ [[ failed =~ ^(pending-install|pending-upgrade|pending-rollback)$ ]]
+ [[ failed == \d\e\p\l\o\y\e\d ]]
+ [[ failed =~ ^(deleted|failed|null|unknown)$ ]]
+ [[ helm_v3 == \h\e\l\m\_\v\3 ]]
+ helm_v3 uninstall elastic-operator --namespace elastic-system
release "elastic-operator" uninstalled
+ echo Deleted
+ helm_v3 install --namespace elastic-system --repo https://helm.elastic.co --version 1.7.0 elastic-operator eck-operator
Deleted
NAME: elastic-operator
LAST DEPLOYED: Wed Aug  4 12:19:23 2021
NAMESPACE: elastic-system
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
1. Inspect the operator logs by running the following command:
   kubectl logs -n elastic-system sts/elastic-operator
+ exit

++ helm_v3 ls --all -f '^elastic-operator$' --namespace elastic-system --output json
+ LINE=1.7.0,failed

The release was in a failed state. Failed releases cannot be upgraded, so the helm controller handles this by uninstalling the broken release, and installing it again to a known-good state. For information on why the release was in a failed state, you would need to look at logs from the failed job pod.

How to check the logs of the failed pod if it got deleted by the controller quickly?

This behavior is good for stateless things, but I don't recommend to use the controller for stuff like rook-ceph where data loss happens when reinstalling it.

Yes, you'd need to catch it pretty quickly. If this is something you run into with some regularity, it is probably worth trying to reproduce the error in a test environment so that you can figure out what's causing the chart upgrade or install to fail initially.

We could perhaps consider an enhancement to the HelmChart CR spec to support enabling different behavior if the release is in a failed state - for example: rolling back, retrying the upgrade as-is, or simply leaving it failed for an administrator to address manually?

In the past we've attempted to avoid adding too much complexity to the helm controller since it's sort of "just enough" to get the basic cluster infra installed via charts; it's definitely not intended to be a complete replacement for more complicated deployment tools.

Sounds like a good idea to me. I suggest using the safe option to not delete it as a default.

I had some bad experiences with this controller and data loss after a few weeks uptime, luckily on a test environment.

The way we use the Helm controller to deploy core cluster infrastructure whose complete configuration is managed by the chart, it's usually best for us as distribution maintainers to default to behavior that gets the chart installed in the desired configuration at any cost. I can see the value in allowing users to disable that for their own charts though.

We have seen more or less the same issue during our upgrade of RKE2 from 1.20.8 to 1.20.12:

 CHART="${CHART//%\{KUBERNETES_API\}%/${KUBERNETES_SERVICE_HOST}:${KUBERNETES_SERVICE_PORT}}"

set +v -x

+ [[ '' != \t\r\u\e ]]

+ export HELM_HOST=127.0.0.1:44134

+ HELM_HOST=127.0.0.1:44134

+ helm_v2 init --skip-refresh --client-only --stable-repo-url https://charts.helm.sh/stable/

+ tiller --listen=127.0.0.1:44134 --storage=secret

[main] 2021/12/16 08:54:01 Starting Tiller v2.17.0 (tls=false)

[main] 2021/12/16 08:54:01 GRPC listening on 127.0.0.1:44134

[main] 2021/12/16 08:54:01 Probes listening on :44135

[main] 2021/12/16 08:54:01 Storage driver is Secret

[main] 2021/12/16 08:54:01 Max history per release is 0

Creating /home/klipper-helm/.helm

Creating /home/klipper-helm/.helm/repository

Creating /home/klipper-helm/.helm/repository/cache

Creating /home/klipper-helm/.helm/repository/local

Creating /home/klipper-helm/.helm/plugins

Creating /home/klipper-helm/.helm/starters

Creating /home/klipper-helm/.helm/cache/archive

Creating /home/klipper-helm/.helm/repository/repositories.yaml

Adding stable repo with URL: https://charts.helm.sh/stable/

Adding local repo with URL: http://127.0.0.1:8879/charts

$HELM_HOME has been configured at /home/klipper-helm/.helm.

Not installing Tiller due to 'client-only' flag having been set

++ timeout -s KILL 30 helm_v2 ls --all '^rke2-ingress-nginx$' --output json

++ jq -r '.Releases | length'

[storage] 2021/12/16 08:54:01 listing all releases with filter

+ V2_CHART_EXISTS=

+ [[ '' == \1 ]]

+ [[ '' == \v\2 ]]

+ [[ -n '' ]]

+ shopt -s nullglob

+ helm_content_decode

+ set -e

+ ENC_CHART_PATH=/chart/rke2-ingress-nginx.tgz.base64

+ CHART_PATH=/tmp/rke2-ingress-nginx.tgz

+ [[ ! -f /chart/rke2-ingress-nginx.tgz.base64 ]]

+ base64 -d /chart/rke2-ingress-nginx.tgz.base64

+ CHART=/tmp/rke2-ingress-nginx.tgz

+ set +e

+ [[ install != \d\e\l\e\t\e ]]

+ helm_repo_init

+ grep -q -e 'https\?://'

+ [[ helm_v3 == \h\e\l\m\_\v\3 ]]

+ [[ /tmp/rke2-ingress-nginx.tgz == stable/* ]]

+ [[ -n '' ]]

+ helm_update install --set-string global.clusterCIDR=10.253.0.0/17 --set-string global.clusterDNS=10.253.128.10 --set-string global.clusterDomain=cluster.local --set-string global.rke2DataDir=/var/lib/rancher/rke2 --set-string global.serviceCIDR=10.253.128.0/17 --set-string global.systemDefaultRegistry=

+ [[ helm_v3 == \h\e\l\m\_\v\3 ]]

++ helm_v3 ls --all -f '^rke2-ingress-nginx$' --namespace kube-system --output json

++ jq -r '"\(.[0].app_version),\(.[0].status)"'

++ tr '[:upper:]' '[:lower:]'

+ LINE=1.0.2,failed

+ IFS=,

+ read -r INSTALLED_VERSION STATUS _

+ VALUES=

+ for VALUES_FILE in /config/*.yaml

+ VALUES=' --values /config/values-10_HelmChartConfig.yaml'

+ [[ install = \d\e\l\e\t\e ]]

+ [[ 1.0.2 =~ ^(|null)$ ]]

+ [[ failed =~ ^(pending-install|pending-upgrade|pending-rollback)$ ]]

+ [[ failed == \d\e\p\l\o\y\e\d ]]

+ [[ failed =~ ^(deleted|failed|null|unknown)$ ]]

+ [[ helm_v3 == \h\e\l\m\_\v\3 ]]

+ helm_v3 uninstall rke2-ingress-nginx --namespace kube-system

release "rke2-ingress-nginx" uninstalled

+ echo Deleted

Deleted

+ helm_v3 install --set-string global.clusterCIDR=10.253.0.0/17 --set-string global.clusterDNS=10.253.128.10 --set-string global.clusterDomain=cluster.local --set-string global.rke2DataDir=/var/lib/rancher/rke2 --set-string global.serviceCIDR=10.253.128.0/17 --set-string global.systemDefaultRegistry= rke2-ingress-nginx /tmp/rke2-ingress-nginx.tgz --values /config/values-10_HelmChartConfig.yaml

In our case this hit ingress-nginx (causing downtime) and longhorn (causing data-loss).

It this the same problem?

That's the same issue I had. I think it's best to have the safe option as a default, and the destructive one as opt in.

As an alternative I suggest writing some damn big warning signs somewhere to not use this controller with any deployments which require any kind of persistence.

I am sure our deployments have been installed and running perfectly well - so not sure why / where we can see a failure and is handled "wrong" for sure..

We tried to reproduce this issue and it seems that we might have found at least one situation where an "uninstall" happens where it should not:

In case the helm upgrade for a given chart / deployment fails i.e. due to hitting a timeout - the deployment is in status "pending-upgrade".

Now the process gets restarted, detects "pending-upgrade", sets the deployment on "failed" and re-runs the helm upgrade.

Here again a failure happens - i.e. this:

Retrying upgrade of longhorn
Error: UPGRADE FAILED: post-upgrade hooks failed: timed out waiting for the condition

In the next re-run the deployment is on failure and this causes the uninstall / reinstall.

Assumption is that this is related to this change of the status from "pending-upgrade" to "failed":
https://github.com/k3s-io/klipper-helm/blob/9a6c7e2ce9f7bf0bafb5d803bb91797e2470b918/entry#L39

Could you verify if this observation is correct?

In general - could we disable the "uninstall / install" in case of any errors completely?
(https://github.com/k3s-io/klipper-helm/blob/9a6c7e2ce9f7bf0bafb5d803bb91797e2470b918/entry#L68)

  1. It is a really bad situation if for what ever reason something that is installed with persistent data gets uninstalled as this will cause data-loss (i.e. with longhorn in our case but also with rook-ceph or elastic,...)
  2. An automated uninstall / install instead of upgrade will cause downtime (nginx-ingress in our case)

Fow now I solve this by switching to flux helm controller which do not suffer from this issue https://github.com/fluxcd/helm-controller.

Fow now I solve this by switching to flux helm controller which do not suffer from this issue https://github.com/fluxcd/helm-controller.

So you deploy this "in addition" to the one we have in rke2/k3s?

Retrying upgrade of longhorn
Error: UPGRADE FAILED: post-upgrade hooks failed: timed out waiting for the condition

OK, so the helm upgrade operation is timing out. It's taking longer than helm gives it by default. Assuming there's not something else broken, the correct fix for this is probably just to raise the timeout. I am also assuming that Longhorn itself supports upgrading by just upgrading the chart without any additional preparation.

I see that we didn't get this field added to the HelmChart spec in the RKE2 docs, but there is a timeout field available that takes a duration string (`30m' or the like):
https://github.com/k3s-io/helm-controller/blob/master/pkg/apis/helm.cattle.io/v1/types.go#L30

This is passed through directly to the helm command's timeout arg:
https://github.com/k3s-io/klipper-helm/blob/master/entry#L152

The „timeout“ situation is just one case that we could reproduce - not sure if there are other situations..

We believe that adjusting the timeout is not a valid workaround, because we never know how much time an upgrade in a given environment might need. We must ensure by 100% that there is no automatic uninstall as an uninstall is downtime and data-loss.

So the way how we reproduce the root cause might be only one situation where this „uninstall“ happens - and we have also seen this with the rke2 included ingress-nginx.. Assumption is that any update/upgrade might cause problems in a given cluster or servers / agents in the cluster etc. So to be on the save side - we might have to set the timeout to „never timeout“. How can we configure this as default for everything that happens with the rke2 integrated helm-controller?

To have a real solution - could we get the „uninstall / install“ removed from the helm-controller completely to protect against downtime and data-loss? Or what would be the reason to allow uninstall / install in a production environment?

Tracking this issue since I'm experiencing the same thing with rke2 + rancher-istio. Will test out that timeout field and report back. We're also managing rook-ceph so further prevention of downtime/data-loss is still necessary.