k3s-io/helm-controller

Bool in `HelmChartSpec.items.spec.set` can stall controller queue processing

brandond opened this issue · 8 comments

From @Stargeras in rancher/rke2#3849:

I updated a set in one of my manifests to have a boolean value (true). It's weird that it affected updates to all of the HelmChart resources and not just the one with the boolean.
I do not see the Job getting updated when the HelmChart resource is updated. I don't see any changes in the kube-controller-manager pod logs. I do see something interesting in the rke2 server logs though. I see the event for the updated manifest but then I see the following error:

Jan 26 16:15:45 pserver24 rke2[30949]: I0126 16:15:45.104127 30949 event.go:294] "Event occurred" object="kube-system/rancher-chart" fieldPath="" kind="Addon" apiVersion="k3s.cattle.io/v1" type="Normal" reason="AppliedManifest" message="Applied manifest at "/var/lib/rancher/rke2/server/manifests/rancher-chart.yaml""
Jan 26 16:16:05 pserver24 rke2[30949]: W0126 16:16:05.361337 30949 reflector.go:324] k8s.io/client-go@v1.24.9-k3s1/tools/cache/reflector.go:167: failed to list *v1.HelmChart: json: cannot unmarshal bool into Go struct field HelmChartSpec.items.spec.set of type int32
Jan 26 16:16:05 pserver24 rke2[30949]: E0126 16:16:05.361389 30949 reflector.go:138] k8s.io/client-go@v1.24.9-k3s1/tools/cache/reflector.go:167: Failed to watch *v1.HelmChart: failed to list *v1.HelmChart: json: cannot unmarshal bool into Go struct field HelmChartSpec.items.spec.set of type int32

Does anyone now how do unstall the controller? It can't do any other thing because of the stalling. Annoying. How can't this be fixed in such a long time?

Uninstall it from what?

It is not fixed because it is triggered by bad input from the user; the fix is to correct your manifest so that it does not have invalid field content. Adding an admission webhook to validate resources when they are created is more work than we current want to take on.

@brandond you are right, this is an issue from the user. Nevertheless, as far as I remember, the problem is that the whole HelmChart processing stop when such « syntax error » occurs, for any HelmChart. And the error message is accessible only on the /var/log/k3s.log.

For a user, it would be helpful if the error is reported on the HelmChart itself, via the Status or an event.

The error itself that stalls the queue is deep inside the Kubernetes code that handles unmarshalling the list/watch event from the channel; I suspect that fixing it on that side would be quite involved. The proper fix is probably to validate the resources before they're admitted to the cluster but that is also a lot of work.

it would be helpful if the error is reported on the HelmChart itself, via the Status or an event.

Yeah but we can't do that, as the resource is invalid. It couldn't be decoded so we don't know which one it is to attach an event to it.

Honestly, just having the controller generate CRDs a proper OpenAPI field spec instead of using x-kubernetes-preserve-unknown-fields: true would probably be the correct way to handle this.

It looks like the standalone controller already did that, it's just K3s and RKE2 that register the CRDs without schema. I have a PR open to fix that.

Closing here, as the standalone controller already enables schema validation via its CRDs. It is just that K3s/RKE2 do not create the CRDs with schema, so they allow invalid resources to be created. I have opened an issue and PR to address that.

Thanks @brandond for this analysis and report.