Keeplived chk_default_ingress error causing ingress operator failure during upgrade from 4.7 to 4.8

Question

Keeplived chk_default_ingress error causing ingress operator failure during upgrade from 4.7 to 4.8

mkaspar opened this issue 3 years ago · 7 comments

During upgrade from 4.7.0-0.okd-2021-09-19-013247 to 4.8.0-0.okd-2021-10-24-061736 (running on vSphere) we've ran into error with ingress operator that caused the upgrade to fail with error:
Unable to apply 4.8.0-0.okd-2021-10-24-061736: wait has exceeded 40 minutes for these operators: ingress
The operator log showed repeating errors like:

2021-12-25T19:58:13.679Z	ERROR	operator.canary_controller	wait/wait.go:155	error performing canary route check	{"error": "error sending canary HTTP request to \"canary-openshift-ingress-canary.apps.okd4.domain.x\": Get \"https://canary-openshift-ingress-canary.apps.okd4.domain.x\": dial tcp 172.26.125.129:443: connect: connection refused"}
2021-12-25T19:58:18.781Z	INFO	operator.ingress_controller	controller/controller.go:298	reconciling	{"request": "openshift-ingress-operator/default"}
2021-12-25T19:58:18.945Z	ERROR	operator.ingress_controller	controller/controller.go:298	got retryable error; requeueing	{"after": "1m0s", "error": "IngressController is degraded: CanaryChecksSucceeding=False (CanaryChecksRepetitiveFailures: Canary route checks for the default ingress controller are failing)"}

The router pods and other components seemed to run just fine, but the external .Cluster.IngressVIP was inaccessible via http and https protocols. Further investigation revealed that it is caused by the .Cluster.IngressVIP being assigned to nodes different than the ones running the router pods.
The reason for this was that we had modified keepalived configuration for 2 of our nodes to host additional VIPs (workaround for okd-project/okd#572) and 4.7 keeplived configuration ({{ .Cluster.Name }}_INGRESS priority set to 40) conflicted with the the new priority setting of 20 combined with the error in chk_default_ingress. The result of these to problems was that the special nodes had the same keepalived priority (40) as the nodes running the router (openshift-ingress/router-default-*) pods and the VIP missassignment.
I think the problem is in the /etc/kubernetes/static-pod-resources/keepalived/keepalived.conf.tmpl in this section:

vrrp_script chk_default_ingress {
    script "/usr/bin/timeout 4.9 /host/bin/oc --kubeconfig /var/lib/kubelet/kubeconfig get ep -n openshift-ingress route
r-internal-default -o yaml  | grep 'ip:' | grep {{.NonVirtualIP}} "
    interval 5
    weight 50
}

where there is C2 A0 byte before the pipe character in the -o yaml | grep.
This causes the script to fail with return code 7 and the priority wasn't added to the correct keepalived instances.

Steps to reproduce the issue:

Run OKD 4.7.0-0.okd-2021-09-19-013247
Let cluster update to 4.8.0-0.okd-2021-10-24-061736
Wait for update to fail

Describe the results you received:

Unable to apply 4.8.0-0.okd-2021-10-24-061736: wait has exceeded 40 minutes for these operators: ingress

Describe the results you expected:

Upgraded cluster

Answer 1 · 2022-01-04T20:04:41.000Z

This is found in the on-prem templates:
https://github.com/openshift/machine-config-operator/blob/115eb3a71d871aa8a8d80cb4e3613dbaae9a4bcd/templates/worker/00-worker/on-prem/files/keepalived-keepalived.yaml

https://github.com/openshift/machine-config-operator/blob/115eb3a71d871aa8a8d80cb4e3613dbaae9a4bcd/templates/master/00-master/on-prem/files/keepalived-keepalived.yaml

So passing to @yboaron @rvanderp3 to take a look. Though they may prefer a proper bugzilla with a must gather from your cluster to investigate.

Answer 2 · 2022-01-04T20:08:12.000Z

Since this is related to an existing OKD issue (okd-project/okd#572), also probably want to

/assign @vrutkovs

Answer 3 · 2022-01-04T21:07:17.000Z

Lets track this in openshift/okd. If its an issue not a support request we'd need to ensure its still reproducible in 4.9

Answer 4 · 2022-04-04T23:42:08.000Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

Answer 5 · 2022-05-05T00:10:52.000Z

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

Answer 6 · 2022-06-04T00:32:04.000Z

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Answer 7 · 2022-06-04T00:32:27.000Z

@openshift-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.