Keeplived chk_default_ingress error causing ingress operator failure during upgrade from 4.7 to 4.8
mkaspar opened this issue · 7 comments
During upgrade from 4.7.0-0.okd-2021-09-19-013247 to 4.8.0-0.okd-2021-10-24-061736 (running on vSphere) we've ran into error with ingress operator that caused the upgrade to fail with error:
Unable to apply 4.8.0-0.okd-2021-10-24-061736: wait has exceeded 40 minutes for these operators: ingress
The operator log showed repeating errors like:
2021-12-25T19:58:13.679Z ERROR operator.canary_controller wait/wait.go:155 error performing canary route check {"error": "error sending canary HTTP request to \"canary-openshift-ingress-canary.apps.okd4.domain.x\": Get \"https://canary-openshift-ingress-canary.apps.okd4.domain.x\": dial tcp 172.26.125.129:443: connect: connection refused"}
2021-12-25T19:58:18.781Z INFO operator.ingress_controller controller/controller.go:298 reconciling {"request": "openshift-ingress-operator/default"}
2021-12-25T19:58:18.945Z ERROR operator.ingress_controller controller/controller.go:298 got retryable error; requeueing {"after": "1m0s", "error": "IngressController is degraded: CanaryChecksSucceeding=False (CanaryChecksRepetitiveFailures: Canary route checks for the default ingress controller are failing)"}
The router pods and other components seemed to run just fine, but the external .Cluster.IngressVIP was inaccessible via http and https protocols. Further investigation revealed that it is caused by the .Cluster.IngressVIP being assigned to nodes different than the ones running the router pods.
The reason for this was that we had modified keepalived configuration for 2 of our nodes to host additional VIPs (workaround for okd-project/okd#572) and 4.7 keeplived configuration ({{ .Cluster.Name }}_INGRESS priority set to 40) conflicted with the the new priority setting of 20 combined with the error in chk_default_ingress. The result of these to problems was that the special nodes had the same keepalived priority (40) as the nodes running the router (openshift-ingress/router-default-*) pods and the VIP missassignment.
I think the problem is in the /etc/kubernetes/static-pod-resources/keepalived/keepalived.conf.tmpl in this section:
vrrp_script chk_default_ingress {
script "/usr/bin/timeout 4.9 /host/bin/oc --kubeconfig /var/lib/kubelet/kubeconfig get ep -n openshift-ingress route
r-internal-default -o yaml | grep 'ip:' | grep {{.NonVirtualIP}} "
interval 5
weight 50
}
where there is C2 A0 byte before the pipe character in the -o yaml | grep.
This causes the script to fail with return code 7 and the priority wasn't added to the correct keepalived instances.
Steps to reproduce the issue:
- Run OKD 4.7.0-0.okd-2021-09-19-013247
- Let cluster update to 4.8.0-0.okd-2021-10-24-061736
- Wait for update to fail
Describe the results you received:
Unable to apply 4.8.0-0.okd-2021-10-24-061736: wait has exceeded 40 minutes for these operators: ingress
Describe the results you expected:
Upgraded cluster
This is found in the on-prem templates:
https://github.com/openshift/machine-config-operator/blob/115eb3a71d871aa8a8d80cb4e3613dbaae9a4bcd/templates/worker/00-worker/on-prem/files/keepalived-keepalived.yaml
So passing to @yboaron @rvanderp3 to take a look. Though they may prefer a proper bugzilla with a must gather from your cluster to investigate.
Since this is related to an existing OKD issue (okd-project/okd#572), also probably want to
/assign @vrutkovs
Lets track this in openshift/okd. If its an issue not a support request we'd need to ensure its still reproducible in 4.9
Issues go stale after 90d of inactivity.
Mark the issue as fresh by commenting /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen
.
If this issue is safe to close now please do so with /close
.
/lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh by commenting /remove-lifecycle rotten
.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen
.
If this issue is safe to close now please do so with /close
.
/lifecycle rotten
/remove-lifecycle stale
Rotten issues close after 30d of inactivity.
Reopen the issue by commenting /reopen
.
Mark the issue as fresh by commenting /remove-lifecycle rotten
.
Exclude this issue from closing again by commenting /lifecycle frozen
.
/close
@openshift-bot: Closing this issue.
In response to this:
Rotten issues close after 30d of inactivity.
Reopen the issue by commenting
/reopen
.
Mark the issue as fresh by commenting/remove-lifecycle rotten
.
Exclude this issue from closing again by commenting/lifecycle frozen
./close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.