taint phase retry loop exits on temporary connectivity failures
knisbet opened this issue · 0 comments
knisbet commented
Description
What happened:
API connectivity is wrapped in a retry loop. Temporary failures such as connectivity problems should be retried several times. This doesn't appear to be the case.
What you expected to happen:
Several attempts to be made to connect to the API before marking the phase/upgrade as failed.
How to reproduce it (as minimally and precisely as possible):
Unknown, this has been happening to this customer on and off for some time.
This code appears to be the culprit:
gravity/lib/kubernetes/errors.go
Lines 27 to 38 in f111aeb
Environment
- Gravity version [e.g. 7.0.11]: Reported on 6.1
Browser environment
- Browser Version (for UI-related issues):
- Install tools:
- Others:
Relevant Debug Logs If Applicable
2021-09-21T22:33:21Z INFO Executing phase. phase:/masters/node1/taint utils/logging.go:103
2021-09-21T22:33:21Z INFO Taint Server(AdvertiseIP=10.0.0.136, Hostname=node1, Role=master, ClusterRole=master). phase:/masters/node1/taint utils/logging.go:103
2021-09-21T22:33:21Z DEBU Dial. addr:leader.telekube.local:6443 network:tcp utils/logging.go:103
2021-09-21T22:33:21Z DEBU Resolve leader.telekube.local took 295.08µs. utils/logging.go:103
2021-09-21T22:33:21Z DEBU Resolved leader.telekube.local to 10.0.0.135. utils/logging.go:103
2021-09-21T22:33:21Z DEBU Dial. host-port:10.0.0.135:6443 utils/logging.go:103
2021-09-21T22:33:21Z WARN All attempts failed. error:[Get https://leader.telekube.local:6443/api/v1/nodes/10.0.0.136: dial tcp 10.0.0.135:6443: connect: connection refused] utils/logging.go:103
2021-09-21T22:33:21Z ERRO Phase execution failed. error:[
ERROR REPORT:
Original Error: *url.Error Get https://leader.telekube.local:6443/api/v1/nodes/10.0.0.136: dial tcp 10.0.0.135:6443: connect: connection refused
Stack Trace:
/gopath/src/github.com/gravitational/gravity/lib/utils/retry.go:247 github.com/gravitational/gravity/lib/utils.RetryWithInterval
/gopath/src/github.com/gravitational/gravity/lib/kubernetes/nodes.go:256 github.com/gravitational/gravity/lib/kubernetes.Retry
/gopath/src/github.com/gravitational/gravity/lib/kubernetes/nodes.go:84 github.com/gravitational/gravity/lib/kubernetes.UpdateTaints
/gopath/src/github.com/gravitational/gravity/lib/update/cluster/phases/kubernetes.go:281 github.com/gravitational/gravity/lib/update/cluster/phases.taint
/gopath/src/github.com/gravitational/gravity/lib/update/cluster/phases/kubernetes.go:59 github.com/gravitational/gravity/lib/update/cluster/phases.(*phaseTaint).Execute
/gopath/src/github.com/gravitational/gravity/lib/fsm/fsm.go:512 github.com/gravitational/gravity/lib/fsm.(*FSM).executeOnePhase
/gopath/src/github.com/gravitational/gravity/lib/fsm/fsm.go:444 github.com/gravitational/gravity/lib/fsm.(*FSM).executePhaseLocally
/gopath/src/github.com/gravitational/gravity/lib/fsm/fsm.go:404 github.com/gravitational/gravity/lib/fsm.(*FSM).executePhase
/gopath/src/github.com/gravitational/gravity/lib/fsm/fsm.go:246 github.com/gravitational/gravity/lib/fsm.(*FSM).ExecutePhase
/gopath/src/github.com/gravitational/gravity/lib/fsm/fsm.go:455 github.com/gravitational/gravity/lib/fsm.(*FSM).executeSubphasesSequentially
/gopath/src/github.com/gravitational/gravity/lib/fsm/fsm.go:449 github.com/gravitational/gravity/lib/fsm.(*FSM).executePhaseLocally
/gopath/src/github.com/gravitational/gravity/lib/fsm/fsm.go:376 github.com/gravitational/gravity/lib/fsm.(*FSM).executePhase
/gopath/src/github.com/gravitational/gravity/lib/fsm/fsm.go:246 github.com/gravitational/gravity/lib/fsm.(*FSM).ExecutePhase
/gopath/src/github.com/gravitational/gravity/lib/fsm/fsm.go:455 github.com/gravitational/gravity/lib/fsm.(*FSM).executeSubphasesSequentially
/gopath/src/github.com/gravitational/gravity/lib/fsm/fsm.go:449 github.com/gravitational/gravity/lib/fsm.(*FSM).executePhaseLocally
/gopath/src/github.com/gravitational/gravity/lib/fsm/fsm.go:376 github.com/gravitational/gravity/lib/fsm.(*FSM).executePhase
/gopath/src/github.com/gravitational/gravity/lib/fsm/fsm.go:246 github.com/gravitational/gravity/lib/fsm.(*FSM).ExecutePhase
/gopath/src/github.com/gravitational/gravity/lib/fsm/fsm.go:175 github.com/gravitational/gravity/lib/fsm.(*FSM).ExecutePlan
/gopath/src/github.com/gravitational/gravity/lib/update/updater.go:217 github.com/gravitational/gravity/lib/update.(*Updater).executePlan
/gopath/src/github.com/gravitational/gravity/lib/update/updater.go:62 github.com/gravitational/gravity/lib/update.(*Updater).Run.func1
/go/src/runtime/asm_amd64.s:1337 runtime.goexit
User Message: failed to add taint {gravitational.io/runlevel system NoExecute <nil>} to node "10.0.0.136"
Get https://leader.telekube.local:6443/api/v1/nodes/10.0.0.136: dial tcp 10.0.0.135:6443: connect: connection refused] phase:/masters/dsp-rwhit5-upgrade-1/taint utils/logging.go:103
2021-09-21T22:33:21Z DEBU [FSM:UPDAT] Apply. change:StateChange(Phase=/masters/node1/taint, State=failed, Error=failed to add taint {gravitational.io/runlevel system NoExecute <nil>} to node "10.0.0.136"
Get https://leader.telekube.local:6443/api/v1/nodes/10.0.0.136: dial tcp 10.0.0.135:6443: connect: connection refused) utils/logging.go:103