coreos/container-linux-update-operator

Shutdown update-agent cleanly, it panics on reboot

dghubble opened this issue · 3 comments

When the host is shutting down, update-agent will panic. This isn't a huge priority since the host is being shutdown anyway, but users will see it in the logs and worry.

0622 20:28:31.776431       1 agent.go:68] Updating status
I0622 20:28:31.776467       1 agent.go:78] Indicating a reboot is needed
I0622 20:28:45.169911       1 agent.go:163] Setting annotations map[string]string{"container-linux-update.v1.coreos.com/reboot-in-progress":"true"}
I0622 20:28:45.190260       1 agent.go:175] Marking node as unschedulable
I0622 20:28:45.208388       1 agent.go:180] Getting pod list for deletion
I0622 20:28:45.228146       1 agent.go:189] Deleting 0 pods
I0622 20:28:45.228252       1 agent.go:199] Node drained, rebooting
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x30 pc=0x14252c7]

goroutine 31 [running]:
github.com/coreos/container-linux-update-operator/pkg/updateengine.(*Client).ReceiveStatuses(0xc4202337e0, 0xc420321560, 0xc4203215c0)
        /go/src/github.com/coreos/container-linux-update-operator/pkg/updateengine/client.go:83 +0xf7
created by github.com/coreos/container-linux-update-operator/pkg/agent.(*Klocksmith).watchUpdateStatus
        /go/src/github.com/coreos/container-linux-update-operator/pkg/agent/agent.go:217 +0x138

#107 partially addresses this issue (it fixes this exact error message) but the underlying problem is that goroutines spun off by the agent are not properly stopped when execution is completed, and #107 does not fully address that issue on its own.

Once #111 merges, we plan to add traps for shutdown signals in both the update-operator and update-agent that will close the top-level stop channel and should cleanly stop all goroutines and sleeps that take the channel.

Closing, I don't use or maintain this project anymore. ❤️