Certificate rotation is erroring out
kaman05010 opened this issue · 6 comments
Certificate rotation is erroring out with the below error:
DEBU[0000] Resolving tenantID for
DEBU[0007] Already registered for "Microsoft.Compute"
DEBU[0007] Already registered for "Microsoft.Storage"
DEBU[0007] Already registered for "Microsoft.Network"
INFO[0007] Backing up artifacts to directory /root/_output//_rotate_certs_backup
DEBU[0007] output: wrote /root/_output//_rotate_certs_backup/apimodel.json
DEBU[0007] output: wrote /root/_output//_rotate_certs_backup/azuredeploy.json
DEBU[0007] output: wrote /root/_output//_rotate_certs_backup/azuredeploy.parameters.json
DEBU[0007] output: wrote /root/_output//_rotate_certs_backup/kubeconfig/kubeconfig.southindia.json
DEBU[0007] output: wrote /root/_output//_rotate_certs_backup/ca.key
DEBU[0007] output: wrote /root/_output//_rotate_certs_backup/ca.crt
DEBU[0007] output: wrote /root/_output//_rotate_certs_backup/apiserver.key
DEBU[0007] output: wrote /root/_output//_rotate_certs_backup/apiserver.crt
DEBU[0007] output: wrote /root/_output//_rotate_certs_backup/client.key
DEBU[0007] output: wrote /root/_output//_rotate_certs_backup/client.crt
DEBU[0007] output: wrote /root/_output//_rotate_certs_backup/kubectlClient.key
DEBU[0007] output: wrote /root/_output//_rotate_certs_backup/kubectlClient.crt
DEBU[0007] output: wrote /root/_output//_rotate_certs_backup/etcdserver.key
DEBU[0007] output: wrote /root/_output//_rotate_certs_backup/etcdserver.crt
DEBU[0007] output: wrote /root/_output//_rotate_certs_backup/etcdclient.key
DEBU[0007] output: wrote /root/_output//_rotate_certs_backup/etcdclient.crt
DEBU[0007] output: wrote /root/_output//_rotate_certs_backup/etcdpeer0.key
DEBU[0007] output: wrote /root/_output//_rotate_certs_backup/etcdpeer0.crt
DEBU[0007] output: wrote /root/_output//_rotate_certs_backup/etcdpeer1.key
DEBU[0007] output: wrote /root/_output//_rotate_certs_backup/etcdpeer1.crt
DEBU[0007] output: wrote /root/_output//_rotate_certs_backup/etcdpeer2.key
DEBU[0007] output: wrote /root/_output//_rotate_certs_backup/etcdpeer2.crt
INFO[0007] Generating new certificates
DEBU[0031] pki: PKI asset creation took 22.903501399s
INFO[0031] Writing artifacts to output directory /root/_output//_rotate_certs_output
DEBU[0031] output: wrote /root/_output//_rotate_certs_output/apimodel.json
DEBU[0031] output: wrote /root/_output//_rotate_certs_output/azuredeploy.json
DEBU[0031] output: wrote /root/_output//_rotate_certs_output/azuredeploy.parameters.json
DEBU[0031] output: wrote /root/_output//_rotate_certs_output/kubeconfig/kubeconfig.southindia.json
DEBU[0031] output: wrote /root/_output//_rotate_certs_output/ca.key
DEBU[0031] output: wrote /root/_output//_rotate_certs_output/ca.crt
DEBU[0031] output: wrote /root/_output//_rotate_certs_output/apiserver.key
DEBU[0031] output: wrote /root/_output//_rotate_certs_output/apiserver.crt
DEBU[0031] output: wrote /root/_output//_rotate_certs_output/client.key
DEBU[0031] output: wrote /root/_output//_rotate_certs_output/client.crt
DEBU[0031] output: wrote /root/_output//_rotate_certs_output/kubectlClient.key
DEBU[0031] output: wrote /root/_output//_rotate_certs_output/kubectlClient.crt
DEBU[0031] output: wrote /root/_output//_rotate_certs_output/etcdserver.key
DEBU[0031] output: wrote /root/_output//_rotate_certs_output/etcdserver.crt
DEBU[0031] output: wrote /root/_output//_rotate_certs_output/etcdclient.key
DEBU[0031] output: wrote /root/_output//_rotate_certs_output/etcdclient.crt
DEBU[0031] output: wrote /root/_output//_rotate_certs_output/etcdpeer0.key
DEBU[0031] output: wrote /root/_output//_rotate_certs_output/etcdpeer0.crt
DEBU[0031] output: wrote /root/_output//_rotate_certs_output/etcdpeer1.key
DEBU[0031] output: wrote /root/_output//_rotate_certs_output/etcdpeer1.crt
DEBU[0031] output: wrote /root/_output//_rotate_certs_output/etcdpeer2.key
DEBU[0031] output: wrote /root/_output//_rotate_certs_output/etcdpeer2.crt
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x1803406]
goroutine 1 [running]:
github.com/Azure/aks-engine/cmd.(*rotateCertsCmd).run.func1(0x0)
/go/src/github.com/Azure/aks-engine/cmd/rotate_certs.go:259 +0x26
github.com/Azure/aks-engine/cmd.(*rotateCertsCmd).run(0xc00033e240, 0x1d95660, 0xc000410018)
/go/src/github.com/Azure/aks-engine/cmd/rotate_certs.go:264 +0x452
github.com/Azure/aks-engine/cmd.newRotateCertsCmd.func1(0xc000504a00, 0xc0005200e0, 0x0, 0xe, 0x0, 0x0)
/go/src/github.com/Azure/aks-engine/cmd/rotate_certs.go:111 +0xb7
github.com/spf13/cobra.(*Command).execute(0xc000504a00, 0xc000520000, 0xe, 0xe, 0xc000504a00, 0xc000520000)
/go/src/github.com/Azure/aks-engine/vendor/github.com/spf13/cobra/command.go:762 +0x472
github.com/spf13/cobra.(*Command).ExecuteC(0xc000340780, 0x1d962c0, 0xc0000a2008, 0xc00008a058)
/go/src/github.com/Azure/aks-engine/vendor/github.com/spf13/cobra/command.go:852 +0x2fe
github.com/spf13/cobra.(*Command).Execute(...)
/go/src/github.com/Azure/aks-engine/vendor/github.com/spf13/cobra/command.go:800
main.main()
Below is the state of kube-system pod:
azure-ip-masq-agent-8ln4c 1/1 Running 1 137m
azure-ip-masq-agent-8wxqr 1/1 Running 0 137m
azure-ip-masq-agent-f6v7z 0/1 Pending 0 136m
azure-ip-masq-agent-gvrwx 0/1 Pending 0 136m
azure-ip-masq-agent-j876l 0/1 Pending 0 136m
azure-ip-masq-agent-jlxdl 0/1 Pending 0 135m
azure-ip-masq-agent-jntld 0/1 Pending 0 136m
azure-ip-masq-agent-lg9f9 0/1 Pending 0 136m
azure-ip-masq-agent-pgbbc 0/1 Pending 0 136m
azure-ip-masq-agent-qmqbt 1/1 Running 0 137m
azure-ip-masq-agent-qwdwd 0/1 Pending 0 136m
azure-ip-masq-agent-x6kdl 0/1 Pending 0 136m
coredns-6dcff98ff4-p4zs9 1/1 Running 0 58d
coredns-autoscaler-8689c554f9-tszmj 1/1 Running 0 58d
heapster-6f6cbcfcf6-b9fk5 0/2 Pending 0 4h19m
kube-addon-manager-k8s-master-27469183-0 1/1 Running 3 58d
kube-addon-manager-k8s-master-27469183-1 1/1 Running 1 58d
kube-addon-manager-k8s-master-27469183-2 1/1 Running 2 58d
kube-apiserver-k8s-master-27469183-0 1/1 Running 3 3h5m
kube-apiserver-k8s-master-27469183-1 1/1 Running 1 104m
kube-apiserver-k8s-master-27469183-2 1/1 Running 2 104m
kube-controller-manager-k8s-master-27469183-0 1/1 Running 4 4h13m
kube-controller-manager-k8s-master-27469183-1 1/1 Running 2 3h10m
kube-controller-manager-k8s-master-27469183-2 1/1 Running 2 58d
kube-proxy-2swbv 1/1 Running 0 58d
kube-proxy-65skk 1/1 Running 2 58d
kube-proxy-98c68 1/1 Running 1 58d
kube-proxy-bh92q 1/1 Running 3 58d
kube-proxy-bw65c 1/1 Running 0 58d
kube-proxy-cxtkm 1/1 Running 0 58d
kube-proxy-fgbd8 1/1 Running 0 58d
kube-proxy-hlzz8 1/1 Running 0 58d
kube-proxy-k5t6r 1/1 Running 1 58d
kube-proxy-l5qxs 1/1 Running 0 58d
kube-proxy-r2sh9 1/1 Running 0 58d
kube-proxy-skc7t 1/1 Running 0 58d
kube-scheduler-k8s-master-27469183-0 1/1 Running 4 58d
kube-scheduler-k8s-master-27469183-1 1/1 Running 2 58d
kube-scheduler-k8s-master-27469183-2 1/1 Running 4 58d
metrics-server-5f88cd68f9-9dv4l 0/1 Pending 0 3h10m
The only ceritificate which expired is the kubeproxy certificate:
for i in $(ls /etc/kubernetes/certs/*.crt); do openssl x509 -noout -text -in $i |grep "Not After" ; done
Not After : Jan 7 00:26:21 2052 GMT
Not After : Jan 7 00:26:19 2052 GMT
Not After : Jan 7 00:26:21 2052 GMT
Not After : Jan 7 00:26:21 2052 GMT
Not After : Jan 7 00:26:21 2052 GMT
Not After : Jan 7 00:26:21 2052 GMT
Not After : Jan 7 00:26:21 2052 GMT
Not After : Jan 7 00:26:21 2052 GMT
Not After : Nov 12 03:22:54 2041 GMT
Not After : Aug 28 09:59:38 2024 GMT
Not After : Aug 28 09:59:38 2021 GMT
State of k8s cluster , the master have lost access to the nodes:
k8s-master-27469183-0 Ready master 58d v1.18.20 10.22.0.50 Ubuntu 18.04.5 LTS 5.4.0-1065-azure docker://19.3.14
k8s-master-27469183-1 Ready master 58d v1.18.20 10.22.0.51 Ubuntu 18.04.5 LTS 5.4.0-1065-azure docker://19.3.14
k8s-master-27469183-2 Ready master 58d v1.18.20 10.22.0.52 Ubuntu 18.04.5 LTS 5.4.0-1065-azure docker://19.3.14
k8s-node-27469183-vmss00001m NotReady agent 58d v1.18.20 10.22.0.44 Ubuntu 18.04.5 LTS 5.4.0-1049-azure docker://19.3.14
k8s-node-27469183-vmss00001n NotReady agent 58d v1.18.20 10.22.0.47 Ubuntu 18.04.5 LTS 5.4.0-1049-azure docker://19.3.14
k8s-node-27469183-vmss00001o NotReady agent 58d v1.18.20 10.22.0.21 Ubuntu 18.04.5 LTS 5.4.0-1049-azure docker://19.3.14
k8s-node-27469183-vmss00001p NotReady agent 58d v1.18.20 10.22.0.26 Ubuntu 18.04.5 LTS 5.4.0-1049-azure docker://19.3.14
k8s-node-27469183-vmss00001q NotReady agent 58d v1.18.20 10.22.0.43 Ubuntu 18.04.5 LTS 5.4.0-1049-azure docker://19.3.14
k8s-node-27469183-vmss00001r NotReady agent 58d v1.18.20 10.22.0.28 Ubuntu 18.04.5 LTS 5.4.0-1049-azure docker://19.3.14
k8s-node-27469183-vmss00001s NotReady agent 58d v1.18.20 10.22.0.20 Ubuntu 18.04.5 LTS 5.4.0-1065-azure docker://19.3.14
k8s-node-27469183-vmss00001t NotReady agent 58d v1.18.20 10.22.0.22 Ubuntu 18.04.5 LTS 5.4.0-1049-azure docker://19.3.14
k8s-node-27469183-vmss00001u NotReady agent 58d v1.18.20 10.22.0.23 Ubuntu 18.04.5 LTS 5.4.0-1049-azure docker://19.3.14
Status of etcd cluster looks fine:
member 8cbd83a9c3d8c4e is healthy: got healthy result from
member 4d6f0b6663a5c56a is healthy: got healthy result from
member 630935a1a1e3ead2 is healthy: got healthy result from
cluster is healthy
Command use to rotate the ceritifcate:
aks-engine rotate-certs --api-model <> --location <>--resource-group <>--subscription-id <>f --auth-method=cli --linux-ssh-private-key /root/ --ssh-host 10.x.x.x --debug
AKS Engine version
0.65.1
Kubernetes version
1.18.20
Please recommend us a workaround for the issue
👋 Thanks for opening your first issue here! If you're reporting a 🐞 bug, please make sure you include steps to reproduce it.
looking at code:
goroutine 1 [running]:
github.com/Azure/aks-engine/cmd.(*rotateCertsCmd).run.func1(0x0)
/go/src/github.com/Azure/aks-engine/cmd/rotate_certs.go:259 +0x26
refers to resumeClusterAutoscaler, except this add-on is set false in apimodel.json
cc @jadarsie
Regardless of autoscaler being turned on, the code needs to be fixed.
aks-engine/cmd/rotate_certs.go
Lines 257 to 262 in 8f41133
PauseClusterAutoscaler(...) (func() error, error)
in several return cases will not return the func, but only the error. The returned func will be nil for each error case. This is problematic since resumeClusterAutoscaler
is called in the defer without regard to it being nil
.
In almost every error case of PauseClusterAutoscaler
this will cause a nil pointer deref.
I have a moment to see this