pingcap/tidb-operator

Webhook certificate expired when API server starts one year

Smityz opened this issue · 4 comments

Bug Report

What version of Kubernetes are you using?

v1.22

What version of TiDB Operator are you using?

v1.4.4

What did you do?
After running stably for several months, the operator suddenly keeps reporting errors and cannot complete sync, after disable the webhook , the operator returned to normal.
Related error log:

E0112 17:49:03.476708       1 tidb_cluster_controller.go:133] TidbCluster: x sync failed Internal error occurred: failed calling webhook "defaulting.admission.tidb.pingcap.com": failed to call webhook: Post "https://kubernetes.default.svc:443/apis/admission.tidb.pingcap.com/v1alpha1/pingcapresourcemutations?timeout=10s": x509: certificate has expired or is not yet valid: current time 2024-01-12T17:49:03+08:00 is after 2024-01-10T09:50:21Z, requeuing
E0112 17:49:03.859792       1 tidbcluster_control.go:90] failed to update TidbCluster: [x], error: Internal error occurred: failed calling webhook "defaulting.admission.tidb.pingcap.com": failed to call webhook: Post "https://kubernetes.default.svc:443/apis/admission.tidb.pingcap.com/v1alpha1/pingcapresourcemutations?timeout=10s": x509: certificate has expired or is not yet valid: current time 2024-01-12T17:49:03+08:00 is after 2024-01-10T09:50:21Z

We speculate that this may be related to the self-signed mechanism of the api-server, because the expiration time of the certificate happens to be one year after the api server starts. And we also found related bug here openshift/generic-admission-server#33

as openshift/generic-admission-server#33 (comment) said, in k8s 1.18, k8s.io/apiserver supports reload of the serving certs.

TiDB Operator v1.4.4 has been using v1.19 of K8s (https://github.com/pingcap/tidb-operator/blob/v1.4.4/go.mod#L65), and this version of generic-admission-server also using k8s v1.19 (https://github.com/openshift/generic-admission-server/blob/da96454c926de350e52f6c7a6ee86af49ee96b00/go.mod), it should reload the certs.

Did your cert just expire or renew after expired?

that's not the certs of tidb-webhook expired, but the CA of "kuberntes.default.svc" in the k8s apiserver is.

because the call flow of tidb crd adminssion is
k8s apiserver -> apiservice (kuberntes.default.svc) -> tidb webhook pod
i.e.
k8s apiserver -> k8s apiserver (kuberntes.default.svc) -> tidb webhook pod

when a k8s apiserver runs for more that one year and doesn't restart, the CA of kuberntes.default.svc in the k8s apiserver memory will expire.
As a result, the k8s apiserver accessing the k8s apiserver itself will fail after a year in this case.

by default the CA of kuberntes.default.svc in k8s apiserver memory is self-signed for one year during k8s apiserver starting.

@Smityz is this caused as iPenx said? Have you resolved it?

@Smityz is this caused as iPenx said? Have you resolved it?

Yes, we are in the same team. We disable webhook finally, but I think it's a common problem and it needs to be solve.