aws-encryption-provider occasionally fails health check and restarts
d-m opened this issue · 11 comments
What happened:
Occasionally, the aws-encryption-provider will fail its health check, resulting in the following logs in the associated kube-apiserver pod:
$ kubectl logs -n kube-system kube-apiserver -c healthcheck
I0519 15:35:31.861185 1 main.go:178] listening on :3990
I0519 15:44:34.766218 1 main.go:128] proxied to GET https://127.0.0.1/healthz: 500 Internal Server Error
and in /var/log/kube-apiserver.log:
I0605 04:15:55.737313 1 healthz.go:244] kms-provider-0 check failed: healthz
[-]kms-provider-0 failed: failed to perform encrypt section of the healthz check for KMS Provider aws-encryption-provider, error: rpc error: code = DeadlineExceeded desc = context deadline exceeded
I0605 04:15:57.054853 1 healthz.go:244] kms-provider-0 check failed: healthz
[-]kms-provider-0 failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
This sometimes results in the restart of the aws-encryption-provider. When this happens, the socket isn't cleaned up and the aws-encryption-provider generates the following logs when it tries to start back up:
{"level":"info","timestamp":"2021-06-10T13:33:47.256Z","caller":"server/main.go:62","message":"creating kms server","healthz-path":"/healthz","healthz-port":":8083","region":"us-east-1","listen-address":"/srv/kubernetes/aws-encryption-provider/socket.sock","kms-endpoint":"","qps-limit":0,"burst-limit":0}
{"level":"info","timestamp":"2021-06-10T13:33:47.257Z","caller":"plugin/plugin.go:253","message":"registering the kms plugin with grpc server"}
{"level":"info","timestamp":"2021-06-10T13:33:47.257Z","caller":"plugin/plugin.go:106","message":"starting health check routine","period":"30s"}
{"level":"info","timestamp":"2021-06-10T13:33:47.257Z","caller":"server/main.go:98","message":"Healthchecks server started","port":":8083"}
{"level":"info","timestamp":"2021-06-10T13:33:47.257Z","caller":"server/main.go:99","message":"Plugin server started","port":"/srv/kubernetes/aws-encryption-provider/socket.sock"}
{"level":"fatal","timestamp":"2021-06-10T13:33:47.258Z","caller":"server/main.go:94","message":"Failed to start server","error":"failed to create listener: listen unix /srv/kubernetes/aws-encryption-provider/socket.sock: bind: address already in use","stacktrace":"main.main.func2\n\t/go/src/sigs.k8s.io/aws-encryption-provider/cmd/server/main.go:94"}
When this happens, the kube-apiserver on the same node goes into a restart loop and causes an api outage.
What you expected to happen:
aws-encryption-provider remains running
How to reproduce it (as minimally and precisely as possible):
The issue seems to be intermittent.
Anything else we need to know?:
Environment:
- Kubernetes version (use
kubectl version
):Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.6", GitCommit:"8a62859e515889f07e3e3be6a1080413f17cf2c3", GitTreeState:"clean", BuildDate:"2021-04-15T03:28:42Z", GoVersion:"go1.15.10", Compiler:"gc", Platform:"linux/amd64"} Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.6", GitCommit:"8a62859e515889f07e3e3be6a1080413f17cf2c3", GitTreeState:"clean", BuildDate:"2021-04-15T03:19:55Z", GoVersion:"go1.15.10", Compiler:"gc", Platform:"linux/amd64"}
- OS (e.g:
cat /etc/os-release
):NAME="Ubuntu" VERSION="18.04.5 LTS (Bionic Beaver)" ID=ubuntu ID_LIKE=debian PRETTY_NAME="Ubuntu 18.04.5 LTS" VERSION_ID="18.04" HOME_URL="https://www.ubuntu.com/" SUPPORT_URL="https://help.ubuntu.com/" BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/" PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy" VERSION_CODENAME=bionic UBUNTU_CODENAME=bionic
- Kernel (e.g.
uname -a
): 5.4.0-1035-aws #37~18.04.1-Ubuntu SMP Wed Jan 6 22:31:04 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux - Install tools: kops
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale
- Mark this issue or PR as rotten with
/lifecycle rotten
- Close this issue or PR with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle rotten
- Close this issue or PR with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
/remove-lifecycle rotten
any update on this issue ?
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale
- Mark this issue or PR as rotten with
/lifecycle rotten
- Close this issue or PR with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle rotten
- Close this issue or PR with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
/remove-lifecycle rotten
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale
- Mark this issue or PR as rotten with
/lifecycle rotten
- Close this issue or PR with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle rotten
- Close this issue or PR with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Reopen this issue or PR with
/reopen
- Mark this issue or PR as fresh with
/remove-lifecycle rotten
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close
@k8s-triage-robot: Closing this issue.
In response to this:
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied- After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied- After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closedYou can:
- Reopen this issue or PR with
/reopen
- Mark this issue or PR as fresh with
/remove-lifecycle rotten
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.