aws-encryption-provider occasionally fails health check and restarts

Question

aws-encryption-provider occasionally fails health check and restarts

d-m opened this issue 3 years ago · 11 comments

What happened:

Occasionally, the aws-encryption-provider will fail its health check, resulting in the following logs in the associated kube-apiserver pod:

$ kubectl logs -n kube-system kube-apiserver -c healthcheck                                                                                                                                                                                    
I0519 15:35:31.861185       1 main.go:178] listening on :3990
I0519 15:44:34.766218       1 main.go:128] proxied to GET https://127.0.0.1/healthz: 500 Internal Server Error

and in /var/log/kube-apiserver.log:

I0605 04:15:55.737313       1 healthz.go:244] kms-provider-0 check failed: healthz
[-]kms-provider-0 failed: failed to perform encrypt section of the healthz check for KMS Provider aws-encryption-provider, error: rpc error: code = DeadlineExceeded desc = context deadline exceeded
I0605 04:15:57.054853       1 healthz.go:244] kms-provider-0 check failed: healthz
[-]kms-provider-0 failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded

This sometimes results in the restart of the aws-encryption-provider. When this happens, the socket isn't cleaned up and the aws-encryption-provider generates the following logs when it tries to start back up:

{"level":"info","timestamp":"2021-06-10T13:33:47.256Z","caller":"server/main.go:62","message":"creating kms server","healthz-path":"/healthz","healthz-port":":8083","region":"us-east-1","listen-address":"/srv/kubernetes/aws-encryption-provider/socket.sock","kms-endpoint":"","qps-limit":0,"burst-limit":0}
{"level":"info","timestamp":"2021-06-10T13:33:47.257Z","caller":"plugin/plugin.go:253","message":"registering the kms plugin with grpc server"}
{"level":"info","timestamp":"2021-06-10T13:33:47.257Z","caller":"plugin/plugin.go:106","message":"starting health check routine","period":"30s"}
{"level":"info","timestamp":"2021-06-10T13:33:47.257Z","caller":"server/main.go:98","message":"Healthchecks server started","port":":8083"}
{"level":"info","timestamp":"2021-06-10T13:33:47.257Z","caller":"server/main.go:99","message":"Plugin server started","port":"/srv/kubernetes/aws-encryption-provider/socket.sock"}
{"level":"fatal","timestamp":"2021-06-10T13:33:47.258Z","caller":"server/main.go:94","message":"Failed to start server","error":"failed to create listener: listen unix /srv/kubernetes/aws-encryption-provider/socket.sock: bind: address already in use","stacktrace":"main.main.func2\n\t/go/src/sigs.k8s.io/aws-encryption-provider/cmd/server/main.go:94"}

When this happens, the kube-apiserver on the same node goes into a restart loop and causes an api outage.

What you expected to happen:

aws-encryption-provider remains running

How to reproduce it (as minimally and precisely as possible):

The issue seems to be intermittent.

Anything else we need to know?:

Environment:

Kubernetes version (use kubectl version):

Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.6", GitCommit:"8a62859e515889f07e3e3be6a1080413f17cf2c3", GitTreeState:"clean", BuildDate:"2021-04-15T03:28:42Z", GoVersion:"go1.15.10", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.6", GitCommit:"8a62859e515889f07e3e3be6a1080413f17cf2c3", GitTreeState:"clean", BuildDate:"2021-04-15T03:19:55Z", GoVersion:"go1.15.10", Compiler:"gc", Platform:"linux/amd64"}

OS (e.g: cat /etc/os-release):

NAME="Ubuntu"
VERSION="18.04.5 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.5 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic

Kernel (e.g. uname -a): 5.4.0-1035-aws #37~18.04.1-Ubuntu SMP Wed Jan 6 22:31:04 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Install tools: kops

Answer 1 · 2021-09-08T15:23:54.000Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Answer 2 · 2021-10-08T16:16:52.000Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Answer 3 · 2021-10-28T17:20:21.000Z

/remove-lifecycle rotten

Answer 4 · 2021-11-30T12:24:13.000Z

any update on this issue ?

Answer 5 · 2022-02-28T13:21:55.000Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Answer 6 · 2022-03-30T13:49:48.000Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Answer 7 · 2022-04-01T07:40:16.000Z

/remove-lifecycle rotten

Answer 8 · 2022-06-30T08:18:59.000Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Answer 9 · 2022-07-30T08:29:14.000Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Answer 10 · 2022-08-29T09:23:20.000Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen
Mark this issue or PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Answer 11 · 2022-08-29T09:24:33.000Z

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen

Mark this issue or PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.