kubernetes-sigs/aws-encryption-provider

aws-encryption-provider occasionally fails health check and restarts

d-m opened this issue · 11 comments

d-m commented

What happened:

Occasionally, the aws-encryption-provider will fail its health check, resulting in the following logs in the associated kube-apiserver pod:

$ kubectl logs -n kube-system kube-apiserver -c healthcheck                                                                                                                                                                                    
I0519 15:35:31.861185       1 main.go:178] listening on :3990
I0519 15:44:34.766218       1 main.go:128] proxied to GET https://127.0.0.1/healthz: 500 Internal Server Error

and in /var/log/kube-apiserver.log:

I0605 04:15:55.737313       1 healthz.go:244] kms-provider-0 check failed: healthz
[-]kms-provider-0 failed: failed to perform encrypt section of the healthz check for KMS Provider aws-encryption-provider, error: rpc error: code = DeadlineExceeded desc = context deadline exceeded
I0605 04:15:57.054853       1 healthz.go:244] kms-provider-0 check failed: healthz
[-]kms-provider-0 failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded

This sometimes results in the restart of the aws-encryption-provider. When this happens, the socket isn't cleaned up and the aws-encryption-provider generates the following logs when it tries to start back up:

{"level":"info","timestamp":"2021-06-10T13:33:47.256Z","caller":"server/main.go:62","message":"creating kms server","healthz-path":"/healthz","healthz-port":":8083","region":"us-east-1","listen-address":"/srv/kubernetes/aws-encryption-provider/socket.sock","kms-endpoint":"","qps-limit":0,"burst-limit":0}
{"level":"info","timestamp":"2021-06-10T13:33:47.257Z","caller":"plugin/plugin.go:253","message":"registering the kms plugin with grpc server"}
{"level":"info","timestamp":"2021-06-10T13:33:47.257Z","caller":"plugin/plugin.go:106","message":"starting health check routine","period":"30s"}
{"level":"info","timestamp":"2021-06-10T13:33:47.257Z","caller":"server/main.go:98","message":"Healthchecks server started","port":":8083"}
{"level":"info","timestamp":"2021-06-10T13:33:47.257Z","caller":"server/main.go:99","message":"Plugin server started","port":"/srv/kubernetes/aws-encryption-provider/socket.sock"}
{"level":"fatal","timestamp":"2021-06-10T13:33:47.258Z","caller":"server/main.go:94","message":"Failed to start server","error":"failed to create listener: listen unix /srv/kubernetes/aws-encryption-provider/socket.sock: bind: address already in use","stacktrace":"main.main.func2\n\t/go/src/sigs.k8s.io/aws-encryption-provider/cmd/server/main.go:94"}

When this happens, the kube-apiserver on the same node goes into a restart loop and causes an api outage.

What you expected to happen:

aws-encryption-provider remains running

How to reproduce it (as minimally and precisely as possible):

The issue seems to be intermittent.

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version):
    Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.6", GitCommit:"8a62859e515889f07e3e3be6a1080413f17cf2c3", GitTreeState:"clean", BuildDate:"2021-04-15T03:28:42Z", GoVersion:"go1.15.10", Compiler:"gc", Platform:"linux/amd64"}
    Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.6", GitCommit:"8a62859e515889f07e3e3be6a1080413f17cf2c3", GitTreeState:"clean", BuildDate:"2021-04-15T03:19:55Z", GoVersion:"go1.15.10", Compiler:"gc", Platform:"linux/amd64"}
    
  • OS (e.g: cat /etc/os-release):
    NAME="Ubuntu"
    VERSION="18.04.5 LTS (Bionic Beaver)"
    ID=ubuntu
    ID_LIKE=debian
    PRETTY_NAME="Ubuntu 18.04.5 LTS"
    VERSION_ID="18.04"
    HOME_URL="https://www.ubuntu.com/"
    SUPPORT_URL="https://help.ubuntu.com/"
    BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
    PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
    VERSION_CODENAME=bionic
    UBUNTU_CODENAME=bionic
    
  • Kernel (e.g. uname -a): 5.4.0-1035-aws #37~18.04.1-Ubuntu SMP Wed Jan 6 22:31:04 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
  • Install tools: kops

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

d-m commented

/remove-lifecycle rotten

any update on this issue ?

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

/remove-lifecycle rotten

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.