Kong/charts

"kong stop" in wait-for-db command prevents init container from exiting gracefully, suggest kong quit"

mitrousis opened this issue · 2 comments

Using "kong stop" in this command exits the container, but not gracefully. While the follow-on proxy container is still healthy, the wait-for-db container continues to restart, creating a lot of concerning-looking logs, such as:

Mon, Feb 5 2024 11:17:31 pm | 2024/02/06 07:17:22 [emerg] 1173795#0: bind() to [::]:8002 failed (98: Address already in use)
Mon, Feb 5 2024 11:17:31 pm | 2024/02/06 07:17:22 [emerg] 1173795#0: bind() to 0.0.0.0:8003 failed (98: Address already in use)
Mon, Feb 5 2024 11:17:31 pm | 2024/02/06 07:17:22 [emerg] 1173795#0: bind() to 0.0.0.0:8446 failed (98: Address already in use)
Mon, Feb 5 2024 11:17:31 pm | 2024/02/06 07:17:22 [emerg] 1173795#0: bind() to 0.0.0.0:8004 failed (98: Address already in use)
Mon, Feb 5 2024 11:17:31 pm | 2024/02/06 07:17:22 [emerg] 1173795#0: bind() to 0.0.0.0:8447 failed (98: Address already in use)
Mon, Feb 5 2024 11:17:31 pm | 2024/02/06 07:17:22 [emerg] 1173795#0: bind() to 0.0.0.0:8001 failed (98: Address already in use)
Mon, Feb 5 2024 11:17:31 pm | 2024/02/06 07:17:22 [emerg] 1173795#0: bind() to [::]:8001 failed (98: Address already in use)
Mon, Feb 5 2024 11:17:31 pm | 2024/02/06 07:17:22 [emerg] 1173795#0: bind() to 0.0.0.0:8100 failed (98: Address already in use)
Mon, Feb 5 2024 11:17:31 pm | 2024/02/06 07:17:22 [emerg] 1173795#0: bind() to [::]:8100 failed (98: Address already in use)
Mon, Feb 5 2024 11:17:31 pm | 2024/02/06 07:17:22 [emerg] 1173795#0: bind() to 0.0.0.0:8005 failed (98: Address already in use)
Mon, Feb 5 2024 11:17:31 pm | 2024/02/06 07:17:22 [emerg] 1173795#0: bind() to [::]:8005 failed (98: Address already in use)
Mon, Feb 5 2024 11:17:31 pm | 2024/02/06 07:17:22 [emerg] 1173795#0: bind() to 0.0.0.0:8006 failed (98: Address already in use)
Mon, Feb 5 2024 11:17:31 pm | 2024/02/06 07:17:22 [notice] 1173795#0: try again to bind() after 500ms
Mon, Feb 5 2024 11:17:31 pm | 2024/02/06 07:17:22 [emerg] 1173795#0: bind() to 0.0.0.0:8002 failed (98: Address already in use)
Mon, Feb 5 2024 11:17:31 pm | 2024/02/06 07:17:22 [emerg] 1173795#0: bind() to [::]:8002 failed (98: Address already in use)
Mon, Feb 5 2024 11:17:31 pm | 2024/02/06 07:17:22 [emerg] 1173795#0: bind() to 0.0.0.0:8003 failed (98: Address already in use)
Mon, Feb 5 2024 11:17:31 pm | 2024/02/06 07:17:22 [emerg] 1173795#0: bind() to 0.0.0.0:8446 failed (98: Address already in use)
Mon, Feb 5 2024 11:17:31 pm | 2024/02/06 07:17:22 [emerg] 1173795#0: bind() to 0.0.0.0:8004 failed (98: Address already in use)
Mon, Feb 5 2024 11:17:31 pm | 2024/02/06 07:17:22 [emerg] 1173795#0: bind() to 0.0.0.0:8447 failed (98: Address already in use)

Suggest a change similar to:

 until kong start; do echo 'waiting for db'; sleep 5; done; kong quit --wait=5

Additionally, there should be a timeout to fail the init reasonably quickly since DB avail is unrecoverable.

Can you confirm your chart and Kong image versions? Do you have logs from the wait-for-db container prior to it attempting the conflicting binds and Pod describe info? I'm not able to replicate this with a basic instance, so I suspect there's something more going on.

In any case, I'd recommend just setting waitImage.enabled=false if it's causing trouble. That init container is of pretty limited use, since the main app container will hit the same DB unavailable failure state. wait-for-db is mostly in place to avoid the Kong Pods hitting CrashLoopBackOff during initial install while waiting for initial migrations.

With the current 2.37.0 release and the example Postgres-bundled config:

helm install ana kong/kong -f https://raw.githubusercontent.com/Kong/charts/main/charts/kong/example-values/minimal-kong-standalone.yaml

kubectl describe pod PODNAME gives me:

  wait-for-db:
    Container ID:    containerd://9f0842f70707888d2daf34115ac549d01acf3023489c1350f05fdf03a0d19d7c
    Image:           kong:3.5
    Image ID:        docker.io/library/kong@sha256:ba218fe5832d5fd2c2d915df9016d59e17af1ef58dcd7edbf4744cc480c55793
    Port:            <none>
    Host Port:       <none>
    SeccompProfile:  RuntimeDefault
    Args:
      /bin/bash
      -c
      export KONG_NGINX_DAEMON=on KONG_PREFIX=`mktemp -d` KONG_KEYRING_ENABLED=off; until kong start; do echo 'waiting for db'; sleep 1; done; kong stop
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Wed, 21 Feb 2024 15:00:25 -0800
      Finished:     Wed, 21 Feb 2024 15:00:49 -0800
    Ready:          True

The described behavior is also pretty odd. AFAIK you should never reach a state where both the application and init containers are running simultaneously: failed init containers block the start of application containers, and successful init containers do not respawn, since their restart policy is OnFailure when the Pod-level policy is Always.

If it's actually failing (I edited the args to run false to force it), you should get an Init:Error eventually transitioning to Init:CrashLoopBackOff, with the app containers never starting at all:

$ kubectl get po
ana-kong-7464647b7b-szldd        0/1     Init:Error   1 (6s ago)   7s
$ sleep 80
$ kubectl get po
ana-kong-7464647b7b-szldd        0/1     Init:CrashLoopBackOff   5 (102s ago)   4m39s
  wait-for-db:
    Container ID:    containerd://db988d99cad4f82efec21334304fcb7bf936fc9017be7fe1bc4d80dfcd9c60fe
    Image:           kong:3.5
    Image ID:        docker.io/library/kong@sha256:ba218fe5832d5fd2c2d915df9016d59e17af1ef58dcd7edbf4744cc480c55793
    Port:            <none>
    Host Port:       <none>
    SeccompProfile:  RuntimeDefault
    Args:
      false
    State:          Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Wed, 21 Feb 2024 15:46:59 -0800
      Finished:     Wed, 21 Feb 2024 15:46:59 -0800
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Wed, 21 Feb 2024 15:46:58 -0800
      Finished:     Wed, 21 Feb 2024 15:46:58 -0800
    Ready:          False

The basic rules around init container management stipulate that they should never be running concurrently with anything else, so if they are that'd arguably be a Kubernetes-level bug. I was unable to find any reports elsewhere matching that description, however.

It's worth noting that the exit code is technically from the bash -c (and then https://github.com/Kong/docker-kong/blob/master/docker-entrypoint.sh below that), so this is maybe something else in the scripts returning the exit code.

We're using Kong helm charts 2.19.0, and image kong-gateway:3.4.1.1-rhel. However, I no longer see the issue (now seeing success using the original chart version) and can't replicate to get the original error logs. I only have this screenshot from Rancher:

image

Given there are a variety of teams working on our infra, I won't be able to narrow this down to any specific change/fix. Feel free to close this as unreproducible from here. Thanks for having a look.