Impossible to cluster when having readiness gates on port 8080

Question

Impossible to cluster when having readiness gates on port 8080

c-datculescu opened this issue 8 months ago · 3 comments

Describe the bug
When using clustering in combination with Readiness Gates (AWS ALB readiness gates), it is impossible to start the pods, because no endpoint will become available until the endpoints have been populated, but the endpoints will never be populated until the readiness gate passes. This ends up in a loop which never allows a pod to be fully started.

To Reproduce
Steps to reproduce the behavior:

Use EKS
Deploy kube-httpcache, 2 pods minimum
Look at the logs from kube httpcache, an error message like the following one appears:

W0308 14:41:30.853956       1 endpoints_watch.go:66] service 'some_random_service' has no endpoints

Expected behavior
I would expect to be able to cluster the pods.

Environment:

Kubernetes version: [e.g. 1.26]
kube-httpcache version: [e.g. v0.7]

Configuration

Additional context

Answer 1 · 2024-04-15T19:18:59.000Z

I'm having the same issue. It's a catch-22.

Answer 2 · 2024-07-11T14:39:27.000Z

I solved it with a custom readiness check script which always returns positive on the first check then only reports positive if the cached site available on 127.0.0.1:8080. But it's a dirty hack of course.
What is the exact functionality for the frontend watch? What happens when I turn it off? It is related to distributing signals eg PURGE?

Answer 3 · 2024-07-24T13:44:52.000Z

I have the same issue. It helps to have a Service with .spec.publishNotReadyAddresses=true but then another problem will appear.

When PODs are added (by scaling Deployment or Statefulset up) there is a race condition in pkg/watcher/endpoints_watch.go:89.

PODs are added to the Service but they are not necessarily in ready status for all conditions and the check on that line will discard this POD address from the list. After receiving the next event (after scaling up again for example) this skipped POD will be included (assuming it is ready now) but the next one will experience the same race condition and probably will be missed as well.

I would suggest adding a command line options to disable this check and always include all frontend/backend endpoints (depending on the cli options):

--no-frontend-condition-check
--no-backend-condition-check

I could prepare a PullRequest with those CLI options if this solution is acceptable.