Pods failing their health check after upgrading to operator 1.1.2

Question

Pods failing their health check after upgrading to operator 1.1.2

denniskorbginski opened this issue 9 months ago · 3 comments

Background info for context

Earlier today, I set up the deagonfly operator based on this manifest which deployed version 1.1.1. I applied the below manifest to run dragonfly which is based on the sample linked in the docs. I added the proactor_threads argument to prevent the containers from exiting with the error message There are 4 threads, so 1.00GiB are required. Exiting...- not sure if thats in any way related to my issue, but since it's the only change I made to the sample, I thought it's worth mentioning. With this setup, the pods were running fine, at least for an hour or two before I discovered that you had pushed a new version of the operator manifest.

apiVersion: dragonflydb.io/v1alpha1
kind: Dragonfly
metadata:
  labels:
    app.kubernetes.io/name: dragonfly
    app.kubernetes.io/instance: dragonfly-sample
    app.kubernetes.io/part-of: dragonfly-operator
    app.kubernetes.io/managed-by: kustomize
    app.kubernetes.io/created-by: dragonfly-operator
  name: dragonfly
spec:
  replicas: 2
  args:
    - "--proactor_threads=2"
  resources:
    requests:
      cpu: 500m
      memory: 500Mi
    limits:
      cpu: 600m
      memory: 750Mi

My actual issue

After updating the operator with the new manifest, it replaced the dragonfly pods which passed their health checks for a moment, but then started failing them. Checking the healthcheck script, I saw that it tries to autodetect the correct port. I ran netstat in the container and at first, it shows the expected ports and the healthchecks succeeds:

$ netstat -tulnp
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name    
tcp        0      0 0.0.0.0:9999            0.0.0.0:*               LISTEN      1/dragonfly         
tcp        0      0 0.0.0.0:6379            0.0.0.0:*               LISTEN      1/dragonfly

After a few seconds, when the healthcheck begins to fail, there is another entry listed:

Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name    
tcp        0      0 0.0.0.0:9999            0.0.0.0:*               LISTEN      1/dragonfly         
tcp        0      0 0.0.0.0:6379            0.0.0.0:*               LISTEN      1/dragonfly         
udp        0      0 127.0.0.1:8125          0.0.0.0:*                           -                   
udp6       0      0 ::1:8125                :::*                                -

The healthcheck then tries to run against port 8125 and fails. I'm not sure if this can be specific to my setup or is a general problem. It was easy enough to work around the issue by setting the HEALTHCHECK_PORT env var to 6379.

Even if this is not just me, this hardly feels like a bug in the operator, but maybe the healthcheck script could be improved to handle this case better? Or should the sample manifest and the docs be updated to include the HEALTHCHECK_PORT env var?

Answer 1 · 2024-04-04T08:45:02.000Z

Hi @denniskorbginski, I couldn't reproduce it (i.e. the udp port opening) with the config you shared. I think you have some monitoring/load-balancer setup (targeting the pods) in your environment that opens the port. The thing that it failed after you switched to 1.1.2 is maybe because 6379 was the last entry before (and hence healthcheck script didn't fail). But I agree with your finding.

The best approach here would be to set the env by the operator when creating the statefulset so that you don't have to explicitly set it. Will fix, thanks!

Answer 2 · 2024-04-04T11:36:56.000Z

It seems the port was hardcoded before v1.16 release and hence it didn't cause the issue - dragonflydb/dragonfly#2841 (comment). (v1.1.2 uses dragonfly v1.16 by default)

Answer 3 · 2024-04-04T11:36:59.000Z

Hey, with the port in question being 8125, I agree it's very likely that this is caused from a monitoring agent running in my cluster. Maybe I was just lucky with the order of entries returned by netstat before the update.
Anyway, thank you for looking into this 😊

Edit: your other response just came in as I was submitting this comment 😅 great, this explains it.