IBM/varnish-operator

Container restart loses connectivity to backends

Closed this issue · 5 comments

When Kubernetes restarts the container due to a liveliness probe failure the container comes back with 0 backends.

varnishadm backend.list
Backend name   Admin      Probe    Health     Last change
boot.default   healthy    0/0      healthy    Sun, 08 May 2022 02:30:49 GMT

I confirmed that /etc/varnish/backends.vcl is still populated correctly and other pods can still connect to the backends without a problem. Deleting the pod "fixes" it.

Here is our VarnishCluster manifest for context.

apiVersion: caching.ibm.com/v1alpha1
kind: VarnishCluster
metadata:
  labels:
    operator: varnish
  name: pcms-api
spec:
  backend:
    port: 80
    selector:
      app: pcms
      component: web
      purpose: api
  replicas: 3
  service:
    annotations:
      prometheus.io/path: /metrics
      prometheus.io/port: "9131"
      prometheus.io/scrape-only: "true"
    port: 80
  varnish:
    args:
    - -p
    - http_max_hdr=256
    - -p
    - http_resp_hdr_len=256k
    - -p
    - http_resp_size=1024k
    - -p
    - workspace_backend=256k
    - -s
    - malloc,756M
    resources:
      limits:
        cpu: 500m
        memory: 1028Mi
      requests:
        cpu: 500m
        memory: 1028Mi
  vcl:
    configMapName: pcms-varnishcluster
    entrypointFileName: default.vcl
cin commented

Thanks for reporting the issue. We'll take a look. FYI: @tomashibm

cin commented

Was able to reproduce by /sbin/killall5 which killed all the processes and containers but still caused the condition mentioned above to occur. Deleting pod fixed it but going to try forcing an update.

cin commented

There's no vcl either...

varnish> vcl.list
200
active   auto    warm         0    boot

EDIT: It is possible to fix w/out deleting the pod by manually loading the vcl.

varnish> vcl.load test /etc/varnish/entrypoint.vcl                                                                                                                                                                                                                    [145/3353]
200
VCL compiled.

varnish> vcl.use test
200
VCL 'test' now active
varnish> vcl.list
200
available   auto    warm         0    boot
active      auto    warm         0    test

varnish> backend.list
200
Backend name                Admin    Probe  Health   Last change
test.nginx-7848d4b86f-fkw7w healthy  0/0    healthy  Mon, 09 May 2022 17:05:22 GMT
test.nginx-7848d4b86f-npc52 healthy  0/0    healthy  Mon, 09 May 2022 17:05:22 GMT
test.container_rr           probe    2/2    healthy  Mon, 09 May 2022 17:05:22 GMT

@cin Thank you for the the followup. If we need manual intervention it is easier for my team to just delete the pod.

I wonder if it is restarting and just using the system default vcl /etc/default/varnish.

fix is released in the latest 0.31.0 version