bloomberg/goldpinger

Readiness probe failed: Get "http://172.16.1.4:8080/healthz": dial tcp 172.16.1.4:8080: i/o timeout (Client.Timeout exceeded while awaiting headers)

pjohnst5 opened this issue · 7 comments

Describe the bug
A goldpinger pod gets in crashloopbackoff reporting that the readiness probe failed due to timeout

This may not be a bug per se

But I am able to create pods (we'll call these "toolbox" pods) in the same network as the gold-pinger pods.

The toolbox pods can listen on port 8085, and also can serve up traffic to other toolbox pods

Question is: why might the gold-pinger pods not be able to serve up traffic? Or at least, get blocked while doing so?

Goldpinger describe:

k describe po k8s-podpool-25044293-0-pod-0
Name:         k8s-podpool-25044293-0-pod-0
Namespace:    default
Priority:     0
Node:         k8s-podpool-25044293-0/10.240.0.4
Start Time:   Mon, 28 Mar 2022 13:44:30 -0700
Labels:       app=goldpinger
Annotations:  kubernetes.io/psp: privileged
Status:       Running
IP:           172.16.1.4
IPs:
  IP:  172.16.1.4
Containers:
  goldpinger:
    Container ID:   containerd://cf57961601d1733cedaaef946371e01c3d587441126792643af791e615eebde4
    Image:          docker.io/bloomberg/goldpinger:v3.0.0
    Image ID:       docker.io/bloomberg/goldpinger@sha256:6314bae32e0ec5e993369d6114cf4ade3eb055214ba80d0463d79435c7b2c7eb
    Port:           8080/TCP
    Host Port:      0/TCP
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    2
      Started:      Mon, 28 Mar 2022 13:57:47 -0700
      Finished:     Mon, 28 Mar 2022 13:58:06 -0700
    Ready:          False
    Restart Count:  9
    Limits:
      memory:  80Mi
    Requests:
      cpu:      1m
      memory:   40Mi
    Liveness:   http-get http://:8080/healthz delay=5s timeout=1s period=5s #success=1 #failure=3
    Readiness:  http-get http://:8080/healthz delay=5s timeout=1s period=5s #success=1 #failure=3
    Environment:
      HOST:              0.0.0.0
      PORT:              8080
      HOSTNAME:           (v1:spec.nodeName)
      POD_IP:             (v1:status.podIP)
      HOSTS_TO_RESOLVE:  1.1.1.1 8.8.8.8 www.bing.com
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-cd5nc (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  kube-api-access-cd5nc:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              dncnode=k8s-podpool-25044293-0
Tolerations:                 node.kubernetes.io/not-ready:NoExecute for 300s
                             node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason                  Age                 From                             Message
  ----     ------                  ----                ----                             -------
  Normal   Scheduled               <unknown>                                            Successfully assigned default/k8s-podpool-25044293-0-pod-0 to k8s-podpool-25044293-0
  Warning  FailedCreatePodSandBox  16m                 kubelet, k8s-podpool-25044293-0  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "b4f6176516b42c0f9f7e8e1c605c7962998d3a01ba2e88cbd88f5c8170394c92": GetMultiTenancyCNIResult failed:GetContainerNetworkConfiguration failed:[Azure cnsclient] Code: 18 , Error: NetworkContainer doesn't exist.
  Normal   Started                 15m (x2 over 15m)   kubelet, k8s-podpool-25044293-0  Started container goldpinger
  Warning  Unhealthy               15m (x2 over 15m)   kubelet, k8s-podpool-25044293-0  Readiness probe failed: Get "http://172.16.1.4:8080/healthz": dial tcp 172.16.1.4:8080: i/o timeout (Client.Timeout exceeded while awaiting headers)
  Normal   Killing                 15m (x2 over 15m)   kubelet, k8s-podpool-25044293-0  Container goldpinger failed liveness probe, will be restarted
  Warning  Unhealthy               15m (x6 over 15m)   kubelet, k8s-podpool-25044293-0  Liveness probe failed: Get "http://172.16.1.4:8080/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
  Warning  Unhealthy               15m (x6 over 15m)   kubelet, k8s-podpool-25044293-0  Readiness probe failed: Get "http://172.16.1.4:8080/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
  Normal   Created                 15m (x3 over 15m)   kubelet, k8s-podpool-25044293-0  Created container goldpinger
  Normal   Pulled                  15m (x3 over 15m)   kubelet, k8s-podpool-25044293-0  Container image "docker.io/bloomberg/goldpinger:v3.0.0" already present on machine
  Warning  BackOff                 65s (x59 over 14m)  kubelet, k8s-podpool-25044293-0  Back-off restarting failed container

Goldpinger logs

k logs k8s-podpool-25044293-0-pod-0
{"level":"info","ts":1648504183.5840726,"caller":"goldpinger/main.go:63","msg":"Goldpinger","version":"v3.0.0","build":"Wed Apr  8 12:36:28 UTC 2020"}
{"level":"info","ts":1648504183.5996177,"caller":"goldpinger/main.go:102","msg":"Kubeconfig not specified, trying to use in cluster config"}
{"level":"info","ts":1648504183.6012673,"caller":"goldpinger/main.go:127","msg":"--ping-number set to 0: pinging all pods"}
{"level":"info","ts":1648504183.602398,"caller":"restapi/configure_goldpinger.go:132","msg":"Added the static middleware"}
{"level":"info","ts":1648504183.60244,"caller":"restapi/configure_goldpinger.go:149","msg":"Added the prometheus middleware"}

Goldpinger pod yaml

 k get po -oyaml k8s-podpool-25044293-0-pod-0
apiVersion: v1
kind: Pod
metadata:
  annotations:
    kubernetes.io/psp: privileged
  creationTimestamp: "2022-03-28T20:44:30Z"
  labels:
    app: goldpinger
  managedFields:
  - apiVersion: v1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:labels:
          .: {}
          f:app: {}
      f:spec:
        f:affinity:
          .: {}
          f:podAntiAffinity:
            .: {}
            f:preferredDuringSchedulingIgnoredDuringExecution: {}
        f:containers:
          k:{"name":"goldpinger"}:
            .: {}
            f:env:
              .: {}
              k:{"name":"HOST"}:
                .: {}
                f:name: {}
                f:value: {}
              k:{"name":"HOSTNAME"}:
                .: {}
                f:name: {}
                f:valueFrom:
                  .: {}
                  f:fieldRef: {}
              k:{"name":"HOSTS_TO_RESOLVE"}:
                .: {}
                f:name: {}
                f:value: {}
              k:{"name":"POD_IP"}:
                .: {}
                f:name: {}
                f:valueFrom:
                  .: {}
                  f:fieldRef: {}
              k:{"name":"PORT"}:
                .: {}
                f:name: {}
                f:value: {}
            f:image: {}
            f:imagePullPolicy: {}
            f:livenessProbe:
              .: {}
              f:failureThreshold: {}
              f:httpGet:
                .: {}
                f:path: {}
                f:port: {}
                f:scheme: {}
              f:initialDelaySeconds: {}
              f:periodSeconds: {}
              f:successThreshold: {}
              f:timeoutSeconds: {}
            f:name: {}
            f:ports:
              .: {}
              k:{"containerPort":8080,"protocol":"TCP"}:
                .: {}
                f:containerPort: {}
                f:name: {}
                f:protocol: {}
            f:readinessProbe:
              .: {}
              f:failureThreshold: {}
              f:httpGet:
                .: {}
                f:path: {}
                f:port: {}
                f:scheme: {}
              f:initialDelaySeconds: {}
              f:periodSeconds: {}
              f:successThreshold: {}
              f:timeoutSeconds: {}
            f:resources:
              .: {}
              f:limits:
                .: {}
                f:memory: {}
              f:requests:
                .: {}
                f:cpu: {}
                f:memory: {}
            f:securityContext:
              .: {}
              f:allowPrivilegeEscalation: {}
              f:readOnlyRootFilesystem: {}
            f:terminationMessagePath: {}
            f:terminationMessagePolicy: {}
        f:dnsPolicy: {}
        f:enableServiceLinks: {}
        f:nodeSelector: {}
        f:restartPolicy: {}
        f:schedulerName: {}
        f:securityContext:
          .: {}
          f:fsGroup: {}
          f:runAsNonRoot: {}
          f:runAsUser: {}
        f:serviceAccount: {}
        f:serviceAccountName: {}
        f:terminationGracePeriodSeconds: {}
    manager: __debug_bin
    operation: Update
    time: "2022-03-28T20:44:30Z"
  - apiVersion: v1
    fieldsType: FieldsV1
    fieldsV1:
      f:status:
        f:conditions:
          k:{"type":"ContainersReady"}:
            .: {}
            f:lastProbeTime: {}
            f:lastTransitionTime: {}
            f:message: {}
            f:reason: {}
            f:status: {}
            f:type: {}
          k:{"type":"Initialized"}:
            .: {}
            f:lastProbeTime: {}
            f:lastTransitionTime: {}
            f:status: {}
            f:type: {}
          k:{"type":"Ready"}:
            .: {}
            f:lastProbeTime: {}
            f:lastTransitionTime: {}
            f:message: {}
            f:reason: {}
            f:status: {}
            f:type: {}
        f:containerStatuses: {}
        f:hostIP: {}
        f:phase: {}
        f:podIP: {}
        f:podIPs:
          .: {}
          k:{"ip":"172.16.1.4"}:
            .: {}
            f:ip: {}
        f:startTime: {}
    manager: kubelet
    operation: Update
    subresource: status
    time: "2022-03-28T20:44:44Z"
  name: k8s-podpool-25044293-0-pod-0
  namespace: default
  resourceVersion: "910397"
  uid: afd422d4-b742-48c9-887e-0321c421dc78
spec:
  affinity:
    podAntiAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - podAffinityTerm:
          labelSelector:
            matchExpressions:
            - key: app
              operator: In
              values:
              - goldpinger
          topologyKey: kubernetes.io/hostname
        weight: 100
  containers:
  - env:
    - name: HOST
      value: 0.0.0.0
    - name: PORT
      value: "8080"
    - name: HOSTNAME
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: spec.nodeName
    - name: POD_IP
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: status.podIP
    - name: HOSTS_TO_RESOLVE
      value: 1.1.1.1 8.8.8.8 www.bing.com
    image: docker.io/bloomberg/goldpinger:v3.0.0
    imagePullPolicy: IfNotPresent
    livenessProbe:
      failureThreshold: 3
      httpGet:
        path: /healthz
        port: 8080
        scheme: HTTP
      initialDelaySeconds: 5
      periodSeconds: 5
      successThreshold: 1
      timeoutSeconds: 1
    name: goldpinger
    ports:
    - containerPort: 8080
      name: http
      protocol: TCP
    readinessProbe:
      failureThreshold: 3
      httpGet:
        path: /healthz
        port: 8080
        scheme: HTTP
      initialDelaySeconds: 5
      periodSeconds: 5
      successThreshold: 1
      timeoutSeconds: 1
    resources:
      limits:
        memory: 80Mi
      requests:
        cpu: 1m
        memory: 40Mi
    securityContext:
      allowPrivilegeEscalation: false
      readOnlyRootFilesystem: true
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-cd5nc
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  nodeName: k8s-podpool-25044293-0
  nodeSelector:
    dncnode: k8s-podpool-25044293-0
  preemptionPolicy: PreemptLowerPriority
  priority: 0
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext:
    fsGroup: 2000
    runAsNonRoot: true
    runAsUser: 1000
  serviceAccount: goldpinger-serviceaccount
  serviceAccountName: goldpinger-serviceaccount
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  volumes:
  - name: kube-api-access-cd5nc
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          expirationSeconds: 3607
          path: token
      - configMap:
          items:
          - key: ca.crt
            path: ca.crt
          name: kube-root-ca.crt
      - downwardAPI:
          items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
            path: namespace
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2022-03-28T20:44:30Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2022-03-28T20:44:30Z"
    message: 'containers with unready status: [goldpinger]'
    reason: ContainersNotReady
    status: "False"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2022-03-28T20:44:30Z"
    message: 'containers with unready status: [goldpinger]'
    reason: ContainersNotReady
    status: "False"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2022-03-28T20:44:30Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: containerd://4e4a3883c9b0b7e472341069ae896b320bafb7c11410b03cbed8b8ecfb7660db
    image: docker.io/bloomberg/goldpinger:v3.0.0
    imageID: docker.io/bloomberg/goldpinger@sha256:6314bae32e0ec5e993369d6114cf4ade3eb055214ba80d0463d79435c7b2c7eb
    lastState:
      terminated:
        containerID: containerd://4e4a3883c9b0b7e472341069ae896b320bafb7c11410b03cbed8b8ecfb7660db
        exitCode: 2
        finishedAt: "2022-03-28T22:01:26Z"
        reason: Error
        startedAt: "2022-03-28T22:01:09Z"
    name: goldpinger
    ready: false
    restartCount: 31
    started: false
    state:
      waiting:
        message: back-off 5m0s restarting failed container=goldpinger pod=k8s-podpool-25044293-0-pod-0_default(afd422d4-b742-48c9-887e-0321c421dc78)
        reason: CrashLoopBackOff
  hostIP: 10.240.0.4
  phase: Running
  podIP: 172.16.1.4
  podIPs:
  - ip: 172.16.1.4
  qosClass: Burstable
  startTime: "2022-03-28T20:44:30Z"

Toolbox pod describe:

k describe po toolbox2
Name:         toolbox2
Namespace:    default
Priority:     0
Node:         k8s-podpool-25044293-1/10.240.0.64
Start Time:   Mon, 28 Mar 2022 14:46:31 -0700
Labels:       app=toolbox
Annotations:  kubernetes.io/psp: privileged
Status:       Running
IP:           172.16.1.7
IPs:
  IP:  172.16.1.7
Containers:
  toolbox:
    Container ID:   containerd://257f4c29c8f8f559762ae81bd4babea0a4f6bfd06cc0125dc147d6596b60cd9f
    Image:          matmerr/toolbox:lite
    Image ID:       docker.io/matmerr/toolbox@sha256:e47201c473eb0cfdeccd381a696492ac2f8f2bae62b2d92dc8832bffa341502c
    Port:           <none>
    Host Port:      <none>
    State:          Running
      Started:      Mon, 28 Mar 2022 14:46:32 -0700
    Ready:          True
    Restart Count:  0
    Environment:
      TCP_PORT:   8085
      UDP_PORT:   8086
      HTTP_PORT:  8087
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-4tcqp (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True
Volumes:
  kube-api-access-4tcqp:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              dncnode=k8s-podpool-25044293-1
Tolerations:                 node.kubernetes.io/not-ready:NoExecute for 300s
                             node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type    Reason     Age        From                             Message
  ----    ------     ----       ----                             -------
  Normal  Scheduled  <unknown>                                   Successfully assigned default/toolbox2 to k8s-podpool-25044293-1
  Normal  Pulled     13m        kubelet, k8s-podpool-25044293-1  Container image "matmerr/toolbox:lite" already present on machine
  Normal  Created    13m        kubelet, k8s-podpool-25044293-1  Created container toolbox
  Normal  Started    13m        kubelet, k8s-podpool-25044293-1  Started container toolbox

toolbox logs:

 k logs toolbox2
[HTTP] Listening on 8087
[UDP] Listening on [::]:8086
[TCP] Listening on [::]:8085
[TCP] Received Connection from 172.16.1.6:49240
[TCP] Received Connection from 172.16.1.6:49616

kubectl get po -owide

azure-cni-manager-snrc2        1/1     Running            0                78m   10.240.0.64   k8s-podpool-25044293-1   <none>           <none>
azure-cni-manager-zsxv6        1/1     Running            0                78m   10.240.0.4    k8s-podpool-25044293-0   <none>           <none>
azure-cns-765kd                1/1     Running            0                78m   10.240.0.4    k8s-podpool-25044293-0   <none>           <none>
azure-cns-twhhv                1/1     Running            0                78m   10.240.0.64   k8s-podpool-25044293-1   <none>           <none>
dnc-788c667694-x2bwj           1/1     Running            0                76m   10.240.0.94   k8s-dncpool-25044293-0   <none>           <none>
k8s-podpool-25044293-0-pod-0   0/1     CrashLoopBackOff   29 (4m46s ago)   76m   172.16.1.4    k8s-podpool-25044293-0   <none>           <none>
k8s-podpool-25044293-1-pod-0   0/1     CrashLoopBackOff   29 (4m21s ago)   76m   172.16.1.5    k8s-podpool-25044293-1   <none>           <none>
toolbox1                       1/1     Running            0                65m   172.16.1.6    k8s-podpool-25044293-0   <none>           <none>
toolbox2                       1/1     Running            0                14m   172.16.1.7    k8s-podpool-25044293-1   <none>           <none>

Trying to connecto to goldpinger readiness probe, from toolbox1 pod:

curl -v "http://172.16.1.4:8080/healthz"
*   Trying 172.16.1.4:8080...
* TCP_NODELAY set
* connect to 172.16.1.4 port 8080 failed: Connection refused
* Failed to connect to 172.16.1.4 port 8080: Connection refused
* Closing connection 0
curl: (7) Failed to connect to 172.16.1.4 port 8080: Connection refused

I should mention that I am able to ping the goldpinger pods just fine:

toolbox1:/app# ping 172.16.1.4
PING 172.16.1.4 (172.16.1.4) 56(84) bytes of data.
64 bytes from 172.16.1.4: icmp_seq=1 ttl=64 time=2.15 ms
64 bytes from 172.16.1.4: icmp_seq=2 ttl=64 time=0.820 ms
64 bytes from 172.16.1.4: icmp_seq=3 ttl=64 time=1.63 ms

This looks like a pretty standard Goldpinger yaml, including the HOST envvar set to 0.0.0.0, so I don't see anything obvious that should prevent the connectivity.

I noticed from your logs that that you're running with azure-cni - so I'm wondering if perhaps you have a default deny kind of networking policy in place? Just guessing here.

Perhaps you could also try to set a healthcheck on your toolbox pod to confirm that it works.

Is there a way to exec into the goldpinger pod to look around? I'm just wondering if it's listening on 8080 as expected.
For toolbox pods, I'm able to exec into them. I can try a health check too

Okay I confirmed with an nginx pod, the nginx pod replies OK on port 80:

kubectl get po -owide
NAME                           READY   STATUS             RESTARTS          AGE    IP            NODE                     NOMINATED NODE   READINESS GATES
azure-cni-manager-snrc2        1/1     Running            0                 21h    10.240.0.64   k8s-podpool-25044293-1   <none>           <none>
azure-cni-manager-zsxv6        1/1     Running            0                 21h    10.240.0.4    k8s-podpool-25044293-0   <none>           <none>
azure-cns-765kd                1/1     Running            0                 21h    10.240.0.4    k8s-podpool-25044293-0   <none>           <none>
azure-cns-twhhv                1/1     Running            0                 21h    10.240.0.64   k8s-podpool-25044293-1   <none>           <none>
dnc-788c667694-x2bwj           1/1     Running            0                 21h    10.240.0.94   k8s-dncpool-25044293-0   <none>           <none>
k8s-podpool-25044293-0-pod-0   0/1     CrashLoopBackOff   444 (31s ago)     21h    172.16.1.4    k8s-podpool-25044293-0   <none>           <none>
k8s-podpool-25044293-1-pod-0   0/1     CrashLoopBackOff   442 (3m16s ago)   21h    172.16.1.5    k8s-podpool-25044293-1   <none>           <none>
nginx1                         1/1     Running            0                 112s   172.16.1.8    k8s-podpool-25044293-0   <none>           <none>
toolbox1                       1/1     Running            0                 20h    172.16.1.6    k8s-podpool-25044293-0   <none>           <none>
toolbox2                       1/1     Running            0                 20h    172.16.1.7    k8s-podpool-25044293-1   <none>           <none>
toolbox1:/app# curl "http://172.16.1.8:80"
<!DOCTYPE html>
<html>
<head>
<title>Welcome to nginx!</title>
<style>
html { color-scheme: light dark; }
body { width: 35em; margin: 0 auto;
font-family: Tahoma, Verdana, Arial, sans-serif; }
</style>
</head>
<body>
<h1>Welcome to nginx!</h1>
<p>If you see this page, the nginx web server is successfully installed and
working. Further configuration is required.</p>

<p>For online documentation and support please refer to
<a href="http://nginx.org/">nginx.org</a>.<br/>
Commercial support is available at
<a href="http://nginx.com/">nginx.com</a>.</p>

<p><em>Thank you for using nginx.</em></p>
</body>
</html>

Is there a way to exec into the goldpinger pod to look around? I'm just wondering if it's listening on 8080 as expected. For toolbox pods, I'm able to exec into them. I can try a health check too

You can exec into it, but other than the goldpigner binary, there isn't anything there to execute. If you have access to the underlying host, you can investigate the process from there.

If you remove the readiness/health probe temporarily, will you be able to reach the pods from your toolbox, like you did with the nginx? You were getting Connection refused, but I suspect it's because Goldpigner was already in CLBO.

Hmm yeah @seeker89 now, when removing the readiness/health probe from the goldpinger pod, I am able to curl the /healthz port o_0

image

Ah, it's because the goldpinger pod is on a different network than the kubelet
So kubelet (or whoever does readiness checks) can't talk to the goldpinger pod, but since the toolbox pods are on the same network as the goldpinger pods, the toolbox pods can talk to the goldpinger pods
Thanks @seeker89 !