goldpinger UI fails if a node is in bad state.

Question

goldpinger UI fails if a node is in bad state.

kevinkim9264 opened this issue 5 years ago · 16 comments

Describe the bug
When I disable a node (i.e. manually stop a VM, or kills flanneld service, etc), I expect to see red mark on that particular node in goldpinger UI. However, the entire goldpinger UI fails. This happens even if I port-forward other goldpinger pods in healthy nodes.

Is this an expected behavior? If so, doesn't it defeat the purpose of using goldpinger UI when a node is in bad state?

To Reproduce
Steps to reproduce the behavior:

Deploy goldpinger as daemonsets
Make a node unhealthy. (stops the VM, disables flanneld service, etc)
Try to load goldpinger UI.

Expected behavior
Goldpinger UI loads and shows the red marks on the network paths to the affected node.

Screenshots
If applicable, add screenshots to help explain your problem.
When accessing via pod port-forwarding:

When accessing via service:

logs from still working goldpinger pod:

Environment (please complete the following information):

Operating System and Version: [e.g. Ubuntu Linux 18.04]
Browser [e.g. Firefox, Safari] (if applicable):
Tried in Chrome and Safari

Additional context
Add any other context about the problem here.

** Deployment configuration **

{
  apiVersion: "apps/v1",
  kind: "DaemonSet",
  metadata: {
    name: "goldpinger",
    namespace: "goldpinger",
    labels: {
      app: "goldpinger",
    },
  },
  spec: {
    updateStrategy: {
      type: "RollingUpdate",
    },
    selector: {
      matchLabels: {
        app: "goldpinger",
      },
    },
    template: {
      metadata: {
        labels: {
          app: "goldpinger",
        },
      },
      spec: {
        serviceAccount: "goldpinger",
        containers: [
          {
            name: "goldpinger",
            env: [
              {
                name: "HOST",
                value: "0.0.0.0",
              },
              {
                name: "PORT",
                value: "80",
              },
              {
                name: "HOSTNAME",
                valueFrom: {
                  fieldRef: {
                    fieldPath: "spec.nodeName",
                  },
                },
              },
            ],
            image: "docker.io/bloomberg/goldpinger:2.0.0",
            ports: [
              {
                containerPort: 80,
                name: "http",
              },
            ],
          },
        ],
      },
    },
  },
}

Answer 1 · 2019-11-07T23:26:16.000Z

For more info, this is the case where ping would hang, instead of immediately returning error code. Will that be a possible reason why the UI just hangs and doesn't show?

Answer 2 · 2019-11-08T11:06:21.000Z

This looks like a regression. Will try to reproduce and report back.

Answer 3 · 2019-11-08T11:42:52.000Z

Could you paste a sample output from a call to /check_all ? That would help debugging. Thanks!

Answer 4 · 2019-11-08T15:55:56.000Z

OK, I could reproduce the behaviour. Would you like to take #68 for a spin and confirm it works in your case?

Answer 5 · 2019-11-08T16:51:06.000Z

@seeker89 thank you! will try it and let you know

Answer 6 · 2019-11-08T17:21:34.000Z

@seeker89 unfortunately the problem persists. The behavior is a little different though:
instead of throwing weird error, it just hangs for some time and shows red dots:

Answer 7 · 2019-11-08T17:24:46.000Z

/check_all would show this result after some time:

{"hosts":[{"hostIP":"10.120.252.11","podIP":"10.2.97.146"},{"hostIP":"10.120.252.13","podIP":"10.2.96.83"},{"hostIP":"10.120.252.12","podIP":"10.2.66.71"},{"hostIP":"10.120.252.10","podIP":"10.2.117.162"},{"hostIP":"10.120.252.14","podIP":"10.2.89.70"},{"hostIP":"10.120.252.15","podIP":"10.2.2.91"},{"hostIP":"10.120.252.18","podIP":"10.2.123.96"},{"hostIP":"10.120.252.17","podIP":"10.2.92.81"},{"hostIP":"10.120.252.16","podIP":"10.2.54.83"}],"responses":{"10.2.117.162":{"HostIP":"10.120.252.10","OK":false,"error":"Get http://10.2.117.162:80/check: context deadline exceeded"},"10.2.123.96":{"HostIP":"10.120.252.18","OK":false,"error":"Get http://10.2.123.96:80/check: context deadline exceeded"},"10.2.2.91":{"HostIP":"10.120.252.15","OK":false,"error":"Get http://10.2.2.91:80/check: context deadline exceeded"},"10.2.54.83":{"HostIP":"10.120.252.16","OK":false,"error":"Get http://10.2.54.83:80/check: context deadline exceeded"},"10.2.66.71":{"HostIP":"10.120.252.12","OK":false,"error":"Get http://10.2.66.71:80/check: context deadline exceeded"},"10.2.89.70":{"HostIP":"10.120.252.14","OK":false,"error":"Get http://10.2.89.70:80/check: context deadline exceeded"},"10.2.92.81":{"HostIP":"10.120.252.17","OK":false,"error":"Get http://10.2.92.81:80/check: context deadline exceeded"},"10.2.96.83":{"HostIP":"10.120.252.13","OK":false,"error":"Get http://10.2.96.83:80/check: context deadline exceeded"},"10.2.97.146":{"HostIP":"10.120.252.11","OK":false,"error":"Get http://10.2.97.146:80/check: context deadline exceeded"}}}

Answer 8 · 2019-11-08T17:30:17.000Z

The behavior in safari is also different, as it throws error:

Answer 9 · 2019-11-08T18:14:02.000Z

Ok. So it works properly if a pod is in bad state. (manually making a goldpinger pod in invalid state). However, if a node's networking is messed up (i.e. ping hangs, instead of returning error right away), then goldpinger UI would hang. You can easily reproduce it by stopping the underlying virtual machine.

Answer 10 · 2019-11-08T19:06:43.000Z

Thanks for taking it for a spin. In that case we should tweak the logic displaying things, when there is not enough data. I should be able to sit down to it next week. Cheers!

Answer 11 · 2019-11-08T19:07:49.000Z

@seeker89 thank you again for the quick response! I will look forward to the fix :)

Answer 12 · 2019-12-02T18:52:19.000Z

@seeker89 did you get any chance to fix this issue? thank you

Answer 13 · 2019-12-05T20:33:57.000Z

Sorry, haven't gotten to that yet.

Answer 14 · 2020-01-26T07:14:53.000Z

Hi, any update on this? Maybe I can help. Haven't got any experience with this but why not learn.

Answer 15 · 2020-01-26T16:50:26.000Z

@kristoflemmens that would be great and much appreciated. I haven't been able to find time since November!

Answer 16 · 2020-03-25T09:02:13.000Z

Hi, any update?