goldpinger UI fails if a node is in bad state.
kevinkim9264 opened this issue · 16 comments
Describe the bug
When I disable a node (i.e. manually stop a VM, or kills flanneld service, etc), I expect to see red mark on that particular node in goldpinger UI. However, the entire goldpinger UI fails. This happens even if I port-forward other goldpinger pods in healthy nodes.
Is this an expected behavior? If so, doesn't it defeat the purpose of using goldpinger UI when a node is in bad state?
To Reproduce
Steps to reproduce the behavior:
- Deploy goldpinger as daemonsets
- Make a node unhealthy. (stops the VM, disables flanneld service, etc)
- Try to load goldpinger UI.
Expected behavior
Goldpinger UI loads and shows the red marks on the network paths to the affected node.
Screenshots
If applicable, add screenshots to help explain your problem.
When accessing via pod port-forwarding:
logs from still working goldpinger pod:
Environment (please complete the following information):
- Operating System and Version: [e.g. Ubuntu Linux 18.04]
- Browser [e.g. Firefox, Safari] (if applicable):
Tried in Chrome and Safari
Additional context
Add any other context about the problem here.
** Deployment configuration **
{
apiVersion: "apps/v1",
kind: "DaemonSet",
metadata: {
name: "goldpinger",
namespace: "goldpinger",
labels: {
app: "goldpinger",
},
},
spec: {
updateStrategy: {
type: "RollingUpdate",
},
selector: {
matchLabels: {
app: "goldpinger",
},
},
template: {
metadata: {
labels: {
app: "goldpinger",
},
},
spec: {
serviceAccount: "goldpinger",
containers: [
{
name: "goldpinger",
env: [
{
name: "HOST",
value: "0.0.0.0",
},
{
name: "PORT",
value: "80",
},
{
name: "HOSTNAME",
valueFrom: {
fieldRef: {
fieldPath: "spec.nodeName",
},
},
},
],
image: "docker.io/bloomberg/goldpinger:2.0.0",
ports: [
{
containerPort: 80,
name: "http",
},
],
},
],
},
},
},
}
For more info, this is the case where ping would hang, instead of immediately returning error code. Will that be a possible reason why the UI just hangs and doesn't show?
This looks like a regression. Will try to reproduce and report back.
Could you paste a sample output from a call to /check_all
? That would help debugging. Thanks!
OK, I could reproduce the behaviour. Would you like to take #68 for a spin and confirm it works in your case?
@seeker89 thank you! will try it and let you know
@seeker89 unfortunately the problem persists. The behavior is a little different though:
instead of throwing weird error, it just hangs for some time and shows red dots:
/check_all
would show this result after some time:
{"hosts":[{"hostIP":"10.120.252.11","podIP":"10.2.97.146"},{"hostIP":"10.120.252.13","podIP":"10.2.96.83"},{"hostIP":"10.120.252.12","podIP":"10.2.66.71"},{"hostIP":"10.120.252.10","podIP":"10.2.117.162"},{"hostIP":"10.120.252.14","podIP":"10.2.89.70"},{"hostIP":"10.120.252.15","podIP":"10.2.2.91"},{"hostIP":"10.120.252.18","podIP":"10.2.123.96"},{"hostIP":"10.120.252.17","podIP":"10.2.92.81"},{"hostIP":"10.120.252.16","podIP":"10.2.54.83"}],"responses":{"10.2.117.162":{"HostIP":"10.120.252.10","OK":false,"error":"Get http://10.2.117.162:80/check: context deadline exceeded"},"10.2.123.96":{"HostIP":"10.120.252.18","OK":false,"error":"Get http://10.2.123.96:80/check: context deadline exceeded"},"10.2.2.91":{"HostIP":"10.120.252.15","OK":false,"error":"Get http://10.2.2.91:80/check: context deadline exceeded"},"10.2.54.83":{"HostIP":"10.120.252.16","OK":false,"error":"Get http://10.2.54.83:80/check: context deadline exceeded"},"10.2.66.71":{"HostIP":"10.120.252.12","OK":false,"error":"Get http://10.2.66.71:80/check: context deadline exceeded"},"10.2.89.70":{"HostIP":"10.120.252.14","OK":false,"error":"Get http://10.2.89.70:80/check: context deadline exceeded"},"10.2.92.81":{"HostIP":"10.120.252.17","OK":false,"error":"Get http://10.2.92.81:80/check: context deadline exceeded"},"10.2.96.83":{"HostIP":"10.120.252.13","OK":false,"error":"Get http://10.2.96.83:80/check: context deadline exceeded"},"10.2.97.146":{"HostIP":"10.120.252.11","OK":false,"error":"Get http://10.2.97.146:80/check: context deadline exceeded"}}}
Ok. So it works properly if a pod is in bad state. (manually making a goldpinger pod in invalid state). However, if a node's networking is messed up (i.e. ping hangs, instead of returning error right away), then goldpinger UI would hang. You can easily reproduce it by stopping the underlying virtual machine.
Thanks for taking it for a spin. In that case we should tweak the logic displaying things, when there is not enough data. I should be able to sit down to it next week. Cheers!
@seeker89 thank you again for the quick response! I will look forward to the fix :)
@seeker89 did you get any chance to fix this issue? thank you
Sorry, haven't gotten to that yet.
Hi, any update on this? Maybe I can help. Haven't got any experience with this but why not learn.
@kristoflemmens that would be great and much appreciated. I haven't been able to find time since November!
Hi, any update?