Some problems with Diagnosis
Opened this issue · 9 comments
-
Capture:
When Capturing in pods like kubeskoop's "pod controller" , meet this error: "failed to run command: pod not found on nettop cache".
When Capturing in pods like kubeskoop's "pod agent", it meet this error: "failed to upload task result", and "pod controller" seems to crash and reload. -
Connectivity Diagnosis:
How could I submit args like "--network-plugin", when auto-detection failed on web console?
Hi, thanks for reporting this issue!
-
Seems like the controller is not working properly. Could you please provide the error log of controller?
-
You can use
diagnose.extraArgs
in controller config to pass custom arguments to the diagnose. The controller config is in a ConfigMap namedcontroller-config
orkubeskoop-controller-config
For example:
logLevel: debug
server:
httpPort: 10264
agentPort: 10263
controller:
namespace: kubeskoop
prometheus: ""
loki: ""
database:
type: sqlite3
diagnose:
# Set network plugin to flannel.
extraArgs:
- --network-plugin=flannel
Sorry for the late. Cloud you please provide detailed log in text?
And for the pod crashing scenario, you can fetch logs of the crashed pod by adding -p
to kubectl logs
.
kubectl logs -n kubeskoop <controller pod name> -p
- Capture controller-pod:
controller-pod logs:
[GIN] 2024/08/02 - 02:44:20 | 200 | 284.332µs | 10.42.0.1 | GET "/pods"
[GIN] 2024/08/02 - 02:44:37 | 200 | 3.723619ms | 10.42.0.1 | POST "/capture"
I0802 02:44:37.710384 1 agent.go:37] send task to agent node_name:"sqaobxdn011161204144.sa128" type:Capture type:Ping: server:{} task:{id:"2" capture:{pod:{name:"controller-78c945f8cc-7w4mt" namespace:"kubeskoop"} node:{name:"sqaobxdn011161204144.sa128"} capture_type:"Pod" capture_duration_seconds:10}}
I0802 02:44:37.712404 1 capture.go:252] store capture result for 2, failed to run command: pod not found on nettop cache
[GIN] 2024/08/02 - 02:44:37 | 200 | 52.019µs | 10.42.0.1 | GET "/captures"
agent-pod logs:
time="2024-08-02T02:44:30Z" level=info msg="cache process finished, latency: 0.000571s"
time="2024-08-02T02:44:30Z" level=debug msg="loki ingester send of maxwait"
time="2024-08-02T02:44:35Z" level=debug msg="loki ingester send of maxwait"
time="2024-08-02T02:44:37Z" level=info msg="process task: server:{} task:{id:\"2\" capture:{pod:{name:\"controller-78c945f8cc-7w4mt\" namespace:\"kubeskoop\"} node:{name:\"sqaobxdn011161204144.sa128\"} capture_type:\"Pod\" capture_duration_seconds:10}}"
time="2024-08-02T02:44:37Z" level=error msg="failed to run command: pod not found on nettop cache"
time="2024-08-02T02:44:40Z" level=debug msg="finished cache process"
time="2024-08-02T02:44:40Z" level=info msg="cache process finished, latency: 0.000601s"
time="2024-08-02T02:44:40Z" level=debug msg="loki ingester send of maxwait"
- Capture agent-pod:
controller-pod logs:
$kubectl logs -n kubeskoop controller-78c945f8cc-7w4mt
...
[GIN-debug] PUT /config --> github.com/alibaba/kubeskoop/pkg/controller/cmd.(*Server).UpdateExporterConfig-fm (3 handlers)
[GIN-debug] [WARNING] You trusted all proxies, this is NOT safe. We recommend you to set a value.
Please check https://pkg.go.dev/github.com/gin-gonic/gin#readme-don-t-trust-all-proxies for details.
[GIN-debug] Listening and serving HTTP on 0.0.0.0:10264
[GIN] 2024/08/02 - 02:49:08 | 200 | 172.503µs | 10.42.0.1 | GET "/captures"
I0802 02:49:08.641722 1 agent.go:31] watch tasks for filter: node_name:"sqaobxdn011161204144.sa128" type:Capture type:Ping
$kubectl logs -n kubeskoop controller-78c945f8cc-7w4mt -p
...
[GIN] 2024/08/02 - 02:44:37 | 200 | 52.019µs | 10.42.0.1 | GET "/captures"
[GIN] 2024/08/02 - 02:48:48 | 200 | 454.246µs | 10.42.0.1 | GET "/pods"
[GIN] 2024/08/02 - 02:48:53 | 200 | 3.920268ms | 10.42.0.1 | POST "/capture"
I0802 02:48:53.129156 1 agent.go:37] send task to agent node_name:"sqaobxdn011161204144.sa128" type:Capture type:Ping: server:{} task:{id:"3" capture:{pod:{name:"agent-dm5qr" namespace:"kubeskoop" hostNetwork:true} node:{name:"sqaobxdn011161204144.sa128"} capture_type:"Pod" capture_duration_seconds:10}}
[GIN] 2024/08/02 - 02:48:53 | 200 | 79.636µs | 10.42.0.1 | GET "/captures"
[GIN] 2024/08/02 - 02:48:56 | 200 | 170.669µs | 10.42.0.1 | GET "/captures"
[GIN] 2024/08/02 - 02:48:59 | 200 | 42.782µs | 10.42.0.1 | GET "/captures"
[GIN] 2024/08/02 - 02:49:02 | 200 | 46.882µs | 10.42.0.1 | GET "/captures"
[GIN] 2024/08/02 - 02:49:05 | 200 | 51.859µs | 10.42.0.1 | GET "/captures"
agent-pod logs:
...
time="2024-08-02T02:48:53Z" level=info msg="process task: server:{} task:{id:\"3\" capture:{pod:{name:\"agent-dm5qr\" namespace:\"kubeskoop\" hostNetwork:true} node:{name:\"sqaobxdn011161204144.sa128\"} capture_type:\"Pod\" capture_duration_seconds:10}}"
time="2024-08-02T02:48:53Z" level=info msg="start capture: [{ tcpdump -i any -C 100 -w /tmp/3_sqaobxdn011161204144.sa128_host.pcap /tmp/3_sqaobxdn011161204144.sa128_host.pcap 10000000000}]"
time="2024-08-02T02:48:55Z" level=debug msg="loki ingester send of maxwait"
time="2024-08-02T02:49:00Z" level=debug msg="finished cache process"
time="2024-08-02T02:49:00Z" level=info msg="cache process finished, latency: 0.000588s"
time="2024-08-02T02:49:00Z" level=debug msg="loki ingester send of maxwait"
time="2024-08-02T02:49:05Z" level=debug msg="loki ingester send of maxwait"
time="2024-08-02T02:49:06Z" level=error msg="ipcache failed to receive task: rpc error: code = Unavailable desc = error reading from server: EOF"
time="2024-08-02T02:49:06Z" level=info msg="connecting to controller."
time="2024-08-02T02:49:06Z" level=error msg="failed to receive task: rpc error: code = Unavailable desc = error reading from server: EOF"
time="2024-08-02T02:49:06Z" level=info msg="connecting to controller."
time="2024-08-02T02:49:06Z" level=error msg="failed to upload task result: rpc error: code = Unavailable desc = error reading from server: EOF"
time="2024-08-02T02:49:06Z" level=error msg="failed to watch task: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp 10.43.134.4:10263: connect: connection refused\""
time="2024-08-02T02:49:06Z" level=warning msg="retry watching task after 2 seconds"
time="2024-08-02T02:49:06Z" level=error msg="failed to watch ipcache: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp 10.43.134.4:10263: connect: connection refused\""
time="2024-08-02T02:49:06Z" level=warning msg="retry watching ipcache after 2 seconds"
time="2024-08-02T02:49:08Z" level=info msg="connecting to controller."
time="2024-08-02T02:49:08Z" level=info msg="connecting to controller."
time="2024-08-02T02:49:08Z" level=info msg="controller connected."
time="2024-08-02T02:49:08Z" level=error msg="ipcache failed to receive task: rpc error: code = InvalidArgument desc = invalid period"
time="2024-08-02T02:49:08Z" level=info msg="connecting to controller."
time="2024-08-02T02:49:08Z" level=warning msg="list ipcache"
time="2024-08-02T02:49:08Z" level=info msg="current period:88f6bca0-46d0-4653-9338-8d3a49aaf77d, revision: 0"
time="2024-08-02T02:49:08Z" level=debug msg="ip cache changed op=0,info=[ip=11.161.204.144,type=node,node=sqaobxdn011161204144.sa128,pod=,namespace=]"
time="2024-08-02T02:49:08Z" level=debug msg="ip cache changed op=0,info=[ip=10.42.0.105,type=pod,node=,pod=coredns-6799fbcd5-pl8mn,namespace=kube-system]"
time="2024-08-02T02:49:08Z" level=debug msg="ip cache changed op=0,info=[ip=10.42.0.120,type=pod,node=,pod=local-path-provisioner-6c86858495-h2v9c,namespace=kube-system]"
time="2024-08-02T02:49:08Z" level=debug msg="ip cache changed op=0,info=[ip=10.42.0.108,type=pod,node=,pod=metrics-server-54fd9b65b-9kgrj,namespace=kube-system]"
time="2024-08-02T02:49:08Z" level=debug msg="ip cache changed op=0,info=[ip=10.42.0.116,type=pod,node=,pod=svclb-traefik-71ce8185-vq7jn,namespace=kube-system]"
time="2024-08-02T02:49:08Z" level=debug msg="ip cache changed op=0,info=[ip=10.42.0.112,type=pod,node=,pod=traefik-7d5f6474df-52wv5,namespace=kube-system]"
time="2024-08-02T02:49:08Z" level=debug msg="ip cache changed op=0,info=[ip=10.42.0.113,type=pod,node=,pod=controller-78c945f8cc-7w4mt,namespace=kubeskoop]"
time="2024-08-02T02:49:08Z" level=debug msg="ip cache changed op=0,info=[ip=10.42.0.110,type=pod,node=,pod=webconsole-6d64c54cc4-m8n8x,namespace=kubeskoop]"
...
Seems no obvious error log here. Cloud you please find there is any golang panic message in controller pod?
I'll also try to reproduce it on my local machine.
Thank you!
Seems no obvious error log here. Cloud you please find there is any golang panic message in controller pod? I'll also try to reproduce it on my local machine.
Thank you!
I'm wondering where to find these messages in controller pod? Could you please give some suggestions?
Seems no obvious error log here. Cloud you please find there is any golang panic message in controller pod? I'll also try to reproduce it on my local machine.
Thank you!I'm wondering where to find these messages in controller pod? Could you please give some suggestions?
Usually, the controller pod crashed and restarted should be due to a golang program panic.
The detailed info (reason and call stack) can be found at the end of the log text using kubectl logs -p
.
Seems no obvious error log here. Cloud you please find there is any golang panic message in controller pod? I'll also try to reproduce it on my local machine.
Thank you!I'm wondering where to find these messages in controller pod? Could you please give some suggestions?
Usually, the controller pod crashed and restarted should be due to a golang program panic. The detailed info (reason and call stack) can be found at the end of the log text using
kubectl logs -p
.
I find out that the nettop cache always stores nothing because the function in exporter/nettop/cache.go always returns empty:
sandboxList, err := criClient.ListPodSandbox(&v1.PodSandboxFilter{
State: &v1.PodSandboxStateValue{
State: v1.PodSandboxState_SANDBOX_READY,
},
})
Do you know how to figure it out ? Or maybe sth wrong with my env? Thank you very much!
Seems no obvious error log here. Cloud you please find there is any golang panic message in controller pod? I'll also try to reproduce it on my local machine.
Thank you!I'm wondering where to find these messages in controller pod? Could you please give some suggestions?
Usually, the controller pod crashed and restarted should be due to a golang program panic. The detailed info (reason and call stack) can be found at the end of the log text using
kubectl logs -p
.I find out that the nettop cache always stores nothing because the function in exporter/nettop/cache.go always returns empty:
sandboxList, err := criClient.ListPodSandbox(&v1.PodSandboxFilter{ State: &v1.PodSandboxStateValue{ State: v1.PodSandboxState_SANDBOX_READY, }, })
Do you know how to figure it out ? Or maybe sth wrong with my env? Thank you very much!
Hi, thank for your information!
Maybe the cause is that the agent can't list pods on nodes properly.
What container runtime are you using, and which version is it?
I'm using Docker Engine and cri-dockerd, as follows:
$ tr \\0 ' ' < /proc/"$(pgrep kubelet)"/cmdline
/usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/config.yaml --container-runtime=remote --container-runtime-endpoint=unix:///run/cri-dockerd.sock --cgroup-driver=systemd --container-runtime-endpoint=/run/cri-dockerd.sock
My k8s version is
$kubectl version
Client Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.0", GitCommit:"a866cbe2e5bbaa01cfd5e969aa3e033f3282a8a2", GitTreeState:"clean", BuildDate:"2022-08-23T17:44:59Z", GoVersion:"go1.19", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v4.5.7
Server Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.0", GitCommit:"a866cbe2e5bbaa01cfd5e969aa3e033f3282a8a2", GitTreeState:"clean", BuildDate:"2022-08-23T17:38:15Z", GoVersion:"go1.19", Compiler:"gc", Platform:"linux/amd64"}
My docker version is
Server: Docker Engine - Community
Engine:
Version: 26.1.3
API version: 1.45 (minimum version 1.24)
Go version: go1.21.10
Git commit: 8e96db1
Built: Thu May 16 08:33:34 2024
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.6.32
GitCommit: 8b3b7ca2e5ce38e8f31a34f35b2b68ceb8470d89
runc:
Version: 1.1.12
GitCommit: v1.1.12-0-g51d5e94
docker-init:
Version: 0.19.0
GitCommit: de40ad0
My cri-dockerd version is
cri-dockerd 0.3.6 (877dc6a4)