coroot/coroot-node-agent

The metrics interface of some nodes cannot respond normally

wenhuwang opened this issue · 7 comments

Env

# k get node -owide
NAME          STATUS   ROLES    AGE      VERSION   INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                KERNEL-VERSION                CONTAINER-RUNTIME
10.165.6.25   Ready    node     584d     v1.19.4   10.165.6.25   <none>        Ubuntu 18.04.6 LTS      5.4.187-0504187-generic       docker://19.3.13
10.165.6.26   Ready    node     581d     v1.19.4   10.165.6.26   <none>        CentOS Linux 7 (Core)   5.4.243-1.el7.elrepo.x86_64   docker://19.3.13
10.165.6.27   Ready    node     581d     v1.19.4   10.165.6.27   <none>        CentOS Linux 7 (Core)   5.4.243-1.el7.elrepo.x86_64   docker://19.3.13
10.165.8.23   Ready    node     109d     v1.19.4   10.165.8.23   <none>        CentOS Linux 7 (Core)   5.4.243-1.el7.elrepo.x86_64   docker://19.3.13
....

# helm -n coroot list
NAME  	NAMESPACE	REVISION	UPDATED                             	STATUS  	CHART       	APP VERSION
coroot	coroot   	1       	2023-10-25 15:48:12.919318 +0800 CST	deployed	coroot-0.5.1	0.21.0

# k -n coroot get pods -owide | grep coroot-node-agent
coroot-node-agent-249ws                           1/1     Running   0          36m     10.165.208.69    10.165.6.27   <none>           <none>
coroot-node-agent-6bxlb                           1/1     Running   0          4h27m   10.165.204.252   10.165.8.23   <none>           <none>
coroot-node-agent-tfhdw                           1/1     Running   6          4h27m   10.165.210.2     10.165.6.26   <none>           <none>
coroot-node-agent-89xqp                           1/1     Running   7          4h26m   10.165.202.98    10.165.6.25   <none>           <none>

Description

the some nodes coroot-node-agent status was done.

image

the cpu profile shows that the netlink.AddrList function takes up more than 70% of the cpu time.
image

abnormal coroot-node-agent pods cpu usage is about 2.5C, normal pod cp usage is about 0.2C
image

all node configurations and pod numbers are similar, and please help me troubleshoot the problem.

@wenhuwang, could you please share the CPU and Memory profiles of an affected agent?

curl -o mem_profile.tgz 'http://<agent_ip>:<agent_port>/debug/pprof/heap'
curl -o cpu_profile.tgz 'http://<agent_ip>:<agent_port>/debug/pprof/profile?seconds=60'

goroutine_profile.tgz
goroutine leaked?
image

@wenhuwang,
Could you please update coroot-node-agent to the latest version, 1.14.1 (the helm chart has also been updated), and verify if this issue has been resolved?

@wenhuwang, Could you please update coroot-node-agent to the latest version, 1.14.1 (the helm chart has also been updated), and verify if this issue has been resolved?

OK, I will try.

@apetruhin this issues has been solved after upgrading to the latest version, thank you.

@wenhuwang, thank you for providing such comprehensive details on the issue.