The metrics interface of some nodes cannot respond normally

Question

The metrics interface of some nodes cannot respond normally

wenhuwang opened this issue a year ago · 7 comments

Env

# k get node -owide
NAME          STATUS   ROLES    AGE      VERSION   INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                KERNEL-VERSION                CONTAINER-RUNTIME
10.165.6.25   Ready    node     584d     v1.19.4   10.165.6.25   <none>        Ubuntu 18.04.6 LTS      5.4.187-0504187-generic       docker://19.3.13
10.165.6.26   Ready    node     581d     v1.19.4   10.165.6.26   <none>        CentOS Linux 7 (Core)   5.4.243-1.el7.elrepo.x86_64   docker://19.3.13
10.165.6.27   Ready    node     581d     v1.19.4   10.165.6.27   <none>        CentOS Linux 7 (Core)   5.4.243-1.el7.elrepo.x86_64   docker://19.3.13
10.165.8.23   Ready    node     109d     v1.19.4   10.165.8.23   <none>        CentOS Linux 7 (Core)   5.4.243-1.el7.elrepo.x86_64   docker://19.3.13
....

# helm -n coroot list
NAME  	NAMESPACE	REVISION	UPDATED                             	STATUS  	CHART       	APP VERSION
coroot	coroot   	1       	2023-10-25 15:48:12.919318 +0800 CST	deployed	coroot-0.5.1	0.21.0

# k -n coroot get pods -owide | grep coroot-node-agent
coroot-node-agent-249ws                           1/1     Running   0          36m     10.165.208.69    10.165.6.27   <none>           <none>
coroot-node-agent-6bxlb                           1/1     Running   0          4h27m   10.165.204.252   10.165.8.23   <none>           <none>
coroot-node-agent-tfhdw                           1/1     Running   6          4h27m   10.165.210.2     10.165.6.26   <none>           <none>
coroot-node-agent-89xqp                           1/1     Running   7          4h26m   10.165.202.98    10.165.6.25   <none>           <none>

Description

the some nodes coroot-node-agent status was done.

the cpu profile shows that the netlink.AddrList function takes up more than 70% of the cpu time.

abnormal coroot-node-agent pods cpu usage is about 2.5C, normal pod cp usage is about 0.2C

all node configurations and pod numbers are similar, and please help me troubleshoot the problem.

Answer 1 · 2023-10-31T09:06:09.000Z

@wenhuwang, could you please share the CPU and Memory profiles of an affected agent?

curl -o mem_profile.tgz 'http://<agent_ip>:<agent_port>/debug/pprof/heap'
curl -o cpu_profile.tgz 'http://<agent_ip>:<agent_port>/debug/pprof/profile?seconds=60'

Answer 2 · 2023-10-31T09:17:33.000Z

cpu_profile.tgz
mem_profile.tgz
@apetruhin

Answer 3 · 2023-10-31T09:29:12.000Z

goroutine_profile.tgz
goroutine leaked?

Answer 4 · 2023-10-31T12:23:00.000Z

@wenhuwang,
Could you please update coroot-node-agent to the latest version, 1.14.1 (the helm chart has also been updated), and verify if this issue has been resolved?

Answer 5 · 2023-11-01T05:11:47.000Z

@wenhuwang, Could you please update coroot-node-agent to the latest version, 1.14.1 (the helm chart has also been updated), and verify if this issue has been resolved?

OK, I will try.

Answer 6 · 2023-11-01T11:34:05.000Z

@apetruhin this issues has been solved after upgrading to the latest version, thank you.

Answer 7 · 2023-11-01T11:51:17.000Z

@wenhuwang, thank you for providing such comprehensive details on the issue.