The metrics interface of some nodes cannot respond normally
wenhuwang opened this issue · 7 comments
Env
# k get node -owide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
10.165.6.25 Ready node 584d v1.19.4 10.165.6.25 <none> Ubuntu 18.04.6 LTS 5.4.187-0504187-generic docker://19.3.13
10.165.6.26 Ready node 581d v1.19.4 10.165.6.26 <none> CentOS Linux 7 (Core) 5.4.243-1.el7.elrepo.x86_64 docker://19.3.13
10.165.6.27 Ready node 581d v1.19.4 10.165.6.27 <none> CentOS Linux 7 (Core) 5.4.243-1.el7.elrepo.x86_64 docker://19.3.13
10.165.8.23 Ready node 109d v1.19.4 10.165.8.23 <none> CentOS Linux 7 (Core) 5.4.243-1.el7.elrepo.x86_64 docker://19.3.13
....
# helm -n coroot list
NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION
coroot coroot 1 2023-10-25 15:48:12.919318 +0800 CST deployed coroot-0.5.1 0.21.0
# k -n coroot get pods -owide | grep coroot-node-agent
coroot-node-agent-249ws 1/1 Running 0 36m 10.165.208.69 10.165.6.27 <none> <none>
coroot-node-agent-6bxlb 1/1 Running 0 4h27m 10.165.204.252 10.165.8.23 <none> <none>
coroot-node-agent-tfhdw 1/1 Running 6 4h27m 10.165.210.2 10.165.6.26 <none> <none>
coroot-node-agent-89xqp 1/1 Running 7 4h26m 10.165.202.98 10.165.6.25 <none> <none>
Description
the some nodes coroot-node-agent status was done.
![image](https://private-user-images.githubusercontent.com/16423072/279296001-23ab2975-5990-4cda-bd73-5e6538122850.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTY0MTg4MjIsIm5iZiI6MTcxNjQxODUyMiwicGF0aCI6Ii8xNjQyMzA3Mi8yNzkyOTYwMDEtMjNhYjI5NzUtNTk5MC00Y2RhLWJkNzMtNWU2NTM4MTIyODUwLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA1MjIlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwNTIyVDIyNTUyMlomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWZmYjJhZjZkNjJkN2QxMDYxMGQyZDdjNGMzOGUzYWQ3ODk3NTUyZWRkMTkwNDE2YmE5YTI1ZjE1YjdjMmQ4OTQmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.28Zx4DQNOB8ZWReFfqbsfFmS_4YCAnRQd2pZN-QM-Cs)
the cpu profile shows that the netlink.AddrList
function takes up more than 70% of the cpu time.
abnormal coroot-node-agent pods cpu usage is about 2.5C, normal pod cp usage is about 0.2C
all node configurations and pod numbers are similar, and please help me troubleshoot the problem.
@wenhuwang, could you please share the CPU and Memory profiles of an affected agent?
curl -o mem_profile.tgz 'http://<agent_ip>:<agent_port>/debug/pprof/heap'
curl -o cpu_profile.tgz 'http://<agent_ip>:<agent_port>/debug/pprof/profile?seconds=60'
goroutine_profile.tgz
goroutine leaked?
@wenhuwang,
Could you please update coroot-node-agent to the latest version, 1.14.1 (the helm chart has also been updated), and verify if this issue has been resolved?
@wenhuwang, Could you please update coroot-node-agent to the latest version, 1.14.1 (the helm chart has also been updated), and verify if this issue has been resolved?
OK, I will try.
@apetruhin this issues has been solved after upgrading to the latest version, thank you.
@wenhuwang, thank you for providing such comprehensive details on the issue.