NetApp/trident

Could not update Trident controller with node registration. Slow Trident CSI controller.

dmpo1 opened this issue · 0 comments

dmpo1 commented

Hi guys, I have rather old version of Trident 21.10.0 and kubernetes 1.23.10, but it was working well up until yesterday when it stopped :)

We have a k8s cluster with Trident and OnTap NetApp used as NFS storage.
The problem is that after restart of daemonset trident-csi pods , it takes minutes or hours for them to fully start.
The pods can't register nodes in the controller (which I've tried to restart as well).

time="2023-10-11T10:01:24Z" level=debug msg="\n>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>\nPUT https://172.17.51.66:34571/trident/v1/node/xxxx\nHeaders: map[Content-Type:[application/json] X-Request-Id:[24dcb56a-947f-4648-ae73-55ff76a23e75]]\nBody: {\n  \"name\": \"xxxx\",\n  \"ips\": [\n    \"x.x.x.x\",\n    \"x.x.x.x\",\n    \"172.16.1.0\",\n    \"172.17.0.1\"\n  ],\n  \"nodePrep\": {\n    \"enabled\": false\n  }\n}\n--------------------------------------------------------------------------------" requestID=24dcb56a-947f-4648-ae73-55ff76a23e75 requestSource=Internal
time="2023-10-11T10:01:54Z" level=warning msg="Could not update Trident controller with node registration, will retry." error="could not log into the Trident CSI Controller: error communicating with Trident CSI Controller; Put \"https://172.17.51.66:34571/trident/v1/node/xxxx\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)" increment=4.763368615s requestID=24dcb56a-947f-4648-ae73-55ff76a23e75 requestSource=Internal

In debug logs of the trident-main container in the controller I see that it takes enormous time to complete the requests, like here it took 2 minutes and returned 400 status :

time="2023-10-11T10:03:21Z" level=debug msg="REST API call complete." duration=1m57.453167471s method=PUT requestID=24dcb56a-947f-4648-ae73-55ff76a23e75 requestSource=REST route=AddOrUpdateNode status_code=400 uri=/trident/v1/node/xxx

But after a while some pods (in the example bellow it is another pod/node) managed to register in the same controllers, here the controller processed the request in 25 seconds:

time="2023-10-11T12:16:33Z" level=info msg="Added a new node." handler=AddOrUpdateNode node=yyy requestID=b8c4813b-069d-48f9-a005-03f5da61a061 requestSource=REST
time="2023-10-11T12:16:33Z" level=debug msg="REST API call complete." duration=25.81268769s method=PUT requestID=b8c4813b-069d-48f9-a005-03f5da61a061 requestSource=REST route=AddOrUpdateNode status_code=201 uri=/trident/v1/node/yyy

Also if I run tridentctl get backends -n trident the command hangs forever. At the same time I can list detail about backend using kubectl get tridentbackend -n trident

I don't see that the node where the controller is running of the controller's pod is having any performance issues. ETCD database is relatively busy (300mb in size) but the control plane nodes have plenty CPU and memory resources.

Does anyone know what could cause such slowness of the controller (and 400 status)?

Thanks a lot!