Mellanox/k8s-rdma-shared-dev-plugin

RDMA resources change to 0 after kubelet restart, and will not be updated again

miaojianwei opened this issue · 0 comments

What happened?
After kubelet was restarted, Allocatable rdma/mlnx_shared in Node.Status turned to be 0, and won't be updated any more.

What did you expect to happen?
RDMA resources should be updated correctly to 1 after kubelet restart.

Versions
device plugin: v1.2.1
kubelet: v1.19.8
Linux version: 3.10.0-1127.18.2.el7.x86_64 (mockbuild@kbuilder.bsys.centos.org) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-39) (GCC) )

How can we reproduce it (as minimally and precisely as possible)?
Just restart kubelet

root@51QD0W2 ~# kks describe node | grep "rdma/mlnx_shared:"
rdma/mlnx_shared: 1
rdma/mlnx_shared: 1
root@51QD0W2 ~# systemctl restart kubelet
root@51QD0W2 ~# kks describe node | grep "rdma/mlnx_shared:"
rdma/mlnx_shared: 0
rdma/mlnx_shared: 0

Device plugin logs

2022/01/19 11:15:55 Starting K8s RDMA Shared Device Plugin version= master
Using Kubelet Plugin Registry Mode
...
2022/01/19 11:15:55 All servers started.
2022/01/19 11:15:55 Listening for term signals
2022/01/19 11:15:55 Starting OS watcher.
2022/01/19 11:15:55 Updating "rdma/mlnx_shared" devices
2022/01/19 11:15:55 mlnx_shared.sock gets registered successfully at Kubelet
2022/01/19 11:15:55 exposing "1" devices
2022/01/19 11:16:16 mlnx_shared.sock gets registered successfully at Kubelet
2022/01/19 11:16:16 Updating "rdma/mlnx_shared" devices
2022/01/19 11:16:16 error: failed to update "rdma/mlnx_shared" resouces: rpc error: code = Unavailable desc = transport is closing
2022/01/19 11:16:45 discovering host network devices
2022/01/19 11:16:45 DiscoverHostDevices(): device found: 0000:19:00.0 02 Mellanox Technolo... MT27710 Family [ConnectX-4 Lx]
2022/01/19 11:16:45 DiscoverHostDevices(): device found: 0000:19:00.1 02 Mellanox Technolo... MT27710 Family [ConnectX-4 Lx]
2022/01/19 11:16:45 no changes to devices for "rdma/mlnx_shared"
2022/01/19 11:16:45 exposing "1" devices
...

These logs show that after kubelet restart, the device plugin failed to send devices through ListAndWatch stream, since the devices not change, so the device plugin will not send update event to ListAndWatch() any more.

Debugging
After add an id to ListAndWatch() and some more debugging logs, I get this:

2022/01/19 11:35:44 Starting K8s RDMA Shared Device Plugin version= master
Using Kubelet Plugin Registry Mode
...
2022/01/19 11:35:44 All servers started.
2022/01/19 11:35:44 Listening for term signals
2022/01/19 11:35:44 Starting OS watcher.
2022/01/19 11:35:44 mlnx_shared.sock gets registered successfully at Kubelet
2022/01/19 11:35:44 xxxx ListAndWatch is involed with id: 81
2022/01/19 11:35:44 Updating "rdma/mlnx_shared" devices
2022/01/19 11:35:44 xxxxxx ListAndWatch updating with id: 81
2022/01/19 11:35:44 exposing "1" devices
2022/01/19 11:36:06 mlnx_shared.sock gets registered successfully at Kubelet
2022/01/19 11:36:06 xxxx ListAndWatch is involed with id: 87
2022/01/19 11:36:06 Updating "rdma/mlnx_shared" devices
2022/01/19 11:36:06 xxxxxx ListAndWatch updating with id: 81
2022/01/19 11:36:06 error: failed to update "rdma/mlnx_shared" resouces: rpc error: code = Unavailable desc = transport is closing
2022/01/19 11:36:34 discovering host network devices
2022/01/19 11:36:34 DiscoverHostDevices(): device found: 0000:19:00.0 02 Mellanox Technolo... MT27710 Family [ConnectX-4 Lx]
2022/01/19 11:36:34 DiscoverHostDevices(): device found: 0000:19:00.1 02 Mellanox Technolo... MT27710 Family [ConnectX-4 Lx]
2022/01/19 11:36:34 no changes to devices for "rdma/mlnx_shared"
2022/01/19 11:36:34 exposing "1" devices
...

These shows that before kubelet restart the ListAndWatch stream id is 81, after kubelet restart , the new ListAndWatch stream id is 87, but the update event is recived by the old ListAndWatch stream whoes id is 81, this old stream has been closed, so stream.Send() failed.

Solution
When stream.Send() fail, return an error to stop old ListAndWatch(), and pass the update event to the new ListAndWatch function for processing can resolve this bug.
And the rs.updateResource <- true in the beginning of the new ListAndWatch() may be blocked because the UpdateDevices() or the old ListAndWatch() may invoke rs.updateResource <- true too during kubelet restarting, so it's better not use rs.updateResource <- true at the begin of ListAndWatch, use stream.Send() instead, to expose devices to kubelet directly.

PR
I also submit a PR to sovle this bug. #46