Server is not restarted after restarting kubelet
Opened this issue · 5 comments
Hi, We deploy k8s-rdma-shared-dev-plugin (artprod.dev.bloomberg.com/ds/yweng14/nvidia/cloud-native/k8s-rdma-shared-dev-plugin:v1.3.2
) on our clusters and find that the the socket (ib.sock) is not recreated. We print some debugging message in the Restart()
function and find it is blocked at rs.stop <- true
// Restart restart plugin server
func (rs *resourceServer) Restart() error {
log.Printf("restarting %s device plugin server...", rs.resourceName)
if rs.rsConnector == nil {
fmt.Println("HPC Test line 225 rs.rsConnector is nil")
}
if rs.rsConnector.GetServer() == nil {
fmt.Println("HPC test line 228 rs.rsConnector.GetServer() is nil")
}
if rs.rsConnector == nil || rs.rsConnector.GetServer() == nil {
return fmt.Errorf("grpc server instance not found for %s", rs.resourceName)
}
fmt.Println("HPC test line 235 start stop connector and delete server")
rs.rsConnector.Stop()
fmt.Println("HPC test line 238 succeeds to stop rsConnector")
rs.rsConnector.DeleteServer()
fmt.Println("HPC test line 240 succeeds to delete server")
// Send terminate signal to ListAndWatch()
rs.stop <- true
fmt.Println("HPC test line 245 start resource server")
return rs.Start()
}
In our log we also see
2023/08/10 19:52:47 ListAndWatch stream close: context canceled
I think it is blocking here
https://github.com/Mellanox/k8s-rdma-shared-dev-plugin/blob/master/pkg/resources/server.go#L294-L299
When kubelet is restarted, the context is closed. Then the ListAndWatch print out message and return nil and thus stop channel is blocked.
We remove the context check (line 294 - 299), restart is not blocked and ib.sock is recreated. We also check how Nvidia GPU Device plugin implements ListAndWatch, they don't check context.Done()
I think this issue #74 may be also related this issue.
What is the K8s version you are using ?
does the following path exists in your system: /var/lib/kubelet/plugins_registry
?
can you provide device plugin logs ?
Hi @adrianchiris
Q: What is the K8s version you are using ?
A: We use 1.23
Q: does the following path exists in your system: /var/lib/kubelet/plugins_registry
A: Our kubelet path is /var/lib/kubelet, but we don't have /var/lib/kubelet/plugins_registry
.
$ ls /var/lib/kubelet
device-plugins pki
$ ls /var/lib/kubelet/device-plugins
DEPRECATION ib.sock kubelet_internal_checkpoint kubelet.sock nvidia.sock
Q: can you provide device plugin logs ?
A: When we start rdma-device-plugin, we see logs like the following, ib.sock
is created.
2023/08/14 02:33:17 starting rdma/ib device plugin endpoint at: ib.sock
2023/08/14 02:33:17 rdma/ib device plugin endpoint started serving
2023/08/14 02:33:17 All servers started.
2023/08/14 02:33:17 Listening for term signals
2023/08/14 02:33:17 Starting OS watcher.
2023/08/14 02:33:17 Updating "rdma/ib" devices
2023/08/14 02:33:17 exposing "1000" devices
Then I manually restart kubelet
sudo systemctl restart kubelet
The logs are the following. The server is not restarted, and ib.sock
is not recreated
2023/08/14 02:34:17 discovering host network devices
2023/08/14 02:34:17 DiscoverHostDevices(): device found: 0000:0b:00.0 02 Intel Corporation Ethernet Controller 10G X550T
2023/08/14 02:34:17 DiscoverHostDevices(): device found: 0000:18:00.0 02 Mellanox Technolo... MT2910 Family [ConnectX-7]
2023/08/14 02:34:17 DiscoverHostDevices(): device found: 0000:29:00.0 02 Mellanox Technolo... MT2910 Family [ConnectX-7]
2023/08/14 02:34:17 DiscoverHostDevices(): device found: 0000:29:00.1 02 Mellanox Technolo... MT2910 Family [ConnectX-7]
2023/08/14 02:34:17 DiscoverHostDevices(): device found: 0000:40:00.0 02 Mellanox Technolo... MT2910 Family [ConnectX-7]
2023/08/14 02:34:17 DiscoverHostDevices(): device found: 0000:4f:00.0 02 Mellanox Technolo... MT2910 Family [ConnectX-7]
2023/08/14 02:34:17 DiscoverHostDevices(): device found: 0000:5e:00.0 02 Mellanox Technolo... MT2910 Family [ConnectX-7]
2023/08/14 02:34:17 DiscoverHostDevices(): device found: 0000:82:00.0 02 Intel Corporation Ethernet Controller E810-C for QSFP
2023/08/14 02:34:17 DiscoverHostDevices(): device found: 0000:82:00.1 02 Intel Corporation Ethernet Controller E810-C for QSFP
2023/08/14 02:34:17 DiscoverHostDevices(): device found: 0000:9a:00.0 02 Mellanox Technolo... MT2910 Family [ConnectX-7]
2023/08/14 02:34:17 DiscoverHostDevices(): device found: 0000:aa:00.0 02 Mellanox Technolo... MT2910 Family [ConnectX-7]
2023/08/14 02:34:17 DiscoverHostDevices(): device found: 0000:aa:00.1 02 Mellanox Technolo... MT2910 Family [ConnectX-7]
2023/08/14 02:34:17 DiscoverHostDevices(): device found: 0000:c0:00.0 02 Mellanox Technolo... MT2910 Family [ConnectX-7]
2023/08/14 02:34:17 DiscoverHostDevices(): device found: 0000:ce:00.0 02 Mellanox Technolo... MT2910 Family [ConnectX-7]
2023/08/14 02:34:17 DiscoverHostDevices(): device found: 0000:dc:00.0 02 Mellanox Technolo... MT2910 Family [ConnectX-7]
2023/08/14 02:34:17 error creating new device: "missing RDMA device spec for device 0000:0b:00.0, RDMA device \"issm\" not found"
2023/08/14 02:34:17 no changes to devices for "rdma/ib"
2023/08/14 02:34:17 exposing "1000" devices
2023/08/14 02:35:17 discovering host network devices
2023/08/14 02:35:17 DiscoverHostDevices(): device found: 0000:0b:00.0 02 Intel Corporation Ethernet Controller 10G X550T
2023/08/14 02:35:17 DiscoverHostDevices(): device found: 0000:18:00.0 02 Mellanox Technolo... MT2910 Family [ConnectX-7]
2023/08/14 02:35:17 DiscoverHostDevices(): device found: 0000:29:00.0 02 Mellanox Technolo... MT2910 Family [ConnectX-7]
2023/08/14 02:35:17 DiscoverHostDevices(): device found: 0000:29:00.1 02 Mellanox Technolo... MT2910 Family [ConnectX-7]
2023/08/14 02:35:17 DiscoverHostDevices(): device found: 0000:40:00.0 02 Mellanox Technolo... MT2910 Family [ConnectX-7]
2023/08/14 02:35:17 DiscoverHostDevices(): device found: 0000:4f:00.0 02 Mellanox Technolo... MT2910 Family [ConnectX-7]
2023/08/14 02:35:17 DiscoverHostDevices(): device found: 0000:5e:00.0 02 Mellanox Technolo... MT2910 Family [ConnectX-7]
2023/08/14 02:35:17 DiscoverHostDevices(): device found: 0000:82:00.0 02 Intel Corporation Ethernet Controller E810-C for QSFP
2023/08/14 02:35:17 DiscoverHostDevices(): device found: 0000:82:00.1 02 Intel Corporation Ethernet Controller E810-C for QSFP
2023/08/14 02:35:17 DiscoverHostDevices(): device found: 0000:9a:00.0 02 Mellanox Technolo... MT2910 Family [ConnectX-7]
2023/08/14 02:35:17 DiscoverHostDevices(): device found: 0000:aa:00.0 02 Mellanox Technolo... MT2910 Family [ConnectX-7]
2023/08/14 02:35:17 DiscoverHostDevices(): device found: 0000:aa:00.1 02 Mellanox Technolo... MT2910 Family [ConnectX-7]
2023/08/14 02:35:17 DiscoverHostDevices(): device found: 0000:c0:00.0 02 Mellanox Technolo... MT2910 Family [ConnectX-7]
2023/08/14 02:35:17 DiscoverHostDevices(): device found: 0000:ce:00.0 02 Mellanox Technolo... MT2910 Family [ConnectX-7]
2023/08/14 02:35:17 DiscoverHostDevices(): device found: 0000:dc:00.0 02 Mellanox Technolo... MT2910 Family [ConnectX-7]
2023/08/14 02:35:17 error creating new device: "missing RDMA device spec for device 0000:0b:00.0, RDMA device \"issm\" not found"
2023/08/14 02:35:17 no changes to devices for "rdma/ib"
2023/08/14 02:35:17 exposing "1000" devices
ack, so it will use the old way to register with kubelet and write resource sockets.
please check #82 it should solve the issue.
Hi @adrianchiris Thank you very much for helping fix this issue, the pr is merged, could we have a new release ?
v1.4.0 is out please check :)