Node driver could not register on Windows in a little possibility
andyzhangx opened this issue · 13 comments
On Windows, we found that this node-driver-registrar hang on below statement forever in a little possibility, and then after restart registrar manually, it could register correctly, question: is there any way to add timeout and retry in below statement?
node-driver-registrar/cmd/csi-node-driver-registrar/node_register.go
Lines 63 to 64 in d79871c
- node-driver-registrar hang
I0429 19:12:33.856777 5500 main.go:113] Version: v2.1.0-0-g80d42f24
E0429 19:12:35.055779 5500 processstarttime.go:43] Could not get process start time, could not read /proc: CreateFile /proc: The system cannot find the file specified.
E0429 19:12:35.055779 5500 processstarttime.go:43] Could not get process start time, could not read /proc: CreateFile /proc: The system cannot find the file specified.
I0429 19:12:35.055779 5500 main.go:137] Attempting to open a gRPC connection with: "unix://C:\\\\csi\\\\csi.sock"
I0429 19:12:35.055779 5500 connection.go:153] Connecting to unix://C:\\csi\\csi.sock
I0429 19:12:44.501633 5500 main.go:144] Calling CSI driver to discover driver name
I0429 19:12:44.511914 5500 main.go:154] CSI driver name: "disk.csi.azure.com"
I0429 19:12:44.512939 5500 node_register.go:52] Starting Registration Server at: /registration/disk.csi.azure.com-reg.sock
I0429 19:12:44.523926 5500 node_register.go:61] Registration Server started at: /registration/disk.csi.azure.com-reg.sock
I0429 19:12:44.524941 5500 node_register.go:83] Skipping healthz server because HTTP endpoint is set to: ""
- After restart driver(run node-driver-registrar), it works
0429 19:12:34.188716 5480 main.go:113] Version: v2.1.0-0-g80d42f24
E0429 19:12:35.839739 5480 processstarttime.go:43] Could not get process start time, could not read /proc: CreateFile /proc: The system cannot find the file specified.
E0429 19:12:35.840735 5480 processstarttime.go:43] Could not get process start time, could not read /proc: CreateFile /proc: The system cannot find the file specified.
I0429 19:12:35.840735 5480 main.go:137] Attempting to open a gRPC connection with: "unix://C:\\\\csi\\\\csi.sock"
I0429 19:12:35.840735 5480 connection.go:153] Connecting to unix://C:\\csi\\csi.sock
I0429 19:12:45.261075 5480 main.go:144] Calling CSI driver to discover driver name
I0429 19:12:45.264589 5480 main.go:154] CSI driver name: "disk.csi.azure.com"
I0429 19:12:45.264589 5480 node_register.go:52] Starting Registration Server at: /registration/disk.csi.azure.com-reg.sock
I0429 19:12:45.266595 5480 node_register.go:61] Registration Server started at: /registration/disk.csi.azure.com-reg.sock
I0429 19:12:45.268093 5480 node_register.go:83] Skipping healthz server because HTTP endpoint is set to: ""
I0429 19:12:45.616080 5480 main.go:80] Received GetInfo call: &InfoRequest{}
I0429 19:12:47.292740 5480 main.go:90] Received NotifyRegistrationStatus call: &RegistrationStatus{PluginRegistered:true,Error:,}
/kind bug
/sig windows
Following up my comments on #149, after I read how the kubelet plugin registration works, after the gRPC server is started we should get a notification from the kubelet through the callback
If we don't get the callback then as a workaround we can force a container restart, I tried a similar approach in kubernetes-csi/external-snapshotter#504, the idea is to crash the container if we don't get to a valid state after some seconds to force a restart, we could use race timer, in pseudocode it'd be something like:
// Registered is set to true once the driver receives an ack from the kubelet
var Registered = false
func (e registrationServer) GetInfo(ctx context.Context, req *registerapi.InfoRequest) (*registerapi.PluginInfo, error) {
klog.Infof("Received GetInfo call: %+v", req)
Registered = true
return ®isterapi.PluginInfo{
Type: registerapi.CSIPlugin,
Name: e.driverName,
Endpoint: e.endpoint,
SupportedVersions: e.version,
}, nil
}
// ...
func nodeRegister(csiDriverName, httpEndpoint string) {
// Registers kubelet plugin watcher api.
registerapi.RegisterRegistrationServer(grpcServer, registrar)
go func() {
time.Sleep(time.Second) // or a timer with exp backoff
if !Registered {
os.Exit(1)
}
}()
go healthzServer(socketPath, httpEndpoint)
go removeRegSocket(csiDriverName)
// Starts service
if err := grpcServer.Serve(lis); err != nil {
klog.Errorf("Registration Server stopped serving: %v", err)
os.Exit(1)
}
// If gRPC server is gracefully shutdown, exit
os.Exit(0)
}
In the external-snapshotter PR I also made changes to the manifests to make sure that a pod is only considered after some time, also this approach worked in the snapshotter because it's a Deployment
and it's the only container.
Another idea is to use the http endpoint already exposed and a readiness probe, that endpoint would check the value of Registered
to see if the pod is alive, if not then it'd get restarted
These ideas are workarounds though, the fact that sometimes we get a callback and sometimes we don't is something that we should ask in sig-windows
cc @lizhuqi
+1 to ask in sig-windows. I will ping sig-windows to take a look in the slack.
I would vote for force restart if registrar does not get response from kubelet.
We just upgraded our e2e tests to run against 1.21 windows for secrets-store-csi-driver and most of the e2e runs are failing with this error. There is no InfoRequest
from kubelet only for windows.
sig-windows asked to check the output of sc query afunix
, if you see this error again please try to run this command in your nodes.
When is the next
node-driver-registrar
planned for?
Within the next two weeks.
https://github.com/kubernetes-csi/node-driver-registrar/releases/tag/v2.3.0 is already published.
I have tried with --mode=kubelet-registration-probe
with v2.3.0, it would fail on both Linux and Windows:
F0814 00:55:19.223354 1 main.go:157] Kubelet plugin registration hasn't succeeded yet, file=/var/lib/kubelet/plugins/file.csi.azure.com/registration doesn't exist.
goroutine 1 [running]:
k8s.io/klog/v2.stacks(0xc00000e001, 0xc0000ca600, 0xaa, 0xfe)
/mnt/vss/_work/1/go/src/github.com/kubernetes-csi/node-driver-registrar/vendor/k8s.io/klog/v2/klog.go:1026 +0xb9
k8s.io/klog/v2.(*loggingT).output(0xf9d840, 0xc000000003, 0x0, 0x0, 0xc00016d030, 0x0, 0xcd3ca3, 0x7, 0x9d, 0x0)
/mnt/vss/_work/1/go/src/github.com/kubernetes-csi/node-driver-registrar/vendor/k8s.io/klog/v2/klog.go:975 +0x1e5
k8s.io/klog/v2.(*loggingT).printf(0xf9d840, 0xc000000003, 0x0, 0x0, 0x0, 0x0, 0xb32465, 0x48, 0xc0002626e0, 0x1, ...)
/mnt/vss/_work/1/go/src/github.com/kubernetes-csi/node-driver-registrar/vendor/k8s.io/klog/v2/klog.go:753 +0x19a
k8s.io/klog/v2.Fatalf(...)
/mnt/vss/_work/1/go/src/github.com/kubernetes-csi/node-driver-registrar/vendor/k8s.io/klog/v2/klog.go:1514
main.main()
/mnt/vss/_work/1/go/src/github.com/kubernetes-csi/node-driver-registrar/cmd/csi-node-driver-registrar/main.go:157 +0xfc6
goroutine 7 [chan receive]:
k8s.io/klog/v2.(*loggingT).flushDaemon(0xf9d840)
/mnt/vss/_work/1/go/src/github.com/kubernetes-csi/node-driver-registrar/vendor/k8s.io/klog/v2/klog.go:1169 +0x8b
created by k8s.io/klog/v2.init.0
/mnt/vss/_work/1/go/src/github.com/kubernetes-csi/node-driver-registrar/vendor/k8s.io/klog/v2/klog.go:420 +0xdf
@andyzhangx I wrote how to use it here: https://github.com/kubernetes-csi/node-driver-registrar/blob/master/README.md#health-check-with-an-exec-probe, what's your setup?
@andyzhangx I wrote how to use it here: https://github.com/kubernetes-csi/node-driver-registrar/blob/master/README.md#health-check-with-an-exec-probe, what's your setup?
@mauriciopoppe thanks, it works well with that config!