kubernetes-csi/node-driver-registrar

Node driver could not register on Windows in a little possibility

andyzhangx opened this issue · 13 comments

On Windows, we found that this node-driver-registrar hang on below statement forever in a little possibility, and then after restart registrar manually, it could register correctly, question: is there any way to add timeout and retry in below statement?

// Registers kubelet plugin watcher api.
registerapi.RegisterRegistrationServer(grpcServer, registrar)

  • node-driver-registrar hang
I0429 19:12:33.856777    5500 main.go:113] Version: v2.1.0-0-g80d42f24
E0429 19:12:35.055779    5500 processstarttime.go:43] Could not get process start time, could not read /proc: CreateFile /proc: The system cannot find the file specified.
E0429 19:12:35.055779    5500 processstarttime.go:43] Could not get process start time, could not read /proc: CreateFile /proc: The system cannot find the file specified.
I0429 19:12:35.055779    5500 main.go:137] Attempting to open a gRPC connection with: "unix://C:\\\\csi\\\\csi.sock"
I0429 19:12:35.055779    5500 connection.go:153] Connecting to unix://C:\\csi\\csi.sock
I0429 19:12:44.501633    5500 main.go:144] Calling CSI driver to discover driver name
I0429 19:12:44.511914    5500 main.go:154] CSI driver name: "disk.csi.azure.com"
I0429 19:12:44.512939    5500 node_register.go:52] Starting Registration Server at: /registration/disk.csi.azure.com-reg.sock
I0429 19:12:44.523926    5500 node_register.go:61] Registration Server started at: /registration/disk.csi.azure.com-reg.sock
I0429 19:12:44.524941    5500 node_register.go:83] Skipping healthz server because HTTP endpoint is set to: ""
  • After restart driver(run node-driver-registrar), it works
0429 19:12:34.188716    5480 main.go:113] Version: v2.1.0-0-g80d42f24
E0429 19:12:35.839739    5480 processstarttime.go:43] Could not get process start time, could not read /proc: CreateFile /proc: The system cannot find the file specified.
E0429 19:12:35.840735    5480 processstarttime.go:43] Could not get process start time, could not read /proc: CreateFile /proc: The system cannot find the file specified.
I0429 19:12:35.840735    5480 main.go:137] Attempting to open a gRPC connection with: "unix://C:\\\\csi\\\\csi.sock"
I0429 19:12:35.840735    5480 connection.go:153] Connecting to unix://C:\\csi\\csi.sock
I0429 19:12:45.261075    5480 main.go:144] Calling CSI driver to discover driver name
I0429 19:12:45.264589    5480 main.go:154] CSI driver name: "disk.csi.azure.com"
I0429 19:12:45.264589    5480 node_register.go:52] Starting Registration Server at: /registration/disk.csi.azure.com-reg.sock
I0429 19:12:45.266595    5480 node_register.go:61] Registration Server started at: /registration/disk.csi.azure.com-reg.sock
I0429 19:12:45.268093    5480 node_register.go:83] Skipping healthz server because HTTP endpoint is set to: ""
I0429 19:12:45.616080    5480 main.go:80] Received GetInfo call: &InfoRequest{}
I0429 19:12:47.292740    5480 main.go:90] Received NotifyRegistrationStatus call: &RegistrationStatus{PluginRegistered:true,Error:,}

/kind bug
/sig windows

Following up my comments on #149, after I read how the kubelet plugin registration works, after the gRPC server is started we should get a notification from the kubelet through the callback

func (e registrationServer) GetInfo(ctx context.Context, req *registerapi.InfoRequest) (*registerapi.PluginInfo, error) {

If we don't get the callback then as a workaround we can force a container restart, I tried a similar approach in kubernetes-csi/external-snapshotter#504, the idea is to crash the container if we don't get to a valid state after some seconds to force a restart, we could use race timer, in pseudocode it'd be something like:

// Registered is set to true once the driver receives an ack from the kubelet
var Registered = false

func (e registrationServer) GetInfo(ctx context.Context, req *registerapi.InfoRequest) (*registerapi.PluginInfo, error) {  
	klog.Infof("Received GetInfo call: %+v", req)
    Registered = true
  	return &registerapi.PluginInfo{
		Type:              registerapi.CSIPlugin,
		Name:              e.driverName,
		Endpoint:          e.endpoint,
		SupportedVersions: e.version,
	}, nil
}

// ...

func nodeRegister(csiDriverName, httpEndpoint string) {
	// Registers kubelet plugin watcher api.
	registerapi.RegisterRegistrationServer(grpcServer, registrar)

    go func() {
       time.Sleep(time.Second) // or a timer with exp backoff
       if !Registered {
         os.Exit(1)
       }
    }()

	go healthzServer(socketPath, httpEndpoint)
	go removeRegSocket(csiDriverName)
	// Starts service
	if err := grpcServer.Serve(lis); err != nil {
		klog.Errorf("Registration Server stopped serving: %v", err)
		os.Exit(1)
	}
	// If gRPC server is gracefully shutdown, exit
	os.Exit(0)
}

In the external-snapshotter PR I also made changes to the manifests to make sure that a pod is only considered after some time, also this approach worked in the snapshotter because it's a Deployment and it's the only container.

Another idea is to use the http endpoint already exposed and a readiness probe, that endpoint would check the value of Registered to see if the pod is alive, if not then it'd get restarted

These ideas are workarounds though, the fact that sometimes we get a callback and sometimes we don't is something that we should ask in sig-windows

cc @jingxu97 @lizhuqi

+1 to ask in sig-windows. I will ping sig-windows to take a look in the slack.

I would vote for force restart if registrar does not get response from kubelet.

We just upgraded our e2e tests to run against 1.21 windows for secrets-store-csi-driver and most of the e2e runs are failing with this error. There is no InfoRequest from kubelet only for windows.

https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_secrets-store-csi-driver/638/pull-secrets-store-csi-driver-e2e-windows/1416135481476780032/artifacts/2021-07-16T211148/node-driver-registrar.log

sig-windows asked to check the output of sc query afunix, if you see this error again please try to run this command in your nodes.

@msau42 Thank you for merging #152. When is the next node-driver-registrar planned for?

When is the next node-driver-registrar planned for?

Within the next two weeks.

I have tried with --mode=kubelet-registration-probe with v2.3.0, it would fail on both Linux and Windows:

F0814 00:55:19.223354       1 main.go:157] Kubelet plugin registration hasn't succeeded yet, file=/var/lib/kubelet/plugins/file.csi.azure.com/registration doesn't exist.
goroutine 1 [running]:
k8s.io/klog/v2.stacks(0xc00000e001, 0xc0000ca600, 0xaa, 0xfe)
        /mnt/vss/_work/1/go/src/github.com/kubernetes-csi/node-driver-registrar/vendor/k8s.io/klog/v2/klog.go:1026 +0xb9
k8s.io/klog/v2.(*loggingT).output(0xf9d840, 0xc000000003, 0x0, 0x0, 0xc00016d030, 0x0, 0xcd3ca3, 0x7, 0x9d, 0x0)
        /mnt/vss/_work/1/go/src/github.com/kubernetes-csi/node-driver-registrar/vendor/k8s.io/klog/v2/klog.go:975 +0x1e5
k8s.io/klog/v2.(*loggingT).printf(0xf9d840, 0xc000000003, 0x0, 0x0, 0x0, 0x0, 0xb32465, 0x48, 0xc0002626e0, 0x1, ...)
        /mnt/vss/_work/1/go/src/github.com/kubernetes-csi/node-driver-registrar/vendor/k8s.io/klog/v2/klog.go:753 +0x19a
k8s.io/klog/v2.Fatalf(...)
        /mnt/vss/_work/1/go/src/github.com/kubernetes-csi/node-driver-registrar/vendor/k8s.io/klog/v2/klog.go:1514
main.main()
        /mnt/vss/_work/1/go/src/github.com/kubernetes-csi/node-driver-registrar/cmd/csi-node-driver-registrar/main.go:157 +0xfc6

goroutine 7 [chan receive]:
k8s.io/klog/v2.(*loggingT).flushDaemon(0xf9d840)
        /mnt/vss/_work/1/go/src/github.com/kubernetes-csi/node-driver-registrar/vendor/k8s.io/klog/v2/klog.go:1169 +0x8b
created by k8s.io/klog/v2.init.0
        /mnt/vss/_work/1/go/src/github.com/kubernetes-csi/node-driver-registrar/vendor/k8s.io/klog/v2/klog.go:420 +0xdf