stub.go calls os.Exit(0) on lost connection
Opened this issue · 4 comments
If your NRI plugin timeouts, containerd will close the connection, and your NRI plugin os.Exit(0)
without a chance to log anything
Lines 519 to 528 in 53d3371
@klihub do you remember what was the logic behind this ? I don't think libraries should ever call os.Exit()
Also stub.Run
hides ttrpc.ErrServerClosed
Lines 436 to 450 in 53d3371
Maybe Run(ctx) should only return nil if we call Stop() and/or cancel the context ?
If your NRI plugin timeouts, containerd will close the connection, and your NRI plugin
os.Exit(0)
without a chance to log anything
Yes, but only if the plugin does not have an onClose()-handler.
Lines 519 to 528 in 53d3371
@klihub do you remember what was the logic behind this ? I don't think libraries should ever call os.Exit()
The logic was roughly: If the plugin has an onClose()
-handler it wants to get notified about the lost connection, so we gather it does know how to and wants to deal with the situation on its own. But if the plugin ignores lost connection (does not have an onClose()
-handler), we do an os.Exit()
.
Also
stub.Run
hidesttrpc.ErrServerClosed
Lines 436 to 450 in 53d3371
Maybe Run(ctx) should only return nil if we call Stop() and/or cancel the context ?
Hmm... I think we ignore ttrpc.ErrServerClosed
as a cheap way of differentiating between a failed Start()
and a successful run terminated by the runtime being shut down. But maybe we should do it differently, for instance by defining in stub a ErrFailedToStart = errors.New("failed to start")
then always return a fmt.Errorf("%w: %w", ErrFailedToStart, err)
for all errors from Start()
and return any other/later errors as such. Then one could differentiate between startup- and other runtime errors by checking it with errors.Is(err, stub.ErrFailedToStart)
, if that is really necessary.
The problem here is that you/the current code assumes lost connection is a normal shutdown.
For clean shutdown, containerd (nri adaptation) should do an rpc call to ask the plugin to shutdown itself, so we are sure ErrServerClosed means timeout or containerd crashing.
The problem here is that you/the current code assumes lost connection is a normal shutdown. For clean shutdown, containerd (nri adaptation) should do an rpc call to ask the plugin to shutdown itself, so we are sure ErrServerClosed means timeout or containerd crashing.
But is the Exit(0)
-logic really a problem ? If you don't want the stub to exit your plugin when the connection is lost, yet you are not interested in the fact itself, you can simply instantiate the stub with an extra stub.WithOnClose(func (){})
option.
Unless your plugin is the kind which does unsolicited updates, if you don't register an OnClose
-handler, you will never realize that you have lost the connection. So IMO, the least we must do in the stub is to exit runtime-launched plugins when the connection is lost since those cannot re-establish their connection. Instead of that, now we exit all plugins which have no OnClose handler.
About swallowing ErrServerClosed
in Run()
once the stub has been successfully Start()
ed, I think it should be safe to get rid of that. Just need to check if I rely on that behavior in some tests or integration tests and fix them if I do...