mercury-hpc/mercury

HG: safe mechanism to deregister an RPC while handles for that RPC are still in use

Opened this issue · 3 comments

carns commented

Is your feature request related to a problem? Please describe.

Imagine a hypothetical scenario in which a service is periodically receiving a particular RPC type. The service then begins to shut down (without coordinating with clients) and deregisters that RPC as part of the shut down process.

In this case, a the service could have already begun executing handlers for the RPC, and those handlers will continue to execute despite deregistration. Margo includes a workaround for this that seems to cover most cases by simply checking whether the registered data associated with a given RPC is NULL or not when it is retrieved mochi-hpc/mochi-margo#170.

Describe the solution you'd like

It may be cleaner if Mercury had a way to avoid impacting existing handles on a given RPC ID when deregistering. For example it could deny new RPCs on that ID immediately, but use reference counting to defer full deregistration until in-flight handles associated with the ID are all closed. There are probably other solutions; that's just one option.

Describe alternatives you've considered

So far it seems like in-flight RPCs aren't particularly harmed unless they rely on registered data associated with the RPC, but we are still testing.

The Margo fix that Phil mentions is only part of the solution, as it just applies to some Margo boiler-plate logic that runs before user RPC handler code. It looks like service RPC handlers themselves have to be careful not to assume they will be able to retrieve data registered with the RPC -- that's not a huge deal to add safety checks there, but it would be nice if Mercury could provide some stricter guarantees in terms of lifetime of registered data for RPC handlers that are already executing.

Has this problem been solved in mercury 2.3.0?

no this has not been implemented yet