mtrudel/thousand_island

Feature request: Let the process outlive beyond the connection

farhadi opened this issue · 9 comments

I notice in the code that when a tcp_error or tcp_closed is received, the connection process is terminated.

It would be beneficial if instead of ignoring the return value from handle_close / handle_error callbacks, we could depend on the return value of these callbacks to determine whether to keep the process running or to terminate it.

This feature is particularly useful in implementing protocols like MQTT, where we aim to make the session resilient to connection interruptions. Let's say when a connection drops, we allow the process to remain alive for a reasonable amount of time. If the client reconnects, we can then assign the controlling process of the socket to the already existing process, saving us a significant amount of time and effort in persisting/initializing the state.

Changing the behavior to handle the return value of close/error callbacks would be a breaking change. However, I believe it is worth adding this feature for the version 2 milestone.

That sounds like a great idea! Did you have an idea for how a reconnect would happen in that case, or would you just want time to safe things away? If it helps your use case, you're able to take as much time as you need in those callbacks; there's no expectation that they return quickly

Blocking those callbacks is kind of a workaround for connection interruptions but still has the downside that we have to copy the state to the new connection process, also in some cases we might need to keep processing incoming messages and update the state accordingly while we are waiting for the client to reconnect, and in case of the MQTT protocol the session timeout is a configurable value by clients which might be even hours.

Sorry for the late reply; was on holiday last week.

The tricky thing here is establishing how the server process and the new client would find each other. Once the TCP connection is severed, it's obviously on the client to attempt to re-establish a connection. When that happens, there's nothing inherent in the connection to make it possible to reconnect it with the previous server process (at least, nothing that's low-level enough to be a Thousand Island concern). The same goes for any of the errors that get raised out of the TCP stack; they render the existing connection essentially closed and all of the above applies.

Am I missing something? Did you have something specific in mind?

Thanks for your reply.
I think Thousand Island should provide a way to let the connection process stay alive and also provide an api to change the controlling process of a connection to another process. The connection handover scenarios might be very protocol specific and should be up to the library user to decide how to implement it.
For example in MQTT, first the CONNECT packet is processed and based on the parameters provided by the client, server decides whether it should look up the client id in a registry to see if there is already a process for that client running and hand over the connection to that process.

I understand. I think there are two parts of this that I have somewhat different opinions on.

First, allowing the handler process to outlive the underlying connection. There's a of plumbing that would need to change to accommodate that (internally, we signal close/error via a {:stop,... tuple and do the handle_close/2, handle_error/3 calls inside terminate/2, when we've already committed to shutting down the GenServer. We'll have to move things around in order to get this to work, but it should be strictly additive and not a breaking change). At first blush, I think that this is actually a meaningful use case and something we ought to support.

Second, in terms of being able to transfer the controlling process of a connection between handler processes, I'm rather more reluctant to consider. Despite being an advanced feature, it leaves the door open to a ton of oddball corner cases (a handler process being the controlling process for multiple sockets at one time, what to do about the state of the socket after its' been shut down but the handler process is alive, how & where handshake and upgrades are handled, etc). It's definitely possible (and I don't think even that difficult) to accomplish, but it doesn't feel very OTP-like, which is one of Thousand Island's fundamental North Stars.

At the same time, Bandit is currently wrestling with some of the side effects of doing a lot of work in the handler process (specifically, handling a potentially large number of keepalive requests in a single handler function; see mtrudel/bandit#313). As a result of this, my thinking is changing towards trying to do less in a handler process, not more. The Thousand Island handler model has always been premised on the idea that it's a thin wrapper around a single connection, not necessarily a client (which may make any number of connections). I'm more than a little reluctant to change this abstraction.

For the first part I think it is possible to implement it in a backward compatible way by adding a configuration flag to take a different path to do the handle_close/2, and handle_error/3 calls. On the other hand the use case is very limited and I completely understand if you decide not to implement it.

For the second part, I agree that this is an advanced, low-level feature that if implemented should be used responsibly and with care, and I agree that for the scope of this library it might not be relevant.

Feel free to close this issue if you believe that the complexity introduced by implementing this feature will outweigh the benefits. We can always reopen it if we come up with better ideas.

I was actually thinking about this the other day; I think this is the perfect example of a case when you may want to consider using the escape hatch built into ThousandIsland.Handler (see 'When Handler Isn't Enough' at https://hexdocs.pm/thousand_island/ThousandIsland.Handler.html). Everything you're looking to accomplish is part of the handler behaviour's purview, and writing up your own copy that does what you're looking to do is likely pretty straightforward.

Now that I think about it, that's the best way to implement a feature like this without over engineering the standard features of the library.

Sounds good! I'll close the issue for hygiene but feel free to reopen it if you end up needing anything else!