magic-wormhole/magic-wormhole-mailbox-server

misclassifying successful connections as "pruned"?

warner opened this issue · 1 comments

The public server records a status for each mailbox that gets used (one per transfer). Over the last few months, about 76% are classified as "happy" (meaning both sides reported success in their CLOSE message), 21% as "pruny" (meaning the server deleted the mailbox before seeing both CLOSE messages because nobody had maintained a websocket connection to the mailbox for over 15-ish minutes), and 2.4% as "scary" (meaning at least one side signalled a PAKE mismatch in their CLOSE message).

I think that 21% is artificially high. I noticed an entry in the database recently that showed a status of "happy" for one client, while the other side showed that the client was still connected. That shouldn't happen for the current client: once the transfer is complete, both sides should signal "happy" and then drop the websocket connection.

I'm thinking that magic-wormhole/magic-wormhole#272 is related, and that sometimes clients don't drop their connection when they should. The symptom would be that one of the two users sees a success message, but then their wormhole client doesn't exit. If they forget about it and just leave it running, the server would see this state. If they then close their laptop or drop off wifi, even when the process exits, the server won't see the FIN packet, and the connection won't close right away. Later, the websocket keepalive will probably be missed, and the socket will close. That will allow the mailbox to be pruned, and will result in a "pruny" status.

Fixing magic-wormhole/magic-wormhole#272 is the most important thing, of course. But it might also be useful to classify half-happy non-scary mailboxes as "happy" rather than "pruny", or maybe make a new category for them ("half-happy"?).

Hm actually that 272 bug might not be related. No matter how the application closes the websocket (cleanly or otherwise), when the wormhole process exits, the TCP connection will go down. I can imagine abrupt partitions (disconnect from the network) that would make the TCP connection seem alive, but not application-level misbehavior that is nevertheless followed by process exit.

I need to look more closely at the server. I know I've seen mailbox TCP connections open much longer than they seem useful, but I haven't investigated a possible difference between whether the server thinks they're open or not. Maybe an abrupt websocket shutdown allows the TCP connection to drop but doesn't signal the right connectionLost message on the server (do we depend upon a WS-specific connectionLost somehow?), possibly making the server think the connection is still around, somehow leading it to a "pruny" state.