Quickly Drop clients not consuming messages
slominskir opened this issue · 2 comments
Currently we keep track of "dropped messages", for each web socket client. This is a counter that is incremented each time a CA update comes in and cannot be added to the client's web socket write queue because the queue is full (currently 2000 message queue size). Right now this is just for informational purposes. However, dropped messages are generally a symptom that the client is unresponsive and has disconnected abruptly without closing the web socket formally. Perhaps after say 1000 dropped messages (plus 2000 in queue) it is likely the client isn't coming back online anytime soon and should be forcefully dropped to save server resources.
Another reason to drop the client: the write queue becomes stale fairly quickly as PV updates from minutes or even hours ago stored in the queue are basically worthless. If the client does come back online at a later date they should get fresh updates.
Note: the TCP network stack eventually drops unresponsive clients, but this can take a very long time and may waste a lot of server resources in the meantime.
On the flip side, it is possible that a client has requested too many high frequency channels and the client and server are simply overloaded and the write queue grows faster than web socket writes occur. In this case dropping them would aid server and client recovery so is probably fine too.
Another possibility is if a user is driving and enters a tunnel their phone may lose the Internet connection temporarily and will need to reconnect after exiting the tunnel. The epics2web client API will automatically attempt a reconnect so this should be fine. That is of course if the user hasn't flipped their car inside the tunnel because they were looking at their phone and not the road.
Using the epics2web console I'm looking at a client that has nearly 3 million dropped messages. The IP maps to a Verizon host and user agent shows mobile safari (iPhone). Probably should add date client connected to console and perhaps even date client starting dropping messages. Looks like it has been hours, maybe days for this one client!
Another consideration is ping/pong "liveness" mechanisms. The web socket protocol has this mechanism, but it must be managed by API users (i.e. web browsers nor application servers will initiate, both will respond automatically to ping with pong though). Each side of the communication channel must check if the other side is still responding. The client browser cannot use web socket protocol ping/pong due to the API not exposing "sendPing" method so epics2web implements its own. However, currently epics2web does NOT periodically ping clients (clients ping it). Maybe epics2web should ping clients (using protocol mechanism). Or actually since we know clients must ping periodically the server could just watch if pinging stops. The purgeStaleClients code as well as code to update last modified upon interaction exists already. We need to decide to use it or not and maybe session cleanup based on dropped message will be unnecessary.