d-Rickyy-b/certstream-server-go

Slow websocket clients get stuck / disconnected

d-Rickyy-b opened this issue · 1 comments

When a client is not able to keep up (see #28), the server should at least provide the client with certs at the max rate the client can handle. Currently there is a bug in certstream-server-go's websocket code that leads to clients being disconnected after some time, if they can't keep up.

Example

In a certain time frame the websocket processed 1636 certificates:
image

In the same time frame >2000 certs were skipped (that's totally fine and actually is the solution to overloading the client)
image

The actual websocket client (was heavily rate limited and) processed only 107 certs in the same time.
image

Hypothesis

My current assumption is that the websocket code is not blocking, but instead buffering the certificates before actually sending them to the websocket. Depending on the server side buffer size and the client consumption rate, there might be a point where the websocket write (on the network) just happens too late (because it was written to a buffer first and only way later written to the network) and the deadline is exceeded.

After that, the broadcasthandler returns (line 34). But the websocket connection is not closed yet. Hence, this leaves no indication that something isn't working anymore.

func (c *client) broadcastHandler() {
for message := range c.broadcastChan {
c.conn.SetWriteDeadline(time.Now().Add(5 * time.Second)) //nolint:errcheck
w, err := c.conn.NextWriter(websocket.TextMessage)
if err != nil {
return
}
w.Write(message) //nolint:errcheck
if err := w.Close(); err != nil {
return
}
}
_ = c.conn.WriteMessage(websocket.CloseMessage, []byte{})
}

The clients being stuck issue was fixed in 8429ab5.

The problem with disconnected clients should be fixed in dc548b8.

Seems like the OS does weird things. A write deadline of 60 seconds prevents disconnected clients. I tested this by running a client with a limited capacity (100 certs/s) for a whole hour. With smaller values the client was disconnected after a few seconds already.