boazsegev/plezi

server shutdown when one or more client has a poor connection

Closed this issue · 7 comments

switching between mobile and wifi network access works fine , however when i went to a specific area ( where the internet is not good ) , i found that server consider that as a slowloris attack and i understand that , but why the server go down if he receive too many "slowloris"

WARNING: (facil.io) possible Slowloris attack from /tmp/facil-io-sock-11935
FATAL: (12073) Parent Process crash detected!
INFO: (12073) detected exit signal.
INFO: (12073) cleanup complete.
WARNING: Child worker (12073) shutdown. Respawning worker.
FATAL: (facil.io) unknown cluster connection error
       errno: No such file or directory
INFO: Server Detected exit signal.

This really shouldn't happen.

I'm assuming this only happens for WebSocket connections (the HTTP slowloris protection doesn't really print warnings, it just throttles client requests to avoid pipelining attacks)...?

Which version of iodine are you running?

Also, do you know if there's a way to re-create the issue locally?

Thanks!
Bo.

 * Iodine 0.7.21
 * Ruby 2.6.1
 * facil.io 0.7.0.beta8 (epoll)
 * 15 Workers X 7 Threads per worker.
 * Maximum 4080 open files / sockets per worker.
 * Master (root) process: 10366.

i don't have any method to recreate the issue :/

SlowLoris strive to maintain persistent connections, thus draining server resources. This happens automatically with WebSockets, which keep the connection alive by design.

an other question , how i can check if channel exist or not ?

Hi @moxgeek

I'm still looking into this, however, just a few notes:

Iodine 0.7.21

I would recommend updating to iodine 0.7.23, although I don't think this will solve the issue.

how i can check if channel exist or not?

This is intentionally impossible, because the answer has a high likelihood of being wrong.

A channel might exist on another machine / worker. This is especially true when scaling horizontally using Redis.

Moreover, the channel bight be lost (a user disconnected) a moment after testing for it's existence, so the result is meaningless.

SlowLoris strive to maintain persistent connections, thus draining server resources.

There are more variations on the attack. One would attempt to consume server memory by requesting a lot of data (if possible) and acting as a slow client, keeping the outgoing buffer alive and the memory bloated. This type of attack could also effect WebSocket connections.

But I don't think this is what happened. I think the pub/sub system might have been overrun. I'm still looking into this.

thanks for your feedback
about my need to have a "if channel exist " feature :
in mobile side i can receive data from socket , if the socket is disconnected i use the firebase to start the websocket connection again , but i need to know who is the mobile user has been disconncted
in my case , i identify mobile by channel ( i put every user in one channel )
if i can verify if the user is conncted or the channel exist it will be great becouse in this case i can send a fcm request if he is not conncted .

also , there is some case when the client disconnect and the server not call the on_close .( if my app is craching )

FYI: I managed to find the source of the issue, but it will take me a couple of days or so to get to it.

As for verifying if the user is connected, there's no way (AFAIK) to safely do so in network applications. The best approach would be to send an ACK as part of the exchange. A missing ACK (say, after 10 seconds, for example) would be considered as if the message wasn't sent.

hello Bo , thanks .
about the current connected user i use a redis list , when the user is connected i put the id on it :
REDIS.lpush("onlineClient","#{userid}")
and when he is disconncted i remove him from the list
REDIS.lrem("onlineClient",-2,userid)

but this method is not reliable because in some scenario the user can disconnect without calling on_close in the server side ,that happen ofter to me when my client crashed instead of closed
i think your proposition ( send an ACK ) is more reliable , however i can't see how i can send and wait for response in the same action using plezi .

Hi @moxgeek ,

I released an update to iodine that should solve the original issue (server crashes) and improve pub/sub memory usage in cluster mode.

As for on_close:

Consider that the on_close callback might be called after the client already reconnected.

For example, the on_close might be called due to a network error and timeout, 40 seconds after the client lost connection. The client, on the other hand, might have detected the error sooner and reconnected 30 seconds after the connection was lost.

Using ACK:

This would probably be a combination of a few messages:

  1. server sends message {type: :update, msg_id: XXX , timestamp: YYY ...} to the client, using a counter for the message id or using a timestamp (less safe, but saves data in the database when saving the "last connected" client status).
  2. client sends ACK message {type: :ack, msg_id: XXX } to the server. All messages up to msg_id (or timestamp) are marked as delivered.

However, this might require the client to use an unoptimized subscription in order to add timers to messages. i.e.

client.subscribe("USER#{@user.id}") do |ch, msg|
   Iodine.run_after(5000) { "test code for ACK goes here" }
   client.write msg
end

Keep-Alive alternative:

Another alternative is for the client to send an "alive" message every 5 seconds or so. i.e.

  1. client sends {message: :alive, timestamp: XXX}
  2. server stores "last connected" in database, which on_close might move a few seconds backwards.
  3. client state is stipulated (not necessarily true) according to the latest "last connected" value.

PING interval:

Assuming the on_close is still in use, consider lowering the ping interval (see iodine -?).


I hope this information helps and that the update solves your issue. Please let me know.

Kindly,
Bo.