
Timeouts/connection closed errors with benchmark tests

Opened this issue · 5 comments

I'm trying to make env for handle, for example 5-7k connections. I'v read (#12), but do not able to reproduce that success story.

Socket/files limits is 500000 on the poxa-machine. Erlang and Elixir versions are (Install instructions from https://gist.github.com/rubencaro/6a28138a40e629b06470):

  • Erlang/OTP 20 [erts-9.3] [source] [64-bit] [smp:8:8] [ds:8:8:10] [async-threads:10] [hipe] [kernel-poll:false]
  • Elixir 1.6.4 (compiled with OTP 19)

There is no errors on that side (console log contains just correct messages). But on the other side(s) (where benchmark started) I have errors:

Running with n = 600
** (EXIT from #PID<0.74.0>) an exception was raised:
    ** (MatchError) no match of right hand side value: {:error, :closed}
    ~/deps/websocket_client/src/websocket_client.erl:150: :websocket_client.receive_handshake/3
    ~/deps/websocket_client/src/websocket_client.erl:137: :websocket_client.websocket_handshake/2
    ~/deps/websocket_client/src/websocket_client.erl:89: :websocket_client.ws_client_init/7
    (stdlib) proc_lib.erl:247: :proc_lib.init_p_do_apply/3


Running with n = 400
** (EXIT from #PID<0.74.0>) an exception was raised:
    ** (MatchError) no match of right hand side value: {:error, :timeout}
        connect.exs:14: Worker.handle_info/2
        (stdlib) gen_server.erl:616: :gen_server.try_dispatch/4
        (stdlib) gen_server.erl:686: :gen_server.handle_msg/6
        (stdlib) proc_lib.erl:247: :proc_lib.init_p_do_apply/3

I think, that I'm missing something and any help is appreciate.

Hey @demuskov thanks for opening this issue. I will give it a try on this benchmark again this weekend. Thanks for opening this issue

Thanks, Eduardo!
Btw, I think, that cowboy connection limits somehow exhausted. And if you trying to look through poxa web console - you eventually get unresponsive poxa endpoint.

PS. Now I'm trying to redesign your benchmark for the following case:

  • Three or four independent subscription source machines with 500-1000 processes
  • One publisher, that emit 10 messages

Thanks once more.

Hi! I'v practiced with benchmark-based code on a few machines with Ubuntu 16 (poxa server) and Mint 18 (poxa multiple clients). Network settings for Ubuntu adjusted for heavy-load web server. Mints - highest limits (fds - 500000).

Poxa on ubuntu used with several start options (daemon, console) - behavior was identical in all cases.

Two scripts:

  • publisher.exs - publish 10 msgs to the "channel" (Ubuntu)
  • connect.exs - creates & connects N processes to the poxa (http), than trying to make N subscriptions (Mint)

Distributed env behavior:

  • connect.exs always connects to the poxa server (20, 2000 or 20000 connections) - that's good
  • websocket subscriptions in a very rare cases successfully established for more than 1000 processes for the one test run (if subscription process takes less than a 6 secs - thats ok and all processes get their subscriptions, in other case - see below)
  • very often test run breaks with error "closed" just after 6 secs from start:
	** (MatchError) no match of right hand side value: {:error, :closed}
    	~/poxa-original/deps/websocket_client/src/websocket_client.erl:150: :websocket_client.receive_handshake/3
    	~/poxa-original/deps/websocket_client/src/websocket_client.erl:137: :websocket_client.websocket_handshake/2
    	~/poxa-original/deps/websocket_client/src/websocket_client.erl:89: :websocket_client.ws_client_init/7
    	(stdlib) proc_lib.erl:247: :proc_lib.init_p_do_apply/3

Consolidated env behavior (all run under ubuntu):

  • if N < 1000 - all going ok
  • if N > 1500 - just 2 or 3 times i'v get full functioning result, but for the other cases (95%) i'v get following - for _i ==1122 (+/-) processing stalled and timeout reported - stack trace exactly the same as in distributed case, but error is {:error, :timeout}
  • from that point poxa just accepts connections, but subscriptions was not processed for no one process, even for N=1 and poxa restart is the only cure (besides no differences where you then trying to run test suite - remotely or locally - poxa not functioning anymore until restart)


PS. Scripts below:


Ok so we need to find a way to replicate these issues you are finding. It could be just a network slowness caused by the kernel (some resource limit?). TBH I never had more needs than 10k connections with Poxa so I never really tested more than that. I can try to setup a digital ocean machine and try again. I remember having great success running poxa on Linux but terrible results with OS X for example

I do not think that we reach system limits at Ubuntu server. Consolidated env means that all processes - poxa and test suit ran on the single machine.

PS. Ubuntu server sysctl.conf:

net.ipv4.conf.all.accept_redirects = 0
net.ipv4.conf.all.secure_redirects = 0
net.ipv4.conf.all.send_redirects = 0
net.ipv4.tcp_max_orphans = 65536
net.ipv4.tcp_fin_timeout = 1
net.ipv4.tcp_keepalive_time = 10
net.ipv4.tcp_keepalive_intvl = 1
net.ipv4.tcp_keepalive_probes = 1
net.ipv4.tcp_max_syn_backlog = 16384
net.ipv4.tcp_synack_retries = 1
net.ipv4.tcp_mem = 50576   64768   98152
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216
net.ipv4.tcp_orphan_retries = 0
net.ipv4.tcp_syncookies = 0
net.ipv4.netfilter.ip_conntrack_max = 16777216
net.netfilter.nf_conntrack_max = 16777216
net.ipv4.tcp_timestamps = 1
net.ipv4.tcp_sack = 1
net.ipv4.tcp_congestion_control = htcp
net.ipv4.tcp_no_metrics_save = 1
net.ipv4.route.flush = 1
net.ipv4.conf.all.rp_filter = 0
net.ipv4.conf.lo.rp_filter = 0
net.ipv4.conf.eth0.rp_filter = 0
net.ipv4.conf.default.rp_filter = 0
net.ipv4.conf.all.accept_source_route = 0
net.ipv4.conf.lo.accept_source_route = 0
net.ipv4.conf.eth0.accept_source_route = 0
net.ipv4.conf.default.accept_source_route = 0
net.ipv4.ip_local_port_range = 1024 65535
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_tw_recycle = 1
net.ipv4.tcp_window_scaling = 1
net.ipv4.tcp_syn_retries = 1
net.ipv4.tcp_rfc1337 = 1
net.ipv4.ip_forward = 0
net.ipv4.icmp_echo_ignore_broadcasts = 1
net.ipv4.icmp_echo_ignore_all = 1
net.ipv4.icmp_ignore_bogus_error_responses = 1
net.core.somaxconn = 65535
net.core.netdev_max_backlog = 16384
net.core.rmem_default = 65536
net.core.wmem_default = 65536
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
fs.inotify.max_user_watches = 16777216