dcaoyuan/spray-websocket

TCP kernel buffer sizes: CommandFailed for Tcp.Write

fommil opened this issue · 2 comments

/cc https://groups.google.com/d/msg/akka-user/4djcoUvcRu0/k0CT-Os_PE0J

Hi all,

We are using Spray IO (with the wandoulabs WebSockets layer) on really old RHEL5 boxes in our QA environments.

Bizarrely, our server beta release was working fine on one box, but failing to write messages on another, despite the kernels and software versions being identical.

Clients were able to connect to the server, but as soon as the server started to write to the socket, we got this sort of thing:

19 Mar 15 10:56:55.542 HttpServerConnectionakka://MDES/user/IO-UHTTP/listener-0/0 [ MDES-akka.actor.default-dispatcher-4] WARN - CommandFailed for Tcp.Write text frame: ...
19 Mar 15 10:56:55.543 HttpServerConnectionakka://MDES/user/IO-UHTTP/listener-0/0 [ MDES-akka.actor.default-dispatcher-4] WARN - event pipeline: dropped CommandFailed(Write(ByteString(),NoAck(null)))

The boxes are running "Red Hat Enterprise Linux Server release 5.8 (Tikanga)" with 2.6.18-308.el5 on x86_64 cores. We're using scala 2.11.5, Java 1.6.0_40 and Akka 2.3.8 / Spray-IO 1.3.2.

We spotted that the kernel parameters were different on the boxes, this being the diff:

net.core.rmem_default = 262144
net.core.rmem_max = 16777216
net.core.wmem_default = 262144
net.core.wmem_max = 16777216
net.ipv4.tcp_rmem = 4096 4194304 16777216
net.ipv4.tcp_sack = 1
net.ipv4.tcp_timestamps = 1
net.ipv4.tcp_window_scaling = 1
net.ipv4.tcp_wmem = 4096 4194304 16777216

and by using those parameters on all boxes the problems went away.

However, we have no control over setting these parameters on PROD boxes so we need a workaround that works without them.

But this is extremely concerning, why was the failure happening because of kernel parameters? Is this a bug in NIO, Spray IO, or Spray-WebSockets? Wandoulabs aren't doing anything unusual as you can see https://github.com/wandoulabs/spray-websocket/blob/master/spray-websocket/src/main/scala/spray/can/server/UpgradableHttpListener.scala

Most importantly, we need a workaround... does anybody have any suggestions?

The current theory is that the default kernel buffer size is too low to accept the outbound WebSocket frames. On the failing boxes (which we can't change), this is

$ cat /proc/sys/net/ipv4/tcp_wmem
4096 16384 4194304

and our messages are a few kb each of JSON.

Best regards,
Sam

btw, using backpressure solved this.

seems this is not solved, but closing to track under the new ticket which I think is a better description of the problem.