xkonni/raspberry-remote

Daemon randomly crashes - Too many open files

LucaBernstein opened this issue · 6 comments

Though I already commented on #28 I thought it would create more attention if I instead opened a new issue.

My issue is that the raspberry-remote daemon crashes at least once a day.

I found the following in the logs from journalctl, stating the following:

Oct 15 12:07:02 some-123-wks daemon[20908]: message: 111111160
Oct 15 12:07:02 some-123-wks daemon[20908]: nSys: 1
Oct 15 12:07:02 some-123-wks daemon[20908]: nGroup: 11111
Oct 15 12:07:02 some-123-wks daemon[20908]: nSwitchNumber: 16
Oct 15 12:37:03 some-123-wks daemon[20908]: nActERROR on accept: Too many open files
Oct 15 12:37:03 some-123-wks daemon[20908]: ion: 0
Oct 15 12:37:03 some-123-wks daemon[20908]: nAddr: 1008
Oct 15 12:37:03 some-123-wks daemon[20908]: nPlugs: 1280
Oct 15 12:37:03 some-123-wks daemon[20908]: message: 111111160
[...]
Oct 15 12:37:03 some-123-wks daemon[20908]: message: 111111160
Oct 15 12:37:03 some-123-wks daemon[20908]: nSys: 1
Oct 15 12:37:03 some-123-wks daemon[20908]: nGroup: 11111
Oct 15 12:37:03 some-123-wks daemon[20908]: nSwitchNumber: 16
Oct 15 12:37:03 some-123-wks daemon[20908]: nAction: 0
Oct 15 12:37:03 some-123-wks daemon[20908]: nAddr: 1008
Oct 15 12:37:03 some-123-wks daemon[20908]: nPlugs: 1280
Oct 15 12:37:03 some-123-wks systemd[1]: bluelight_daemon.service: Main process exited, code=exited, status=1/FAILURE
Oct 15 12:37:03 some-123-wks systemd[1]: bluelight_daemon.service: Failed with result 'exit-code'.

When I restart the service so that it is running and send a command to the port with echo -en "111111160"|nc -w 1 some-123-wks 11337 no log is created.

Have you any idea what creates those logs instead?
Update: They seem to be added in batch every now and then.

Thanks in advance!

Luca

Made some progress: On the daemon's server side, the ports seem to be kept open:

daemon  31540 root    6u  IPv4   12955367      0t0      TCP some-123-wks:11337->10.1.50.7:35728 (CLOSE_WAIT)
daemon  31540 root    7u  IPv4   12955368      0t0      TCP some-123-wks:11337->10.1.50.7:35958 (CLOSE_WAIT)
daemon  31540 root    8u  IPv4   12955369      0t0      TCP some-123-wks:11337->10.1.50.7:36276 (CLOSE_WAIT)
daemon  31540 root    9u  IPv4   12955385      0t0      TCP some-123-wks:11337->10.1.50.7:36454 (CLOSE_WAIT)
daemon  31540 root   10u  IPv4   12955386      0t0      TCP some-123-wks:11337->10.1.50.7:36712 (CLOSE_WAIT)
daemon  31540 root   11u  IPv4   12955387      0t0      TCP some-123-wks:11337->10.1.50.7:36714 (CLOSE_WAIT)
daemon  31540 root   12u  IPv4   12955388      0t0      TCP some-123-wks:11337->10.1.50.7:36716 (CLOSE_WAIT)
daemon  31540 root   13u  IPv4   12955398      0t0      TCP some-123-wks:11337->10.1.50.7:36718 (CLOSE_WAIT)
daemon  31540 root   14u  IPv4   12955399      0t0      TCP some-123-wks:11337->10.1.50.7:36720 (CLOSE_WAIT)
daemon  31540 root   15u  IPv4   12955400      0t0      TCP some-123-wks:11337->10.1.50.7:36722 (CLOSE_WAIT)
daemon  31540 root   16u  IPv4   12955401      0t0      TCP some-123-wks:11337->10.1.50.7:36724 (CLOSE_WAIT)
daemon  31540 root   17u  IPv4   12955402      0t0      TCP some-123-wks:11337->10.1.50.7:36828 (CLOSE_WAIT)
daemon  31540 root   18u  IPv4   12955519      0t0      TCP some-123-wks:11337->10.1.50.7:36856 (CLOSE_WAIT)
daemon  31540 root   19u  IPv4   12955520      0t0      TCP some-123-wks:11337->10.1.50.7:36864 (CLOSE_WAIT)
daemon  31540 root   20u  IPv4   12955521      0t0      TCP some-123-wks:11337->10.1.50.7:36866 (CLOSE_WAIT)
daemon  31540 root   21u  IPv4   12955522      0t0      TCP some-123-wks:11337->10.1.50.7:36868 (CLOSE_WAIT)
daemon  31540 root   22u  IPv4   12955523      0t0      TCP some-123-wks:11337->10.1.50.7:36872 (CLOSE_WAIT)
daemon  31540 root   23u  IPv4   12955550      0t0      TCP some-123-wks:11337->10.1.50.7:37060 (CLOSE_WAIT)
daemon  31540 root   24u  IPv4   12955551      0t0      TCP some-123-wks:11337->10.1.50.7:37280 (CLOSE_WAIT)
daemon  31540 root   25u  IPv4   12955552      0t0      TCP some-123-wks:11337->10.1.50.7:37438 (CLOSE_WAIT)
daemon  31540 root   26u  IPv4   12955564      0t0      TCP some-123-wks:11337->10.1.50.7:37724 (CLOSE_WAIT)
daemon  31540 root   27u  IPv4   12955647      0t0      TCP some-123-wks:11337->10.1.50.7:37898 (CLOSE_WAIT)
daemon  31540 root   28u  IPv4   12955648      0t0      TCP some-123-wks:11337->10.1.50.7:38116 (CLOSE_WAIT)
daemon  31540 root   29u  IPv4   12962817      0t0      TCP some-123-wks:11337->10.1.50.7:38282 (CLOSE_WAIT)
daemon  31540 root   30u  IPv4   12962818      0t0      TCP some-123-wks:11337->10.1.50.7:38492 (CLOSE_WAIT)
daemon  31540 root   31u  IPv4   12962819      0t0      TCP some-123-wks:11337->10.1.50.7:38734 (CLOSE_WAIT)
daemon  31540 root   32u  IPv4   12962820      0t0      TCP some-123-wks:11337->10.1.50.7:38974 (CLOSE_WAIT)
daemon  31540 root   33u  IPv4   12962938      0t0      TCP some-123-wks:11337->10.1.50.7:39096 (CLOSE_WAIT)
daemon  31540 root   34u  IPv4   12962956      0t0      TCP some-123-wks:11337->10.1.50.7:39300 (CLOSE_WAIT)
daemon  31540 root   35u  IPv4   12962957      0t0      TCP some-123-wks:11337->10.1.50.7:39564 (CLOSE_WAIT)
daemon  31540 root   36u  IPv4   12962958      0t0      TCP some-123-wks:11337->10.1.50.7:39786 (CLOSE_WAIT)
daemon  31540 root   37u  IPv4   12962959      0t0      TCP some-123-wks:11337->10.1.50.7:39938 (CLOSE_WAIT)
daemon  31540 root   38u  IPv4   12962960      0t0      TCP some-123-wks:11337->10.1.50.7:40130 (CLOSE_WAIT)
daemon  31540 root   39u  IPv4   12962961      0t0      TCP some-123-wks:11337->10.1.50.7:40418 (CLOSE_WAIT)
daemon  31540 root   40u  IPv4   12962962      0t0      TCP some-123-wks:11337->10.1.50.7:40620 (CLOSE_WAIT)
daemon  31540 root   41u  IPv4   12962963      0t0      TCP some-123-wks:11337->10.1.50.7:40816 (CLOSE_WAIT)
daemon  31540 root   42u  IPv4   12962969      0t0      TCP some-123-wks:11337->10.1.50.7:41008 (CLOSE_WAIT)
daemon  31540 root   43u  IPv4   12962970      0t0      TCP some-123-wks:11337->10.1.50.7:41232 (CLOSE_WAIT)
daemon  31540 root   44u  IPv4   12962971      0t0      TCP some-123-wks:11337->10.1.50.7:41490 (CLOSE_WAIT)
daemon  31540 root   45u  IPv4   12962981      0t0      TCP some-123-wks:11337->10.1.50.7:41628 (CLOSE_WAIT)
daemon  31540 root   46u  IPv4   12962982      0t0      TCP some-123-wks:11337->10.1.50.7:41888 (CLOSE_WAIT)
daemon  31540 root   47u  IPv4   12962983      0t0      TCP some-123-wks:11337->10.1.50.7:42036 (CLOSE_WAIT)
daemon  31540 root   48u  IPv4   12962984      0t0      TCP some-123-wks:11337->10.1.50.7:42274 (CLOSE_WAIT)
daemon  31540 root   49u  IPv4   12962985      0t0      TCP some-123-wks:11337->10.1.50.7:42454 (CLOSE_WAIT)
daemon  31540 root   50u  IPv4   12962986      0t0      TCP some-123-wks:11337->10.1.50.7:42670 (CLOSE_WAIT)
daemon  31540 root   51u  IPv4   12962987      0t0      TCP some-123-wks:11337->10.1.50.7:42902 (CLOSE_WAIT)
daemon  31540 root   52u  IPv4   12962988      0t0      TCP some-123-wks:11337->10.1.50.7:43128 (CLOSE_WAIT)
daemon  31540 root   53u  IPv4   12962989      0t0      TCP some-123-wks:11337->10.1.50.7:43298 (CLOSE_WAIT)
daemon  31540 root   54u  IPv4   12962990      0t0      TCP some-123-wks:11337->10.1.50.7:43480 (CLOSE_WAIT)
daemon  31540 root   55u  IPv4   12962998      0t0      TCP some-123-wks:11337->10.1.50.7:43724 (CLOSE_WAIT)
daemon  31540 root   56u  IPv4   12963008      0t0      TCP some-123-wks:11337->10.1.50.7:44016 (CLOSE_WAIT)
daemon  31540 root   57u  IPv4   12963096      0t0      TCP some-123-wks:11337->10.1.50.7:44210 (CLOSE_WAIT)
[...]

Ports seem to be openend with every request but not closed. This leads to an out-of-ports error at some point, thus crashing the daemon.

Now the remaining question is: Why is the daemmon keeping the ports open (CLOSE_WAIT) if netcat is configured wo timeout after 1s (-w 1)

Citing the Red Hat docs: (SRC)

CLOSE_WAIT - Indicates that the server has received the first FIN signal from the client and the connection is in the process of being closed. This means the socket is waiting for the application to execute close(). A socket can be in CLOSE_WAIT state indefinitely until the application closes it. Faulty scenarios would be like a file descriptor leak: server not executing close() on sockets leading to pile up of CLOSE_WAIT sockets.

And: "your server isn't closing the accepted socket when it reads end-of-stream."

As of my current research: The daemon seems to not close the connections on EOF / FIN

My mitigation in the meantime will be: Restart daemon service at midnight to free all sockets kept open. Though I assume that to solve this issue we would have to look into the code.

please try the dev branch, tried to (blindly, dont have the hardware anymore) fix the open sockets issue.

Thank you @xkonni for your quick response. I tried the dev branch, unfortunately the connections still pile up in CLOSE_WAIT.
Additionally the response seems to have changed as on master branch I am getting 0 and on dev branch I get 2 as response from this command: echo -en "111111160"|nc -w 1 some-123-wks 11337