Daemon randomly crashes - Too many open files
LucaBernstein opened this issue · 6 comments
Though I already commented on #28 I thought it would create more attention if I instead opened a new issue.
My issue is that the raspberry-remote daemon crashes at least once a day.
I found the following in the logs from journalctl, stating the following:
Oct 15 12:07:02 some-123-wks daemon[20908]: message: 111111160
Oct 15 12:07:02 some-123-wks daemon[20908]: nSys: 1
Oct 15 12:07:02 some-123-wks daemon[20908]: nGroup: 11111
Oct 15 12:07:02 some-123-wks daemon[20908]: nSwitchNumber: 16
Oct 15 12:37:03 some-123-wks daemon[20908]: nActERROR on accept: Too many open files
Oct 15 12:37:03 some-123-wks daemon[20908]: ion: 0
Oct 15 12:37:03 some-123-wks daemon[20908]: nAddr: 1008
Oct 15 12:37:03 some-123-wks daemon[20908]: nPlugs: 1280
Oct 15 12:37:03 some-123-wks daemon[20908]: message: 111111160
[...]
Oct 15 12:37:03 some-123-wks daemon[20908]: message: 111111160
Oct 15 12:37:03 some-123-wks daemon[20908]: nSys: 1
Oct 15 12:37:03 some-123-wks daemon[20908]: nGroup: 11111
Oct 15 12:37:03 some-123-wks daemon[20908]: nSwitchNumber: 16
Oct 15 12:37:03 some-123-wks daemon[20908]: nAction: 0
Oct 15 12:37:03 some-123-wks daemon[20908]: nAddr: 1008
Oct 15 12:37:03 some-123-wks daemon[20908]: nPlugs: 1280
Oct 15 12:37:03 some-123-wks systemd[1]: bluelight_daemon.service: Main process exited, code=exited, status=1/FAILURE
Oct 15 12:37:03 some-123-wks systemd[1]: bluelight_daemon.service: Failed with result 'exit-code'.
When I restart the service so that it is running and send a command to the port with echo -en "111111160"|nc -w 1 some-123-wks 11337
no log is created.
Have you any idea what creates those logs instead?
Update: They seem to be added in batch every now and then.
Thanks in advance!
Luca
Made some progress: On the daemon's server side, the ports seem to be kept open:
daemon 31540 root 6u IPv4 12955367 0t0 TCP some-123-wks:11337->10.1.50.7:35728 (CLOSE_WAIT)
daemon 31540 root 7u IPv4 12955368 0t0 TCP some-123-wks:11337->10.1.50.7:35958 (CLOSE_WAIT)
daemon 31540 root 8u IPv4 12955369 0t0 TCP some-123-wks:11337->10.1.50.7:36276 (CLOSE_WAIT)
daemon 31540 root 9u IPv4 12955385 0t0 TCP some-123-wks:11337->10.1.50.7:36454 (CLOSE_WAIT)
daemon 31540 root 10u IPv4 12955386 0t0 TCP some-123-wks:11337->10.1.50.7:36712 (CLOSE_WAIT)
daemon 31540 root 11u IPv4 12955387 0t0 TCP some-123-wks:11337->10.1.50.7:36714 (CLOSE_WAIT)
daemon 31540 root 12u IPv4 12955388 0t0 TCP some-123-wks:11337->10.1.50.7:36716 (CLOSE_WAIT)
daemon 31540 root 13u IPv4 12955398 0t0 TCP some-123-wks:11337->10.1.50.7:36718 (CLOSE_WAIT)
daemon 31540 root 14u IPv4 12955399 0t0 TCP some-123-wks:11337->10.1.50.7:36720 (CLOSE_WAIT)
daemon 31540 root 15u IPv4 12955400 0t0 TCP some-123-wks:11337->10.1.50.7:36722 (CLOSE_WAIT)
daemon 31540 root 16u IPv4 12955401 0t0 TCP some-123-wks:11337->10.1.50.7:36724 (CLOSE_WAIT)
daemon 31540 root 17u IPv4 12955402 0t0 TCP some-123-wks:11337->10.1.50.7:36828 (CLOSE_WAIT)
daemon 31540 root 18u IPv4 12955519 0t0 TCP some-123-wks:11337->10.1.50.7:36856 (CLOSE_WAIT)
daemon 31540 root 19u IPv4 12955520 0t0 TCP some-123-wks:11337->10.1.50.7:36864 (CLOSE_WAIT)
daemon 31540 root 20u IPv4 12955521 0t0 TCP some-123-wks:11337->10.1.50.7:36866 (CLOSE_WAIT)
daemon 31540 root 21u IPv4 12955522 0t0 TCP some-123-wks:11337->10.1.50.7:36868 (CLOSE_WAIT)
daemon 31540 root 22u IPv4 12955523 0t0 TCP some-123-wks:11337->10.1.50.7:36872 (CLOSE_WAIT)
daemon 31540 root 23u IPv4 12955550 0t0 TCP some-123-wks:11337->10.1.50.7:37060 (CLOSE_WAIT)
daemon 31540 root 24u IPv4 12955551 0t0 TCP some-123-wks:11337->10.1.50.7:37280 (CLOSE_WAIT)
daemon 31540 root 25u IPv4 12955552 0t0 TCP some-123-wks:11337->10.1.50.7:37438 (CLOSE_WAIT)
daemon 31540 root 26u IPv4 12955564 0t0 TCP some-123-wks:11337->10.1.50.7:37724 (CLOSE_WAIT)
daemon 31540 root 27u IPv4 12955647 0t0 TCP some-123-wks:11337->10.1.50.7:37898 (CLOSE_WAIT)
daemon 31540 root 28u IPv4 12955648 0t0 TCP some-123-wks:11337->10.1.50.7:38116 (CLOSE_WAIT)
daemon 31540 root 29u IPv4 12962817 0t0 TCP some-123-wks:11337->10.1.50.7:38282 (CLOSE_WAIT)
daemon 31540 root 30u IPv4 12962818 0t0 TCP some-123-wks:11337->10.1.50.7:38492 (CLOSE_WAIT)
daemon 31540 root 31u IPv4 12962819 0t0 TCP some-123-wks:11337->10.1.50.7:38734 (CLOSE_WAIT)
daemon 31540 root 32u IPv4 12962820 0t0 TCP some-123-wks:11337->10.1.50.7:38974 (CLOSE_WAIT)
daemon 31540 root 33u IPv4 12962938 0t0 TCP some-123-wks:11337->10.1.50.7:39096 (CLOSE_WAIT)
daemon 31540 root 34u IPv4 12962956 0t0 TCP some-123-wks:11337->10.1.50.7:39300 (CLOSE_WAIT)
daemon 31540 root 35u IPv4 12962957 0t0 TCP some-123-wks:11337->10.1.50.7:39564 (CLOSE_WAIT)
daemon 31540 root 36u IPv4 12962958 0t0 TCP some-123-wks:11337->10.1.50.7:39786 (CLOSE_WAIT)
daemon 31540 root 37u IPv4 12962959 0t0 TCP some-123-wks:11337->10.1.50.7:39938 (CLOSE_WAIT)
daemon 31540 root 38u IPv4 12962960 0t0 TCP some-123-wks:11337->10.1.50.7:40130 (CLOSE_WAIT)
daemon 31540 root 39u IPv4 12962961 0t0 TCP some-123-wks:11337->10.1.50.7:40418 (CLOSE_WAIT)
daemon 31540 root 40u IPv4 12962962 0t0 TCP some-123-wks:11337->10.1.50.7:40620 (CLOSE_WAIT)
daemon 31540 root 41u IPv4 12962963 0t0 TCP some-123-wks:11337->10.1.50.7:40816 (CLOSE_WAIT)
daemon 31540 root 42u IPv4 12962969 0t0 TCP some-123-wks:11337->10.1.50.7:41008 (CLOSE_WAIT)
daemon 31540 root 43u IPv4 12962970 0t0 TCP some-123-wks:11337->10.1.50.7:41232 (CLOSE_WAIT)
daemon 31540 root 44u IPv4 12962971 0t0 TCP some-123-wks:11337->10.1.50.7:41490 (CLOSE_WAIT)
daemon 31540 root 45u IPv4 12962981 0t0 TCP some-123-wks:11337->10.1.50.7:41628 (CLOSE_WAIT)
daemon 31540 root 46u IPv4 12962982 0t0 TCP some-123-wks:11337->10.1.50.7:41888 (CLOSE_WAIT)
daemon 31540 root 47u IPv4 12962983 0t0 TCP some-123-wks:11337->10.1.50.7:42036 (CLOSE_WAIT)
daemon 31540 root 48u IPv4 12962984 0t0 TCP some-123-wks:11337->10.1.50.7:42274 (CLOSE_WAIT)
daemon 31540 root 49u IPv4 12962985 0t0 TCP some-123-wks:11337->10.1.50.7:42454 (CLOSE_WAIT)
daemon 31540 root 50u IPv4 12962986 0t0 TCP some-123-wks:11337->10.1.50.7:42670 (CLOSE_WAIT)
daemon 31540 root 51u IPv4 12962987 0t0 TCP some-123-wks:11337->10.1.50.7:42902 (CLOSE_WAIT)
daemon 31540 root 52u IPv4 12962988 0t0 TCP some-123-wks:11337->10.1.50.7:43128 (CLOSE_WAIT)
daemon 31540 root 53u IPv4 12962989 0t0 TCP some-123-wks:11337->10.1.50.7:43298 (CLOSE_WAIT)
daemon 31540 root 54u IPv4 12962990 0t0 TCP some-123-wks:11337->10.1.50.7:43480 (CLOSE_WAIT)
daemon 31540 root 55u IPv4 12962998 0t0 TCP some-123-wks:11337->10.1.50.7:43724 (CLOSE_WAIT)
daemon 31540 root 56u IPv4 12963008 0t0 TCP some-123-wks:11337->10.1.50.7:44016 (CLOSE_WAIT)
daemon 31540 root 57u IPv4 12963096 0t0 TCP some-123-wks:11337->10.1.50.7:44210 (CLOSE_WAIT)
[...]
Ports seem to be openend with every request but not closed. This leads to an out-of-ports error at some point, thus crashing the daemon.
Now the remaining question is: Why is the daemmon keeping the ports open (CLOSE_WAIT) if netcat is configured wo timeout after 1s (-w 1
)
Citing the Red Hat docs: (SRC)
CLOSE_WAIT - Indicates that the server has received the first FIN signal from the client and the connection is in the process of being closed. This means the socket is waiting for the application to execute close(). A socket can be in CLOSE_WAIT state indefinitely until the application closes it. Faulty scenarios would be like a file descriptor leak: server not executing close() on sockets leading to pile up of CLOSE_WAIT sockets.
And: "your server isn't closing the accepted socket when it reads end-of-stream."
As of my current research: The daemon seems to not close the connections on EOF / FIN
My mitigation in the meantime will be: Restart daemon service at midnight to free all sockets kept open. Though I assume that to solve this issue we would have to look into the code.
please try the dev branch, tried to (blindly, dont have the hardware anymore) fix the open sockets issue.
Thank you @xkonni for your quick response. I tried the dev branch, unfortunately the connections still pile up in CLOSE_WAIT.
Additionally the response seems to have changed as on master branch I am getting 0
and on dev branch I get 2
as response from this command: echo -en "111111160"|nc -w 1 some-123-wks 11337