ivykis-0.41 and syslog-ng-3.9.1 hangs on boot on RHEL7
Closed this issue · 14 comments
Hello,
We use pkgsrc across our platforms to keep some system software at consistent versions. One of these packages is syslog-ng. After preparing the packages for RHEL7, the daemon would start and work fine but would hang on boot. It would write some messages to the log and get stuck waiting. If I reduced the number of services starting, it was less likely to hang. When I attach strace to the process it starts working again (apparently I interrupt the epoll that it's stuck on).
I tried older versions and found the problem came up between 0.39 and 0.40. I did a bisect and found it was introduced by 56397c0. Is this a known issue?
I will stick to 0.39 for now.
Let me know if there is any other information I should provide.
Hi!
Thank you for the report. The commit you referenced adds timerfd() support, but I'm not aware of any issues with that code. To clarify, does it only ever hang on boot?
Also, to double-check this, could you try running syslog-ng with a new (affected) version of ivykis but with the IV_EXCLUDE_POLL_METHOD environment variable set to "epoll-timerfd" ?
I haven't been able to get syslog-ng stuck other than on boot. And even then, there's still a chance it works, so it's a mediocre test case. :-/
I reinstalled the affected version, verified it would still hang on boot, then added the environment variable you mentioned. I haven't been able to get it to hang yet after a dozen reboots (usually only takes one or two).
Thank you very much for these tests -- that seems to confirm that the timerfd() support code is indeed the problem.
I went over this commit a few times, but I don't see where the issue is. :( This means that your test case is the best shot we have at reproducing this for now. If I send you a test patch adding some printf() calls, would you be able to test that out?
Sure, happy to help.
Could you try the attached patch? It adds a bunch of debug printf()ing to stderr, which will hopefully catch what's going on. It can produce a lot of output, so you might have to redirect stderr somewhere, but you said that it happens upon boot, so hopefully it won't generate too much output before the bug hits. Thanks in advance! And sorry for the inconvenience.
The systemd journal caught all the output, so this was fairly easy to collect. Let me know if anything seems missing.
-- Logs begin at Tue 2017-04-25 13:29:52 PDT, end at Tue 2017-04-25 13:36:25 PDT. --
Apr 25 13:29:54 rhel7 systemd[1]: Starting System Logger daemon...
Apr 25 13:29:54 rhel7 systemd[1]: Started System Logger daemon.
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for 0 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for 29998 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for 0 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for 10000 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for 29997 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for 10000 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for 29997 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for 10000 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for 29997 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for 10000 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: setting poll timeout timer to 33.886245271, curtime (cached) is 3.890493518
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for -1 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for 10000 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for -1 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for 10000 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for -1 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for 10000 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for -1 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for 10000 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for -1 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for 10000 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for -1 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for 10000 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for -1 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for -1 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for 10000 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for -1 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for 10000 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for -1 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for 10000 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for -1 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for -1 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for 10000 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for -1 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for -1 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for 10000 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for -1 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for -1 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for 10000 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for -1 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for -1 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for 10000 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for -1 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for -1 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for 10000 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for -1 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for -1 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for 10000 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for -1 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for 10000 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for 10000 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for -1 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for -1 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for 10000 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for -1 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for -1 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for 10000 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for -1 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for -1 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for 10000 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for -1 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for -1 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for 10000 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for -1 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for -1 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for 10000 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for -1 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for -1 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for 10000 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for -1 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for -1 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for 10000 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for -1 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for -1 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for 10000 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for -1 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for -1 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for 10000 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for -1 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for -1 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for 10000 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for -1 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for -1 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for 10000 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for -1 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for -1 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for 10000 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for -1 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for -1 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for 10000 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for -1 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for -1 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for 10000 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for -1 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for -1 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for 10000 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for -1 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for -1 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for 10000 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for -1 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for -1 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for 10000 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for -1 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for -1 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for 10000 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for -1 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for -1 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for 10000 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for -1 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for -1 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for 10000 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for -1 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for -1 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for -1 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for -1 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for -1 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for -1 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for -1 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: revalidating time, to 3.901675673
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for 0 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for -1 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for -1 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for -1 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: revalidating time, to 3.901802300
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for 0 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for -1 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for -1 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for -1 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: revalidating time, to 3.901925819
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for 0 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for -1 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for -1 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for -1 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: revalidating time, to 3.902121460
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for 0 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for -1 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for -1 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for -1 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: setting poll timeout timer to 0.0, curtime (cached) is 3.902121460
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for -1 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for -1 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for -1 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for -1 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for -1 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for -1 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for -1 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: revalidating time, to 3.903202307
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for 0 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for -1 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for -1 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for -1 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: revalidating time, to 3.903284657
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for 0 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for -1 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for -1 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for -1 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: revalidating time, to 3.903368432
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for 0 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for -1 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for -1 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for -1 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: revalidating time, to 3.903460955
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for 0 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for -1 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for -1 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for -1 msec
Apr 25 13:29:54 rhel7 syslog-ng[681]: setting poll timeout timer to 0.0, curtime (cached) is 3.903460955
Apr 25 13:29:54 rhel7 syslog-ng[681]: polling for -1 msec
Apr 25 13:30:24 rhel7 syslog-ng[681]: polling for 30000 msec
Apr 25 13:30:54 rhel7 syslog-ng[681]: polling for 30000 msec
Apr 25 13:31:24 rhel7 syslog-ng[681]: polling for 30000 msec
Apr 25 13:31:54 rhel7 syslog-ng[681]: polling for 30000 msec
Apr 25 13:32:24 rhel7 syslog-ng[681]: polling for 30000 msec
Apr 25 13:32:54 rhel7 syslog-ng[681]: polling for 30000 msec
Apr 25 13:33:24 rhel7 syslog-ng[681]: polling for 30000 msec
Apr 25 13:33:54 rhel7 syslog-ng[681]: polling for 30000 msec
Apr 25 13:34:24 rhel7 syslog-ng[681]: polling for 29796 msec
Apr 25 13:34:54 rhel7 syslog-ng[681]: polling for 175 msec
Apr 25 13:34:54 rhel7 syslog-ng[681]: polling for 30000 msec
Apr 25 13:35:25 rhel7 syslog-ng[681]: polling for 30000 msec
Apr 25 13:35:55 rhel7 syslog-ng[681]: polling for 30000 msec
Apr 25 13:36:25 rhel7 syslog-ng[681]: polling for 30000 msec
Hmmmmmmmmmmm.
Can you try if this patch makes the hangs go away?
diff --git a/src/iv_fd_epoll.c b/src/iv_fd_epoll.c
index 5517d19..c6b2285 100644
--- a/src/iv_fd_epoll.c
+++ b/src/iv_fd_epoll.c
@@ -388,6 +388,8 @@ iv_fd_epoll_timerfd_set_poll_timeout(struct iv_state *st,
val.it_interval.tv_sec = 0;
val.it_interval.tv_nsec = 0;
val.it_value = *abs;
-
if (val.it_value.tv_sec == 0 && val.it_value.tv_nsec == 0)
-
val.it_value.tv_nsec = 1; ret = timerfd_settime(st->u.epoll.timer_fd, TFD_TIMER_ABSTIME, &val, NULL);
That patch seems to have gotten mangled, I've attached a copy as well:
That patch seems to fix it, after a dozen or so reboots.
Thanks for testing!
This seems like a bug in syslog-ng -- it's registering a timer with a timeout of tv_sec = 0, tv_nsec = 0. ivykis passes this timeout into timerfd_settime() verbatim, and that could be argued to be a bug in ivykis, as timerfd_settime() interprets setting a timeout to an absolute time of zero as a request to disable the timeout, and as a result, no more timer handlers are ever run.
I'll publish a new ivykis version with a workaround for this, and I'll raise this with the syslog-ng people as well, because they probably should not be registering a timeout at absolute time zero.
syslog-ng bug report here:
http://lists.balabit.hu/pipermail/syslog-ng/2017-April/023704.html
Cool! Thanks for figuring it out.
Thank you for the report and the debug help. :)
I've pushed a fix (and a test case) to: https://github.com/buytenh/ivykis/commits/master
Solaris's port_create(3C) poll method was affected by this issue as well, and I've pushed a fix for that as well, and verified that it fixes the issue on Solaris 11.3.
I'll push out a new ivykis release in a few days.
This is fixed in ivykis 0.42.