nagios 4.4.2 reload breaks livestatus module
Closed this issue · 8 comments
Hi Team!
After upgrading from 4.3.2 to 4.4.2 I`ve got a bug with reloading nagios core:
mk-livestatus socket stops responding.
How to reproduce
livesocket="/usr/local/nagios/var/rw/live"
unixcat="$(which unixcat)"
echo "GET status"|$unixcat $livesocket
#works as expected
/etc/init.d/nagios reload
echo "GET status"|$unixcat $livesocket
#command stalled
Debug
I Think that's because since 0e1b0f1 cleanup() removed from shutdown/restart and neb_unload_all_modules not called anymore at reload (kill -HUP or RESTART_PROGRAM command),
but then neb_load_all_modules() called again in main loop and loads modules again each reload.
Workaround
So I`ve made some patch to get livestatus working after service nagios reload:
--- a/base/nagios.c 2018-08-24 01:30:00.000000000 +0300
+++ b/base/nagios.c 2018-08-27 23:56:33.000000000 +0300
@@ -866,6 +866,11 @@
broker_program_state(NEBTYPE_PROCESS_SHUTDOWN, NEBFLAG_USER_INITIATED, NEBATTR_SHUTDOWN_NORMAL, NULL);
else if(sigrestart == TRUE)
broker_program_state(NEBTYPE_PROCESS_RESTART, NEBFLAG_USER_INITIATED, NEBATTR_RESTART_NORMAL, NULL);
+
+ neb_free_callback_list();
+ neb_unload_all_modules(NEBMODULE_FORCE_UNLOAD, (sigshutdown == TRUE) ? NEBMODULE_NEB_SHUTDOWN : NEBMODULE_NEB_RESTART);
+ neb_free_module_list();
+ neb_deinit_modules();
#endif
/* save service and host state information */
@hedenface please take a look, I'm not sure it is a correct solution, but I believe modules should not be loaded twice at reload..
4.3.2 log before upgrade, no bugs
[1535396308] Caught SIGHUP, restarting...
[1535396309] Event broker module 'NERD' deinitialized successfully.
[1535396309] livestatus: Socket thread has terminated
[1535396310] Event broker module '/usr/local/lib/mk-livestatus/livestatus.o' deinitialized successfully.
[1535396310] Nagios 4.3.2 starting... (PID=28611)
[1535396310] Local time is Mon Aug 27 21:58:30 MSK 2018
[1535396310] LOG VERSION: 2.0
[1535396310] qh: Socket '/usr/local/nagios/var/rw/nagios.qh' successfully initialized
[1535396310] qh: core query handler registered
[1535396310] nerd: Channel hostchecks registered successfully
[1535396310] nerd: Channel servicechecks registered successfully
[1535396310] nerd: Channel opathchecks registered successfully
[1535396310] nerd: Fully initialized and ready to rock!
[1535396310] wproc: Successfully registered manager as @wproc with query handler
[1535396310] wproc: Registry request: name=Core Worker 1658;pid=1658
[1535396310] wproc: Registry request: name=Core Worker 1659;pid=1659
[1535396310] wproc: Registry request: name=Core Worker 1660;pid=1660
[1535396310] livestatus: Livestatus 1.2.4 by Mathias Kettner. Socket: '/usr/local/nagios/var/rw/live'
[1535396310] livestatus: Please visit us at http://mathias-kettner.de/
[1535396310] livestatus: Hint: please try out OMD - the Open Monitoring Distribution
[1535396310] livestatus: Please visit OMD at http://omdistro.org
[1535396310] livestatus: Opened UNIX socket /usr/local/nagios/var/rw/live
[1535396310] livestatus: Your event_broker_options are sufficient for livestatus..
[1535396310] livestatus: Warning: environment_macros are enabled. This might decrease the overall nagios performance
[1535396310] livestatus: Finished initialization. Further log messages go to /usr/local/nagios/var/livestatus.log
[1535396310] Event broker module '/usr/local/lib/mk-livestatus/livestatus.o' initialized successfully.
[1535396310] TIMEPERIOD TRANSITION: 24x7;-1;1
4.4.2 log before patch, bug with livestatus not responding after nagios reload
[1535383971] Caught SIGHUP, restarting...
[1535383972] Nagios 4.4.2 starting... (PID=5378)
[1535383972] Local time is Mon Aug 27 18:32:52 MSK 2018
[1535383972] LOG VERSION: 2.0
[1535383972] qh: Socket '/usr/local/nagios/var/rw/nagios.qh' successfully initialized
[1535383972] qh: core query handler registered
[1535383972] qh: echo service query handler registered
[1535383972] qh: help for the query handler registered
[1535383972] wproc: Successfully registered manager as @wproc with query handler
[1535383972] wproc: Registry request: name=Core Worker 14868;pid=14868
[1535383972] wproc: Registry request: name=Core Worker 14869;pid=14869
[1535383972] wproc: Registry request: name=Core Worker 14870;pid=14870
[1535383972] livestatus: Livestatus 1.2.4 by Mathias Kettner. Socket: '/usr/local/nagios/var/rw/live'
[1535383972] livestatus: Please visit us at http://mathias-kettner.de/
[1535383972] livestatus: Hint: please try out OMD - the Open Monitoring Distribution
[1535383972] livestatus: Please visit OMD at http://omdistro.org
[1535383972] livestatus: Removed old left over socket file /usr/local/nagios/var/rw/live
[1535383972] livestatus: Opened UNIX socket /usr/local/nagios/var/rw/live
[1535383972] livestatus: Your event_broker_options are sufficient for livestatus..
[1535383972] livestatus: Warning: environment_macros are enabled. This might decrease the overall nagios performance
[1535383972] livestatus: Finished initialization. Further log messages go to /usr/local/nagios/var/livestatus.log
[1535383972] Event broker module '/usr/local/lib/mk-livestatus/livestatus.o' initialized successfully.
[1535383973] TIMEPERIOD TRANSITION: 24x7;-1;1
4.4.2 log after patch, fixed
[1535403536] Caught SIGHUP, restarting...
[1535403537] livestatus: Socket thread has terminated
[1535403537] Event broker module '/usr/local/lib/mk-livestatus/livestatus.o' deinitialized successfully.
[1535403537] Nagios 4.4.2 starting... (PID=26668)
[1535403537] Local time is Mon Aug 27 23:58:57 MSK 2018
[1535403537] LOG VERSION: 2.0
[1535403537] qh: Socket '/usr/local/nagios/var/rw/nagios.qh' successfully initialized
[1535403537] qh: core query handler registered
[1535403537] qh: echo service query handler registered
[1535403537] qh: help for the query handler registered
[1535403537] wproc: Successfully registered manager as @wproc with query handler
[1535403537] wproc: Registry request: name=Core Worker 29707;pid=29707
[1535403537] wproc: Registry request: name=Core Worker 29708;pid=29708
[1535403537] wproc: Registry request: name=Core Worker 29709;pid=29709
[1535403537] livestatus: Livestatus 1.2.4 by Mathias Kettner. Socket: '/usr/local/nagios/var/rw/live'
[1535403537] livestatus: Please visit us at http://mathias-kettner.de/
[1535403537] livestatus: Hint: please try out OMD - the Open Monitoring Distribution
[1535403537] livestatus: Please visit OMD at http://omdistro.org
[1535403537] livestatus: Warning: environment_macros are enabled. This might decrease the overall nagios performance
[1535403537] livestatus: Finished initialization. Further log messages go to /usr/local/nagios/var/livestatus.log
[1535403537] Event broker module '/usr/local/lib/mk-livestatus/livestatus.o' initialized successfully.
[1535403539] TIMEPERIOD TRANSITION: 24x7;-1;1
Nagios compiled with ./configure --enable-event-broker --with-iobroker=epoll
on debian 2.6.26-686.
__
BR,
Alexey
But there is a solution or we have to change nagios.c ? I move from 4.1.1 to 4.4.2 and i feel event_broken problem and duplicate hostname into cgi maps... :-(
Any news?
FYI
Here is what I've got after sending HUP to nagios-4.4.2 with proposed patch ( livestatus-1.4.0p31, mod_gearman-3.0.6)
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/bin/nagios -d /etc/nagios/nagios.cfg'.
Program terminated with signal 11, Segmentation fault.
#0 pthread_join (threadid=140464895325952, thread_return=0x0) at pthread_join.c:47
47 if (INVALID_NOT_TERMINATED_TD_P (pd))
Missing separate debuginfos, use: debuginfo-install gearmand-0.33-6.x86_64
(gdb) bt
#0 pthread_join (threadid=140464895325952, thread_return=0x0) at pthread_join.c:47
#1 0x00007fc08a01b433 in terminate_threads() () from /usr/lib64/check_mk/livestatus.o
#2 0x00007fc08a02189e in nebmodule_deinit () from /usr/lib64/check_mk/livestatus.o
#3 0x0000000000416958 in neb_unload_module ()
#4 0x0000000100000002 in ?? ()
#5 0x0000000002059e60 in ?? ()
#6 0x00007fc08a0216f0 in ?? () from /usr/lib64/check_mk/livestatus.o
#7 0x0000000000000000 in ?? ()
Hi, @amuhametov!
How did u applied the patch?
Can u cd to nagios source code directory and show output of command
grep -A6 NEBTYPE_PROCESS_RESTART base/nagios.c
I've compiled standalone mk-livestatus 1.5.0p7 and 1.4.0p31 with ./configure --with-nagios4
and was not able to reproduce SIGSEGV at program reloads:
[1542675331] Caught SIGHUP, restarting...
[1542675331] livestatus: deinitializing
[1542675331] livestatus: waiting for main to terminate...
[1542675332] livestatus: waiting for client threads to terminate...
[1542675332] livestatus: could not join thread main
[1542675332] livestatus: main thread + 10 client threads have finished
[1542675332] Event broker module '/usr/local/lib/mk-livestatus/livestatus.o' deinitialized successfully.
[1542675332] Nagios 4.4.2 starting... (PID=31047)
[1542675332] Local time is Tue Nov 20 03:55:32 MSK 2018
[1542675332] LOG VERSION: 2.0
[1542675332] qh: Socket '/usr/local/nagios/var/rw/nagios.qh' successfully initialized
[1542675332] qh: core query handler registered
[1542675332] qh: echo service query handler registered
[1542675332] qh: help for the query handler registered
[1542675332] wproc: Successfully registered manager as @wproc with query handler
[1542675332] wproc: Registry request: name=Core Worker 2767;pid=2767
[1542675332] wproc: Registry request: name=Core Worker 2769;pid=2769
[1542675332] wproc: Registry request: name=Core Worker 2768;pid=2768
[1542675332] livestatus: fl_socket_path=[/usr/local/nagios/var/rw/live], fl_mkeventd_socket_path=[/usr/local/nagios/var/rw/mkeventd/status]
[1542675332] livestatus: Livestatus 1.4.0p31 by Mathias Kettner. Socket: '/usr/local/nagios/var/rw/live'
[1542675332] livestatus: Please visit us at http://mathias-kettner.de/
[1542675332] livestatus: Hint: Please try out OMD - the Open Monitoring Distribution
[1542675332] livestatus: Please visit OMD at http://omdistro.org
[1542675332] livestatus: opened UNIX socket at /usr/local/nagios/var/rw/live
[1542675332] livestatus: your event_broker_options are sufficient for livestatus..
[1542675332] livestatus: environment_macros are enabled, this might decrease the overall nagios performance
[1542675332] livestatus: finished initialization, further log messages go to /usr/local/nagios/var/livestatus.log
[1542675332] Event broker module '/usr/local/lib/mk-livestatus/livestatus.o' initialized successfully.
[1542675332] livestatus: TIMEPERIOD TRANSITION: 24x7;-1;1
[1542675332] livestatus: TIMEPERIOD TRANSITION: none;-1;0
[1542675332] livestatus: starting main thread and 10 client threads
[1542675332] livestatus: default stack size is 8388608
[1542675332] livestatus: setting thread stack size to 1048576
@dvoryanchikov I can't reproduce it since all nagios-related packages have been reinstalled. So, now it seems to work fine, except sometimes I get this could not join thread main
[1542874295] Event broker module '/usr/lib64/check_mk/livestatus.o' deinitialized successfully.
[1542874300] livestatus: setting maximum response size to 419430400 bytes (400 MB)
[1542874300] livestatus: setting number of client threads to 100
[1542874300] livestatus: setting size of thread stacks to 4194304
[1542874300] livestatus: fl_socket_path=[/var/spool/nagios/live], fl_mkeventd_socket_path=[/var/spool/nagios/mkeventd/status]
[1542874300] livestatus: Livestatus 1.4.0p31 by Mathias Kettner. Socket: '/var/spool/nagios/live'
[1542874300] livestatus: Please visit us at http://mathias-kettner.de/
[1542874300] livestatus: Hint: Please try out OMD - the Open Monitoring Distribution
[1542874300] livestatus: Please visit OMD at http://omdistro.org
[1542874300] livestatus: opened UNIX socket at /var/spool/nagios/live
[1542874300] livestatus: your event_broker_options are sufficient for livestatus..
[1542874300] livestatus: finished initialization, further log messages go to /var/log/nagios/livestatus.log
[1542874300] Event broker module '/usr/lib64/check_mk/livestatus.o' initialized successfully.
[1542874320] livestatus: TIMEPERIOD TRANSITION: 24x7;-1;1
[1542874320] livestatus: TIMEPERIOD TRANSITION: cluster;-1;1
[1542874320] livestatus: TIMEPERIOD TRANSITION: holidays;-1;0
[1542874320] livestatus: TIMEPERIOD TRANSITION: night;-1;0
[1542874320] livestatus: TIMEPERIOD TRANSITION: not_at_night;-1;1
[1542874320] livestatus: TIMEPERIOD TRANSITION: oscar-searcherd;-1;1
[1542874320] livestatus: TIMEPERIOD TRANSITION: silent;-1;0
[1542874320] livestatus: TIMEPERIOD TRANSITION: work_hours;-1;1
[1542874320] livestatus: starting main thread and 100 client threads
[1542874320] livestatus: default stack size is 8388608
[1542874320] livestatus: setting thread stack size to 4194304
[1542874654] livestatus: deinitializing
[1542874654] livestatus: waiting for main to terminate...
[1542874655] livestatus: waiting for client threads to terminate...
[1542874655] livestatus: could not join thread main
[1542874655] livestatus: main thread + 100 client threads have finished
[1542874655] Event broker module '/usr/lib64/check_mk/livestatus.o' deinitialized successfully.
The original issue for this (the old neb module not being cleaned up) should be fixed with f082ab2 and in maint branch. Can you test this against the maint @dvoryanchikov ?
Tested this internally and seems to be working. It is essentially doing what Core 4.3.x was doing before the changes to cleanup() in 4.4.x
This should be fixed in 4.4.3