naemon/naemon-livestatus

g_tree_foreach: assertion `tree != NULL' failed and Naemon fails with segfault

riton opened this issue · 10 comments

riton commented

We're experiencing crash with the latest version of naemon and thruk when using the /thruk/cgi-bin/showlog.cgi CGI by clicking on the Event Log button in Thruk interface.

Our configuration holds

Checking objects...
        Checked 5212 services.
        Checked 1548 hosts.
        Checked 79 contacts.
        Checked 302 host groups.
        Checked 310 service groups.
        Checked 85 contact groups.
        Checked 124 commands.
        Checked 4 time periods.
        Checked 0 host escalations.
        Checked 0 service escalations.

After investigations, the crash is caused by this LiveStatus query:

GET log                                          
Columns: class time type state host_name service_description plugin_output message options state_type contact_name
Filter: time >= 1540854000
Filter: time <= 1540940400
And: 2
Limit: 100
Sort: time desc
OutputFormat: wrapped_json
ResponseHeader: fixed16

with a log files that has the following number of lines

root@naemon-server ~ $ wc -l /var/log/naemon/archives/*
  1235420 /var/log/naemon/archives/naemon.log-20181005
   593846 /var/log/naemon/archives/naemon.log-20181027
  1468552 /var/log/naemon/archives/naemon.log-20181028
  1151668 /var/log/naemon/archives/naemon.log-20181029
   582285 /var/log/naemon/archives/naemon.log-20181030
   959542 /var/log/naemon/archives/naemon.log-20181031
  5991313 total

When we issue the Livestatus query above, Livestatus outputs messages like

0;0;;;;0;;;;;;;;;;0;;;;;;0;0;0;0;0;;;;;0;0;0;0;;;0;;;;0;0;0.0000000000e+00;0;;;0;0;;;;;;0;0;;;;;;;;0;0.0000000000e+00;;0.0000000000e+00;0;;0;0;0.0000000000e+00;0;;;;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0.0000000000e+00;;0.0000000000e+00;0;0;;;0;0;0;;;;;0.0000000000e+00;;0;0;0;0;0;0;0;0;0;0;0;0;0;0;;0;0.0000000000e+00;;;0;0;0.0000000000e+00;0;;;;0;0.0000000000e+00;0;0;;0;0;0;0.0000000000e+00;0.0000000000e+00;0.0000000000e+00;0;0;0;;;0;;0;0.0000000000e+00;0;;;0;0;;;;;0;0;;;;;;;;;0;0.0000000000e+00;0.0000000000e+00;0;;0;0.0000000000e+00;0;;;;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0.0000000000e+00;;0.0000000000e+00;0;0;;0;0;0;;;;;0.0000000000e+00;;0;0;0;0;0.0000000000e+00;;;0;0;0.0000000000e+00;0;0;0.0000000000e+00;0;0;;739891;[1540921434] g_tree_foreach: assertion 'tree != NULL' failed;assertion 'tree != NULL' failed;;;0;;1540921434;g_tree_foreach: assertion 'tree != NULL' failed

and very often (every time ?) naemon dies.

After applying the patch specified in #29 , it seems the g_tree_foreach: assertion 'tree != NULL' failed still appears but this time, naemon seem to stay alive. We can see the g_tree_foreach: assertion 'tree != NULL' failed messages in Thruk interface like shown in the following screenshot:

naemon_logs

This issue is related to #28 but the proposed patch only fixes the naemon crash problem (which is a good thing 👍 ) but the problem is only mitigated here since the g_tree_foreach: assertion 'tree != NULL' failed always shows up.

Maybe that this issue is also related to #39 (cause the Query also contains a Sort filter), maybe it's not.

sni commented

If you can reproduce the crash, could you create a backtrace with gdb? That would help, since i still cannot reproduce the issue.

riton commented

Here is what I can get using gdb:

root@naemon-host ~ $ rpm -qa '*naemon*'
naemon-devel-1.0.8-0.x86_64
naemon-livestatus-1.0.8-0.x86_64
naemon-core-1.0.8-0.x86_64
naemon-thruk-1.0.8-0.noarch
naemon-1.0.8-0.noarch
libnaemon-1.0.8-0.x86_64
naemon-core-dbg-1.0.8-0.x86_64
naemon@naemon-host$ gdb /usr/bin/naemon-dbg
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-110.el7
[...]
Reading symbols from /usr/bin/naemon-dbg...Reading symbols from /usr/bin/naemon-dbg...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Missing separate debuginfos, use: debuginfo-install naemon-core-dbg-1.0.8-0.x86_64

(gdb) run /etc/naemon/naemon.cfg
Starting program: /usr/bin/naemon-dbg /etc/naemon/naemon.cfg
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
[...]
[New Thread 0x7ffff7f79700 (LWP 8311)]
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff7f57700 (LWP 8227)]
0x00007ffff7092faf in __strlen_sse42 () from /lib64/libc.so.6
(gdb) backtrace 
#0  0x00007ffff7092faf in __strlen_sse42 () from /lib64/libc.so.6
#1  0x00007ffff5ac8f59 in Query::outputString(char const*) () from /usr/lib64/naemon/naemon-livestatus/livestatus.so
#2  0x00007ffff5ac8c33 in Query::printRow(void*) () from /usr/lib64/naemon/naemon-livestatus/livestatus.so
#3  0x00007ffff5acc602 in Query::finish() () from /usr/lib64/naemon/naemon-livestatus/livestatus.so
#4  0x00007ffff5acf407 in Store::answerGetRequest(InputBuffer*, OutputBuffer*, char const*) () from /usr/lib64/naemon/naemon-livestatus/livestatus.so
#5  0x00007ffff5acf72b in Store::answerRequest(InputBuffer*, OutputBuffer*) () from /usr/lib64/naemon/naemon-livestatus/livestatus.so
#6  0x00007ffff5aceab9 in store_answer_request () from /usr/lib64/naemon/naemon-livestatus/livestatus.so
#7  0x00007ffff5b06e93 in client_thread () from /usr/lib64/naemon/naemon-livestatus/livestatus.so
#8  0x00007ffff6adfe25 in start_thread () from /lib64/libpthread.so.0
#9  0x00007ffff704e34d in clone () from /lib64/libc.so.6

Here is a trace I got when two simultaneous queries were made to Livestatus (the query is the one I detailed in my previous post).

riton commented

Another one which is slightly different

[...]
[New Thread 0x7ffff7f57700 (LWP 13133)]

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff7f68700 (LWP 13112)]
0x00007ffff5ac5858 in OffsetStringColumn::getValue(void*) () from /usr/lib64/naemon/naemon-livestatus/livestatus.so
(gdb) backtrace
#0  0x00007ffff5ac5858 in OffsetStringColumn::getValue(void*) () from /usr/lib64/naemon/naemon-livestatus/livestatus.so
#1  0x00007ffff5ad0b03 in StringColumn::compare(void*, void*, Query*) () from /usr/lib64/naemon/naemon-livestatus/livestatus.so
#2  0x00007ffff5b0a172 in RowSortedSet::compare(void*, void*) () from /usr/lib64/naemon/naemon-livestatus/livestatus.so
#3  0x00007ffff5b0a273 in RowSortedSet::extract() () from /usr/lib64/naemon/naemon-livestatus/livestatus.so
#4  0x00007ffff5b0a3c7 in RowSortedSet::insert(void*, int) () from /usr/lib64/naemon/naemon-livestatus/livestatus.so
#5  0x00007ffff5acc228 in Query::processDataset(void*) () from /usr/lib64/naemon/naemon-livestatus/livestatus.so
#6  0x00007ffff5afd745 in Logfile::answerQueryReverse(Query*, TableLog*, long, long, unsigned int) () from /usr/lib64/naemon/naemon-livestatus/livestatus.so
#7  0x00007ffff5b02308 in TableLog::answerQuery(Query*) () from /usr/lib64/naemon/naemon-livestatus/livestatus.so
#8  0x00007ffff5acf3ff in Store::answerGetRequest(InputBuffer*, OutputBuffer*, char const*) () from /usr/lib64/naemon/naemon-livestatus/livestatus.so
#9  0x00007ffff5acf72b in Store::answerRequest(InputBuffer*, OutputBuffer*) () from /usr/lib64/naemon/naemon-livestatus/livestatus.so
#10 0x00007ffff5aceab9 in store_answer_request () from /usr/lib64/naemon/naemon-livestatus/livestatus.so
#11 0x00007ffff5b06e93 in client_thread () from /usr/lib64/naemon/naemon-livestatus/livestatus.so
#12 0x00007ffff6adfe25 in start_thread () from /lib64/libpthread.so.0
#13 0x00007ffff704e34d in clone () from /lib64/libc.so.6

The exact same requests were sent to Livestatus to trigger the SIGSEGV.

[Thinking this is the same as my issue] I am also getting crashes when looking back over periods of time for events.
Naemon: ERROR: failed to connect - failed to connect to /var/cache/naemon/live: Connection refused.. (/var/cache/naemon/live)

In reference to my above, can someone tell me where the debug commands are for Ubuntu 16.04, and what I need to gather a not collected trace/debugs out of Naemon before. This is a critical system for our company so keen to get it fixed.

sni commented

Should be better soon, i removed the sort header from log livestatus queries in Thruk: sni/Thruk@d9d98ca

i will keep this issue open, since the real cause has not been solved yet.

This #29 fixed the same problem I was having here #28 with event log.

sni commented

#29 does not fix anything, it simply disables the cache.

As far as I remembered debugging this, the LogEntry array gets corrupted after this function is called a second time.
I did not know it would disable cache, but it is a work around for the segfault that stops naemon.

sni commented

i think this has been fixed with #103
please let me know if it still happens.