naemon/naemon-livestatus

Livestatus query containing sort causes naemon process SIGABRT

alexharpin opened this issue · 9 comments

After upgrading from naemon 1.0.7 and Thruk 2.19 (long overdue upgrade on my part), the naemon process would terminate with a SIGABRT when access status information from Thruk. Some sections would display the information as expected while others would report "connection refused" for the livestatus socket (obviously due to the crash above). Tracked the issue down to when livestatus is sent a query containing a sort option. Capture the query sent just before the crash, removing the sort options and sending it again (unixcat to the livestatus socket) resulted in data being returned, adding the sort options back in resulted in a backend process crash. Not sure if this is a livestatus or naemon-core issue.

Current versions

Naemon = 1.0.8
Thruk = 2.22

Query that causes crash (not just limited to this one, any with sort seem to be an issue)

GET services
Columns: accept_passive_checks acknowledged action_url action_url_expanded active_checks_enabled check_command check_interval check_options check_period check_type checks_enabled comments current_attempt current_notification_number description event_handler event_handler_enabled custom_variable_names custom_variable_values execution_time first_notification_delay flap_detection_enabled groups has_been_checked high_flap_threshold host_acknowledged host_action_url_expanded host_active_checks_enabled host_address host_alias host_checks_enabled host_check_type host_latency host_plugin_output host_perf_data host_current_attempt host_check_command host_comments host_groups host_has_been_checked host_icon_image_expanded host_icon_image_alt host_is_executing host_is_flapping host_name host_notes_url_expanded host_notifications_enabled host_scheduled_downtime_depth host_state host_accept_passive_checks host_last_state_change icon_image icon_image_alt icon_image_expanded is_executing is_flapping last_check last_notification last_state_change latency low_flap_threshold max_check_attempts next_check notes notes_expanded notes_url notes_url_expanded notification_interval notification_period notifications_enabled obsess_over_service percent_state_change perf_data plugin_output process_performance_data retry_interval scheduled_downtime_depth state state_type modified_attributes_list last_time_critical last_time_ok last_time_unknown last_time_warning display_name host_display_name host_custom_variable_names host_custom_variable_values in_check_period in_notification_period host_parents long_plugin_output
Filter: host_name = ..*
Sort: host_name asc
Sort: description asc

OutputFormat: wrapped_json
ResponseHeader: fixed16

Same query without the sort(s) work fine.

GET services
Columns: accept_passive_checks acknowledged action_url action_url_expanded active_checks_enabled check_command check_interval check_options check_period check_type checks_enabled comments current_attempt current_notification_number description event_handler event_handler_enabled custom_variable_names custom_variable_values execution_time first_notification_delay flap_detection_enabled groups has_been_checked high_flap_threshold host_acknowledged host_action_url_expanded host_active_checks_enabled host_address host_alias host_checks_enabled host_check_type host_latency host_plugin_output host_perf_data host_current_attempt host_check_command host_comments host_groups host_has_been_checked host_icon_image_expanded host_icon_image_alt host_is_executing host_is_flapping host_name host_notes_url_expanded host_notifications_enabled host_scheduled_downtime_depth host_state host_accept_passive_checks host_last_state_change icon_image icon_image_alt icon_image_expanded is_executing is_flapping last_check last_notification last_state_change latency low_flap_threshold max_check_attempts next_check notes notes_expanded notes_url notes_url_expanded notification_interval notification_period notifications_enabled obsess_over_service percent_state_change perf_data plugin_output process_performance_data retry_interval scheduled_downtime_depth state state_type modified_attributes_list last_time_critical last_time_ok last_time_unknown last_time_warning display_name host_display_name host_custom_variable_names host_custom_variable_values in_check_period in_notification_period host_parents long_plugin_output
Filter: host_name = ..*
OutputFormat: wrapped_json
ResponseHeader: fixed16

I got a similar issue:
"View History For This Host" in Thruk with a larger time frame crashes Naemon. 1 week was enough for a server with many changes in the logfile. The largest json result without crash was 14855 bytes in size.

When the "Sort:" line is removed from the query it works.

Naemon generates a segmentation fault here:

int chars_left = strlen(r);

The pointer address "0x6d6f5a006e616c2e" is a part of the query answer: ".lan\x00Zom"

naemon-core 1.0.8
naemon-livestatus 1.0.8
thruk 2.22

backtrace.log
query.log

Can anyone tell me how this progresses as I am having to tell my users to not do searches and is looking kinda rubbish for the platform...?

sni commented

well, i could remove the sort from the query in thruk till this issue is resolved.

Many thanks for your reply. I don't really know how removing the sort from query would impact anything else in the system, but if its acceptable workaround for all then I am happy with it. If you can tell me also how to run traces/debugs for Naemon in Ubuntu 16.04 that would be great as I can send that information in also.

It just that Naemon has built up such a good reputation in the company that I've kinda got protective about keeping it up there. As always, thanks to everyone that makes it such a good product.

sni commented

Should be better soon, i removed the sort header from log livestatus queries in Thruk: sni/Thruk@d9d98ca

i will keep this issue open, since the real cause has not been solved yet.

Could this be related to #73 ?

I just started getting this crash today, while migrating to CentOS 8
This query
GET hosts
Columns: host_name
Stats: host_state = 1
Stats: childs !=
StatsAnd: 2
OutputFormat: json
ResponseHeader: fixed16
ColumnHeaders: on

generate this crash on gdb
(gdb) where
#0 0x00007ffff6efe70f in raise () from /lib64/libc.so.6
#1 0x00007ffff6ee8b25 in abort () from /lib64/libc.so.6
#2 0x00007ffff502bb48 in std::__replacement_assert(char const*, int, char const*, char const*) ()
from /usr/lib64/naemon/naemon-livestatus/livestatus.so
#3 0x00007ffff50686f1 in RowSortedSet::extract() () from /usr/lib64/naemon/naemon-livestatus/livestatus.so
#4 0x00007ffff502b56d in Query::finish() () from /usr/lib64/naemon/naemon-livestatus/livestatus.so
#5 0x00007ffff502df45 in Store::answerGetRequest(InputBuffer*, OutputBuffer*, char const*) ()
from /usr/lib64/naemon/naemon-livestatus/livestatus.so
#6 0x00007ffff502e243 in Store::answerRequest(InputBuffer*, OutputBuffer*) () from /usr/lib64/naemon/naemon-livestatus/livestatus.so
#7 0x00007ffff502d5ed in store_answer_request () from /usr/lib64/naemon/naemon-livestatus/livestatus.so
#8 0x00007ffff50642b7 in client_thread () from /usr/lib64/naemon/naemon-livestatus/livestatus.so
#9 0x00007ffff66822de in start_thread () from /lib64/libpthread.so.0
#10 0x00007ffff6fc2e83 in clone () from /lib64/libc.so.6

The same versions running on CentOS 6 do not crash with the same queries.
It crashed with my own compiled naemon daemon and livestatus and with the CentOS 8 RPMs from naemon.org

This patch #73 fixed the problem above, just tested it on CentOS 8 just now.

sni commented

great, so we can close this one as well