codership/mysql-wsrep

Galera Cluster crashing after mysqld got signal 11

mhunger opened this issue · 2 comments

we have recently setup our new 3 node galera cluster. We are using following config:

wsrep_osu_method	TOI
wsrep_sr_store	table
wsrep_auto_increment_control	ON
wsrep_causal_reads	OFF
wsrep_certification_rules	strict
wsrep_certify_nonpk	ON
wsrep_cluster_address	gcomm://xxxx.crewmeister.com:12011,xxxx.crewmeister.com:12011,xxxx.crewmeister.com:12011
wsrep_cluster_name	db-cluster-prod-pod02
wsrep_convert_lock_to_trx	OFF
wsrep_data_home_dir	/var/lib/mysql/
wsrep_dbug_option	
wsrep_debug	NONE
wsrep_desync	OFF
wsrep_dirty_reads	OFF
wsrep_drupal_282555_workaround	OFF
wsrep_forced_binlog_format	NONE
wsrep_gtid_domain_id	0
wsrep_gtid_mode	OFF
wsrep_ignore_apply_errors	7
wsrep_load_data_splitting	OFF
wsrep_log_conflicts	OFF
wsrep_max_ws_rows	0
wsrep_max_ws_size	2147483647
wsrep_mysql_replication_bundle	0
wsrep_node_address	xxxx.crewmeister.com:12011
wsrep_node_incoming_address	AUTO
wsrep_node_name	xxxx.crewmeister.com
wsrep_notify_cmd	
wsrep_on	ON
wsrep_patch_version	wsrep_26.22
wsrep_provider	/usr/lib/libgalera_smm.so
wsrep_provider_options	base_dir = /var/lib/mysql/; base_host = xxxx.crewmeister.com; base_port = 12011; cert.log_conflicts = no; cert.optimistic_pa = yes; debug = no; evs.auto_evict = 0; evs.causal_keepalive_period = PT1S; evs.debug_log_mask = 0x1; evs.delay_margin = PT1S; evs.delayed_keep_period = PT30S; evs.inactive_check_period = PT0.5S; evs.inactive_timeout = PT15S; evs.info_log_mask = 0; evs.install_timeout = PT7.5S; evs.join_retrans_period = PT1S; evs.keepalive_period = PT1S; evs.max_install_timeouts = 3; evs.send_window = 4; evs.stats_report_period = PT1M; evs.suspect_timeout = PT5S; evs.use_aggregate = true; evs.user_send_window = 2; evs.version = 1; evs.view_forget_timeout = P1D; gcache.dir = /var/lib/mysql/; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = galera.cache; gcache.page_size = 128M; gcache.recover = yes; gcache.size = 128M; gcomm.thread_prio = ; gcs.fc_debug = 0; gcs.fc_factor = 1.0; gcs.fc_limit = 16; gcs.fc_master_slave = no; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = no; gmcast.listen_addr = tcp://0.0.0.0:12011; gmcast.mcast_addr = ; gmcast.mcast_ttl = 1; gmcast.peer_timeout = PT3S; gmcast.segment = 0; gmcast.time_wait = PT5S; gmcast.version = 0; ist.recv_addr = xxxx.crewmeister.com:12012; ist.recv_bind = 172.23.0.2:12012; pc.announce_timeout = PT3S; pc.checksum = false; pc.ignore_quorum = false; pc.ignore_sb = false; pc.linger = PT20S; pc.npvo = false; pc.recovery = true; pc.version = 0; pc.wait_prim = true; pc.wait_prim_timeout = PT30S; pc.weight = 1; protonet.backend = asio; protonet.version = 0; repl.causal_read_timeout = PT30S; repl.commit_order = 3; repl.key_format = FLAT8; repl.max_ws_size = 2147483647; repl.proto_max = 10; socket.checksum = 2; socket.recv_buf_size = auto; socket.send_buf_size = auto; 
wsrep_recover	OFF
wsrep_reject_queries	NONE
wsrep_replicate_myisam	OFF
wsrep_restart_slave	OFF
wsrep_retry_autocommit	1
wsrep_slave_fk_checks	ON
wsrep_slave_uk_checks	OFF
wsrep_slave_threads	1
wsrep_sst_auth	********
wsrep_sst_donor	xxxx.crewmeister.com
wsrep_sst_donor_rejects_queries	OFF
wsrep_sst_method	mariabackup
wsrep_sst_receive_address	xxxx.crewmeister.com:12013
wsrep_start_position	00000000-0000-0000-0000-000000000000:-1
wsrep_strict_ddl	OFF
wsrep_sync_wait	0
wsrep_trx_fragment_size	0
wsrep_trx_fragment_unit	bytes

The cluster crashes regularly every day and not under peak load times. Looking in the logs it seems that it happens because of the following error message:

210715 14:08:51 [ERROR] mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.

To report this bug, see https://mariadb.com/kb/en/reporting-bugs

We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.

Server version: 10.5.8-MariaDB-1:10.5.8+maria~focal-log
key_buffer_size=134217728
read_buffer_size=131072
max_used_connections=64
max_threads=302
thread_count=66
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 795868 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x7efafc05aea8
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x7efb836d1d58 thread_stack 0x49000
mysqld(my_print_stacktrace+0x32)[0x559e6b2c9692]
Printing to addr2line failed
mysqld(handle_fatal_signal+0x485)[0x559e6ad20e45]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x153c0)[0x7efcbf6563c0]
mysqld(_ZNK19rpl_sql_thread_info22cached_charset_compareEPc+0x6)[0x559e6ac0ea76]
mysqld(_ZN15Query_log_event14do_apply_eventEP14rpl_group_infoPKcj+0xa04)[0x559e6ae43dd4]
mysqld(_Z18wsrep_apply_eventsP3THDP14Relay_log_infoPKvm+0x1e9)[0x559e6aff2e79]
mysqld(_ZN22Wsrep_replayer_service15apply_write_setERKN5wsrep7ws_metaERKNS0_12const_bufferERNS0_14mutable_bufferE+0xac)[0x559e6afdaa4c]
mysqld(+0xf94270)[0x559e6b34c270]
mysqld(_ZN5wsrep12server_state8on_applyERNS_21high_priority_serviceERKNS_9ws_handleERKNS_7ws_metaERKNS_12const_bufferE+0xc1)[0x559e6b34d2c1]
mysqld(+0xfa4a3c)[0x559e6b35ca3c]
/usr/lib/libgalera_smm.so(+0x1b6de5)[0x7efcbedfade5]
/usr/lib/libgalera_smm.so(+0x202be8)[0x7efcbee46be8]
/usr/lib/libgalera_smm.so(+0x21d563)[0x7efcbee61563]
mysqld(_ZN5wsrep18wsrep_provider_v266replayERKNS_9ws_handleEPNS_21high_priority_serviceE+0x2d)[0x559e6b35d27d]
mysqld(_ZN20Wsrep_client_service6replayEv+0x102)[0x559e6afda562]
mysqld(_ZN5wsrep11transaction6replayERNS_11unique_lockINS_5mutexEEE+0x8a)[0x559e6b35720a]
mysqld(_ZN5wsrep11transaction15after_statementEv+0xe7)[0x559e6b359557]
mysqld(_ZN5wsrep12client_state15after_statementEv+0xaf)[0x559e6b3436df]
mysqld(+0x74e93c)[0x559e6ab0693c]
mysqld(_Z16dispatch_command19enum_server_commandP3THDPcjbb+0x2e54)[0x559e6ab144f4]
mysqld(_Z10do_commandP3THD+0x116)[0x559e6ab14d76]
mysqld(_Z24do_handle_one_connectionP7CONNECTb+0x411)[0x559e6ac19131]
mysqld(handle_one_connection+0x5d)[0x559e6ac195ad]
mysqld(+0xbbe266)[0x559e6af76266]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x9609)[0x7efcbf64a609]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x7efcbf239293]

Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (0x7efafc047f12): SAVEPOINT `44dz61a6f62c947aaf3abz00a155d359`

Connection ID (thread ID): 1
Status: NOT_KILLED

Optimizer switch: index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_merge_sort_intersection=off,engine_condition_pushdown=off,index_condition_pushdown=on,derived_merge=on,derived_with_keys=on,firstmatch=on,loosescan=on,materialization=on,in_to_exists=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on,mrr=off,mrr_cost_based=off,mrr_sort_keys=off,outer_join_with_cache=on,semijoin_with_cache=on,join_cache_incremental=on,join_cache_hashed=on,join_cache_bka=on,optimize_join_buffer_size=on,table_elimination=on,extended_keys=on,exists_to_in=on,orderby_uses_equalities=on,condition_pushdown_for_derived=on,split_materialized=on,condition_pushdown_for_subquery=on,rowid_filter=on,condition_pushdown_from_having=on,not_null_range_scan=off

The manual page at https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mysqld/ contains
information that should help you find out what is causing the crash.
Writing a core file...
Working directory at /var/lib/mysql
Resource Limits:
Limit                     Soft Limit           Hard Limit           Units
Max cpu time              unlimited            unlimited            seconds
Max file size             unlimited            unlimited            bytes
Max data size             unlimited            unlimited            bytes
Max stack size            8388608              unlimited            bytes
Max core file size        unlimited            unlimited            bytes
Max resident set          unlimited            unlimited            bytes
Max processes             unlimited            unlimited            processes
Max open files            1048576              1048576              files
Max locked memory         65536                65536                bytes
Max address space         unlimited            unlimited            bytes
Max file locks            unlimited            unlimited            locks
Max pending signals       125521               125521               signals
Max msgqueue size         819200               819200               bytes
Max nice priority         0                    0
Max realtime priority     0                    0
Max realtime timeout      unlimited            unlimited            us
Core pattern: core

All system variables look they are fine memory, cpu, disk space.

What is the configuration for binlog_format variable? Note that only ROW format is supported

What is the configuration for binlog_format variable? Note that only ROW format is supported

It is ROW on all nodes