codership/galera

GCache::RingBuffer initial scan dies at 0.0%

businessbean opened this issue · 3 comments

  • Ubuntu 20.04 plus MariaDB Galera packages from the mariadb.org repo
  • MariaDB 10.5.13 and Galera 26.4.8
  • gcache file in the MariaDB data partition

With ProxySQL in front of the MariaDB cluster routing all write traffic to the first database node i had started a database benchmark.

./dbbench mysql --iter 524288 --threads 16 --conns 8 --host mariadb-g-frontend.database.svc.cluster.local --user root --pass pass

After some time the cluster had died and the first data node is not able to restart because of the bootstrap fails during the gcache file scan. The disks still have plenty of free space, but the gcache seems to be corrupted because of the high load during the benchmark. I have later updated to MariaDB 10.5.17 and Galera 26.4.12, but it is also not able to read the gcache file. I can just delete the file to make the (test) cluster come up again, but it would be good to be able to validate the gcache before the bootstrap to be able to decide if the delete of the gcache is necessary. It also would be good if MariaDB could handle the problem gracefully.

mariadbd --defaults-file=/opt/mariadb/etc/my.cnf --basedir=/usr --wsrep-new-cluster

022-08-31  8:57:18 0 [Note] mariadbd (mysqld 10.5.17-MariaDB-1:10.5.17+maria~ubu2004-log) starting as process 27 ...
2022-08-31  8:57:18 0 [Note] WSREP: Loading provider /usr/lib/libgalera_smm.so initial position: 00000000-0000-0000-0000-000000000000:-1
2022-08-31  8:57:18 0 [Note] WSREP: wsrep_load(): loading provider library '/usr/lib/libgalera_smm.so'
2022-08-31  8:57:18 0 [Note] WSREP: wsrep_load(): Galera 26.4.12(r1eac5b64) by Codership Oy <info@codership.com> loaded successfully.
2022-08-31  8:57:18 0 [Note] WSREP: CRC-32C: using 64-bit x86 acceleration.
2022-08-31  8:57:18 0 [Note] WSREP: /home/buildbot/buildbot/build/galera/src/saved_state.cpp:SavedState():116: Found saved state: cfaa8cb8-1f9d-11ed-8d5e-6fd03ac1bd39:1204230, safe_to_bootstrap: 1
2022-08-31  8:57:18 0 [Note] WSREP: /home/buildbot/buildbot/build/gcache/src/gcache_rb_store.cpp:open_preamble():652: GCache DEBUG: opened preamble:
Version: 2
UUID: cfaa8cb8-1f9d-11ed-8d5e-6fd03ac1bd39
Seqno: -1 - -1
Offset: -1
Synced: 0
2022-08-31  8:57:18 0 [Note] WSREP: /home/buildbot/buildbot/build/gcache/src/gcache_rb_store.cpp:open_preamble():663: Recovering GCache ring buffer: version: 2, UUID: cfaa8cb8-1f9d-11ed-8d5e-6fd03ac1bd39, offset: -1
2022-08-31  8:57:18 0 [Note] WSREP: /home/buildbot/buildbot/build/galerautils/src/gu_progress.hpp:log():52: GCache::RingBuffer initial scan...  0.0% (        0/134217752 bytes) complete.
Killed

echo $?
137

grastate.dat:

# GALERA saved state
version: 2.1
uuid:    cfaa8cb8-1f9d-11ed-8d5e-6fd03ac1bd39
seqno:   1204230
safe_to_bootstrap: 1

my.cnf:

[mysqld]
# folders
plugin-dir=/usr/lib/mysql/plugin
datadir=/opt/mariadb/data
tmpdir=/opt/mariadb/tmp
ignore-db-dirs=lost+found
ignore-db-dirs=seqno
# performance monitoring
performance_schema=ON
performance-schema-instrument='stage/%=ON'
performance-schema-consumer-events-stages-current=ON
performance-schema-consumer-events-stages-history=ON
performance-schema-consumer-events-stages-history-long=ON

# process
pid-file=/opt/mariadb/run/mariadbd.pid
socket=/opt/mariadb/run/mariadbd.sock

[mysql_upgrade]
socket=/opt/mariadb/run/mariadbd.sock

[client]
socket=/opt/mariadb/run/mariadbd.sock

[client-server]
socket=/opt/mariadb/run/mariadbd.sock

[mariadb]
plugin_load_add = query_response_time #https://mariadb.com/kb/en/query-response-time-plugin/

# include additional configs
!includedir /opt/mariadb/etc/conf.d

conf.d/my.cnf:

[mariadb]
wsrep-provider=/usr/lib/libgalera_smm.so
binlog_format=ROW
log-bin=/opt/mariadb/log/mysql-bin.log
expire_logs_days=1
default_storage_engine=InnoDB
innodb_autoinc_lock_mode=2
wsrep-cluster-name=eu-de-1.nova
wsrep_cluster_address=gcomm://mariadb-g-0.database.svc.cluster.local:4567,mariadb-g-1.database.svc.cluster.local:4567,mariadb-g-2.database.svc.cluster.local:4567,mariadb-g-backend.database.svc.cluster.local:4567
wsrep_provider_options=cert.log_conflicts=ON;debug=YES;gcache.recover=yes;ist.recv_addr=10.60.3.3:4568;pc.recovery=FALSE;pc.wait_prim_timeout=PT60S;pc.weight=4
wsrep_node_address=10.60.3.3
wsrep_node_name=mariadb-g-0
wsrep-on=1
wsrep_log_conflicts=ON
wsrep_slave_threads=16

PR #608 does not seem to fix the problem, because it also fails with 26.4.12.