mariadb-operator/mariadb-operator

[Bug] Adding Node to Galera cluster fails

Closed this issue · 4 comments

Documentation

Describe the bug
When scaling up an existing cluster that is running Galera, the node fails to join and then goes into a crash loop.

Expected behaviour

Steps to reproduce the bug

  1. Create a new Galera cluster
  2. Add some data to it
  3. Scale it up, or delete the PVCs associated with one of the nodes
New node in Galera cluster just crashing
2024-04-23 19:48:02+00:00 [Note] [Entrypoint]: Entrypoint script for MariaDB Server 1:11.3.2+maria~ubu2204 started.
2024-04-23 19:48:02+00:00 [Warn] [Entrypoint]: /sys/fs/cgroup///memory.pressure not writable, functionality unavailable to MariaDB
2024-04-23 19:48:02+00:00 [Note] [Entrypoint]: Switching to dedicated user 'mysql'
2024-04-23 19:48:02+00:00 [Note] [Entrypoint]: Entrypoint script for MariaDB Server 1:11.3.2+maria~ubu2204 started.
2024-04-23 19:48:03+00:00 [Note] [Entrypoint]: MariaDB upgrade information missing, assuming required
2024-04-23 19:48:03+00:00 [Note] [Entrypoint]: MariaDB upgrade (mariadb-upgrade or creating healthcheck users) required, but skipped due to $MARIADB_AUTO_UPGRADE setting
2024-04-23 19:48:03 0 [Note] Starting MariaDB 11.3.2-MariaDB-1:11.3.2+maria~ubu2204 source revision 068a6819eb63bcb01fdfa037c9bf3bf63c33ee42 as process 1
2024-04-23 19:48:03 0 [Note] WSREP: Loading provider /usr/lib/galera/libgalera_smm.so initial position: 00000000-0000-0000-0000-000000000000:-1
2024-04-23 19:48:03 0 [Note] WSREP: wsrep_load(): loading provider library '/usr/lib/galera/libgalera_smm.so'
2024-04-23 19:48:03 0 [Note] WSREP: wsrep_load(): Galera 26.4.16(r7dce5149) by Codership Oy <info@codership.com> loaded successfully.
2024-04-23 19:48:03 0 [Note] WSREP: Initializing allowlist service v1
2024-04-23 19:48:03 0 [Note] WSREP: Initializing event service v1
2024-04-23 19:48:03 0 [Note] WSREP: CRC-32C: using 64-bit x86 acceleration.
2024-04-23 19:48:03 0 [Note] WSREP: Found saved state: 00000000-0000-0000-0000-000000000000:-1, safe_to_bootstrap: 0
2024-04-23 19:48:03 0 [Note] WSREP: GCache DEBUG: opened preamble:
Version: 2
UUID: 003b272f-01a9-11ef-beba-026f3aace78f
Seqno: -1 - -1
Offset: -1
Synced: 0
2024-04-23 19:48:03 0 [Note] WSREP: Recovering GCache ring buffer: version: 2, UUID: 003b272f-01a9-11ef-beba-026f3aace78f, offset: -1
2024-04-23 19:48:03 0 [Note] WSREP: GCache::RingBuffer initial scan...  0.0% (        0/134217752 bytes) complete.
2024-04-23 19:48:03 0 [Note] WSREP: GCache::RingBuffer initial scan...100.0% (134217752/134217752 bytes) complete.
2024-04-23 19:48:03 0 [Note] WSREP: Recovering GCache ring buffer: found gapless sequence 675-1409
2024-04-23 19:48:03 0 [Note] WSREP: GCache::RingBuffer unused buffers scan...  0.0% (        0/132302552 bytes) complete.
2024-04-23 19:48:03 0 [Note] WSREP: Recovering GCache ring buffer: found 7/742 locked buffers
2024-04-23 19:48:03 0 [Note] WSREP: Recovering GCache ring buffer: free space: 1916680/134217728
2024-04-23 19:48:03 0 [Note] WSREP: GCache::RingBuffer unused buffers scan...100.0% (132302552/132302552 bytes) complete.
2024-04-23 19:48:03 0 [Note] WSREP: Passing config to GCS: base_dir = /var/lib/mysql/; base_host = 10.233.91.138; base_port = 4567; cert.log_conflicts = no; cert.optimistic_pa = yes; debug = no; evs.auto_evict = 0; evs.delay_margin = PT1S; evs.delayed_keep_period = PT30S; evs.inactive_check_period = PT0.5S; evs.inactive_timeout = PT15S; evs.join_retrans_period = PT1S; evs.max_install_timeouts = 3; evs.send_window = 4; evs.stats_report_period = PT1M; evs.suspect_timeout = PT5S; evs.user_send_window = 2; evs.view_forget_timeout = PT24H; gcache.dir = /var/lib/mysql/; gcache.keep_pages_size = 0; gcache.keep_plaintext_size = 128M; gcache.mem_size = 0; gcache.name = galera.cache; gcache.page_size = 128M; gcache.recover = yes; gcache.size = 128M; gcomm.thread_prio = ; gcs.fc_debug = 0; gcs.fc_factor = 1.0; gcs.fc_limit = 16; gcs.fc_master_slave = no; gcs.fc_single_primary = no; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = no; gmcast.segment = 0
2024-04-23 19:48:03 0 [Note] WSREP: Start replication
2024-04-23 19:48:03 0 [Note] WSREP: Connecting with bootstrap option: 0
2024-04-23 19:48:03 0 [Note] WSREP: Setting GCS initial position to 00000000-0000-0000-0000-000000000000:-1
2024-04-23 19:48:03 0 [Note] WSREP: protonet asio version 0
2024-04-23 19:48:03 0 [Note] WSREP: Using CRC-32C for message checksums.
2024-04-23 19:48:03 0 [Note] WSREP: backend: asio
2024-04-23 19:48:03 0 [Note] WSREP: gcomm thread scheduling priority set to other:0 
2024-04-23 19:48:03 0 [Note] WSREP: access file(/var/lib/mysql//gvwstate.dat) failed(No such file or directory)
2024-04-23 19:48:03 0 [Note] WSREP: restore pc from disk failed
2024-04-23 19:48:03 0 [Note] WSREP: GMCast version 0
2024-04-23 19:48:03 0 [Note] WSREP: (65c7b769-a4b9, 'tcp://0.0.0.0:4567') listening at tcp://0.0.0.0:4567
2024-04-23 19:48:03 0 [Note] WSREP: (65c7b769-a4b9, 'tcp://0.0.0.0:4567') multicast: , ttl: 1
2024-04-23 19:48:03 0 [Note] WSREP: EVS version 1
2024-04-23 19:48:03 0 [Note] WSREP: gcomm: connecting to group 'mariadb-operator', peer 'mariadb-gnew-0.mariadb-gnew-internal.mariadb.svc.newcluster.local:,mariadb-gnew-1.mariadb-gnew-internal.mariadb.svc.newcluster.local:,mariadb-gnew-2.mariadb-gnew-internal.mariadb.svc.newcluster.local:'
2024-04-23 19:48:03 0 [Note] WSREP: (65c7b769-a4b9, 'tcp://0.0.0.0:4567') Found matching local endpoint for a connection, blacklisting address tcp://10.233.91.138:4567
2024-04-23 19:48:03 0 [Note] WSREP: (65c7b769-a4b9, 'tcp://0.0.0.0:4567') connection established to 1f792853-896d tcp://10.233.101.184:4567
2024-04-23 19:48:03 0 [Note] WSREP: (65c7b769-a4b9, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: 
2024-04-23 19:48:04 0 [Note] WSREP: EVS version upgrade 0 -> 1
2024-04-23 19:48:04 0 [Note] WSREP: declaring 1f792853-896d at tcp://10.233.101.184:4567 stable
2024-04-23 19:48:04 0 [Note] WSREP: PC protocol upgrade 0 -> 1
2024-04-23 19:48:04 0 [Note] WSREP: Node 1f792853-896d state prim
2024-04-23 19:48:04 0 [Note] WSREP: view(view_id(PRIM,1f792853-896d,22) memb {
	1f792853-896d,0
	65c7b769-a4b9,0
} joined {
} left {
} partitioned {
})
2024-04-23 19:48:04 0 [Note] WSREP: save pc into disk
2024-04-23 19:48:04 0 [Note] WSREP: discarding pending addr without UUID: tcp://10.233.69.10:4567
2024-04-23 19:48:04 0 [Note] WSREP: gcomm: connected
2024-04-23 19:48:04 0 [Note] WSREP: Changing maximum packet size to 64500, resulting msg size: 32636
2024-04-23 19:48:04 0 [Note] WSREP: Shifting CLOSED -> OPEN (TO: 0)
2024-04-23 19:48:04 0 [Note] WSREP: Opened channel 'mariadb-operator'
2024-04-23 19:48:04 0 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 1, memb_num = 2
2024-04-23 19:48:04 0 [Note] WSREP: STATE EXCHANGE: Waiting for state UUID.
2024-04-23 19:48:04 0 [Note] WSREP: STATE EXCHANGE: sent state msg: 66165298-01aa-11ef-91b2-43347ae01c3b
2024-04-23 19:48:04 0 [Note] WSREP: STATE EXCHANGE: got state msg: 66165298-01aa-11ef-91b2-43347ae01c3b from 0 (mariadb-gnew-2)
2024-04-23 19:48:04 0 [Note] WSREP: Initializing config service v1
2024-04-23 19:48:04 1 [Note] WSREP: Starting rollbacker thread 1
2024-04-23 19:48:04 2 [Note] WSREP: Starting applier thread 2
2024-04-23 19:48:04 0 [Note] WSREP: STATE EXCHANGE: got state msg: 66165298-01aa-11ef-91b2-43347ae01c3b from 1 (mariadb-gnew-1)
2024-04-23 19:48:04 0 [Note] WSREP: Quorum results:
	version    = 6,
	component  = PRIMARY,
	conf_id    = 20,
	members    = 1/2 (joined/total),
	act_id     = 1416,
	last_appl. = 1284,
	protocols  = 2/10/4 (gcs/repl/appl),
	vote policy= 0,
	group UUID = 003b272f-01a9-11ef-beba-026f3aace78f
2024-04-23 19:48:04 0 [Note] WSREP: Flow-control interval: [23, 23]
2024-04-23 19:48:04 0 [Note] WSREP: Shifting OPEN -> PRIMARY (TO: 1417)
2024-04-23 19:48:04 0 [Note] WSREP: Deinitializing config service v1
2024-04-23 19:48:04 2 [Note] WSREP: ####### processing CC 1417, local, ordered
2024-04-23 19:48:04 2 [Note] WSREP: Process first view: 003b272f-01a9-11ef-beba-026f3aace78f my uuid: 65c7b769-01aa-11ef-a4b9-4fcbab8908cb
2024-04-23 19:48:04 2 [Note] WSREP: Server mariadb-gnew-1 connected to cluster at position 003b272f-01a9-11ef-beba-026f3aace78f:1417 with ID 65c7b769-01aa-11ef-a4b9-4fcbab8908cb
2024-04-23 19:48:04 2 [Note] WSREP: Server status change disconnected -> connected
2024-04-23 19:48:04 2 [Note] WSREP: ####### My UUID: 65c7b769-01aa-11ef-a4b9-4fcbab8908cb
2024-04-23 19:48:04 2 [Note] WSREP: Cert index reset to 00000000-0000-0000-0000-000000000000:-1 (proto: 10), state transfer needed: yes
2024-04-23 19:48:04 0 [Note] WSREP: Service thread queue flushed.
2024-04-23 19:48:04 2 [Note] WSREP: ####### Assign initial position for certification: 00000000-0000-0000-0000-000000000000:-1, protocol version: -1
2024-04-23 19:48:04 2 [Note] WSREP: State transfer required: 
	Group state: 003b272f-01a9-11ef-beba-026f3aace78f:1417
	Local state: 00000000-0000-0000-0000-000000000000:-1
2024-04-23 19:48:04 2 [Note] WSREP: Server status change connected -> joiner
2024-04-23 19:48:04 0 [Note] WSREP: Joiner monitor thread started to monitor
2024-04-23 19:48:04 0 [Note] WSREP: Running: 'wsrep_sst_mariabackup --role 'joiner' --address '10.233.91.138' --datadir '/var/lib/mysql/' --parent 1 --progress 0'
WSREP_SST: [INFO] mariabackup SST started on joiner (20240423 19:48:04.639)
WSREP_SST: [INFO] SSL configuration: CA='', CAPATH='', CERT='', KEY='', MODE='DISABLED', encrypt='0' (20240423 19:48:04.791)
WSREP_SST: [INFO] Progress reporting tool pv not found in path: /usr//bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/sbin:/usr/bin:/sbin:/bin (20240423 19:48:05.150)
WSREP_SST: [INFO] Disabling all progress/rate-limiting (20240423 19:48:05.154)
WSREP_SST: [INFO] Streaming with mbstream (20240423 19:48:05.187)
WSREP_SST: [INFO] Using socat as streamer (20240423 19:48:05.191)
WSREP_SST: [INFO] Stale sst_in_progress file: /var/lib/mysql/sst_in_progress (20240423 19:48:05.197)
WSREP_SST: [INFO] previous SST is not completed, waiting for it to exit (20240423 19:48:05.252)
WSREP_SST: [INFO] previous SST is not completed, waiting for it to exit (20240423 19:48:06.283)
2024-04-23 19:48:07 0 [Note] WSREP: (65c7b769-a4b9, 'tcp://0.0.0.0:4567') turning message relay requesting off
WSREP_SST: [INFO] previous SST is not completed, waiting for it to exit (20240423 19:48:07.315)
WSREP_SST: [INFO] previous SST is not completed, waiting for it to exit (20240423 19:48:08.350)
WSREP_SST: [INFO] previous SST is not completed, waiting for it to exit (20240423 19:48:09.381)
WSREP_SST: [INFO] previous SST is not completed, waiting for it to exit (20240423 19:48:10.423)
WSREP_SST: [INFO] previous SST is not completed, waiting for it to exit (20240423 19:48:11.458)
WSREP_SST: [INFO] previous SST is not completed, waiting for it to exit (20240423 19:48:12.500)
WSREP_SST: [INFO] previous SST is not completed, waiting for it to exit (20240423 19:48:13.536)
WSREP_SST: [INFO] previous SST is not completed, waiting for it to exit (20240423 19:48:14.574)
WSREP_SST: [ERROR] previous SST script still running. (20240423 19:48:14.584)
2024-04-23 19:48:14 0 [ERROR] WSREP: Failed to read 'ready <addr>' from: wsrep_sst_mariabackup --role 'joiner' --address '10.233.91.138' --datadir '/var/lib/mysql/' --parent 1 --progress 0
	Read: '(null)'
2024-04-23 19:48:14 0 [ERROR] WSREP: Process completed with error: wsrep_sst_mariabackup --role 'joiner' --address '10.233.91.138' --datadir '/var/lib/mysql/' --parent 1 --progress 0: 114 (Operation already in progress)
2024-04-23 19:48:14 2 [ERROR] WSREP: Failed to prepare for 'mariabackup' SST. Unrecoverable.
2024-04-23 19:48:14 2 [ERROR] WSREP: SST request callback failed. This is unrecoverable, restart required.
2024-04-23 19:48:14 2 [Note] WSREP: ReplicatorSMM::abort()
2024-04-23 19:48:14 2 [Note] WSREP: Closing send monitor...
2024-04-23 19:48:14 2 [Note] WSREP: Closed send monitor.
2024-04-23 19:48:14 2 [Note] WSREP: gcomm: terminating thread
2024-04-23 19:48:14 2 [Note] WSREP: gcomm: joining thread
2024-04-23 19:48:14 2 [Note] WSREP: gcomm: closing backend
2024-04-23 19:48:15 2 [Note] WSREP: view(view_id(NON_PRIM,1f792853-896d,22) memb {
	65c7b769-a4b9,0
} joined {
} left {
} partitioned {
	1f792853-896d,0
})
2024-04-23 19:48:15 2 [Note] WSREP: PC protocol downgrade 1 -> 0
2024-04-23 19:48:15 2 [Note] WSREP: view((empty))
2024-04-23 19:48:15 2 [Note] WSREP: gcomm: closed
2024-04-23 19:48:15 0 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
2024-04-23 19:48:15 0 [Note] WSREP: Flow-control interval: [16, 16]
2024-04-23 19:48:15 0 [Note] WSREP: Received NON-PRIMARY.
2024-04-23 19:48:15 0 [Note] WSREP: Shifting PRIMARY -> OPEN (TO: 1417)
2024-04-23 19:48:15 0 [Note] WSREP: New SELF-LEAVE.
2024-04-23 19:48:15 0 [Note] WSREP: Flow-control interval: [0, 0]
2024-04-23 19:48:15 0 [Note] WSREP: Received SELF-LEAVE. Closing connection.
2024-04-23 19:48:15 0 [Note] WSREP: Shifting OPEN -> CLOSED (TO: 1417)
2024-04-23 19:48:15 0 [Note] WSREP: RECV thread exiting 0: Success
2024-04-23 19:48:15 2 [Note] WSREP: recv_thread() joined.
2024-04-23 19:48:15 2 [Note] WSREP: Closing replication queue.
2024-04-23 19:48:15 2 [Note] WSREP: Closing slave action queue.
2024-04-23 19:48:15 2 [Note] WSREP: mariadbd: Terminated.
240423 19:48:15 [ERROR] mysqld got signal 11 ;
Sorry, we probably made a mistake, and this is a bug.

Your assistance in bug reporting will enable us to fix this for the next release.
To report this bug, see https://mariadb.com/kb/en/reporting-bugs

We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed, 
something is definitely wrong and this may fail.

Server version: 11.3.2-MariaDB-1:11.3.2+maria~ubu2204 source revision: 068a6819eb63bcb01fdfa037c9bf3bf63c33ee42
key_buffer_size=0
read_buffer_size=131072
max_used_connections=0
max_threads=153
thread_count=3
It is possible that mysqld could use up to 
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 336992 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x7f6f5c000c68
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x7f6f81425c68 thread_stack 0x49000
Printing to addr2line failed
mariadbd(my_print_stacktrace+0x32)[0x55a1a86358a2]
mariadbd(handle_fatal_signal+0x478)[0x55a1a8106488]
/lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f6f83bef520]
/lib/x86_64-linux-gnu/libc.so.6(abort+0x178)[0x7f6f83bd5898]
/usr/lib/galera/libgalera_smm.so(+0x157602)[0x7f6f83671602]
/usr/lib/galera/libgalera_smm.so(+0x700e1)[0x7f6f8358a0e1]
/usr/lib/galera/libgalera_smm.so(+0x6cc94)[0x7f6f83586c94]
/usr/lib/galera/libgalera_smm.so(+0x8b311)[0x7f6f835a5311]
/usr/lib/galera/libgalera_smm.so(+0x604a0)[0x7f6f8357a4a0]
/usr/lib/galera/libgalera_smm.so(+0x48261)[0x7f6f83562261]
mariadbd(_ZN5wsrep18wsrep_provider_v2611run_applierEPNS_21high_priority_serviceE+0x12)[0x55a1a86f5592]
mariadbd(+0xd93e31)[0x55a1a83c5e31]
mariadbd(_Z15start_wsrep_THDPv+0x26b)[0x55a1a83b3a7b]
mariadbd(+0xd05f86)[0x55a1a8337f86]
/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3)[0x7f6f83c41ac3]
/lib/x86_64-linux-gnu/libc.so.6(+0x126850)[0x7f6f83cd3850]

Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (0x0): (null)
Connection ID (thread ID): 2
Status: NOT_KILLED

Optimizer switch: index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_merge_sort_intersection=off,index_condition_pushdown=on,derived_merge=on,derived_with_keys=on,firstmatch=on,loosescan=on,materialization=on,in_to_exists=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on,mrr=off,mrr_cost_based=off,mrr_sort_keys=off,outer_join_with_cache=on,semijoin_with_cache=on,join_cache_incremental=on,join_cache_hashed=on,join_cache_bka=on,optimize_join_buffer_size=on,table_elimination=on,extended_keys=on,exists_to_in=on,orderby_uses_equalities=on,condition_pushdown_for_derived=on,split_materialized=on,condition_pushdown_for_subquery=on,rowid_filter=on,condition_pushdown_from_having=on,not_null_range_scan=off,hash_join_cardinality=on,cset_narrowing=off,sargable_casefold=on

The manual page at https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mariadbd/ contains
information that should help you find out what is causing the crash.

We think the query pointer is invalid, but we will try to print it anyway. 
Query: 

Writing a core file...
Working directory at /var/lib/mysql
Resource Limits:
Limit                     Soft Limit           Hard Limit           Units     
Max cpu time              unlimited            unlimited            seconds   
Max file size             unlimited            unlimited            bytes     
Max data size             unlimited            unlimited            bytes     
Max stack size            8388608              unlimited            bytes     
Max core file size        0                    0                    bytes     
Max resident set          unlimited            unlimited            bytes     
Max processes             unlimited            unlimited            processes 
Max open files            65535                65535                files     
Max locked memory         unlimited            unlimited            bytes     
Max address space         unlimited            unlimited            bytes     
Max file locks            unlimited            unlimited            locks     
Max pending signals       192953               192953               signals   
Max msgqueue size         819200               819200               bytes     
Max nice priority         0                    0                    
Max realtime priority     0                    0                    
Max realtime timeout      unlimited            unlimited            us        
Core pattern: core

Kernel version: Linux version 5.10.0-28-amd64 (debian-kernel@lists.debian.org) (gcc-10 (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2) #1 SMP Debian 5.10.209-2 (2024-01-31)
2024-04-23T12:48:15.638369254-07:00

I'm not sure if I'm doing something insanely wrong, or if my MariaDB cluster is just broke lol.

Here's my deployment YAML again:

Deployment YAML
apiVersion: k8s.mariadb.com/v1alpha1
kind: MariaDB
metadata:
  name: mariadb-fixed
  namespace: mariadb
  annotations:
    argocd.argoproj.io/compare-options: IgnoreExtraneous
    argocd.argoproj.io/sync-options: Prune=false
spec:
  rootPasswordSecretKeyRef:
    name: mariadb-creds
    key: root-password

  podSecurityContext:
    runAsUser: 0

  storage:
    size: 30Gi
    storageClassName: local-path
    resizeInUseVolumes: true
    waitForVolumeResize: true
    volumeClaimTemplate:
      accessModes:
        - ReadWriteOnce
      resources:
        requests:
          storage: 30Gi
      storageClassName: local-path

  image: mariadb:11.3.2
  replicas: 3

  galera:
    enabled: true
    primary:
      automaticFailover: true
    replicaThreads: 1
    agent:
      image: ghcr.io/mariadb-operator/mariadb-operator:v0.0.27
      port: 5555
      kubernetesAuth:
        enabled: true
      gracefulShutdownTimeout: 1s
    recovery:
      enabled: true
      minClusterSize: 40%
      clusterHealthyTimeout: 30s
      clusterBootstrapTimeout: 10m0s
      podRecoveryTimeout: 3m0s
      podSyncTimeout: 3m0s
    initContainer:
      image: ghcr.io/mariadb-operator/mariadb-operator:v0.0.27
    initJob:
      labels:
        sidecar.istio.io/inject: "false"
    config:
      reuseStorageVolume: false
      volumeClaimTemplate:
        resources:
          requests:
            storage: 300Mi
        accessModes:
          - ReadWriteOnce

  service:
    type: LoadBalancer
    annotations:
      metallb.universe.tf/ip-allocated-from-pool: first-pool
      metallb.universe.tf/loadBalancerIPs: 10.11.0.30
  connection:
    secretName: mariadb-fixed-conn
    secretTemplate:
      key: dsn

  primaryService:
    type: LoadBalancer
    annotations:
      metallb.universe.tf/ip-allocated-from-pool: first-pool
      metallb.universe.tf/loadBalancerIPs: 10.11.0.29
  primaryConnection:
    secretName: mariadb-fixed-conn-primary
    secretTemplate:
      key: dsn

  secondaryService:
    type: LoadBalancer
    annotations:
      metallb.universe.tf/ip-allocated-from-pool: first-pool
      metallb.universe.tf/loadBalancerIPs: 10.11.0.28
  secondaryConnection:
    secretName: mariadb-fixed-conn-secondary
    secretTemplate:
      key: dsn

  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        - topologyKey: "kubernetes.io/hostname"

  tolerations:
    - key: "mariadb.mmontes.io/ha"
      operator: "Exists"
      effect: "NoSchedule"

  updateStrategy:
    type: RollingUpdate

  myCnf: |
    [mariadb]
    bind-address=*
    default_storage_engine=InnoDB
    binlog_format=row
    innodb_autoinc_lock_mode=2
    max_allowed_packet=256M

  resources:
    requests:
      cpu: 300m
      memory: 256Mi
    limits:
      memory: 1Gi

  metrics:
    enabled: true
---
apiVersion: k8s.mariadb.com/v1alpha1
kind: Backup
metadata:
  name: mariadb-fixed-backup-scheduled
  namespace: mariadb
spec:
  mariaDbRef:
    name: mariadb-fixed
  schedule:
    cron: "0 */12 * * *" # every 12 hours
    suspend: false
  maxRetention: 1440h # 60 days
  storage:
    s3:
      bucket: mysql-backups
      endpoint: minio.minio.svc.newcluster.local:9000
      region: us-east-1
      accessKeyIdSecretKeyRef:
        name: minio-creds
        key: MINIO_ACCESS_KEY
      secretAccessKeySecretKeyRef:
        name: minio-creds
        key: MINIO_SECRET_KEY
      tls:
        enabled: false
  args:
    - --single-transaction
    - --all-databases
  logLevel: info
  resources:
    requests:
      cpu: 100m
      memory: 128Mi
    limits:
      cpu: 300m
      memory: 512Mi

Hey there @perfectra1n !

I've just tested this locally and managed to upscale a Galera cluster with 1GB of data generated with sysbench.

[INFO] previous SST is not completed, waiting for it to exit (20240423 19:48:14.574)

The issue is that your node has a pending SST, and it will keep restarting until it succeeds. The operator still doesn't manage this situation, but it is in our radar:

You can cancel the SST by:

  • Exec into the Pod and delete /var/lib/mysql/wsrep_sst.pid
  • Restart the Pod

If that works, let's close this issue and track everything related to SST recovery in #425

Thanks!

Gotchya, well I'll be sure to give those a shot when I use Galera next. Currently I just scaled down my DB to 1 node so that it would at least work for the time being.

This issue is stale because it has been open 30 days with no activity.

This issue was closed because it has been stalled for 10 days with no activity.