Yolean/kubernetes-mysql-cluster

Manual recovery after "gcs connection failed" with one statefulset pod gone

Opened this issue · 2 comments

On preemtible nodes we had one instance of manual recovery, after #30. There was no mariadb-1 pod, and -0 and -2 stayed crashlooping. The were past init but the mariadb containers exited after:

2020-05-27  6:10:26 140050350966464 [Warning] WSREP: last inactive check more than PT1.5S ago (PT3.50176S), skipping check
2020-05-27  6:10:55 140050350966464 [Note] WSREP: view((empty))
2020-05-27  6:10:55 140050350966464 [ERROR] WSREP: failed to open gcomm backend connection: 110: failed to reach primary view: 110 (Connection timed out)
	 at gcomm/src/pc.cpp:connect():158
2020-05-27  6:10:55 140050350966464 [ERROR] WSREP: gcs/src/gcs_core.cpp:gcs_core_open():209: Failed to open backend connection: -110 (Connection timed out)
2020-05-27  6:10:55 140050350966464 [ERROR] WSREP: gcs/src/gcs.cpp:gcs_open():1458: Failed to open channel 'my_wsrep_cluster' at 'gcomm://mariadb-0.mariadb,mariadb-1.mariadb,mariadb-2.mariadb': -110 (Connection timed out)
2020-05-27  6:10:55 140050350966464 [ERROR] WSREP: gcs connect failed: Connection timed out
2020-05-27  6:10:55 140050350966464 [ERROR] WSREP: wsrep::connect(gcomm://mariadb-0.mariadb,mariadb-1.mariadb,mariadb-2.mariadb) failed: 7
2020-05-27  6:10:55 140050350966464 [ERROR] Aborting

This could be a case for switching from OrderedReady to Parrallel.

The solution was to scale down to zero and then back up to three again. Oddly the pods wouldn't go away at scale to zero, so I had to manually delete mariadb-2. Is that the expected behavior for OrderedReady?

We've since had a similar failure (have a look at the logs from 7:43:

mariadb-0                                   1/2     Running   0          24h
mariadb-2                                   2/2     Running   1          11h

# logs mariadb-0
2020-06-03 21:24:47 139899381815040 [Warning] WSREP: a0e77878 conflicting prims: my prim: view_id(PRIM,a0e77878,1) other prim: view_id(PRIM,da4a9e7f,88)
2020-06-03 21:24:47 139899381815040 [ERROR] WSREP: caught exception in PC, state dump to stderr follows:
...
2020-06-03 21:24:47 139899381815040 [ERROR] WSREP: exception from gcomm, backend must be restarted: a0e77878 aborting due to conflicting prims: older overrides (FATAL)
	 at gcomm/src/pc_proto.cpp:handle_state():1001
...
2020-06-03 21:24:47 139899298641664 [Note] WSREP: RECV thread exiting -103: Software caused connection abort
2020-06-03 21:24:47 139899680020224 [Note] WSREP: New cluster view: global state: 00000000-0000-0000-0000-000000000000:0, view# -1: non-Primary, number of nodes: 0, my index: -1, protocol version -1
2020-06-03 21:24:47 139899680020224 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2020-06-03 21:24:47 139899680020224 [Note] WSREP: applier thread exiting (code:6)

# logs mariadb-2
2020-06-03 21:25:02 140233294333632 [Note] WSREP: Node da4a9e7f state prim
2020-06-03 21:25:02 140233113806592 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 2
2020-06-03 21:25:02 140233292687104 [Note] WSREP: New cluster view: global state: d1704a99-a17e-11ea-86df-d368d39f6188:24, view# 88: Primary, number of nodes: 2, my index: 0, protocol version 3
...
2020-06-04  7:43:26 140233292687104 [Note] WSREP: New cluster view: global state: d1704a99-a17e-11ea-86df-d368d39f6188:33, view# 89: Primary, number of nodes: 1, my index: 0, protocol version 3

Full logs: kubernetes-mysql-cluster-issue-31-logs.zip

Had a similar/identical issue again, mariadb-0 1/2 ready, the other two pods 2/2, logs for mariadb-0:

2020-07-22 04:16:52+00:00 [Note] [Entrypoint]: Entrypoint script for MySQL Server 1:10.2.32+maria~bionic started.
2020-07-22 04:16:53+00:00 [Note] [Entrypoint]: Switching to dedicated user 'mysql'
2020-07-22 04:16:53+00:00 [Note] [Entrypoint]: Entrypoint script for MySQL Server 1:10.2.32+maria~bionic started.
2020-07-22  4:16:53 140630079710912 [Note] mysqld (mysqld 10.2.32-MariaDB-1:10.2.32+maria~bionic) starting as process 1 ...
2020-07-22  4:16:53 140630079710912 [Note] WSREP: Read nil XID from storage engines, skipping position init
2020-07-22  4:16:53 140630079710912 [Note] WSREP: wsrep_load(): loading provider library '/usr/lib/galera/libgalera_smm.so'
2020-07-22  4:16:53 140630079710912 [Note] WSREP: wsrep_load(): Galera 25.3.29(r3902) by Codership Oy <info@codership.com> loaded successfully.
2020-07-22  4:16:53 140630079710912 [Note] WSREP: CRC-32C: using hardware acceleration.
2020-07-22  4:16:53 140630079710912 [Note] WSREP: Found saved state: 35de4dd1-ca3e-11ea-9217-0b5c2293d6b7:12, safe_to_bootstrap: 1
2020-07-22  4:16:53 140630079710912 [Note] WSREP: Passing config to GCS: base_dir = /data/db/; base_host = 10.12.4.13; base_port = 4567; cert.log_conflicts = no; cert.optimistic_pa = yes; debug = no; evs.auto_evict = 0; evs.delay_margin = PT1S; evs.delayed_keep_period = PT30S; evs.inactive_check_period = PT0.5S; evs.inactive_timeout = PT15S; evs.join_retrans_period = PT1S; evs.max_install_timeouts = 3; evs.send_window = 4; evs.stats_report_period = PT1M; evs.suspect_timeout = PT5S; evs.user_send_window = 2; evs.view_forget_timeout = PT24H; gcache.dir = /data/db/; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = /data/db//galera.cache; gcache.page_size = 128M; gcache.recover = no; gcache.size = 128M; gcomm.thread_prio = ; gcs.fc_debug = 0; gcs.fc_factor = 1.0; gcs.fc_limit = 16; gcs.fc_master_slave = no; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = no; gmcast.segment = 0; gmcast.version = 0; pc.announce_timeout = PT3S; pc.checksum = fa...
2020-07-22  4:16:54 140630079710912 [Note] WSREP: GCache history reset: 35de4dd1-ca3e-11ea-9217-0b5c2293d6b7:0 -> 35de4dd1-ca3e-11ea-9217-0b5c2293d6b7:12
2020-07-22  4:16:54 140630079710912 [Note] WSREP: Assign initial position for certification: 12, protocol version: -1
2020-07-22  4:16:54 140630079710912 [Note] WSREP: wsrep_sst_grab()
2020-07-22  4:16:54 140630079710912 [Note] WSREP: Start replication
2020-07-22  4:16:54 140630079710912 [Note] WSREP: 'wsrep-new-cluster' option used, bootstrapping the cluster
2020-07-22  4:16:54 140630079710912 [Note] WSREP: Setting initial position to 35de4dd1-ca3e-11ea-9217-0b5c2293d6b7:12
2020-07-22  4:16:54 140630079710912 [Note] WSREP: protonet asio version 0
2020-07-22  4:16:54 140630079710912 [Note] WSREP: Using CRC-32C for message checksums.
2020-07-22  4:16:54 140630079710912 [Note] WSREP: backend: asio
2020-07-22  4:16:54 140630079710912 [Note] WSREP: gcomm thread scheduling priority set to other:0 
2020-07-22  4:16:54 140630079710912 [Warning] WSREP: access file(/data/db//gvwstate.dat) failed(No such file or directory)
2020-07-22  4:16:54 140630079710912 [Note] WSREP: restore pc from disk failed
2020-07-22  4:16:54 140630079710912 [Note] WSREP: GMCast version 0
2020-07-22  4:16:54 140630079710912 [Warning] WSREP: Failed to resolve tcp://mariadb-0.mariadb:4567
2020-07-22  4:16:59 140630079710912 [Warning] WSREP: Failed to resolve tcp://mariadb-1.mariadb:4567
2020-07-22  4:17:09 140630079710912 [Warning] WSREP: Failed to resolve tcp://mariadb-2.mariadb:4567
2020-07-22  4:17:09 140630079710912 [Note] WSREP: (2ca50f1c, 'tcp://0.0.0.0:4567') listening at tcp://0.0.0.0:4567
2020-07-22  4:17:09 140630079710912 [Note] WSREP: (2ca50f1c, 'tcp://0.0.0.0:4567') multicast: , ttl: 1
2020-07-22  4:17:09 140630079710912 [Note] WSREP: EVS version 0
2020-07-22  4:17:09 140630079710912 [Note] WSREP: gcomm: bootstrapping new group 'my_wsrep_cluster'
2020-07-22  4:17:09 140630079710912 [Note] WSREP: start_prim is enabled, turn off pc_recovery
2020-07-22  4:17:09 140630079710912 [Note] WSREP: Node 2ca50f1c state prim
2020-07-22  4:17:09 140630079710912 [Note] WSREP: view(view_id(PRIM,2ca50f1c,1) memb {
	2ca50f1c,0
} joined {
} left {
} partitioned {
})
2020-07-22  4:17:09 140630079710912 [Note] WSREP: save pc into disk
2020-07-22  4:17:09 140630079710912 [Note] WSREP: gcomm: connected
2020-07-22  4:17:09 140630079710912 [Note] WSREP: Changing maximum packet size to 64500, resulting msg size: 32636
2020-07-22  4:17:09 140630079710912 [Note] WSREP: Shifting CLOSED -> OPEN (TO: 0)
2020-07-22  4:17:09 140630079710912 [Note] WSREP: Opened channel 'my_wsrep_cluster'
2020-07-22  4:17:09 140630079710912 [Note] WSREP: Waiting for SST to complete.
2020-07-22  4:17:09 140629696616192 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 1
2020-07-22  4:17:09 140629696616192 [Note] WSREP: STATE_EXCHANGE: sent state UUID: 3598cf7f-cbd2-11ea-a089-aaa5f521af4b
2020-07-22  4:17:09 140629696616192 [Note] WSREP: STATE EXCHANGE: sent state msg: 3598cf7f-cbd2-11ea-a089-aaa5f521af4b
2020-07-22  4:17:09 140629696616192 [Note] WSREP: STATE EXCHANGE: got state msg: 3598cf7f-cbd2-11ea-a089-aaa5f521af4b from 0 (mariadb-0)
2020-07-22  4:17:09 140629696616192 [Note] WSREP: Quorum results:
	version    = 6,
	component  = PRIMARY,
	conf_id    = 0,
	members    = 1/1 (joined/total),
	act_id     = 12,
	last_appl. = -1,
	protocols  = 0/9/3 (gcs/repl/appl),
	group UUID = 35de4dd1-ca3e-11ea-9217-0b5c2293d6b7
2020-07-22  4:17:09 140629696616192 [Note] WSREP: Flow-control interval: [16, 16]
2020-07-22  4:17:09 140629696616192 [Note] WSREP: Trying to continue unpaused monitor
2020-07-22  4:17:09 140629696616192 [Note] WSREP: Restored state OPEN -> JOINED (12)
2020-07-22  4:17:09 140629696616192 [Note] WSREP: Member 0.0 (mariadb-0) synced with group.
2020-07-22  4:17:09 140629696616192 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 12)
2020-07-22  4:17:09 140630078064384 [Note] WSREP: New cluster view: global state: 35de4dd1-ca3e-11ea-9217-0b5c2293d6b7:12, view# 1: Primary, number of nodes: 1, my index: 0, protocol version 3
2020-07-22  4:17:09 140630079710912 [Note] WSREP: SST complete, seqno: 12
2020-07-22  4:17:09 140630079710912 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins
2020-07-22  4:17:09 140630079710912 [Note] InnoDB: Uses event mutexes
2020-07-22  4:17:09 140630079710912 [Note] InnoDB: Compressed tables use zlib 1.2.11
2020-07-22  4:17:09 140630079710912 [Note] InnoDB: Using Linux native AIO
2020-07-22  4:17:09 140630079710912 [Note] InnoDB: Number of pools: 1
2020-07-22  4:17:09 140630079710912 [Note] InnoDB: Using SSE2 crc32 instructions
2020-07-22  4:17:09 140630079710912 [Note] InnoDB: Initializing buffer pool, total size = 256M, instances = 1, chunk size = 128M
2020-07-22  4:17:09 140630079710912 [Note] InnoDB: Completed initialization of buffer pool
2020-07-22  4:17:09 140628882306816 [Note] InnoDB: If the mysqld execution user is authorized, page cleaner thread priority can be changed. See the man page of setpriority().
2020-07-22  4:17:09 140630079710912 [Note] InnoDB: Highest supported file format is Barracuda.
2020-07-22  4:17:09 140630079710912 [Note] InnoDB: 128 out of 128 rollback segments are active.
2020-07-22  4:17:09 140630079710912 [Note] InnoDB: Creating shared tablespace for temporary tables
2020-07-22  4:17:09 140630079710912 [Note] InnoDB: Setting file './ibtmp1' size to 12 MB. Physically writing the file full; Please wait ...
2020-07-22  4:17:09 140630079710912 [Note] InnoDB: File './ibtmp1' size is now 12 MB.
2020-07-22  4:17:09 140630079710912 [Note] InnoDB: Waiting for purge to start
2020-07-22  4:17:09 140630079710912 [Note] InnoDB: 5.7.30 started; log sequence number 1618260
2020-07-22  4:17:09 140628721637120 [Note] InnoDB: Loading buffer pool(s) from /data/db/ib_buffer_pool
2020-07-22  4:17:09 140630079710912 [Note] Plugin 'FEEDBACK' is disabled.
2020-07-22  4:17:09 140628721637120 [Note] InnoDB: Buffer pool(s) load completed at 200722  4:17:09
2020-07-22  4:17:09 140630079710912 [Note] Server socket created on IP: '0.0.0.0'.
2020-07-22  4:17:09 140630079710912 [Note] Reading of all Master_info entries succeeded
2020-07-22  4:17:09 140630079710912 [Note] Added new Master_info '' to hash table
2020-07-22  4:17:09 140630079710912 [Note] mysqld: ready for connections.
Version: '10.2.32-MariaDB-1:10.2.32+maria~bionic'  socket: '/var/run/mysqld/mysqld.sock'  port: 3306  mariadb.org binary distribution
2020-07-22  4:17:09 140630078064384 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2020-07-22  4:17:09 140630078064384 [Note] WSREP: REPL Protocols: 9 (4, 2)
2020-07-22  4:17:09 140630078064384 [Note] WSREP: Assign initial position for certification: 12, protocol version: 4
2020-07-22  4:17:09 140629947614976 [Note] WSREP: Service thread queue flushed.
2020-07-22  4:17:09 140630078064384 [Note] WSREP: GCache history reset: 35de4dd1-ca3e-11ea-9217-0b5c2293d6b7:0 -> 35de4dd1-ca3e-11ea-9217-0b5c2293d6b7:12
2020-07-22  4:17:09 140630078064384 [Note] WSREP: Synchronized with group, ready for connections
2020-07-22  4:17:09 140630078064384 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2020-07-22  4:31:17 140629647947520 [Warning] IP address '10.12.3.16' could not be resolved: Name or service not known
2020-07-22  9:01:36 140629711513344 [Note] WSREP: (2ca50f1c, 'tcp://0.0.0.0:4567') connection established to f2cb7e80 tcp://10.12.2.19:4567
2020-07-22  9:01:36 140629711513344 [Note] WSREP: (2ca50f1c, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: 
2020-07-22  9:01:37 140629711513344 [Note] WSREP: (2ca50f1c, 'tcp://0.0.0.0:4567') connection established to 7efc2e5a tcp://10.12.3.20:4567
2020-07-22  9:01:37 140629711513344 [Note] WSREP: declaring 7efc2e5a at tcp://10.12.3.20:4567 stable
2020-07-22  9:01:37 140629711513344 [Note] WSREP: declaring f2cb7e80 at tcp://10.12.2.19:4567 stable
2020-07-22  9:01:37 140629711513344 [Warning] WSREP: 2ca50f1c conflicting prims: my prim: view_id(PRIM,2ca50f1c,1) other prim: view_id(PRIM,7efc2e5a,21)
2020-07-22  9:01:37 140629711513344 [ERROR] WSREP: caught exception in PC, state dump to stderr follows:
pc::Proto{uuid=2ca50f1c,start_prim=1,npvo=0,ignore_sb=0,ignore_quorum=0,state=1,last_sent_seq=3,checksum=0,instances=
	2ca50f1c,prim=1,un=0,last_seq=3,last_prim=view_id(PRIM,2ca50f1c,1),to_seq=2,weight=1,segment=0
,state_msgs=
	2ca50f1c,pcmsg{ type=STATE, seq=0, flags= 0, node_map {	2ca50f1c,prim=1,un=0,last_seq=3,last_prim=view_id(PRIM,2ca50f1c,1),to_seq=2,weight=1,segment=0
}}
,current_view=view(view_id(REG,2ca50f1c,22) memb {
	2ca50f1c,0
	7efc2e5a,0
	f2cb7e80,0
} joined {
	7efc2e5a,0
	f2cb7e80,0
} left {
} partitioned {
}),pc_view=view(view_id(PRIM,2ca50f1c,1) memb {
	2ca50f1c,0
} joined {
} left {
} partitioned {
}),mtu=32636}
2020-07-22  9:01:37 140629711513344 [Note] WSREP: {v=0,t=1,ut=255,o=4,s=0,sr=0,as=-1,f=4,src=7efc2e5a,srcvid=view_id(REG,2ca50f1c,22),insvid=view_id(UNKNOWN,00000000,0),ru=00000000,r=[-1,-1],fs=440865,nl=(
)
} 64
2020-07-22  9:01:37 140629711513344 [ERROR] WSREP: exception caused by message: {v=0,t=3,ut=255,o=1,s=0,sr=-1,as=0,f=4,src=7efc2e5a,srcvid=view_id(REG,2ca50f1c,22),insvid=view_id(UNKNOWN,00000000,0),ru=00000000,r=[-1,-1],fs=440867,nl=(
)
}
 state after handling message: evs::proto(evs::proto(2ca50f1c, OPERATIONAL, view_id(REG,2ca50f1c,22)), OPERATIONAL) {
current_view=view(view_id(REG,2ca50f1c,22) memb {
	2ca50f1c,0
	7efc2e5a,0
	f2cb7e80,0
} joined {
} left {
} partitioned {
}),
input_map=evs::input_map: {aru_seq=0,safe_seq=0,node_index=node: {idx=0,range=[1,0],safe_seq=0} node: {idx=1,range=[1,0],safe_seq=0} node: {idx=2,range=[1,0],safe_seq=0} },
fifo_seq=17083,
last_sent=0,
known:
2ca50f1c at 
{o=1,s=0,i=1,fs=-1,}
7efc2e5a at tcp://10.12.3.20:4567
{o=1,s=0,i=1,fs=440867,}
f2cb7e80 at tcp://10.12.2.19:4567
{o=1,s=0,i=1,fs=9,}
 }2020-07-22  9:01:37 140629711513344 [ERROR] WSREP: exception from gcomm, backend must be restarted: 2ca50f1c aborting due to conflicting prims: older overrides (FATAL)
	 at gcomm/src/pc_proto.cpp:handle_state():1021
2020-07-22  9:01:37 140629711513344 [Note] WSREP: gcomm: terminating thread
2020-07-22  9:01:37 140629711513344 [Note] WSREP: gcomm: joining thread
2020-07-22  9:01:37 140629711513344 [Note] WSREP: gcomm: closing backend
2020-07-22  9:01:37 140629711513344 [Note] WSREP: Forced PC close
2020-07-22  9:01:37 140629711513344 [Warning] WSREP: discarding 2 messages from message index
2020-07-22  9:01:37 140629711513344 [Note] WSREP: gcomm: closed
2020-07-22  9:01:37 140629696616192 [Note] WSREP: Received self-leave message.
2020-07-22  9:01:37 140629696616192 [Note] WSREP: comp msg error in core 103
2020-07-22  9:01:37 140629696616192 [Note] WSREP: Closing send monitor...
2020-07-22  9:01:37 140629696616192 [Note] WSREP: Closed send monitor.
2020-07-22  9:01:37 140629696616192 [Note] WSREP: Closing replication queue.
2020-07-22  9:01:37 140629696616192 [Note] WSREP: Closing slave action queue.
2020-07-22  9:01:37 140629696616192 [Note] WSREP: Shifting SYNCED -> CLOSED (TO: 12)
2020-07-22  9:01:37 140629696616192 [Note] WSREP: RECV thread exiting -103: Software caused connection abort
2020-07-22  9:01:37 140630078064384 [Note] WSREP: New cluster view: global state: 00000000-0000-0000-0000-000000000000:0, view# -1: non-Primary, number of nodes: 0, my index: -1, protocol version -1
2020-07-22  9:01:37 140630078064384 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2020-07-22  9:01:37 140630078064384 [Note] WSREP: applier thread exiting (code:6)