[Bug] Repeated log messages of "Deleted Master_info file"
Closed this issue · 3 comments
Documentation
- I acknowledge that I have read the relevant documentation.
Describe the bug
I am currently deploying a MariaDB cluster of 3 database instances and 2 MaxScale instances using MariaDB Operator. For testing purpose, I brought down the master database by running:
kubectl -n mariadb-cluster delete pod mariadb-cluster-0
After that, since I've set auto_failover: "true"
, failover happened as expected and mariadb-cluster-1
was promoted to the new master. However, as soon as mariadb-cluster-0
was up again and tried connecting to the new master mariadb-cluster-1
, there were tons of log messages "Deleted Master_info file" in pod mariadb-cluster-1
and they kept showing up every second:
2024-05-09 21:20:23+00:00 [Note] [Entrypoint]: Entrypoint script for MariaDB Server 1:11.2.2+maria~ubu2204 started.
2024-05-09 21:20:23+00:00 [Note] [Entrypoint]: Switching to dedicated user 'mysql'
2024-05-09 21:20:23+00:00 [Note] [Entrypoint]: Entrypoint script for MariaDB Server 1:11.2.2+maria~ubu2204 started.
2024-05-09 21:20:23+00:00 [Note] [Entrypoint]: MariaDB upgrade not required
2024-05-09 21:20:23 0 [Note] Starting MariaDB 11.2.2-MariaDB-1:11.2.2+maria~ubu2204-log source revision 929532a9426d085111c24c63de9c23cc54382259 as process 1
2024-05-09 21:20:23 0 [Note] Initializing built-in plugins
2024-05-09 21:20:23 0 [Note] Initializing plugins specified on the command line
2024-05-09 21:20:23 0 [Note] InnoDB: Compressed tables use zlib 1.2.11
2024-05-09 21:20:23 0 [Note] InnoDB: Number of transaction pools: 1
2024-05-09 21:20:23 0 [Note] InnoDB: Using crc32 + pclmulqdq instructions
2024-05-09 21:20:23 0 [Note] mariadbd: O_TMPFILE is not supported on /tmp (disabling future attempts)
2024-05-09 21:20:23 0 [Note] InnoDB: Initializing buffer pool, total size = 128.000MiB, chunk size = 2.000MiB
2024-05-09 21:20:23 0 [Note] InnoDB: Completed initialization of buffer pool
2024-05-09 21:20:23 0 [Note] InnoDB: File system buffers for log disabled (block size=512 bytes)
2024-05-09 21:20:23 0 [Note] InnoDB: End of log at LSN=88057668
2024-05-09 21:20:24 0 [Note] InnoDB: Opened 3 undo tablespaces
2024-05-09 21:20:24 0 [Note] InnoDB: 128 rollback segments in 3 undo tablespaces are active.
2024-05-09 21:20:24 0 [Note] InnoDB: Setting file './ibtmp1' size to 12.000MiB. Physically writing the file full; Please wait ...
2024-05-09 21:20:24 0 [Note] InnoDB: File './ibtmp1' size is now 12.000MiB.
2024-05-09 21:20:24 0 [Note] InnoDB: log sequence number 88057668; transaction id 2637
2024-05-09 21:20:24 0 [Note] Plugin 'FEEDBACK' is disabled.
2024-05-09 21:20:24 0 [Note] Plugin 'wsrep-provider' is disabled.
2024-05-09 21:20:24 0 [Note] InnoDB: Loading buffer pool(s) from /var/lib/mysql/ib_buffer_pool
2024-05-09 21:20:24 0 [Note] Initializing installed plugins
2024-05-09 21:20:24 0 [Note] Server socket created on IP: '0.0.0.0'.
2024-05-09 21:20:24 0 [Note] Server socket created on IP: '::'.
2024-05-09 21:20:24 0 [Note] mariadbd: Event Scheduler: Loaded 0 events
2024-05-09 21:20:24 0 [Note] Reading of all Master_info entries succeeded
2024-05-09 21:20:24 0 [Note] Added new Master_info '' to hash table
2024-05-09 21:20:24 0 [Note] mariadbd: ready for connections.
Version: '11.2.2-MariaDB-1:11.2.2+maria~ubu2204-log' socket: '/run/mysqld/mysqld.sock' port: 3306 mariadb.org binary distribution
2024-05-09 21:20:24 0 [Note] InnoDB: Buffer pool(s) load completed at 240509 21:20:24
2024-05-09 21:20:31 7 [Note] Start binlog_dump to slave_server(12), pos(mariadb-cluster-bin.000001, 75506082), using_gtid(1), gtid('0-10-1132')
2024-05-09 21:20:31 7 [Note] Start semi-sync binlog_dump to slave (server_id: 12), pos(./mariadb-cluster-bin.000002, 4)
2024-05-09 21:22:15 31 [Note] Start binlog_dump to slave_server(10), pos(, 4), using_gtid(1), gtid('0-10-1132')
2024-05-09 21:23:00 233 [Note] Deleted Master_info file '/var/lib/mysql/master.info'.
2024-05-09 21:23:00 233 [Note] Deleted Master_info file '/var/lib/mysql/relay-log.info'.
2024-05-09 21:23:00 232 [Warning] Aborted connection 31 to db: 'unconnected' user: 'maxscale-monitor' host: '10.0.3.115' (KILLED)
2024-05-09 21:23:00 232 [Note] Start binlog_dump to slave_server(10), pos(, 4), using_gtid(1), gtid('0-11-1288')
2024-05-09 21:23:00 232 [Note] Start semi-sync binlog_dump to slave (server_id: 10), pos(./mariadb-cluster-bin.000002, 4)
2024-05-09 21:23:01 237 [Note] Deleted Master_info file '/var/lib/mysql/master.info'.
2024-05-09 21:23:01 237 [Note] Deleted Master_info file '/var/lib/mysql/relay-log.info'.
2024-05-09 21:23:01 242 [Note] Deleted Master_info file '/var/lib/mysql/master.info'.
2024-05-09 21:23:01 242 [Note] Deleted Master_info file '/var/lib/mysql/relay-log.info'.
2024-05-09 21:23:01 254 [Note] Deleted Master_info file '/var/lib/mysql/master.info'.
2024-05-09 21:23:01 254 [Note] Deleted Master_info file '/var/lib/mysql/relay-log.info'.
2024-05-09 21:23:01 262 [Note] Deleted Master_info file '/var/lib/mysql/master.info'.
2024-05-09 21:23:01 262 [Note] Deleted Master_info file '/var/lib/mysql/relay-log.info'.
2024-05-09 21:23:01 264 [Note] Deleted Master_info file '/var/lib/mysql/master.info'.
2024-05-09 21:23:01 264 [Note] Deleted Master_info file '/var/lib/mysql/relay-log.info'.
2024-05-09 21:23:01 267 [Note] Deleted Master_info file '/var/lib/mysql/master.info'.
2024-05-09 21:23:01 267 [Note] Deleted Master_info file '/var/lib/mysql/relay-log.info'.
2024-05-09 21:23:01 269 [Note] Deleted Master_info file '/var/lib/mysql/master.info'.
2024-05-09 21:23:01 269 [Note] Deleted Master_info file '/var/lib/mysql/relay-log.info'.
2024-05-09 21:23:01 272 [Note] Deleted Master_info file '/var/lib/mysql/master.info'.
2024-05-09 21:23:01 272 [Note] Deleted Master_info file '/var/lib/mysql/relay-log.info'.
2024-05-09 21:23:01 274 [Note] Deleted Master_info file '/var/lib/mysql/master.info'.
Also probably worth to mention that after mariadb-cluster-0
was brought up, it had 2 replication connections to mariadb-cluster-1
:
MariaDB [(none)]> SHOW ALL SLAVES STATUS\G
*************************** 1. row ***************************
Connection_name:
Slave_SQL_State:
Slave_IO_State:
Master_Host: mariadb-cluster-1.mariadb-cluster-internal.mariadb-cluster.svc.cluster.local
Master_User: maxscale-monitor
Master_Port: 3306
Connect_Retry: 60
Master_Log_File: mariadb-cluster-bin.000001
Read_Master_Log_Pos: 332323
Relay_Log_File: mariadb-cluster-relay-bin.000002
Relay_Log_Pos: 66140
Relay_Master_Log_File: mariadb-cluster-bin.000001
Slave_IO_Running: No
Slave_SQL_Running: No
Replicate_Rewrite_DB:
Replicate_Do_DB:
Replicate_Ignore_DB:
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 0
Last_Error:
Skip_Counter: 0
Exec_Master_Log_Pos: 332323
Relay_Log_Space: 66459
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Master_SSL_Allowed: No
Master_SSL_CA_File:
Master_SSL_CA_Path:
Master_SSL_Cert:
Master_SSL_Cipher:
Master_SSL_Key:
Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
Last_IO_Errno: 2003
Last_IO_Error: error reconnecting to master 'maxscale-monitor@mariadb-cluster-1.mariadb-cluster-internal.mariadb-cluster.svc.cluster.local:3306' - retry-time: 60 maximum-retries: 100000 message: Can't connect to server on 'mariadb-cluster-1.mariadb-cluster-internal.mariadb-cluster.svc.cluster.local' (111 "Connection refused")
Last_SQL_Errno: 0
Last_SQL_Error:
Replicate_Ignore_Server_Ids:
Master_Server_Id: 11
Master_SSL_Crl:
Master_SSL_Crlpath:
Using_Gtid: Current_Pos
Gtid_IO_Pos: 0-11-415
Replicate_Do_Domain_Ids:
Replicate_Ignore_Domain_Ids:
Parallel_Mode: optimistic
SQL_Delay: 0
SQL_Remaining_Delay: NULL
Slave_SQL_Running_State:
Slave_DDL_Groups: 0
Slave_Non_Transactional_Groups: 0
Slave_Transactional_Groups: 2
Retried_transactions: 0
Max_relay_log_size: 1073741824
Executed_log_entries: 18
Slave_received_heartbeats: 0
Slave_heartbeat_period: 30.000
Gtid_Slave_Pos: 0-11-579
*************************** 2. row ***************************
Connection_name: mariadb-operator
Slave_SQL_State: Slave has read all relay log; waiting for more updates
Slave_IO_State: Waiting for master to send event
Master_Host: mariadb-cluster-1.mariadb-cluster-internal.mariadb-cluster.svc.cluster.local
Master_User: repl
Master_Port: 3306
Connect_Retry: 10
Master_Log_File: mariadb-cluster-bin.000002
Read_Master_Log_Pos: 27121
Relay_Log_File: mariadb-cluster-relay-bin-mariadb@002doperator.000002
Relay_Log_Pos: 27430
Relay_Master_Log_File: mariadb-cluster-bin.000002
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Replicate_Rewrite_DB:
Replicate_Do_DB:
Replicate_Ignore_DB:
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 0
Last_Error:
Skip_Counter: 0
Exec_Master_Log_Pos: 27121
Relay_Log_Space: 27770
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Master_SSL_Allowed: No
Master_SSL_CA_File:
Master_SSL_CA_Path:
Master_SSL_Cert:
Master_SSL_Cipher:
Master_SSL_Key:
Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
Last_IO_Errno: 0
Last_IO_Error:
Last_SQL_Errno: 0
Last_SQL_Error:
Replicate_Ignore_Server_Ids:
Master_Server_Id: 11
Master_SSL_Crl:
Master_SSL_Crlpath:
Using_Gtid: Current_Pos
Gtid_IO_Pos: 0-11-579
Replicate_Do_Domain_Ids:
Replicate_Ignore_Domain_Ids:
Parallel_Mode: optimistic
SQL_Delay: 0
SQL_Remaining_Delay: NULL
Slave_SQL_Running_State: Slave has read all relay log; waiting for more updates
Slave_DDL_Groups: 86
Slave_Non_Transactional_Groups: 78
Slave_Transactional_Groups: 0
Retried_transactions: 0
Max_relay_log_size: 1073741824
Executed_log_entries: 335
Slave_received_heartbeats: 0
Slave_heartbeat_period: 30.000
Gtid_Slave_Pos: 0-11-579
2 rows in set (0.000 sec)
Expected behaviour
Deleted Master_info file
is only printed once, not every second. It's not a breaking bug but I imagine if the cluster is live for a long time, its log will fill up the storage. It also makes it harder to notice other meaningful log messages.
Steps to reproduce the bug
- Start a cluster of 3 databases and 2 maxscale instances with automatic failover.
- Bring the primary pod down.
- After the pod is up and connected to a new primary pod, observe the log in the new primary pod.
Environment details:
- Kubernetes version: v1.25
- mariadb-operator version: v0.27
- Install method: helm
- Install flavor: custom
Update: I let the cluster run overnight and it seems themariadb-cluster-0
got restarted a lot of times:
kubectl -n mariadb-cluster get pods
NAME READY STATUS RESTARTS AGE
mariadb-cluster-0 0/1 CrashLoopBackOff 187 (2m57s ago) 16h
mariadb-cluster-1 1/1 Running 1 (16h ago) 16h
mariadb-cluster-2 1/1 Running 0 16h
mariadb-operator-66864b5fd6-nv6qr 1/1 Running 1 (16h ago) 17h
mariadb-operator-66864b5fd6-xf2hr 1/1 Running 0 17h
mariadb-operator-webhook-5859dd96f9-c29vc 1/1 Running 1 (16h ago) 17h
maxscale-0 1/1 Running 0 16h
maxscale-1 1/1 Running 0 16h
Log from mariadb-cluster-1
:
2024-05-10 14:08:35 828280 [Note] Start binlog_dump to slave_server(10), pos(mariadb-cluster-bin.000002, 368), using_gtid(1), gtid('0-11-10680')
2024-05-10 14:08:35 828282 [Warning] Aborted connection 828280 to db: 'unconnected' user: 'repl' host: '10.0.3.204' (KILLED)
2024-05-10 14:08:35 828282 [Note] Start binlog_dump to slave_server(10), pos(mariadb-cluster-bin.000002, 368), using_gtid(1), gtid('0-11-10680')
2024-05-10 14:08:35 828284 [Warning] Aborted connection 828282 to db: 'unconnected' user: 'maxscale-monitor' host: '10.0.3.204' (KILLED)
2024-05-10 14:08:35 828284 [Note] Start binlog_dump to slave_server(10), pos(mariadb-cluster-bin.000002, 368), using_gtid(1), gtid('0-11-10680')
2024-05-10 14:08:35 828284 [Warning] Aborted connection 828284 to db: 'unconnected' user: 'repl' host: '10.0.3.204' (Got an error writing communication packets)
2024-05-10 14:08:35 828292 [Note] Deleted Master_info file '/var/lib/mysql/master.info'.
2024-05-10 14:08:35 828292 [Note] Deleted Master_info file '/var/lib/mysql/relay-log.info'.
2024-05-10 14:08:36 828297 [Note] Deleted Master_info file '/var/lib/mysql/master.info'.
2024-05-10 14:08:36 828297 [Note] Deleted Master_info file '/var/lib/mysql/relay-log.info'.
2024-05-10 14:08:36 828299 [Note] Deleted Master_info file '/var/lib/mysql/master.info'.
2024-05-10 14:08:36 828299 [Note] Deleted Master_info file '/var/lib/mysql/relay-log.info'.
2024-05-10 14:08:36 828302 [Note] Deleted Master_info file '/var/lib/mysql/master.info'.
2024-05-10 14:08:36 828302 [Note] Deleted Master_info file '/var/lib/mysql/relay-log.info'.
2024-05-10 14:08:36 828305 [Note] Deleted Master_info file '/var/lib/mysql/master.info'.
2024-05-10 14:08:36 828305 [Note] Deleted Master_info file '/var/lib/mysql/relay-log.info'.
2024-05-10 14:08:36 828310 [Note] Deleted Master_info file '/var/lib/mysql/master.info'.
Please ignore. Other logs in mariadb-opeartor
pods:
{"level":"info","ts":1715370575.2262561,"logger":"replication","msg":"Configuring primary","controller":"mariadb","controllerGroup":"k8s.mariadb.com","controllerKind":"MariaDB","MariaDB":{"name":"mariadb-cluster","namespace":"mariadb-cluster"},"namespace":"mariadb-cluster","name":"mariadb-cluster","reconcileID":"deb1c875-0737-42fe-93fc-d9b6b0a597b4","pod":"mariadb-cluster-1"}
{"level":"error","ts":1715370575.2471206,"msg":"Reconciler error","controller":"mariadb","controllerGroup":"k8s.mariadb.com","controllerKind":"MariaDB","MariaDB":{"name":"mariadb-cluster","namespace":"mariadb-cluster"},"namespace":"mariadb-cluster","name":"mariadb-cluster","reconcileID":"deb1c875-0737-42fe-93fc-d9b6b0a597b4","error":"error reconciling Replication: 1 error occurred:\n\t* error configuring replication in Pod 'mariadb-cluster-1': error resetting slave position: Error 1948: Specified value for @@gtid_slave_pos contains no value for replication domain 0. This conflicts with the binary log which contains GTID 0-11-139. If MASTER_GTID_POS=CURRENT_POS is used, the binlog position will override the new value of @@gtid_slave_pos\n\n","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.2/pkg/internal/controller/controller.go:329\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.2/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.2/pkg/internal/controller/controller.go:227"}
{"level":"info","ts":1715370575.3540788,"logger":"replication","msg":"Configuring primary","controller":"mariadb","controllerGroup":"k8s.mariadb.com","controllerKind":"MariaDB","MariaDB":{"name":"mariadb-cluster","namespace":"mariadb-cluster"},"namespace":"mariadb-cluster","name":"mariadb-cluster","reconcileID":"f8830b29-46f5-4c1e-8725-06abb160de31","pod":"mariadb-cluster-1"}
{"level":"error","ts":1715370575.3769333,"msg":"Reconciler error","controller":"mariadb","controllerGroup":"k8s.mariadb.com","controllerKind":"MariaDB","MariaDB":{"name":"mariadb-cluster","namespace":"mariadb-cluster"},"namespace":"mariadb-cluster","name":"mariadb-cluster","reconcileID":"f8830b29-46f5-4c1e-8725-06abb160de31","error":"error reconciling Replication: 1 error occurred:\n\t* error configuring replication in Pod 'mariadb-cluster-1': error resetting slave position: Error 1948: Specified value for @@gtid_slave_pos contains no value for replication domain 0. This conflicts with the binary log which contains GTID 0-11-139. If MASTER_GTID_POS=CURRENT_POS is used, the binlog position will override the new value of @@gtid_slave_pos\n\n","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.2/pkg/internal/controller/controller.go:329\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.2/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.2/pkg/internal/controller/controller.go:227"}
{"level":"info","ts":1715370575.654521,"logger":"replication","msg":"Configuring primary","controller":"mariadb","controllerGroup":"k8s.mariadb.com","controllerKind":"MariaDB","MariaDB":{"name":"mariadb-cluster","namespace":"mariadb-cluster"},"namespace":"mariadb-cluster","name":"mariadb-cluster","reconcileID":"8845c5b8-9611-45da-a486-054316395afa","pod":"mariadb-cluster-1"}
It seems mariadb-operator
trying to reset gtid_slave_pos
. However, I have gtid_strict_mode
enabled so it fails and I assume it keeps retrying, thus it's stuck in a loop. I've turned off gtid_strict_mode
and so far the error has not happened again. Will close this for now.
Hey there @HuyVo2112 ! Thanks a lot for the detailed explanation, I'm glad that you sorted out the issue by yourself. Some comments:
as soon as mariadb-cluster-0 was up again and tried connecting to the new master mariadb-cluster-1,
By default, the mariadbmon
monitor used for replication defaults auto_rejoin
to true
, which is the reason why reconnecting to the existing cluster was attempted when mariadb-cluster-0
came back as replica:
You may disable this by turning it off:
apiVersion: k8s.mariadb.com/v1alpha1
kind: MaxScale
metadata:
name: maxscale-repl
spec:
monitor:
interval: 2s
cooperativeMonitoring: majority_of_all
params:
auto_rejoin: "false"
It seems mariadb-operator trying to reset gtid_slave_pos. However, I have gtid_strict_mode enabled so it fails and I assume it keeps retrying, thus it's stuck in a loop. I've turned off gtid_strict_mode and so far the error has not happened again. Will close this for now.
Yes, as part of the replica reconciliation, the operator manipulates gtid_slave_pos
, gtid_strict_mode
kindof clashes with this.