mariadb-operator/mariadb-operator

[Bug] Repeated log messages of "Deleted Master_info file"

Closed this issue · 3 comments

Documentation

Describe the bug
I am currently deploying a MariaDB cluster of 3 database instances and 2 MaxScale instances using MariaDB Operator. For testing purpose, I brought down the master database by running:

kubectl -n mariadb-cluster delete pod mariadb-cluster-0

After that, since I've set auto_failover: "true", failover happened as expected and mariadb-cluster-1 was promoted to the new master. However, as soon as mariadb-cluster-0 was up again and tried connecting to the new master mariadb-cluster-1, there were tons of log messages "Deleted Master_info file" in pod mariadb-cluster-1 and they kept showing up every second:

2024-05-09 21:20:23+00:00 [Note] [Entrypoint]: Entrypoint script for MariaDB Server 1:11.2.2+maria~ubu2204 started.
2024-05-09 21:20:23+00:00 [Note] [Entrypoint]: Switching to dedicated user 'mysql'
2024-05-09 21:20:23+00:00 [Note] [Entrypoint]: Entrypoint script for MariaDB Server 1:11.2.2+maria~ubu2204 started.
2024-05-09 21:20:23+00:00 [Note] [Entrypoint]: MariaDB upgrade not required
2024-05-09 21:20:23 0 [Note] Starting MariaDB 11.2.2-MariaDB-1:11.2.2+maria~ubu2204-log source revision 929532a9426d085111c24c63de9c23cc54382259 as process 1
2024-05-09 21:20:23 0 [Note] Initializing built-in plugins
2024-05-09 21:20:23 0 [Note] Initializing plugins specified on the command line
2024-05-09 21:20:23 0 [Note] InnoDB: Compressed tables use zlib 1.2.11
2024-05-09 21:20:23 0 [Note] InnoDB: Number of transaction pools: 1
2024-05-09 21:20:23 0 [Note] InnoDB: Using crc32 + pclmulqdq instructions
2024-05-09 21:20:23 0 [Note] mariadbd: O_TMPFILE is not supported on /tmp (disabling future attempts)
2024-05-09 21:20:23 0 [Note] InnoDB: Initializing buffer pool, total size = 128.000MiB, chunk size = 2.000MiB
2024-05-09 21:20:23 0 [Note] InnoDB: Completed initialization of buffer pool
2024-05-09 21:20:23 0 [Note] InnoDB: File system buffers for log disabled (block size=512 bytes)
2024-05-09 21:20:23 0 [Note] InnoDB: End of log at LSN=88057668
2024-05-09 21:20:24 0 [Note] InnoDB: Opened 3 undo tablespaces
2024-05-09 21:20:24 0 [Note] InnoDB: 128 rollback segments in 3 undo tablespaces are active.
2024-05-09 21:20:24 0 [Note] InnoDB: Setting file './ibtmp1' size to 12.000MiB. Physically writing the file full; Please wait ...
2024-05-09 21:20:24 0 [Note] InnoDB: File './ibtmp1' size is now 12.000MiB.
2024-05-09 21:20:24 0 [Note] InnoDB: log sequence number 88057668; transaction id 2637
2024-05-09 21:20:24 0 [Note] Plugin 'FEEDBACK' is disabled.
2024-05-09 21:20:24 0 [Note] Plugin 'wsrep-provider' is disabled.
2024-05-09 21:20:24 0 [Note] InnoDB: Loading buffer pool(s) from /var/lib/mysql/ib_buffer_pool
2024-05-09 21:20:24 0 [Note] Initializing installed plugins
2024-05-09 21:20:24 0 [Note] Server socket created on IP: '0.0.0.0'.
2024-05-09 21:20:24 0 [Note] Server socket created on IP: '::'.
2024-05-09 21:20:24 0 [Note] mariadbd: Event Scheduler: Loaded 0 events
2024-05-09 21:20:24 0 [Note] Reading of all Master_info entries succeeded
2024-05-09 21:20:24 0 [Note] Added new Master_info '' to hash table
2024-05-09 21:20:24 0 [Note] mariadbd: ready for connections.
Version: '11.2.2-MariaDB-1:11.2.2+maria~ubu2204-log'  socket: '/run/mysqld/mysqld.sock'  port: 3306  mariadb.org binary distribution
2024-05-09 21:20:24 0 [Note] InnoDB: Buffer pool(s) load completed at 240509 21:20:24
2024-05-09 21:20:31 7 [Note] Start binlog_dump to slave_server(12), pos(mariadb-cluster-bin.000001, 75506082), using_gtid(1), gtid('0-10-1132')
2024-05-09 21:20:31 7 [Note] Start semi-sync binlog_dump to slave (server_id: 12), pos(./mariadb-cluster-bin.000002, 4)
2024-05-09 21:22:15 31 [Note] Start binlog_dump to slave_server(10), pos(, 4), using_gtid(1), gtid('0-10-1132')
2024-05-09 21:23:00 233 [Note] Deleted Master_info file '/var/lib/mysql/master.info'.
2024-05-09 21:23:00 233 [Note] Deleted Master_info file '/var/lib/mysql/relay-log.info'.
2024-05-09 21:23:00 232 [Warning] Aborted connection 31 to db: 'unconnected' user: 'maxscale-monitor' host: '10.0.3.115' (KILLED)
2024-05-09 21:23:00 232 [Note] Start binlog_dump to slave_server(10), pos(, 4), using_gtid(1), gtid('0-11-1288')
2024-05-09 21:23:00 232 [Note] Start semi-sync binlog_dump to slave (server_id: 10), pos(./mariadb-cluster-bin.000002, 4)
2024-05-09 21:23:01 237 [Note] Deleted Master_info file '/var/lib/mysql/master.info'.
2024-05-09 21:23:01 237 [Note] Deleted Master_info file '/var/lib/mysql/relay-log.info'.
2024-05-09 21:23:01 242 [Note] Deleted Master_info file '/var/lib/mysql/master.info'.
2024-05-09 21:23:01 242 [Note] Deleted Master_info file '/var/lib/mysql/relay-log.info'.
2024-05-09 21:23:01 254 [Note] Deleted Master_info file '/var/lib/mysql/master.info'.
2024-05-09 21:23:01 254 [Note] Deleted Master_info file '/var/lib/mysql/relay-log.info'.
2024-05-09 21:23:01 262 [Note] Deleted Master_info file '/var/lib/mysql/master.info'.
2024-05-09 21:23:01 262 [Note] Deleted Master_info file '/var/lib/mysql/relay-log.info'.
2024-05-09 21:23:01 264 [Note] Deleted Master_info file '/var/lib/mysql/master.info'.
2024-05-09 21:23:01 264 [Note] Deleted Master_info file '/var/lib/mysql/relay-log.info'.
2024-05-09 21:23:01 267 [Note] Deleted Master_info file '/var/lib/mysql/master.info'.
2024-05-09 21:23:01 267 [Note] Deleted Master_info file '/var/lib/mysql/relay-log.info'.
2024-05-09 21:23:01 269 [Note] Deleted Master_info file '/var/lib/mysql/master.info'.
2024-05-09 21:23:01 269 [Note] Deleted Master_info file '/var/lib/mysql/relay-log.info'.
2024-05-09 21:23:01 272 [Note] Deleted Master_info file '/var/lib/mysql/master.info'.
2024-05-09 21:23:01 272 [Note] Deleted Master_info file '/var/lib/mysql/relay-log.info'.
2024-05-09 21:23:01 274 [Note] Deleted Master_info file '/var/lib/mysql/master.info'.

Also probably worth to mention that after mariadb-cluster-0 was brought up, it had 2 replication connections to mariadb-cluster-1:

MariaDB [(none)]> SHOW ALL SLAVES STATUS\G
*************************** 1. row ***************************
               Connection_name: 
               Slave_SQL_State: 
                Slave_IO_State: 
                   Master_Host: mariadb-cluster-1.mariadb-cluster-internal.mariadb-cluster.svc.cluster.local
                   Master_User: maxscale-monitor
                   Master_Port: 3306
                 Connect_Retry: 60
               Master_Log_File: mariadb-cluster-bin.000001
           Read_Master_Log_Pos: 332323
                Relay_Log_File: mariadb-cluster-relay-bin.000002
                 Relay_Log_Pos: 66140
         Relay_Master_Log_File: mariadb-cluster-bin.000001
              Slave_IO_Running: No
             Slave_SQL_Running: No
          Replicate_Rewrite_DB: 
               Replicate_Do_DB: 
           Replicate_Ignore_DB: 
            Replicate_Do_Table: 
        Replicate_Ignore_Table: 
       Replicate_Wild_Do_Table: 
   Replicate_Wild_Ignore_Table: 
                    Last_Errno: 0
                    Last_Error: 
                  Skip_Counter: 0
           Exec_Master_Log_Pos: 332323
               Relay_Log_Space: 66459
               Until_Condition: None
                Until_Log_File: 
                 Until_Log_Pos: 0
            Master_SSL_Allowed: No
            Master_SSL_CA_File: 
            Master_SSL_CA_Path: 
               Master_SSL_Cert: 
             Master_SSL_Cipher: 
                Master_SSL_Key: 
         Seconds_Behind_Master: NULL
 Master_SSL_Verify_Server_Cert: No
                 Last_IO_Errno: 2003
                 Last_IO_Error: error reconnecting to master 'maxscale-monitor@mariadb-cluster-1.mariadb-cluster-internal.mariadb-cluster.svc.cluster.local:3306' - retry-time: 60  maximum-retries: 100000  message: Can't connect to server on 'mariadb-cluster-1.mariadb-cluster-internal.mariadb-cluster.svc.cluster.local' (111 "Connection refused")
                Last_SQL_Errno: 0
                Last_SQL_Error: 
   Replicate_Ignore_Server_Ids: 
              Master_Server_Id: 11
                Master_SSL_Crl: 
            Master_SSL_Crlpath: 
                    Using_Gtid: Current_Pos
                   Gtid_IO_Pos: 0-11-415
       Replicate_Do_Domain_Ids: 
   Replicate_Ignore_Domain_Ids: 
                 Parallel_Mode: optimistic
                     SQL_Delay: 0
           SQL_Remaining_Delay: NULL
       Slave_SQL_Running_State: 
              Slave_DDL_Groups: 0
Slave_Non_Transactional_Groups: 0
    Slave_Transactional_Groups: 2
          Retried_transactions: 0
            Max_relay_log_size: 1073741824
          Executed_log_entries: 18
     Slave_received_heartbeats: 0
        Slave_heartbeat_period: 30.000
                Gtid_Slave_Pos: 0-11-579
*************************** 2. row ***************************
               Connection_name: mariadb-operator
               Slave_SQL_State: Slave has read all relay log; waiting for more updates
                Slave_IO_State: Waiting for master to send event
                   Master_Host: mariadb-cluster-1.mariadb-cluster-internal.mariadb-cluster.svc.cluster.local
                   Master_User: repl
                   Master_Port: 3306
                 Connect_Retry: 10
               Master_Log_File: mariadb-cluster-bin.000002
           Read_Master_Log_Pos: 27121
                Relay_Log_File: mariadb-cluster-relay-bin-mariadb@002doperator.000002
                 Relay_Log_Pos: 27430
         Relay_Master_Log_File: mariadb-cluster-bin.000002
              Slave_IO_Running: Yes
             Slave_SQL_Running: Yes
          Replicate_Rewrite_DB: 
               Replicate_Do_DB: 
           Replicate_Ignore_DB: 
            Replicate_Do_Table: 
        Replicate_Ignore_Table: 
       Replicate_Wild_Do_Table: 
   Replicate_Wild_Ignore_Table: 
                    Last_Errno: 0
                    Last_Error: 
                  Skip_Counter: 0
           Exec_Master_Log_Pos: 27121
               Relay_Log_Space: 27770
               Until_Condition: None
                Until_Log_File: 
                 Until_Log_Pos: 0
            Master_SSL_Allowed: No
            Master_SSL_CA_File: 
            Master_SSL_CA_Path: 
               Master_SSL_Cert: 
             Master_SSL_Cipher: 
                Master_SSL_Key: 
         Seconds_Behind_Master: 0
 Master_SSL_Verify_Server_Cert: No
                 Last_IO_Errno: 0
                 Last_IO_Error: 
                Last_SQL_Errno: 0
                Last_SQL_Error: 
   Replicate_Ignore_Server_Ids: 
              Master_Server_Id: 11
                Master_SSL_Crl: 
            Master_SSL_Crlpath: 
                    Using_Gtid: Current_Pos
                   Gtid_IO_Pos: 0-11-579
       Replicate_Do_Domain_Ids: 
   Replicate_Ignore_Domain_Ids: 
                 Parallel_Mode: optimistic
                     SQL_Delay: 0
           SQL_Remaining_Delay: NULL
       Slave_SQL_Running_State: Slave has read all relay log; waiting for more updates
              Slave_DDL_Groups: 86
Slave_Non_Transactional_Groups: 78
    Slave_Transactional_Groups: 0
          Retried_transactions: 0
            Max_relay_log_size: 1073741824
          Executed_log_entries: 335
     Slave_received_heartbeats: 0
        Slave_heartbeat_period: 30.000
                Gtid_Slave_Pos: 0-11-579
2 rows in set (0.000 sec)

Expected behaviour
Deleted Master_info file is only printed once, not every second. It's not a breaking bug but I imagine if the cluster is live for a long time, its log will fill up the storage. It also makes it harder to notice other meaningful log messages.

Steps to reproduce the bug

  1. Start a cluster of 3 databases and 2 maxscale instances with automatic failover.
  2. Bring the primary pod down.
  3. After the pod is up and connected to a new primary pod, observe the log in the new primary pod.

Environment details:

  • Kubernetes version: v1.25
  • mariadb-operator version: v0.27
  • Install method: helm
  • Install flavor: custom

Update: I let the cluster run overnight and it seems themariadb-cluster-0 got restarted a lot of times:

kubectl -n mariadb-cluster get pods  
NAME                                                              READY   STATUS             RESTARTS          AGE
mariadb-cluster-0                                                 0/1     CrashLoopBackOff   187 (2m57s ago)   16h
mariadb-cluster-1                                                 1/1     Running            1 (16h ago)       16h
mariadb-cluster-2                                                 1/1     Running            0                 16h
mariadb-operator-66864b5fd6-nv6qr                                 1/1     Running            1 (16h ago)       17h
mariadb-operator-66864b5fd6-xf2hr                                 1/1     Running            0                 17h
mariadb-operator-webhook-5859dd96f9-c29vc                         1/1     Running            1 (16h ago)       17h
maxscale-0                                                        1/1     Running            0                 16h
maxscale-1                                                        1/1     Running            0                 16h

Log from mariadb-cluster-1:

2024-05-10 14:08:35 828280 [Note] Start binlog_dump to slave_server(10), pos(mariadb-cluster-bin.000002, 368), using_gtid(1), gtid('0-11-10680')
2024-05-10 14:08:35 828282 [Warning] Aborted connection 828280 to db: 'unconnected' user: 'repl' host: '10.0.3.204' (KILLED)
2024-05-10 14:08:35 828282 [Note] Start binlog_dump to slave_server(10), pos(mariadb-cluster-bin.000002, 368), using_gtid(1), gtid('0-11-10680')
2024-05-10 14:08:35 828284 [Warning] Aborted connection 828282 to db: 'unconnected' user: 'maxscale-monitor' host: '10.0.3.204' (KILLED)
2024-05-10 14:08:35 828284 [Note] Start binlog_dump to slave_server(10), pos(mariadb-cluster-bin.000002, 368), using_gtid(1), gtid('0-11-10680')
2024-05-10 14:08:35 828284 [Warning] Aborted connection 828284 to db: 'unconnected' user: 'repl' host: '10.0.3.204' (Got an error writing communication packets)
2024-05-10 14:08:35 828292 [Note] Deleted Master_info file '/var/lib/mysql/master.info'.
2024-05-10 14:08:35 828292 [Note] Deleted Master_info file '/var/lib/mysql/relay-log.info'.
2024-05-10 14:08:36 828297 [Note] Deleted Master_info file '/var/lib/mysql/master.info'.
2024-05-10 14:08:36 828297 [Note] Deleted Master_info file '/var/lib/mysql/relay-log.info'.
2024-05-10 14:08:36 828299 [Note] Deleted Master_info file '/var/lib/mysql/master.info'.
2024-05-10 14:08:36 828299 [Note] Deleted Master_info file '/var/lib/mysql/relay-log.info'.
2024-05-10 14:08:36 828302 [Note] Deleted Master_info file '/var/lib/mysql/master.info'.
2024-05-10 14:08:36 828302 [Note] Deleted Master_info file '/var/lib/mysql/relay-log.info'.
2024-05-10 14:08:36 828305 [Note] Deleted Master_info file '/var/lib/mysql/master.info'.
2024-05-10 14:08:36 828305 [Note] Deleted Master_info file '/var/lib/mysql/relay-log.info'.
2024-05-10 14:08:36 828310 [Note] Deleted Master_info file '/var/lib/mysql/master.info'.

Please ignore. Other logs in mariadb-opeartor pods:

{"level":"info","ts":1715370575.2262561,"logger":"replication","msg":"Configuring primary","controller":"mariadb","controllerGroup":"k8s.mariadb.com","controllerKind":"MariaDB","MariaDB":{"name":"mariadb-cluster","namespace":"mariadb-cluster"},"namespace":"mariadb-cluster","name":"mariadb-cluster","reconcileID":"deb1c875-0737-42fe-93fc-d9b6b0a597b4","pod":"mariadb-cluster-1"}
{"level":"error","ts":1715370575.2471206,"msg":"Reconciler error","controller":"mariadb","controllerGroup":"k8s.mariadb.com","controllerKind":"MariaDB","MariaDB":{"name":"mariadb-cluster","namespace":"mariadb-cluster"},"namespace":"mariadb-cluster","name":"mariadb-cluster","reconcileID":"deb1c875-0737-42fe-93fc-d9b6b0a597b4","error":"error reconciling Replication: 1 error occurred:\n\t* error configuring replication in Pod 'mariadb-cluster-1': error resetting slave position: Error 1948: Specified value for @@gtid_slave_pos contains no value for replication domain 0. This conflicts with the binary log which contains GTID 0-11-139. If MASTER_GTID_POS=CURRENT_POS is used, the binlog position will override the new value of @@gtid_slave_pos\n\n","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.2/pkg/internal/controller/controller.go:329\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.2/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.2/pkg/internal/controller/controller.go:227"}
{"level":"info","ts":1715370575.3540788,"logger":"replication","msg":"Configuring primary","controller":"mariadb","controllerGroup":"k8s.mariadb.com","controllerKind":"MariaDB","MariaDB":{"name":"mariadb-cluster","namespace":"mariadb-cluster"},"namespace":"mariadb-cluster","name":"mariadb-cluster","reconcileID":"f8830b29-46f5-4c1e-8725-06abb160de31","pod":"mariadb-cluster-1"}
{"level":"error","ts":1715370575.3769333,"msg":"Reconciler error","controller":"mariadb","controllerGroup":"k8s.mariadb.com","controllerKind":"MariaDB","MariaDB":{"name":"mariadb-cluster","namespace":"mariadb-cluster"},"namespace":"mariadb-cluster","name":"mariadb-cluster","reconcileID":"f8830b29-46f5-4c1e-8725-06abb160de31","error":"error reconciling Replication: 1 error occurred:\n\t* error configuring replication in Pod 'mariadb-cluster-1': error resetting slave position: Error 1948: Specified value for @@gtid_slave_pos contains no value for replication domain 0. This conflicts with the binary log which contains GTID 0-11-139. If MASTER_GTID_POS=CURRENT_POS is used, the binlog position will override the new value of @@gtid_slave_pos\n\n","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.2/pkg/internal/controller/controller.go:329\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.2/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.2/pkg/internal/controller/controller.go:227"}
{"level":"info","ts":1715370575.654521,"logger":"replication","msg":"Configuring primary","controller":"mariadb","controllerGroup":"k8s.mariadb.com","controllerKind":"MariaDB","MariaDB":{"name":"mariadb-cluster","namespace":"mariadb-cluster"},"namespace":"mariadb-cluster","name":"mariadb-cluster","reconcileID":"8845c5b8-9611-45da-a486-054316395afa","pod":"mariadb-cluster-1"}

It seems mariadb-operator trying to reset gtid_slave_pos. However, I have gtid_strict_mode enabled so it fails and I assume it keeps retrying, thus it's stuck in a loop. I've turned off gtid_strict_mode and so far the error has not happened again. Will close this for now.

Hey there @HuyVo2112 ! Thanks a lot for the detailed explanation, I'm glad that you sorted out the issue by yourself. Some comments:

as soon as mariadb-cluster-0 was up again and tried connecting to the new master mariadb-cluster-1,

By default, the mariadbmon monitor used for replication defaults auto_rejoin to true, which is the reason why reconnecting to the existing cluster was attempted when mariadb-cluster-0 came back as replica:

You may disable this by turning it off:

apiVersion: k8s.mariadb.com/v1alpha1
kind: MaxScale
metadata:
  name: maxscale-repl
spec:
    monitor:
       interval: 2s
       cooperativeMonitoring: majority_of_all
       params:
          auto_rejoin: "false"

It seems mariadb-operator trying to reset gtid_slave_pos. However, I have gtid_strict_mode enabled so it fails and I assume it keeps retrying, thus it's stuck in a loop. I've turned off gtid_strict_mode and so far the error has not happened again. Will close this for now.

Yes, as part of the replica reconciliation, the operator manipulates gtid_slave_pos, gtid_strict_mode kindof clashes with this.