redsnapper8t8/mysql-master-ha

Failover/switchover not reliable with GTIDs

Opened this issue · 4 comments

What steps will reproduce the problem?
1. Set up 1 master and 2 slaves with MySQL 5.6 and GTIDs enabled. For instance, 
master on port 13001 and slaves on port 13002 and 13003.

2. Create an errant transaction on 13002:
mysql> create database mydb;

Then purge this transaction from the binary logs:
mysql> flush logs;
mysql> purge binary logs to 'mysql-bin.000012';

3. Configure MHA 0.56
[...]
[server1]
hostname = 127.0.0.1
port = 13001

[server2]
hostname = 127.0.0.1
port = 13002

[server3]
hostname = 127.0.0.1
port = 13003

4. Now ask MHA to make 13002 the new master:

$ masterha_master_switch --conf=/etc/mha.conf --master_state=alive 
--orig_master_is_new_slave --interactive=0

What is the expected output? What do you see instead?
Now the switchover never completes and replication is broken:

[...]
From:
127.0.0.1(127.0.0.1:13001) (current master)
 +--127.0.0.1(127.0.0.1:13002)
 +--127.0.0.1(127.0.0.1:13003)

To:
127.0.0.1(127.0.0.1:13002) (new master)
 +--127.0.0.1(127.0.0.1:13003)
 +--127.0.0.1(127.0.0.1:13001)
[...]
Thu May  8 11:05:49 2014 - [info] Getting new master's binlog name and 
position..
Thu May  8 11:05:49 2014 - [info]  mysql-bin.000012:231
Thu May  8 11:05:49 2014 - [info]  All other slaves should start replication 
from here. Statement should be: CHANGE MASTER TO MASTER_HOST='127.0.0.1', 
MASTER_PORT=13002, MASTER_AUTO_POSITION=1, MASTER_USER='repl', 
MASTER_PASSWORD='xxx';
Thu May  8 11:05:49 2014 - [info] 
Thu May  8 11:05:49 2014 - [info] * Switching slaves in parallel..
Thu May  8 11:05:49 2014 - [info] 
Thu May  8 11:05:49 2014 - [info] -- Slave switch on host 
127.0.0.1(127.0.0.1:13003) started, pid: 1277
Thu May  8 11:05:49 2014 - [info]
Thu May  8 11:15:50 2014 - [error][/usr/share/perl5/vendor_perl/MHA/Server.pm, 
ln784] Slave could not be started on 127.0.0.1(127.0.0.1:13003)! Check slave 
status.
Thu May  8 11:15:50 2014 - [info] Log messages from 127.0.0.1 ...
Thu May  8 11:15:50 2014 - [info] 
Thu May  8 11:05:49 2014 - [info]  Waiting to execute all relay logs on 
127.0.0.1(127.0.0.1:13003)..
Thu May  8 11:05:49 2014 - [info]  master_pos_wait(mysql-bin.000013:639) 
completed on 127.0.0.1(127.0.0.1:13003). Executed 0 events.
Thu May  8 11:05:49 2014 - [info]   done.
Thu May  8 11:05:49 2014 - [info]  Resetting slave 127.0.0.1(127.0.0.1:13003) 
and starting replication from the new master 127.0.0.1(127.0.0.1:13002)..
Thu May  8 11:05:49 2014 - [info]  Executed CHANGE MASTER.

# Replication on 13003
mysql> show slave status\G
*************************** 1. row ***************************
               Slave_IO_State: 
                  Master_Host: 127.0.0.1
                  Master_User: repl
                  Master_Port: 13002
                Connect_Retry: 60
              Master_Log_File: 
          Read_Master_Log_Pos: 4
               Relay_Log_File: centos2-relay-bin.000001
                Relay_Log_Pos: 4
        Relay_Master_Log_File: 
             Slave_IO_Running: No
            Slave_SQL_Running: Yes
              Replicate_Do_DB: 
          Replicate_Ignore_DB: 
           Replicate_Do_Table: 
       Replicate_Ignore_Table: 
      Replicate_Wild_Do_Table: 
  Replicate_Wild_Ignore_Table: 
                   Last_Errno: 0
                   Last_Error: 
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 0
              Relay_Log_Space: 151
              Until_Condition: None
               Until_Log_File: 
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File: 
           Master_SSL_CA_Path: 
              Master_SSL_Cert: 
            Master_SSL_Cipher: 
               Master_SSL_Key: 
        Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 1236
                Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'The slave is connecting using CHANGE MASTER TO MASTER_AUTO_POSITION = 1, but the master has purged binary logs containing GTIDs that the slave requires.'
               Last_SQL_Errno: 0
               Last_SQL_Error: 
  Replicate_Ignore_Server_Ids: 
             Master_Server_Id: 13002
                  Master_UUID: 8e3648e4-bc14-11e3-8d4c-0800272864ba
             Master_Info_File: mysql.slave_master_info
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
      Slave_SQL_Running_State: Slave has read all relay log; waiting for the slave I/O thread to update it
           Master_Retry_Count: 86400
                  Master_Bind: 
      Last_IO_Error_Timestamp: 140508 11:05:49
     Last_SQL_Error_Timestamp: 
               Master_SSL_Crl: 
           Master_SSL_Crlpath: 
           Retrieved_Gtid_Set: 
            Executed_Gtid_Set: 8e349184-bc14-11e3-8d4c-0800272864ba:1-26,
8e3648e4-bc14-11e3-8d4c-0800272864ba:1-5
                Auto_Position: 1

Then after a while, it will try to make 13001 a slave of 13002, which will also 
fail:
Thu May  8 11:15:50 2014 - [error][/usr/share/perl5/vendor_perl/MHA/Server.pm, 
ln862] Starting slave IO/SQL thread on 127.0.0.1(127.0.0.1:13003) failed!
Thu May  8 11:15:50 2014 - [info] End of log messages from 127.0.0.1 ...
Thu May  8 11:15:50 2014 - [info] 
Thu May  8 11:15:50 2014 - [info] Unlocking all tables on the orig master:
Thu May  8 11:15:50 2014 - [info] Executing UNLOCK TABLES..
Thu May  8 11:15:50 2014 - [info]  ok.
Thu May  8 11:15:50 2014 - [info] Starting orig master as a new slave..
Thu May  8 11:15:50 2014 - [info]  Resetting slave 127.0.0.1(127.0.0.1:13001) 
and starting replication from the new master 127.0.0.1(127.0.0.1:13002)..
Thu May  8 11:15:50 2014 - [info]  Executed CHANGE MASTER.

# Replication on 13001
mysql> show slave status\G
*************************** 1. row ***************************
               Slave_IO_State: 
                  Master_Host: 127.0.0.1
                  Master_User: repl
                  Master_Port: 13002
                Connect_Retry: 60
              Master_Log_File: 
          Read_Master_Log_Pos: 4
               Relay_Log_File: centos2-relay-bin.000001
                Relay_Log_Pos: 4
        Relay_Master_Log_File: 
             Slave_IO_Running: No
            Slave_SQL_Running: Yes
              Replicate_Do_DB: 
          Replicate_Ignore_DB: 
           Replicate_Do_Table: 
       Replicate_Ignore_Table: 
      Replicate_Wild_Do_Table: 
  Replicate_Wild_Ignore_Table: 
                   Last_Errno: 0
                   Last_Error: 
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 0
              Relay_Log_Space: 151
              Until_Condition: None
               Until_Log_File: 
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File: 
           Master_SSL_CA_Path: 
              Master_SSL_Cert: 
            Master_SSL_Cipher: 
               Master_SSL_Key: 
        Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 1236
                Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'The slave is connecting using CHANGE MASTER TO MASTER_AUTO_POSITION = 1, but the master has purged binary logs containing GTIDs that the slave requires.'
               Last_SQL_Errno: 0
               Last_SQL_Error: 
  Replicate_Ignore_Server_Ids: 
             Master_Server_Id: 13002
                  Master_UUID: 8e3648e4-bc14-11e3-8d4c-0800272864ba
             Master_Info_File: mysql.slave_master_info
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
      Slave_SQL_Running_State: Slave has read all relay log; waiting for the slave I/O thread to update it
           Master_Retry_Count: 86400
                  Master_Bind: 
      Last_IO_Error_Timestamp: 140508 11:15:50
     Last_SQL_Error_Timestamp: 
               Master_SSL_Crl: 
           Master_SSL_Crlpath: 
           Retrieved_Gtid_Set: 
            Executed_Gtid_Set: 8e349184-bc14-11e3-8d4c-0800272864ba:1-26,
8e3648e4-bc14-11e3-8d4c-0800272864ba:1-5
                Auto_Position: 1

and then the script exits with an error:
Thu May  8 11:25:51 2014 - [error][/usr/share/perl5/vendor_perl/MHA/Server.pm, 
ln784] Slave could not be started on 127.0.0.1(127.0.0.1:13001)! Check slave 
status.
Thu May  8 11:25:51 2014 - [error][/usr/share/perl5/vendor_perl/MHA/Server.pm, 
ln862] Starting slave IO/SQL thread on 127.0.0.1(127.0.0.1:13001) failed!
Thu May  8 11:25:51 2014 - 
[error][/usr/share/perl5/vendor_perl/MHA/MasterRotate.pm, ln573]  Failed!
Thu May  8 11:25:51 2014 - 
[error][/usr/share/perl5/vendor_perl/MHA/MasterRotate.pm, ln602] Switching 
master to 127.0.0.1(127.0.0.1:13002) done, but switching slaves partially 
failed.

What version of the product are you using? On what operating system?
This was seen with MHA 0.56 on CentOS 6.5, and with MySQL 5.6.15.

Please provide any additional information below.
I think the script should check if the new master has any errant transaction 
before actually doing the switchover/failover. If such a transaction is found, 
the script should exit with an error as there is a potential risk of breaking 
replication and not being able to do the failover.

Original issue reported on code.google.com by stephane...@percona.com on 8 May 2014 at 9:29

Original comment by Yoshinor...@gmail.com on 8 May 2014 at 5:59

  • Changed state: Accepted
Any updates on this?

Original comment by ege...@indeed.com on 11 Sep 2014 at 6:01

Hi All,

I suspect that's not entirely MasterHA fault - it seems you just hit errant 
transactions - 
http://www.percona.com/blog/2014/05/19/errant-transactions-major-hurdle-for-gtid
-based-failover-in-mysql-5-6/
If you are using GTID you do not need to write to slave - period. Any writes 
will create new transaction on slave which is absent on master and after log 
purging - boom! - your failover failed.
We faced with same problem because of pt-heartbeat with SUPER privileges. You 
need to be extremely cautious with SUPER users on slaves or apply some patch 
for MySQL for "super-read-only" mode when even SUPER user not able to write to 
read-only slave.
Or you can check for errant transactions before failover. Or both.

Original comment by denis.zhdanov on 13 Nov 2014 at 9:16

PS: it will be great to have errant transaction check for masterha_check_repl 
(will try to make PR)

Original comment by denis.zhdanov on 13 Nov 2014 at 9:21