Failover/switchover not reliable with GTIDs
Opened this issue · 4 comments
GoogleCodeExporter commented
What steps will reproduce the problem?
1. Set up 1 master and 2 slaves with MySQL 5.6 and GTIDs enabled. For instance,
master on port 13001 and slaves on port 13002 and 13003.
2. Create an errant transaction on 13002:
mysql> create database mydb;
Then purge this transaction from the binary logs:
mysql> flush logs;
mysql> purge binary logs to 'mysql-bin.000012';
3. Configure MHA 0.56
[...]
[server1]
hostname = 127.0.0.1
port = 13001
[server2]
hostname = 127.0.0.1
port = 13002
[server3]
hostname = 127.0.0.1
port = 13003
4. Now ask MHA to make 13002 the new master:
$ masterha_master_switch --conf=/etc/mha.conf --master_state=alive
--orig_master_is_new_slave --interactive=0
What is the expected output? What do you see instead?
Now the switchover never completes and replication is broken:
[...]
From:
127.0.0.1(127.0.0.1:13001) (current master)
+--127.0.0.1(127.0.0.1:13002)
+--127.0.0.1(127.0.0.1:13003)
To:
127.0.0.1(127.0.0.1:13002) (new master)
+--127.0.0.1(127.0.0.1:13003)
+--127.0.0.1(127.0.0.1:13001)
[...]
Thu May 8 11:05:49 2014 - [info] Getting new master's binlog name and
position..
Thu May 8 11:05:49 2014 - [info] mysql-bin.000012:231
Thu May 8 11:05:49 2014 - [info] All other slaves should start replication
from here. Statement should be: CHANGE MASTER TO MASTER_HOST='127.0.0.1',
MASTER_PORT=13002, MASTER_AUTO_POSITION=1, MASTER_USER='repl',
MASTER_PASSWORD='xxx';
Thu May 8 11:05:49 2014 - [info]
Thu May 8 11:05:49 2014 - [info] * Switching slaves in parallel..
Thu May 8 11:05:49 2014 - [info]
Thu May 8 11:05:49 2014 - [info] -- Slave switch on host
127.0.0.1(127.0.0.1:13003) started, pid: 1277
Thu May 8 11:05:49 2014 - [info]
Thu May 8 11:15:50 2014 - [error][/usr/share/perl5/vendor_perl/MHA/Server.pm,
ln784] Slave could not be started on 127.0.0.1(127.0.0.1:13003)! Check slave
status.
Thu May 8 11:15:50 2014 - [info] Log messages from 127.0.0.1 ...
Thu May 8 11:15:50 2014 - [info]
Thu May 8 11:05:49 2014 - [info] Waiting to execute all relay logs on
127.0.0.1(127.0.0.1:13003)..
Thu May 8 11:05:49 2014 - [info] master_pos_wait(mysql-bin.000013:639)
completed on 127.0.0.1(127.0.0.1:13003). Executed 0 events.
Thu May 8 11:05:49 2014 - [info] done.
Thu May 8 11:05:49 2014 - [info] Resetting slave 127.0.0.1(127.0.0.1:13003)
and starting replication from the new master 127.0.0.1(127.0.0.1:13002)..
Thu May 8 11:05:49 2014 - [info] Executed CHANGE MASTER.
# Replication on 13003
mysql> show slave status\G
*************************** 1. row ***************************
Slave_IO_State:
Master_Host: 127.0.0.1
Master_User: repl
Master_Port: 13002
Connect_Retry: 60
Master_Log_File:
Read_Master_Log_Pos: 4
Relay_Log_File: centos2-relay-bin.000001
Relay_Log_Pos: 4
Relay_Master_Log_File:
Slave_IO_Running: No
Slave_SQL_Running: Yes
Replicate_Do_DB:
Replicate_Ignore_DB:
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 0
Last_Error:
Skip_Counter: 0
Exec_Master_Log_Pos: 0
Relay_Log_Space: 151
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Master_SSL_Allowed: No
Master_SSL_CA_File:
Master_SSL_CA_Path:
Master_SSL_Cert:
Master_SSL_Cipher:
Master_SSL_Key:
Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
Last_IO_Errno: 1236
Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'The slave is connecting using CHANGE MASTER TO MASTER_AUTO_POSITION = 1, but the master has purged binary logs containing GTIDs that the slave requires.'
Last_SQL_Errno: 0
Last_SQL_Error:
Replicate_Ignore_Server_Ids:
Master_Server_Id: 13002
Master_UUID: 8e3648e4-bc14-11e3-8d4c-0800272864ba
Master_Info_File: mysql.slave_master_info
SQL_Delay: 0
SQL_Remaining_Delay: NULL
Slave_SQL_Running_State: Slave has read all relay log; waiting for the slave I/O thread to update it
Master_Retry_Count: 86400
Master_Bind:
Last_IO_Error_Timestamp: 140508 11:05:49
Last_SQL_Error_Timestamp:
Master_SSL_Crl:
Master_SSL_Crlpath:
Retrieved_Gtid_Set:
Executed_Gtid_Set: 8e349184-bc14-11e3-8d4c-0800272864ba:1-26,
8e3648e4-bc14-11e3-8d4c-0800272864ba:1-5
Auto_Position: 1
Then after a while, it will try to make 13001 a slave of 13002, which will also
fail:
Thu May 8 11:15:50 2014 - [error][/usr/share/perl5/vendor_perl/MHA/Server.pm,
ln862] Starting slave IO/SQL thread on 127.0.0.1(127.0.0.1:13003) failed!
Thu May 8 11:15:50 2014 - [info] End of log messages from 127.0.0.1 ...
Thu May 8 11:15:50 2014 - [info]
Thu May 8 11:15:50 2014 - [info] Unlocking all tables on the orig master:
Thu May 8 11:15:50 2014 - [info] Executing UNLOCK TABLES..
Thu May 8 11:15:50 2014 - [info] ok.
Thu May 8 11:15:50 2014 - [info] Starting orig master as a new slave..
Thu May 8 11:15:50 2014 - [info] Resetting slave 127.0.0.1(127.0.0.1:13001)
and starting replication from the new master 127.0.0.1(127.0.0.1:13002)..
Thu May 8 11:15:50 2014 - [info] Executed CHANGE MASTER.
# Replication on 13001
mysql> show slave status\G
*************************** 1. row ***************************
Slave_IO_State:
Master_Host: 127.0.0.1
Master_User: repl
Master_Port: 13002
Connect_Retry: 60
Master_Log_File:
Read_Master_Log_Pos: 4
Relay_Log_File: centos2-relay-bin.000001
Relay_Log_Pos: 4
Relay_Master_Log_File:
Slave_IO_Running: No
Slave_SQL_Running: Yes
Replicate_Do_DB:
Replicate_Ignore_DB:
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 0
Last_Error:
Skip_Counter: 0
Exec_Master_Log_Pos: 0
Relay_Log_Space: 151
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Master_SSL_Allowed: No
Master_SSL_CA_File:
Master_SSL_CA_Path:
Master_SSL_Cert:
Master_SSL_Cipher:
Master_SSL_Key:
Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
Last_IO_Errno: 1236
Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'The slave is connecting using CHANGE MASTER TO MASTER_AUTO_POSITION = 1, but the master has purged binary logs containing GTIDs that the slave requires.'
Last_SQL_Errno: 0
Last_SQL_Error:
Replicate_Ignore_Server_Ids:
Master_Server_Id: 13002
Master_UUID: 8e3648e4-bc14-11e3-8d4c-0800272864ba
Master_Info_File: mysql.slave_master_info
SQL_Delay: 0
SQL_Remaining_Delay: NULL
Slave_SQL_Running_State: Slave has read all relay log; waiting for the slave I/O thread to update it
Master_Retry_Count: 86400
Master_Bind:
Last_IO_Error_Timestamp: 140508 11:15:50
Last_SQL_Error_Timestamp:
Master_SSL_Crl:
Master_SSL_Crlpath:
Retrieved_Gtid_Set:
Executed_Gtid_Set: 8e349184-bc14-11e3-8d4c-0800272864ba:1-26,
8e3648e4-bc14-11e3-8d4c-0800272864ba:1-5
Auto_Position: 1
and then the script exits with an error:
Thu May 8 11:25:51 2014 - [error][/usr/share/perl5/vendor_perl/MHA/Server.pm,
ln784] Slave could not be started on 127.0.0.1(127.0.0.1:13001)! Check slave
status.
Thu May 8 11:25:51 2014 - [error][/usr/share/perl5/vendor_perl/MHA/Server.pm,
ln862] Starting slave IO/SQL thread on 127.0.0.1(127.0.0.1:13001) failed!
Thu May 8 11:25:51 2014 -
[error][/usr/share/perl5/vendor_perl/MHA/MasterRotate.pm, ln573] Failed!
Thu May 8 11:25:51 2014 -
[error][/usr/share/perl5/vendor_perl/MHA/MasterRotate.pm, ln602] Switching
master to 127.0.0.1(127.0.0.1:13002) done, but switching slaves partially
failed.
What version of the product are you using? On what operating system?
This was seen with MHA 0.56 on CentOS 6.5, and with MySQL 5.6.15.
Please provide any additional information below.
I think the script should check if the new master has any errant transaction
before actually doing the switchover/failover. If such a transaction is found,
the script should exit with an error as there is a potential risk of breaking
replication and not being able to do the failover.
Original issue reported on code.google.com by stephane...@percona.com
on 8 May 2014 at 9:29
GoogleCodeExporter commented
Original comment by Yoshinor...@gmail.com
on 8 May 2014 at 5:59
- Changed state: Accepted
GoogleCodeExporter commented
Any updates on this?
Original comment by ege...@indeed.com
on 11 Sep 2014 at 6:01
GoogleCodeExporter commented
Hi All,
I suspect that's not entirely MasterHA fault - it seems you just hit errant
transactions -
http://www.percona.com/blog/2014/05/19/errant-transactions-major-hurdle-for-gtid
-based-failover-in-mysql-5-6/
If you are using GTID you do not need to write to slave - period. Any writes
will create new transaction on slave which is absent on master and after log
purging - boom! - your failover failed.
We faced with same problem because of pt-heartbeat with SUPER privileges. You
need to be extremely cautious with SUPER users on slaves or apply some patch
for MySQL for "super-read-only" mode when even SUPER user not able to write to
read-only slave.
Or you can check for errant transactions before failover. Or both.
Original comment by denis.zhdanov
on 13 Nov 2014 at 9:16
GoogleCodeExporter commented
PS: it will be great to have errant transaction check for masterha_check_repl
(will try to make PR)
Original comment by denis.zhdanov
on 13 Nov 2014 at 9:21