fantasyni/mysql-master-ha

masterha_check_repl skips all slave checks on MySQL 5.6 with GTID enabled

Opened this issue · 3 comments

If you look at this line of code:

https://github.com/yoshinorim/mha4mysql-manager/blob/master/lib/MHA/MasterMonito
r.pm#L415

It looks to me that masterha_check_repl skips all slave checks when using MySQL 
5.6 with GTIDS enabled.

Some of the slave checks are valid even when using GTIDs. For example, it 
doesn't verify that the MySQL user specified in the MHA configuration can 
connect to the master and has replication client privileges.

I can gather more information, if necessary.

Original issue reported on code.google.com by Geoff.Mo...@gmail.com on 11 Feb 2015 at 2:12

As a followup, here is an example of a masterha_check_repl failure that occurs 
on MariaDB 10.0. This check would be skipped on MySQL 5.6 if GTIDs are enabled, 
and a failure would happen when failover happens:

Wed Feb 18 10:57:15 2015 - [info] Checking slave configurations..
Wed Feb 18 10:57:15 2015 - [info]  read_only=1 is not set on slave 
192.168.1.46(192.168.1.46:3306).
Wed Feb 18 10:57:15 2015 - [warning]  relay_log_purge=0 is not set on slave 
192.168.1.46(192.168.1.46:3306).
Wed Feb 18 10:57:15 2015 - [info] Checking replication filtering settings..
Wed Feb 18 10:57:15 2015 - [info]  binlog_do_db= , binlog_ignore_db= 
Wed Feb 18 10:57:15 2015 - [info]  Replication filtering check ok.
Wed Feb 18 10:57:15 2015 - [error][/usr/share/perl5/vendor_perl/MHA/Server.pm, 
ln393] 192.168.1.46(192.168.1.46:3306): User repl does not exist or does not 
have REPLICATION SLAVE privilege! Other slaves can not start replication from 
this host.
Wed Feb 18 10:57:16 2015 - 
[error][/usr/share/perl5/vendor_perl/MHA/MasterMonitor.pm, ln424] Error 
happened on checking configurations.  at 
/usr/share/perl5/vendor_perl/MHA/ServerManager.pm line 1403
Wed Feb 18 10:57:16 2015 - 
[error][/usr/share/perl5/vendor_perl/MHA/MasterMonitor.pm, ln523] Error 
happened on monitoring servers.

Original comment by Geoff.Mo...@gmail.com on 18 Feb 2015 at 4:00

With GTID enabled, MHA skips some configurations that aren't used on GTID based 
failover (i.e. SSH connections). But MHA still checks necessary replication 
configurations. In your case, MHA should check repl_user on GTID based 
failover. Could you show whole error messages on GTID based failover?

Original comment by Yoshinor...@gmail.com on 19 Feb 2015 at 6:57

Here's an example.

Let's say I have two MySQL 5.6 hosts, described by this MHA configuration:

[gmontee@localhost ~]$ cat /etc/mha.cnf 
[server default]
user=mhauser
password=password
ssh_user=gmontee
repl_user=repl
repl_password=password

[server1]
hostname=192.168.1.42
ip=192.168.1.42

[server2]
hostname=192.168.1.43
ip=192.168.1.43

192.168.1.42 is the current master.

According to masterha_check_repl, everything is OK:

[gmontee@localhost ~]$ masterha_check_repl --conf=/etc/mha.cnf 
Thu Feb 19 14:46:35 2015 - [warning] Global configuration file 
/etc/masterha_default.cnf not found. Skipping.
Thu Feb 19 14:46:35 2015 - [info] Reading application default configuration 
from /etc/mha.cnf..
Thu Feb 19 14:46:35 2015 - [info] Reading server configuration from 
/etc/mha.cnf..
Thu Feb 19 14:46:35 2015 - [info] MHA::MasterMonitor version 0.56.
Thu Feb 19 14:46:35 2015 - [info] GTID failover mode = 1
Thu Feb 19 14:46:35 2015 - [info] Dead Servers:
Thu Feb 19 14:46:35 2015 - [info] Alive Servers:
Thu Feb 19 14:46:35 2015 - [info]   192.168.1.42(192.168.1.42:3306)
Thu Feb 19 14:46:35 2015 - [info]   192.168.1.43(192.168.1.43:3306)
Thu Feb 19 14:46:35 2015 - [info] Alive Slaves:
Thu Feb 19 14:46:35 2015 - [info]   192.168.1.43(192.168.1.43:3306)  
Version=5.6.23-log (oldest major version between slaves) log-bin:enabled
Thu Feb 19 14:46:35 2015 - [info]     GTID ON
Thu Feb 19 14:46:35 2015 - [info]     Replicating from 
192.168.1.42(192.168.1.42:3306)
Thu Feb 19 14:46:35 2015 - [info] Current Alive Master: 
192.168.1.42(192.168.1.42:3306)
Thu Feb 19 14:46:35 2015 - [info] Checking slave configurations..
Thu Feb 19 14:46:35 2015 - [info]  read_only=1 is not set on slave 
192.168.1.43(192.168.1.43:3306).
Thu Feb 19 14:46:35 2015 - [info] Checking replication filtering settings..
Thu Feb 19 14:46:35 2015 - [info]  binlog_do_db= , binlog_ignore_db= 
Thu Feb 19 14:46:35 2015 - [info]  Replication filtering check ok.
Thu Feb 19 14:46:35 2015 - [info] GTID (with auto-pos) is supported. Skipping 
all SSH and Node package checking.
Thu Feb 19 14:46:35 2015 - [info] Checking SSH publickey authentication 
settings on the current master..
Thu Feb 19 14:46:36 2015 - [info] HealthCheck: SSH to 192.168.1.42 is reachable.
Thu Feb 19 14:46:36 2015 - [info] 
192.168.1.42(192.168.1.42:3306) (current master)
 +--192.168.1.43(192.168.1.43:3306)

Thu Feb 19 14:46:36 2015 - [info] Checking replication health on 192.168.1.43..
Thu Feb 19 14:46:36 2015 - [info]  ok.
Thu Feb 19 14:46:36 2015 - [warning] master_ip_failover_script is not defined.
Thu Feb 19 14:46:36 2015 - [warning] shutdown_script is not defined.
Thu Feb 19 14:46:36 2015 - [info] Got exit code 0 (Not master dead).

MySQL Replication Health is OK.

However, the current master doesn't actually have REPLICATION SLAVE privileges 
on the other node. Executing on the current slave (192.168.1.43):

mysql> SHOW GRANTS FOR 'repl'@'192.168.1.42';
ERROR 1141 (42000): There is no such grant defined for user 'repl' on host 
'192.168.1.42'
mysql> SELECT User, Host FROM mysql.user WHERE User='repl';
+------+--------------+
| User | Host         |
+------+--------------+
| repl | 192.168.1.43 |
+------+--------------+
1 row in set (0.00 sec)

So we can try to failover manually:

[gmontee@localhost ~]$ masterha_master_switch --conf=/etc/mha.cnf 
--master_state=alive --new_master_host=192.168.1.43 --orig_master_is_new_slave
Thu Feb 19 14:55:01 2015 - [info] MHA::MasterRotate version 0.56.
Thu Feb 19 14:55:01 2015 - [info] Starting online master switch..
Thu Feb 19 14:55:01 2015 - [info] 
Thu Feb 19 14:55:01 2015 - [info] * Phase 1: Configuration Check Phase..
Thu Feb 19 14:55:01 2015 - [info] 
Thu Feb 19 14:55:01 2015 - [warning] Global configuration file 
/etc/masterha_default.cnf not found. Skipping.
Thu Feb 19 14:55:01 2015 - [info] Reading application default configuration 
from /etc/mha.cnf..
Thu Feb 19 14:55:01 2015 - [info] Reading server configuration from 
/etc/mha.cnf..
Thu Feb 19 14:55:01 2015 - [info] GTID failover mode = 1
Thu Feb 19 14:55:01 2015 - [info] Current Alive Master: 
192.168.1.42(192.168.1.42:3306)
Thu Feb 19 14:55:01 2015 - [info] Alive Slaves:
Thu Feb 19 14:55:01 2015 - [info]   192.168.1.43(192.168.1.43:3306)  
Version=5.6.23-log (oldest major version between slaves) log-bin:enabled
Thu Feb 19 14:55:01 2015 - [info]     GTID ON
Thu Feb 19 14:55:01 2015 - [info]     Replicating from 
192.168.1.42(192.168.1.42:3306)

It is better to execute FLUSH NO_WRITE_TO_BINLOG TABLES on the master before 
switching. Is it ok to execute on 192.168.1.42(192.168.1.42:3306)? (YES/no): YES
Thu Feb 19 14:55:03 2015 - [info] Executing FLUSH NO_WRITE_TO_BINLOG TABLES. 
This may take long time..
Thu Feb 19 14:55:03 2015 - [info]  ok.
Thu Feb 19 14:55:03 2015 - [info] Checking MHA is not monitoring or doing 
failover..
Thu Feb 19 14:55:03 2015 - [info] Checking replication health on 192.168.1.43..
Thu Feb 19 14:55:03 2015 - [info]  ok.
Thu Feb 19 14:55:03 2015 - [info] 192.168.1.43 can be new master.
Thu Feb 19 14:55:03 2015 - [info] 
From:
192.168.1.42(192.168.1.42:3306) (current master)
 +--192.168.1.43(192.168.1.43:3306)

To:
192.168.1.43(192.168.1.43:3306) (new master)
 +--192.168.1.42(192.168.1.42:3306)

Starting master switch from 192.168.1.42(192.168.1.42:3306) to 
192.168.1.43(192.168.1.43:3306)? (yes/NO): yes
Thu Feb 19 14:55:13 2015 - [info] Checking whether 
192.168.1.43(192.168.1.43:3306) is ok for the new master..
Thu Feb 19 14:55:13 2015 - [info]  ok.
Thu Feb 19 14:55:13 2015 - [info] ** Phase 1: Configuration Check Phase 
completed.
Thu Feb 19 14:55:13 2015 - [info] 
Thu Feb 19 14:55:13 2015 - [info] * Phase 2: Rejecting updates Phase..
Thu Feb 19 14:55:13 2015 - [info] 
master_ip_online_change_script is not defined. If you do not disable writes on 
the current master manually, applications keep writing on the current master. 
Is it ok to proceed? (yes/NO): yes
Thu Feb 19 14:55:19 2015 - [info] Locking all tables on the orig master to 
reject updates from everybody (including root):
Thu Feb 19 14:55:19 2015 - [info] Executing FLUSH TABLES WITH READ LOCK..
Thu Feb 19 14:55:19 2015 - [info]  ok.
Thu Feb 19 14:55:19 2015 - [info] Orig master binlog:pos is 
mysqld-bin.000005:191.
Thu Feb 19 14:55:19 2015 - [info]  Waiting to execute all relay logs on 
192.168.1.43(192.168.1.43:3306)..
Thu Feb 19 14:55:19 2015 - [info]  master_pos_wait(mysqld-bin.000005:191) 
completed on 192.168.1.43(192.168.1.43:3306). Executed 0 events.
Thu Feb 19 14:55:19 2015 - [info]   done.
Thu Feb 19 14:55:19 2015 - [info] Getting new master's binlog name and 
position..
Thu Feb 19 14:55:19 2015 - [info]  mysqld-bin.000005:379
Thu Feb 19 14:55:19 2015 - [info]  All other slaves should start replication 
from here. Statement should be: CHANGE MASTER TO MASTER_HOST='192.168.1.43', 
MASTER_PORT=3306, MASTER_AUTO_POSITION=1, MASTER_USER='repl', 
MASTER_PASSWORD='xxx';
Thu Feb 19 14:55:19 2015 - [info] 
Thu Feb 19 14:55:19 2015 - [info] * Switching slaves in parallel..
Thu Feb 19 14:55:19 2015 - [info] 
Thu Feb 19 14:55:19 2015 - [info] Unlocking all tables on the orig master:
Thu Feb 19 14:55:19 2015 - [info] Executing UNLOCK TABLES..
Thu Feb 19 14:55:19 2015 - [info]  ok.
Thu Feb 19 14:55:19 2015 - [info] Starting orig master as a new slave..
Thu Feb 19 14:55:19 2015 - [info]  Resetting slave 
192.168.1.42(192.168.1.42:3306) and starting replication from the new master 
192.168.1.43(192.168.1.43:3306)..
Thu Feb 19 14:55:19 2015 - [info]  Executed CHANGE MASTER.

It seems to hang when it gets to the CHANGE MASTER step. SHOW SLAVE STATUS on 
the new slave (old master):

mysql> SHOW SLAVE STATUS\G
*************************** 1. row ***************************
               Slave_IO_State: Connecting to master
                  Master_Host: 192.168.1.43
                  Master_User: repl
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: 
          Read_Master_Log_Pos: 4
               Relay_Log_File: mysqld-relay-bin.000001
                Relay_Log_Pos: 4
        Relay_Master_Log_File: 
             Slave_IO_Running: Connecting
            Slave_SQL_Running: Yes
              Replicate_Do_DB: 
          Replicate_Ignore_DB: 
           Replicate_Do_Table: 
       Replicate_Ignore_Table: 
      Replicate_Wild_Do_Table: 
  Replicate_Wild_Ignore_Table: 
                   Last_Errno: 0
                   Last_Error: 
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 0
              Relay_Log_Space: 151
              Until_Condition: None
               Until_Log_File: 
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File: 
           Master_SSL_CA_Path: 
              Master_SSL_Cert: 
            Master_SSL_Cipher: 
               Master_SSL_Key: 
        Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 1130
                Last_IO_Error: error connecting to master 'repl@192.168.1.43:3306' - retry-time: 60  retries: 1
               Last_SQL_Errno: 0
               Last_SQL_Error: 
  Replicate_Ignore_Server_Ids: 
             Master_Server_Id: 0
                  Master_UUID: 
             Master_Info_File: /var/lib/mysql/master.info
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
      Slave_SQL_Running_State: Slave has read all relay log; waiting for the slave I/O thread to update it
           Master_Retry_Count: 86400
                  Master_Bind: 
      Last_IO_Error_Timestamp: 150219 14:58:20
     Last_SQL_Error_Timestamp: 
               Master_SSL_Crl: 
           Master_SSL_Crlpath: 
           Retrieved_Gtid_Set: 
            Executed_Gtid_Set: 9f0bcc7f-b2ee-11e4-96fa-08002772fdec:1-4
                Auto_Position: 1
1 row in set (0.00 sec)

This status makes sense, since I already know the correct repl_user 
('repl'@'192.168.1.42') doesn't exist on the new master.

However, if I go back to the way things were to start with (192.168.1.42 as 
master, 192.168.1.43 as slave), and delete the 'repl'@'192.168.1.43' account 
from the slave, then masterha_check_repl reports a failure:

[gmontee@localhost ~]$ masterha_check_repl --conf=/etc/mha.cnf 
Thu Feb 19 15:02:28 2015 - [warning] Global configuration file 
/etc/masterha_default.cnf not found. Skipping.
Thu Feb 19 15:02:28 2015 - [info] Reading application default configuration 
from /etc/mha.cnf..
Thu Feb 19 15:02:28 2015 - [info] Reading server configuration from 
/etc/mha.cnf..
Thu Feb 19 15:02:28 2015 - [info] MHA::MasterMonitor version 0.56.
Thu Feb 19 15:02:28 2015 - [info] GTID failover mode = 1
Thu Feb 19 15:02:28 2015 - [info] Dead Servers:
Thu Feb 19 15:02:28 2015 - [info] Alive Servers:
Thu Feb 19 15:02:28 2015 - [info]   192.168.1.42(192.168.1.42:3306)
Thu Feb 19 15:02:28 2015 - [info]   192.168.1.43(192.168.1.43:3306)
Thu Feb 19 15:02:28 2015 - [info] Alive Slaves:
Thu Feb 19 15:02:28 2015 - [info]   192.168.1.43(192.168.1.43:3306)  
Version=5.6.23-log (oldest major version between slaves) log-bin:enabled
Thu Feb 19 15:02:28 2015 - [info]     GTID ON
Thu Feb 19 15:02:28 2015 - [info]     Replicating from 
192.168.1.42(192.168.1.42:3306)
Thu Feb 19 15:02:28 2015 - [info] Current Alive Master: 
192.168.1.42(192.168.1.42:3306)
Thu Feb 19 15:02:28 2015 - [info] Checking slave configurations..
Thu Feb 19 15:02:28 2015 - [info]  read_only=1 is not set on slave 
192.168.1.43(192.168.1.43:3306).
Thu Feb 19 15:02:28 2015 - [info] Checking replication filtering settings..
Thu Feb 19 15:02:28 2015 - [info]  binlog_do_db= , binlog_ignore_db= 
Thu Feb 19 15:02:28 2015 - [info]  Replication filtering check ok.
Thu Feb 19 15:02:28 2015 - [error][/usr/share/perl5/vendor_perl/MHA/Server.pm, 
ln393] 192.168.1.43(192.168.1.43:3306): User repl does not exist or does not 
have REPLICATION SLAVE privilege! Other slaves can not start replication from 
this host.
Thu Feb 19 15:02:28 2015 - 
[error][/usr/share/perl5/vendor_perl/MHA/MasterMonitor.pm, ln424] Error 
happened on checking configurations.  at 
/usr/share/perl5/vendor_perl/MHA/ServerManager.pm line 1403
Thu Feb 19 15:02:28 2015 - 
[error][/usr/share/perl5/vendor_perl/MHA/MasterMonitor.pm, ln523] Error 
happened on monitoring servers.
Thu Feb 19 15:02:28 2015 - [info] Got exit code 1 (Not master dead).

MySQL Replication Health is NOT OK!

So I guess the root of the issue is that it doesn't look at User@Host 
combinations. It just looks at the User field.

Original comment by Geoff.Mo...@gmail.com on 19 Feb 2015 at 8:07