masterha_check_repl skips all slave checks on MySQL 5.6 with GTID enabled
Opened this issue · 3 comments
GoogleCodeExporter commented
If you look at this line of code:
https://github.com/yoshinorim/mha4mysql-manager/blob/master/lib/MHA/MasterMonito
r.pm#L415
It looks to me that masterha_check_repl skips all slave checks when using MySQL
5.6 with GTIDS enabled.
Some of the slave checks are valid even when using GTIDs. For example, it
doesn't verify that the MySQL user specified in the MHA configuration can
connect to the master and has replication client privileges.
I can gather more information, if necessary.
Original issue reported on code.google.com by Geoff.Mo...@gmail.com
on 11 Feb 2015 at 2:12
GoogleCodeExporter commented
As a followup, here is an example of a masterha_check_repl failure that occurs
on MariaDB 10.0. This check would be skipped on MySQL 5.6 if GTIDs are enabled,
and a failure would happen when failover happens:
Wed Feb 18 10:57:15 2015 - [info] Checking slave configurations..
Wed Feb 18 10:57:15 2015 - [info] read_only=1 is not set on slave
192.168.1.46(192.168.1.46:3306).
Wed Feb 18 10:57:15 2015 - [warning] relay_log_purge=0 is not set on slave
192.168.1.46(192.168.1.46:3306).
Wed Feb 18 10:57:15 2015 - [info] Checking replication filtering settings..
Wed Feb 18 10:57:15 2015 - [info] binlog_do_db= , binlog_ignore_db=
Wed Feb 18 10:57:15 2015 - [info] Replication filtering check ok.
Wed Feb 18 10:57:15 2015 - [error][/usr/share/perl5/vendor_perl/MHA/Server.pm,
ln393] 192.168.1.46(192.168.1.46:3306): User repl does not exist or does not
have REPLICATION SLAVE privilege! Other slaves can not start replication from
this host.
Wed Feb 18 10:57:16 2015 -
[error][/usr/share/perl5/vendor_perl/MHA/MasterMonitor.pm, ln424] Error
happened on checking configurations. at
/usr/share/perl5/vendor_perl/MHA/ServerManager.pm line 1403
Wed Feb 18 10:57:16 2015 -
[error][/usr/share/perl5/vendor_perl/MHA/MasterMonitor.pm, ln523] Error
happened on monitoring servers.
Original comment by Geoff.Mo...@gmail.com
on 18 Feb 2015 at 4:00
GoogleCodeExporter commented
With GTID enabled, MHA skips some configurations that aren't used on GTID based
failover (i.e. SSH connections). But MHA still checks necessary replication
configurations. In your case, MHA should check repl_user on GTID based
failover. Could you show whole error messages on GTID based failover?
Original comment by Yoshinor...@gmail.com
on 19 Feb 2015 at 6:57
GoogleCodeExporter commented
Here's an example.
Let's say I have two MySQL 5.6 hosts, described by this MHA configuration:
[gmontee@localhost ~]$ cat /etc/mha.cnf
[server default]
user=mhauser
password=password
ssh_user=gmontee
repl_user=repl
repl_password=password
[server1]
hostname=192.168.1.42
ip=192.168.1.42
[server2]
hostname=192.168.1.43
ip=192.168.1.43
192.168.1.42 is the current master.
According to masterha_check_repl, everything is OK:
[gmontee@localhost ~]$ masterha_check_repl --conf=/etc/mha.cnf
Thu Feb 19 14:46:35 2015 - [warning] Global configuration file
/etc/masterha_default.cnf not found. Skipping.
Thu Feb 19 14:46:35 2015 - [info] Reading application default configuration
from /etc/mha.cnf..
Thu Feb 19 14:46:35 2015 - [info] Reading server configuration from
/etc/mha.cnf..
Thu Feb 19 14:46:35 2015 - [info] MHA::MasterMonitor version 0.56.
Thu Feb 19 14:46:35 2015 - [info] GTID failover mode = 1
Thu Feb 19 14:46:35 2015 - [info] Dead Servers:
Thu Feb 19 14:46:35 2015 - [info] Alive Servers:
Thu Feb 19 14:46:35 2015 - [info] 192.168.1.42(192.168.1.42:3306)
Thu Feb 19 14:46:35 2015 - [info] 192.168.1.43(192.168.1.43:3306)
Thu Feb 19 14:46:35 2015 - [info] Alive Slaves:
Thu Feb 19 14:46:35 2015 - [info] 192.168.1.43(192.168.1.43:3306)
Version=5.6.23-log (oldest major version between slaves) log-bin:enabled
Thu Feb 19 14:46:35 2015 - [info] GTID ON
Thu Feb 19 14:46:35 2015 - [info] Replicating from
192.168.1.42(192.168.1.42:3306)
Thu Feb 19 14:46:35 2015 - [info] Current Alive Master:
192.168.1.42(192.168.1.42:3306)
Thu Feb 19 14:46:35 2015 - [info] Checking slave configurations..
Thu Feb 19 14:46:35 2015 - [info] read_only=1 is not set on slave
192.168.1.43(192.168.1.43:3306).
Thu Feb 19 14:46:35 2015 - [info] Checking replication filtering settings..
Thu Feb 19 14:46:35 2015 - [info] binlog_do_db= , binlog_ignore_db=
Thu Feb 19 14:46:35 2015 - [info] Replication filtering check ok.
Thu Feb 19 14:46:35 2015 - [info] GTID (with auto-pos) is supported. Skipping
all SSH and Node package checking.
Thu Feb 19 14:46:35 2015 - [info] Checking SSH publickey authentication
settings on the current master..
Thu Feb 19 14:46:36 2015 - [info] HealthCheck: SSH to 192.168.1.42 is reachable.
Thu Feb 19 14:46:36 2015 - [info]
192.168.1.42(192.168.1.42:3306) (current master)
+--192.168.1.43(192.168.1.43:3306)
Thu Feb 19 14:46:36 2015 - [info] Checking replication health on 192.168.1.43..
Thu Feb 19 14:46:36 2015 - [info] ok.
Thu Feb 19 14:46:36 2015 - [warning] master_ip_failover_script is not defined.
Thu Feb 19 14:46:36 2015 - [warning] shutdown_script is not defined.
Thu Feb 19 14:46:36 2015 - [info] Got exit code 0 (Not master dead).
MySQL Replication Health is OK.
However, the current master doesn't actually have REPLICATION SLAVE privileges
on the other node. Executing on the current slave (192.168.1.43):
mysql> SHOW GRANTS FOR 'repl'@'192.168.1.42';
ERROR 1141 (42000): There is no such grant defined for user 'repl' on host
'192.168.1.42'
mysql> SELECT User, Host FROM mysql.user WHERE User='repl';
+------+--------------+
| User | Host |
+------+--------------+
| repl | 192.168.1.43 |
+------+--------------+
1 row in set (0.00 sec)
So we can try to failover manually:
[gmontee@localhost ~]$ masterha_master_switch --conf=/etc/mha.cnf
--master_state=alive --new_master_host=192.168.1.43 --orig_master_is_new_slave
Thu Feb 19 14:55:01 2015 - [info] MHA::MasterRotate version 0.56.
Thu Feb 19 14:55:01 2015 - [info] Starting online master switch..
Thu Feb 19 14:55:01 2015 - [info]
Thu Feb 19 14:55:01 2015 - [info] * Phase 1: Configuration Check Phase..
Thu Feb 19 14:55:01 2015 - [info]
Thu Feb 19 14:55:01 2015 - [warning] Global configuration file
/etc/masterha_default.cnf not found. Skipping.
Thu Feb 19 14:55:01 2015 - [info] Reading application default configuration
from /etc/mha.cnf..
Thu Feb 19 14:55:01 2015 - [info] Reading server configuration from
/etc/mha.cnf..
Thu Feb 19 14:55:01 2015 - [info] GTID failover mode = 1
Thu Feb 19 14:55:01 2015 - [info] Current Alive Master:
192.168.1.42(192.168.1.42:3306)
Thu Feb 19 14:55:01 2015 - [info] Alive Slaves:
Thu Feb 19 14:55:01 2015 - [info] 192.168.1.43(192.168.1.43:3306)
Version=5.6.23-log (oldest major version between slaves) log-bin:enabled
Thu Feb 19 14:55:01 2015 - [info] GTID ON
Thu Feb 19 14:55:01 2015 - [info] Replicating from
192.168.1.42(192.168.1.42:3306)
It is better to execute FLUSH NO_WRITE_TO_BINLOG TABLES on the master before
switching. Is it ok to execute on 192.168.1.42(192.168.1.42:3306)? (YES/no): YES
Thu Feb 19 14:55:03 2015 - [info] Executing FLUSH NO_WRITE_TO_BINLOG TABLES.
This may take long time..
Thu Feb 19 14:55:03 2015 - [info] ok.
Thu Feb 19 14:55:03 2015 - [info] Checking MHA is not monitoring or doing
failover..
Thu Feb 19 14:55:03 2015 - [info] Checking replication health on 192.168.1.43..
Thu Feb 19 14:55:03 2015 - [info] ok.
Thu Feb 19 14:55:03 2015 - [info] 192.168.1.43 can be new master.
Thu Feb 19 14:55:03 2015 - [info]
From:
192.168.1.42(192.168.1.42:3306) (current master)
+--192.168.1.43(192.168.1.43:3306)
To:
192.168.1.43(192.168.1.43:3306) (new master)
+--192.168.1.42(192.168.1.42:3306)
Starting master switch from 192.168.1.42(192.168.1.42:3306) to
192.168.1.43(192.168.1.43:3306)? (yes/NO): yes
Thu Feb 19 14:55:13 2015 - [info] Checking whether
192.168.1.43(192.168.1.43:3306) is ok for the new master..
Thu Feb 19 14:55:13 2015 - [info] ok.
Thu Feb 19 14:55:13 2015 - [info] ** Phase 1: Configuration Check Phase
completed.
Thu Feb 19 14:55:13 2015 - [info]
Thu Feb 19 14:55:13 2015 - [info] * Phase 2: Rejecting updates Phase..
Thu Feb 19 14:55:13 2015 - [info]
master_ip_online_change_script is not defined. If you do not disable writes on
the current master manually, applications keep writing on the current master.
Is it ok to proceed? (yes/NO): yes
Thu Feb 19 14:55:19 2015 - [info] Locking all tables on the orig master to
reject updates from everybody (including root):
Thu Feb 19 14:55:19 2015 - [info] Executing FLUSH TABLES WITH READ LOCK..
Thu Feb 19 14:55:19 2015 - [info] ok.
Thu Feb 19 14:55:19 2015 - [info] Orig master binlog:pos is
mysqld-bin.000005:191.
Thu Feb 19 14:55:19 2015 - [info] Waiting to execute all relay logs on
192.168.1.43(192.168.1.43:3306)..
Thu Feb 19 14:55:19 2015 - [info] master_pos_wait(mysqld-bin.000005:191)
completed on 192.168.1.43(192.168.1.43:3306). Executed 0 events.
Thu Feb 19 14:55:19 2015 - [info] done.
Thu Feb 19 14:55:19 2015 - [info] Getting new master's binlog name and
position..
Thu Feb 19 14:55:19 2015 - [info] mysqld-bin.000005:379
Thu Feb 19 14:55:19 2015 - [info] All other slaves should start replication
from here. Statement should be: CHANGE MASTER TO MASTER_HOST='192.168.1.43',
MASTER_PORT=3306, MASTER_AUTO_POSITION=1, MASTER_USER='repl',
MASTER_PASSWORD='xxx';
Thu Feb 19 14:55:19 2015 - [info]
Thu Feb 19 14:55:19 2015 - [info] * Switching slaves in parallel..
Thu Feb 19 14:55:19 2015 - [info]
Thu Feb 19 14:55:19 2015 - [info] Unlocking all tables on the orig master:
Thu Feb 19 14:55:19 2015 - [info] Executing UNLOCK TABLES..
Thu Feb 19 14:55:19 2015 - [info] ok.
Thu Feb 19 14:55:19 2015 - [info] Starting orig master as a new slave..
Thu Feb 19 14:55:19 2015 - [info] Resetting slave
192.168.1.42(192.168.1.42:3306) and starting replication from the new master
192.168.1.43(192.168.1.43:3306)..
Thu Feb 19 14:55:19 2015 - [info] Executed CHANGE MASTER.
It seems to hang when it gets to the CHANGE MASTER step. SHOW SLAVE STATUS on
the new slave (old master):
mysql> SHOW SLAVE STATUS\G
*************************** 1. row ***************************
Slave_IO_State: Connecting to master
Master_Host: 192.168.1.43
Master_User: repl
Master_Port: 3306
Connect_Retry: 60
Master_Log_File:
Read_Master_Log_Pos: 4
Relay_Log_File: mysqld-relay-bin.000001
Relay_Log_Pos: 4
Relay_Master_Log_File:
Slave_IO_Running: Connecting
Slave_SQL_Running: Yes
Replicate_Do_DB:
Replicate_Ignore_DB:
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 0
Last_Error:
Skip_Counter: 0
Exec_Master_Log_Pos: 0
Relay_Log_Space: 151
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Master_SSL_Allowed: No
Master_SSL_CA_File:
Master_SSL_CA_Path:
Master_SSL_Cert:
Master_SSL_Cipher:
Master_SSL_Key:
Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
Last_IO_Errno: 1130
Last_IO_Error: error connecting to master 'repl@192.168.1.43:3306' - retry-time: 60 retries: 1
Last_SQL_Errno: 0
Last_SQL_Error:
Replicate_Ignore_Server_Ids:
Master_Server_Id: 0
Master_UUID:
Master_Info_File: /var/lib/mysql/master.info
SQL_Delay: 0
SQL_Remaining_Delay: NULL
Slave_SQL_Running_State: Slave has read all relay log; waiting for the slave I/O thread to update it
Master_Retry_Count: 86400
Master_Bind:
Last_IO_Error_Timestamp: 150219 14:58:20
Last_SQL_Error_Timestamp:
Master_SSL_Crl:
Master_SSL_Crlpath:
Retrieved_Gtid_Set:
Executed_Gtid_Set: 9f0bcc7f-b2ee-11e4-96fa-08002772fdec:1-4
Auto_Position: 1
1 row in set (0.00 sec)
This status makes sense, since I already know the correct repl_user
('repl'@'192.168.1.42') doesn't exist on the new master.
However, if I go back to the way things were to start with (192.168.1.42 as
master, 192.168.1.43 as slave), and delete the 'repl'@'192.168.1.43' account
from the slave, then masterha_check_repl reports a failure:
[gmontee@localhost ~]$ masterha_check_repl --conf=/etc/mha.cnf
Thu Feb 19 15:02:28 2015 - [warning] Global configuration file
/etc/masterha_default.cnf not found. Skipping.
Thu Feb 19 15:02:28 2015 - [info] Reading application default configuration
from /etc/mha.cnf..
Thu Feb 19 15:02:28 2015 - [info] Reading server configuration from
/etc/mha.cnf..
Thu Feb 19 15:02:28 2015 - [info] MHA::MasterMonitor version 0.56.
Thu Feb 19 15:02:28 2015 - [info] GTID failover mode = 1
Thu Feb 19 15:02:28 2015 - [info] Dead Servers:
Thu Feb 19 15:02:28 2015 - [info] Alive Servers:
Thu Feb 19 15:02:28 2015 - [info] 192.168.1.42(192.168.1.42:3306)
Thu Feb 19 15:02:28 2015 - [info] 192.168.1.43(192.168.1.43:3306)
Thu Feb 19 15:02:28 2015 - [info] Alive Slaves:
Thu Feb 19 15:02:28 2015 - [info] 192.168.1.43(192.168.1.43:3306)
Version=5.6.23-log (oldest major version between slaves) log-bin:enabled
Thu Feb 19 15:02:28 2015 - [info] GTID ON
Thu Feb 19 15:02:28 2015 - [info] Replicating from
192.168.1.42(192.168.1.42:3306)
Thu Feb 19 15:02:28 2015 - [info] Current Alive Master:
192.168.1.42(192.168.1.42:3306)
Thu Feb 19 15:02:28 2015 - [info] Checking slave configurations..
Thu Feb 19 15:02:28 2015 - [info] read_only=1 is not set on slave
192.168.1.43(192.168.1.43:3306).
Thu Feb 19 15:02:28 2015 - [info] Checking replication filtering settings..
Thu Feb 19 15:02:28 2015 - [info] binlog_do_db= , binlog_ignore_db=
Thu Feb 19 15:02:28 2015 - [info] Replication filtering check ok.
Thu Feb 19 15:02:28 2015 - [error][/usr/share/perl5/vendor_perl/MHA/Server.pm,
ln393] 192.168.1.43(192.168.1.43:3306): User repl does not exist or does not
have REPLICATION SLAVE privilege! Other slaves can not start replication from
this host.
Thu Feb 19 15:02:28 2015 -
[error][/usr/share/perl5/vendor_perl/MHA/MasterMonitor.pm, ln424] Error
happened on checking configurations. at
/usr/share/perl5/vendor_perl/MHA/ServerManager.pm line 1403
Thu Feb 19 15:02:28 2015 -
[error][/usr/share/perl5/vendor_perl/MHA/MasterMonitor.pm, ln523] Error
happened on monitoring servers.
Thu Feb 19 15:02:28 2015 - [info] Got exit code 1 (Not master dead).
MySQL Replication Health is NOT OK!
So I guess the root of the issue is that it doesn't look at User@Host
combinations. It just looks at the User field.
Original comment by Geoff.Mo...@gmail.com
on 19 Feb 2015 at 8:07