Agent cannot resume a slave from previous position
Closed this issue · 6 comments
When a slave lost connection to the master for some reason i.e. network issue and Pacemaker takes it down, when it comes back up, it will not resume from the replication coordinates before the shutdown, instead it will pickup REPL_INFO from the cluster configuration which is going back.
I had a database problem yesterday and when the slave tried to come up again, there WAS no REPL_INFO in the configuration. I think this must be because I upgraded the resource script and have not switched masters since, since REPL_INFO seems to be updated when the master is started or promoted. This triggered, off course, an error.. turned out this was my fault
BUT
in the process, I found out why the script doesn't seem to pick up replication. in set_master, this should happen if
"$new_master" = "$master_host"
alas, on my system, $new_master (<REPL_INFO) and $master_host (<get_slave_info) are a node name and fqdn respectively. Could be that that's because I'm running Debian. If you execute "hostname" on Red Hat, you will get a fqdn. So maybe your cluster nodes are named after fqdn's, too.
In any case, I committed a patch over here: https://github.com/FrankVanDamme/percona-pacemaker-agents
Good point, I have not really looked deeper into this but your patch makes sense. Can you send a pull request so Yves can review?
if the master_host in "show slave status\G" is the same as the master defined in pacemaker REPL_INFO, replication should just resume from the previous coordinate, something is not right in your setup, we need to figure what exactly caused the master to be different. Any additional details on what happened?
Yves,
https://www.dropbox.com/s/f09xhs2viarjg9n/prm-issues-12-and-9.tgz
Same set of logs from issue #12, but you can see here on ha01 when it was dropped as slave.
140708 0:38:15 [Note] Slave I/O thread: connected to master 'revin@ha02.localdomain:3306',replication started in log 'mysql-bin.000003' at position 26213087
140708 0:38:15 [Note] Slave SQL thread initialized, starting replication in log 'mysql-bin.000003' at position 26213087, relay log './mysqld-relay-bin.000001' position: 4
140708 0:59:51 [ERROR] Error reading packet from server: Lost connection to MySQL server during query ( server_errno=2013)
140708 0:59:51 [Note] Slave I/O thread killed while reading event
140708 0:59:51 [Note] Slave I/O thread exiting, read up to log 'mysql-bin.000003', position 31610712
140708 0:59:51 [Note] Error reading relay log event: slave SQL thread was killed
140708 1:00:16 [Note] /usr/sbin/mysqld: Normal shutdown
When it shut down, IO thread has read up to 'mysql-bin.000003', position 31610712, however when it started back up it resumed from an old position. One would expect that master.info might not have been in sync, but benot the case, there was a CHANGE MASTER INTO statement below coming from the agent.
140708 1:00:23 [Note] /usr/sbin/mysqld: ready for connections.
Version: '5.5.35-33.0-log' socket: '/var/lib/mysql/mysql.sock' port: 3306 Percona Server (GPL), Release rel33.0, Revision 611
140708 1:00:26 [Warning] Neither --relay-log nor --relay-log-index were used; so replication may break when this MySQL server acts as a slave and has his hostname changed!! Please use '--relay-log=mysqld-relay-bin' to avoid this problem.
140708 1:00:26 [Note] 'CHANGE MASTER TO executed'. Previous state master_host='', master_port='3306', master_log_file='', master_log_pos='4'. New state master_host='ha02.localdomain', master_port='3306', master_log_file='mysql-bin.000003', master_log_pos='26213087'.
140708 1:00:26 [Note] Slave SQL thread initialized, starting replication in log 'mysql-bin.000003' at position 26213087, relay log './mysqld-relay-bin.000001' position: 4
140708 1:00:26 [Note] Slave I/O thread: connected to master 'revin@ha02.localdomain:3306',replication started in log 'mysql-bin.000003' at position 26213087