standby reattached to old master just after attached to new master

Question

standby reattached to old master just after attached to new master

Closed this issue 5 years ago · 4 comments

OS: CentOS Linux release 7.6.1810 (Core)
PostgreSQL: postgresql11-11.5-1PGDG.rhel7.x86_64
Repmgrd setup: 1 primary (node1) + 2 standbys (node2,3)
Repmgrd version:

yum info repmgr11
Installed Packages
Name : repmgr11
Arch : x86_64
Version : 4.4.0
Release : 1.rhel7
Size : 1.0 M
Repo : installed
From repo : pgdg11

repmgr.conf on nodes[1,2,3] (10.99.169.[15,16,17])

node_id=[1,2,3]
node_name='node[1,2,3]'
conninfo='host=10.99.169.[15,16,17] port=5432 dbname=repmgr user=repmgr connect_timeout=2'
data_directory='/data/postgresql/11/data'
config_directory='/data/postgresql/11/data'
replication_user='repmgr'
replication_type=physical
use_replication_slots=no
failover=automatic
promote_command='/usr/bin/repmgr standby promote -f /etc/repmgr/11/repmgr.conf --log-to-file'
follow_command='/usr/bin/repmgr standby follow -f /etc/repmgr/11/repmgr.conf --log-to-file --upstream-node-id=%n'
service_start_command='sudo systemctl start postgresql-11'
service_stop_command='sudo systemctl stop postgresql-11'
service_restart_command='sudo systemctl restart postgresql-11'
service_reload_command='sudo systemctl reload postgresql-11'
log_file='/var/log/postgresql/repmgr.log'
log_level=NOTICE
reconnect_attempts=6
reconnect_interval=10

Hello!

I have a very strange issue: when I have stopped PostgreSQL on node1, node2 promoted as the new primary, but node3 failed to join node2 (new primary) and doing weird thing, i.e. first connected to the node2 (new primary)(this part is OK) and just after this tried to connect back to the node1 (old primary)(this part is weird), which is down and so couldn't be connected and and that's why repmgr terminated itself.
After restarting repmgr on node3, repmgr properly understand current primary location on node2 and working well.
From PostgreSQL replication point of view, changing primary server on node3 is going right and node3 properly reconnected to node2 (new primary) without any issues.

The logs from node[2,3]

Log file from repmgr on node2 (new primary): fail-over occurred, repmgrd decided to use node2 as new primary, and this is OK with this part:

[2019-09-25 22:05:34] [WARNING] unable to ping "user=repmgr connect_timeout=2 dbname=repmgr host=10.99.169.15 port=5432 fallback_application_name=repmgr"
[2019-09-25 22:05:34] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2019-09-25 22:05:34] [WARNING] unable to reconnect to node 1 after 6 attempts
[2019-09-25 22:05:34] [NOTICE] promotion candidate is "node2" (ID: 2)
[2019-09-25 22:05:34] [NOTICE] this node is the winner, will now promote itself and inform other nodes
[2019-09-25 22:05:34] [NOTICE] redirecting logging output to "/var/log/postgresql/repmgr.log"
[2019-09-25 22:05:34] [WARNING] 1 sibling nodes found, but option "--siblings-follow" not specified
[2019-09-25 22:05:34] [DETAIL] these nodes will remain attached to the current primary:  node3 (node ID: 3)
[2019-09-25 22:05:34] [NOTICE] promoting standby to primary
[2019-09-25 22:05:34] [DETAIL] promoting server "node2" (ID: 2) using "/usr/pgsql-11/bin/pg_ctl  -w -D '/data/postgresql/11/data' promote"
[2019-09-25 22:05:34] [NOTICE] waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
[2019-09-25 22:05:34] [NOTICE] STANDBY PROMOTE successful
[2019-09-25 22:05:34] [DETAIL] server "node2" (ID: 2) was successfully promoted to primary
[2019-09-25 22:05:34] [NOTICE] notifying node "node3" (ID: 3) to follow node 2
INFO:  node 3 received notification to follow node 2
[2019-09-25 22:05:34] [NOTICE] monitoring cluster primary "node2" (ID: 2)
[2019-09-25 22:05:46] [NOTICE] new standby "node3" (ID: 3) has connected
[2019-09-25 22:10:47] [NOTICE] new standby "node1" (ID: 1) has connected

Log file from repmgr on node3 (standby): fail-over occurred, repmgrd on node3 informed that new primary is on node2 (new primary), connect to node2 and just after this connect to node1 (old primary):

[2019-09-25 22:05:35] [WARNING] unable to reconnect to node 1 after 6 attempts
[2019-09-25 22:05:35] [WARNING] node "node2" (ID: 2) is not in recovery
[2019-09-25 22:05:35] [NOTICE] redirecting logging output to "/var/log/postgresql/repmgr.log"
[2019-09-25 22:05:35] [NOTICE] setting node 3's upstream to node 2
[2019-09-25 22:05:35] [NOTICE] restarting server using "sudo systemctl restart postgresql-11"
[2019-09-25 22:05:40] [NOTICE] STANDBY FOLLOW successful
[2019-09-25 22:05:40] [DETAIL] standby attached to upstream node "node2" (ID: 2)
INFO:  set_repmgrd_pid(): provided pidfile is /run/repmgr/repmgrd-11.pid
[2019-09-25 22:05:40] [NOTICE] node 3 now following new upstream node 2
[2019-09-25 22:05:41] [NOTICE] local node 3's upstream appears to have changed, restarting monitoring
[2019-09-25 22:05:41] [DETAIL] currently monitoring upstream 2; new upstream is 1
[2019-09-25 22:05:41] [ERROR] connection to database failed
[2019-09-25 22:05:41] [DETAIL] 
could not connect to server: Connection refused
        Is the server running on host "10.99.169.15" and accepting
        TCP/IP connections on port 5432?

[2019-09-25 22:05:41] [DETAIL] attempted to connect using:
  user=repmgr connect_timeout=2 dbname=repmgr host=10.99.169.15 port=5432 fallback_application_name=repmgr
[2019-09-25 22:05:41] [ERROR] unable connect to upstream node (ID: 1), terminating
[2019-09-25 22:05:41] [HINT] upstream node must be running before repmgrd can start

Answer 1 · 2019-09-30T04:00:11.000Z

Which repmgr version is this?

Answer 2 · 2019-09-30T06:23:14.000Z

Hello @ibarwick!
repmgr version:

yum info repmgr11
Installed Packages
Name : repmgr11
Arch : x86_64
Version : 4.4.0
Release : 1.rhel7
Size : 1.0 M
Repo : installed
From repo : pgdg11

Answer 3 · 2019-10-22T16:13:29.000Z

I have tested repmgr 5.0 and it works correctly! Thank you for fixing this issue!

# yum info repmgr11
Installed Packages
Name        : repmgr11
Arch        : x86_64
Version     : 5.0.0
Release     : 1.rhel7
Size        : 1.1 M
Repo        : installed
From repo   : pgdg11

Answer 4 · 2019-10-23T00:44:10.000Z

Thanks for the confirmation, much appreciated!