EnterpriseDB/repmgr

2nd standby server fails to join the 1st standby server when Primary server dies

Opened this issue · 2 comments

I have three postgres9.6 containers in three separate hosts:

Server 1: Primary IP: 187
Server 2: Secondary IP: 186
Server 3: Secondary IP: 125

Everything is configured correctly in repmgr. When the primary server goes down, usually the 2nd server will promote itself to become Primary, the other server will try and follow the new primary but somehow it will fail, the Ip that it's showing is not the server IP (192.168.128.1) also I don't think it's a Docker issue since everything works except this part, is there anything I am doing wrong? here is the log when I try to manually follow the new primary and here is the repmgr.conf settings

Logs
postgres@pgs2:/$ repmgr -f /etc/postgresql/repmgr.conf standby follow --upstream-node-id=2 --dry-run -v -L DEBUG
NOTICE: using provided configuration file "/etc/postgresql/repmgr.conf"
DEBUG: do_standby_follow()
DEBUG: connecting to: "user=repmgr password=repmgr connect_timeout=2 dbname=repmgr host=x.x.x.125 fallback_application_name=repmgr options=-csearch_path="
DEBUG: set_config():
SET synchronous_commit TO 'local'
INFO: connected to local node
DEBUG: get_recovery_type(): SELECT pg_catalog.pg_is_in_recovery()
DEBUG: get_node_record():
SELECT n.node_id, n.type, n.upstream_node_id, n.node_name, n.conninfo, n.repluser, n.slot_name, n.location, n.priority, n.active, n.config_file, '' AS upstream_node_name, NULL AS attached FROM repmgr.nodes n WHERE n.node_id = 3
DEBUG: get_node_record():
SELECT n.node_id, n.type, n.upstream_node_id, n.node_name, n.conninfo, n.repluser, n.slot_name, n.location, n.priority, n.active, n.config_file, '' AS upstream_node_name, NULL AS attached FROM repmgr.nodes n WHERE n.node_id = 2
DEBUG: connecting to: "user=repmgr password=repmgr connect_timeout=2 dbname=repmgr host=x.x.x.186 fallback_application_name=repmgr options=-csearch_path="
DEBUG: set_config():
SET synchronous_commit TO 'local'
DEBUG: get_recovery_type(): SELECT pg_catalog.pg_is_in_recovery()
INFO: follow target is primary node "s1" (ID: 2)
ERROR: connection to database failed
DETAIL:
FATAL: no pg_hba.conf entry for replication connection from host "192.168.128.1", user "repmgr", SSL off

Primary Server conf

node_id=1
node_name=p
conninfo='host=x.x.x.187 user=repmgr password=repmgr dbname=repmgr connect_timeout=2' #node_id IP
config_directory='/etc/postgresql'
data_directory='/var/lib/postgresql/data'
pg_bindir='/usr/lib/postgresql/9.6/bin'
failover=automatic
promote_command='/usr/bin/repmgr standby promote -f /etc/postgresql/repmgr.conf --log-to-file'
follow_command='/usr/bin/repmgr standby follow -f /etc/postgresql/repmgr.conf --log-to-file --upstream-node-id=%n'
monitoring_history=yes
monitor_interval_secs=5

Standby Server 1 conf

node_id=2
node_name=s1
conninfo='host=x.x.x.186 user=repmgr password=repmgr dbname=repmgr connect_timeout=2'
data_directory='/var/lib/postgresql/data/'
#config_directory='/etc/postgresql'
pg_bindir='/usr/lib/postgresql/9.6/bin'
#log_file='/var/log/repmgr.log'
failover=automatic
promote_command='/usr/bin/repmgr standby promote -f /etc/postgresql/repmgr.conf  --log-to-file'
follow_command='/usr/bin/repmgr standby follow -f /etc/postgresql/repmgr.conf --log-to-file --upstream-node-id=%n'
monitoring_history=yes
monitor_interval_secs=5

Standby Server 2 conf

node_id=3
node_name=s2
conninfo='host=x.x.x.125 user=repmgr password=repmgr dbname=repmgr connect_timeout=2'
#config_directory='/etc/postgresql/'
data_directory='/var/lib/postgresql/data'
pg_bindir='/usr/lib/postgresql/9.6/bin'
failover=automatic
promote_command='/usr/bin/repmgr standby promote -f /etc/postgresql/repmgr.conf --log-to-file'
follow_command='/usr/bin/repmgr standby follow -f /etc/postgresql/repmgr.conf --log-to-file --upstream-node-id=%n'
monitoring_history=yes
monitor_interval_secs=5

This is your problem:

FATAL: no pg_hba.conf entry for replication connection from host "192.168.128.1", user "repmgr", SSL off

You need to ensure the repmgr user can make a replication connection to the target node, e.g.:

host    replication  repmgr   192.168.128.1/32    trust     

Thanks, I've fixed this issue by making the network as host network, that way it was able to identify the IPs, it works great now, one minor issue I am having is whenever repmgr tries to restart the Postgres service, the container will die, I know I have to create another service so it doesn't terminate but do you have any other reliable solution?