EnterpriseDB/repmgr

I can not manually switchover a standby to primary online

Opened this issue · 10 comments

Hi,
At first, i built a 3 nodes repmgr cluster in env, dragon01 is the primary and the others is stadnby; then i start to swithover the standby database on dragon03 but failed, pls help to explain it and let me know how to solve this problem or it is a bug, thanks a lot.

1. the cluster env

postgresql 12.3
repmgr 5.1
dragon01 primary
dragon02 standby
dragon03 stadnby

2. the current cluster status

[postgres@dragon03 ~]$ repmgr -f repmgr.conf cluster show
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----+----------+---------+-----------+----------+----------+----------+----------+-----------------------------------------------------------------------
1 | dragon01 | primary | * running | | dc1 | 100 | 3 | host=dragon01 port=26432 user=repmgr dbname=repmgr connect_timeout=10
2 | dragon02 | standby | running | dragon01 | dc1 | 90 | 3 | host=dragon02 port=26432 user=repmgr dbname=repmgr connect_timeout=10
3 | dragon03 | standby | running | dragon01 | dc2 | 80 | 3 | host=dragon03 port=26432 user=repmgr dbname=repmgr connect_timeout=10

3. switchover but failed on running the remote_command, the debug message on below

[postgres@dragon03 ~]$ repmgr -f repmgr.conf standby switchover --siblings-follow --dry-run --verbose --log-level=DEBUG

DEBUG: remote_command():
ssh -o Batchmode=yes -q -o ConnectTimeout=10 dragon01 /usr/local/pgsql123/bin/repmgr -f /home/postgres/repmgr.conf -L DEBUG node check --remote-node-id=3 --replication-connection
DEBUG: connecting to: "user=repmgr connect_timeout=10 dbname=repmgr host=dragon01 port=26432 fallback_application_name=repmgr"
ERROR: connection to database failed
DETAIL:
fe_sendauth: no password supplied

DEBUG: remote_command(): output returned was:
--connection=BAD

ERROR: demotion candidate is unable to make replication connection to promotion candidate

4. it is success when i run the command manual

[postgres@dragon03 ~]$ ssh -o Batchmode=yes -q -o ConnectTimeout=10 dragon01
Last login: Sun Aug 16 16:14:52 2020 from dragon03
[postgres@dragon01 ~]$ /usr/local/pgsql123/bin/repmgr -f /home/postgres/repmgr.conf -L DEBUG node check --remote-node-id=3 --replication-connectio
DEBUG: connecting to: "user=repmgr connect_timeout=10 dbname=repmgr host=dragon01 port=26432 fallback_application_name=repmgr"
--connection=OK
[postgres@dragon01 ~]$

How are you storing the password?

I store the password in .pgpass file.

[postgres@dragon02 09:25:36 ~]# cat .pgpass
#hostname:port:database:username:password
dragon01:26432:repmgr:repmgr:repmgr
dragon01:26432:repmgr:postgres:postgres
dragon01:26432:replication:repmgr:repmgr
dragon02:26432:repmgr:repmgr:repmgr
dragon02:26432:repmgr:postgres:postgres
dragon02:26432:replication:repmgr:repmgr
dragon03:26432:repmgr:repmgr:repmgr
dragon03:26432:repmgr:postgres:postgres
dragon03:26432:replication:repmgr:repmgr

Is ~/.pgpass present and correct on all nodes?

From dragon03 can you execute: ssh -o Batchmode=yes -q -o ConnectTimeout=10 dragon01 env and attach the output?

1. .pgpass on dragon01

[postgres@dragon01 10:00:42 ~]# ll .pgpass
-rw------- 1 postgres postgres 371 Aug 18 09:42 .pgpass

2. .pgpass on dragon02

[postgres@dragon02 10:00:51 ~]# ll .pgpass
-rw------- 1 postgres postgres 655 Aug 16 21:34 .pgpass

3. .pgpass and ssh output from dragon03

[postgres@dragon03 10:00:55 ~]# ll .pgpass
-rw------- 1 postgres postgres 295 Aug 16 21:34 .pgpass
[postgres@dragon03 10:00:57 ~]#
[postgres@dragon03 10:01:08 ~]# ssh -o Batchmode=yes -q -o ConnectTimeout=10 dragon01 env
SHELL=/bin/bash
PWD=/home/postgres
LOGNAME=postgres
HOME=/home/postgres
LANG=en_US.UTF-8
SSH_CONNECTION=192.168.35.212 32260 192.168.35.210 22
USER=postgres
SHLVL=0
XDG_SESSION_ID=160
XDG_RUNTIME_DIR=/run/user/1000
SSH_CLIENT=192.168.35.212 32260 22
PATH=/usr/local/bin:/usr/bin
MAIL=/var/mail/postgres
_=/usr/bin/env

4. I mentioned that i can ssh to dragon01 from dragon03 and it success to run repmgr command to get the replication-connection

[postgres@dragon03 ~]$ ssh -o Batchmode=yes -q -o ConnectTimeout=10 dragon01
Last login: Sun Aug 16 16:14:52 2020 from dragon03
[postgres@dragon01 ~]$ /usr/local/pgsql123/bin/repmgr -f /home/postgres/repmgr.conf -L DEBUG node check --remote-node-id=3 --replication-connectio
DEBUG: connecting to: "user=repmgr connect_timeout=10 dbname=repmgr host=dragon01 port=26432 fallback_application_name=repmgr"
--connection=OK

It's entirely possible that there are differences in the environment when executing a command via ssh, and when logging in directly and executing the same command.

Try (from dragon03):
ssh -o Batchmode=yes -q -o ConnectTimeout=10 dragon01 "psql -d 'user=repmgr connect_timeout=10 dbname=repmgr host=dragon01 port=26432 fallback_application_name=repmgr' -c 'SELECT 1'"

  1. run the query provided by you is ok, but it still get BAD by running "/usr/local/pgsql123/bin/repmgr -f /home/postgres/repmgr.conf"

[postgres@dragon03 11:42:04 ~]# date
Tue Aug 18 11:42:07 CST 2020
[postgres@dragon03 11:42:07 ~]# ssh -o Batchmode=yes -q -o ConnectTimeout=10 dragon01 "psql -d 'user=repmgr connect_timeout=10 dbname=repmgr host=dragon01 port=26432 fallback_application_name=repmgr' -c 'SELECT 1'"
?column?

    1

(1 row)

[postgres@dragon03 11:42:09 ~]#
[postgres@dragon03 11:42:51 ~]#ssh -o Batchmode=yes -q -o ConnectTimeout=10 dragon01 /usr/local/pgsql123/bin/repmgr -f /home/postgres/repmgr.conf -L DEBUG node check --remote-node-id=3 --replication-connection
DEBUG: connecting to: "user=repmgr connect_timeout=10 dbname=repmgr host=dragon01 port=26432 fallback_application_name=repmgr"
--connection=BAD
ERROR: connection to database failed
DETAIL:
fe_sendauth: no password supplied

[postgres@dragon03 11:42:54 ~]#

Then please try (from dragon03):

ssh -o Batchmode=yes -q -o ConnectTimeout=10 dragon01 "psql -d 'user=repmgr connect_timeout=10 replication=1 host=dragon03 port=26432 fallback_application_name=repmgr' -c 'IDENTIFY_SYSTEM'"

[postgres@dragon03 11:52:47 ~]#
<=repmgr connect_timeout=10 replication=1 host=dragon03 port=26432 fallback_application_name=repmgr' -c 'IDENTIFY_SYSTEM'"
systemid | timeline | xlogpos | dbname
---------------------+----------+------------+--------
6848010177502602935 | 5 | 0/2C804488 |
(1 row)

[postgres@dragon03 11:52:48 ~]#

OK, that narrows things down a bit, will see if the issue can be reproduced.

There is a method is to set .pgpass like below, i can manual switch a standby when set the .pgpass like this.
pgpass.txt