I can not manually switchover a standby to primary online

Question

I can not manually switchover a standby to primary online

Opened this issue 4 years ago · 10 comments

Hi,
At first, i built a 3 nodes repmgr cluster in env, dragon01 is the primary and the others is stadnby; then i start to swithover the standby database on dragon03 but failed, pls help to explain it and let me know how to solve this problem or it is a bug, thanks a lot.

1. the cluster env

postgresql 12.3
repmgr 5.1
dragon01 primary
dragon02 standby
dragon03 stadnby

2. the current cluster status

3. switchover but failed on running the remote_command, the debug message on below

[postgres@dragon03 ~]$ repmgr -f repmgr.conf standby switchover --siblings-follow --dry-run --verbose --log-level=DEBUG

DEBUG: remote_command():
ssh -o Batchmode=yes -q -o ConnectTimeout=10 dragon01 /usr/local/pgsql123/bin/repmgr -f /home/postgres/repmgr.conf -L DEBUG node check --remote-node-id=3 --replication-connection
DEBUG: connecting to: "user=repmgr connect_timeout=10 dbname=repmgr host=dragon01 port=26432 fallback_application_name=repmgr"
ERROR: connection to database failed
DETAIL:
fe_sendauth: no password supplied

DEBUG: remote_command(): output returned was:
--connection=BAD

ERROR: demotion candidate is unable to make replication connection to promotion candidate

4. it is success when i run the command manual

[postgres@dragon03 ~]$ ssh -o Batchmode=yes -q -o ConnectTimeout=10 dragon01
Last login: Sun Aug 16 16:14:52 2020 from dragon03
[postgres@dragon01 ~]$ /usr/local/pgsql123/bin/repmgr -f /home/postgres/repmgr.conf -L DEBUG node check --remote-node-id=3 --replication-connectio
DEBUG: connecting to: "user=repmgr connect_timeout=10 dbname=repmgr host=dragon01 port=26432 fallback_application_name=repmgr"
--connection=OK
[postgres@dragon01 ~]$

Answer 1 · 2020-08-18T01:21:17.000Z

How are you storing the password?

Answer 2 · 2020-08-18T01:25:04.000Z

I store the password in .pgpass file.

[postgres@dragon02 09:25:36 ~]# cat .pgpass
#hostname:port:database:username:password
dragon01:26432:repmgr:repmgr:repmgr
dragon01:26432:repmgr:postgres:postgres
dragon01:26432:replication:repmgr:repmgr
dragon02:26432:repmgr:repmgr:repmgr
dragon02:26432:repmgr:postgres:postgres
dragon02:26432:replication:repmgr:repmgr
dragon03:26432:repmgr:repmgr:repmgr
dragon03:26432:repmgr:postgres:postgres
dragon03:26432:replication:repmgr:repmgr

Answer 3 · 2020-08-18T01:53:57.000Z

Is ~/.pgpass present and correct on all nodes?

From dragon03 can you execute: ssh -o Batchmode=yes -q -o ConnectTimeout=10 dragon01 env and attach the output?

Answer 4 · 2020-08-18T02:04:39.000Z

1. .pgpass on dragon01

[postgres@dragon01 10:00:42 ~]# ll .pgpass
-rw------- 1 postgres postgres 371 Aug 18 09:42 .pgpass

2. .pgpass on dragon02

[postgres@dragon02 10:00:51 ~]# ll .pgpass
-rw------- 1 postgres postgres 655 Aug 16 21:34 .pgpass

3. .pgpass and ssh output from dragon03

[postgres@dragon03 10:00:55 ~]# ll .pgpass
-rw------- 1 postgres postgres 295 Aug 16 21:34 .pgpass
[postgres@dragon03 10:00:57 ~]#
[postgres@dragon03 10:01:08 ~]# ssh -o Batchmode=yes -q -o ConnectTimeout=10 dragon01 env
SHELL=/bin/bash
PWD=/home/postgres
LOGNAME=postgres
HOME=/home/postgres
LANG=en_US.UTF-8
SSH_CONNECTION=192.168.35.212 32260 192.168.35.210 22
USER=postgres
SHLVL=0
XDG_SESSION_ID=160
XDG_RUNTIME_DIR=/run/user/1000
SSH_CLIENT=192.168.35.212 32260 22
PATH=/usr/local/bin:/usr/bin
MAIL=/var/mail/postgres
_=/usr/bin/env

4. I mentioned that i can ssh to dragon01 from dragon03 and it success to run repmgr command to get the replication-connection

[postgres@dragon03 ~]$ ssh -o Batchmode=yes -q -o ConnectTimeout=10 dragon01
Last login: Sun Aug 16 16:14:52 2020 from dragon03
[postgres@dragon01 ~]$ /usr/local/pgsql123/bin/repmgr -f /home/postgres/repmgr.conf -L DEBUG node check --remote-node-id=3 --replication-connectio
DEBUG: connecting to: "user=repmgr connect_timeout=10 dbname=repmgr host=dragon01 port=26432 fallback_application_name=repmgr"
--connection=OK

Answer 5 · 2020-08-18T03:37:30.000Z

It's entirely possible that there are differences in the environment when executing a command via ssh, and when logging in directly and executing the same command.

Try (from dragon03):
ssh -o Batchmode=yes -q -o ConnectTimeout=10 dragon01 "psql -d 'user=repmgr connect_timeout=10 dbname=repmgr host=dragon01 port=26432 fallback_application_name=repmgr' -c 'SELECT 1'"

Answer 6 · 2020-08-18T03:47:34.000Z

run the query provided by you is ok, but it still get BAD by running "/usr/local/pgsql123/bin/repmgr -f /home/postgres/repmgr.conf"

[postgres@dragon03 11:42:04 ~]# date
Tue Aug 18 11:42:07 CST 2020
[postgres@dragon03 11:42:07 ~]# ssh -o Batchmode=yes -q -o ConnectTimeout=10 dragon01 "psql -d 'user=repmgr connect_timeout=10 dbname=repmgr host=dragon01 port=26432 fallback_application_name=repmgr' -c 'SELECT 1'"
?column?

(1 row)

[postgres@dragon03 11:42:09 ~]#
[postgres@dragon03 11:42:51 ~]#ssh -o Batchmode=yes -q -o ConnectTimeout=10 dragon01 /usr/local/pgsql123/bin/repmgr -f /home/postgres/repmgr.conf -L DEBUG node check --remote-node-id=3 --replication-connection
DEBUG: connecting to: "user=repmgr connect_timeout=10 dbname=repmgr host=dragon01 port=26432 fallback_application_name=repmgr"
--connection=BAD
ERROR: connection to database failed
DETAIL:
fe_sendauth: no password supplied

[postgres@dragon03 11:42:54 ~]#

Answer 7 · 2020-08-18T03:52:08.000Z

Then please try (from dragon03):

ssh -o Batchmode=yes -q -o ConnectTimeout=10 dragon01 "psql -d 'user=repmgr connect_timeout=10 replication=1 host=dragon03 port=26432 fallback_application_name=repmgr' -c 'IDENTIFY_SYSTEM'"

Answer 8 · 2020-08-18T03:53:09.000Z

[postgres@dragon03 11:52:47 ~]#
<=repmgr connect_timeout=10 replication=1 host=dragon03 port=26432 fallback_application_name=repmgr' -c 'IDENTIFY_SYSTEM'"
systemid | timeline | xlogpos | dbname
---------------------+----------+------------+--------
6848010177502602935 | 5 | 0/2C804488 |
(1 row)

[postgres@dragon03 11:52:48 ~]#

Answer 9 · 2020-08-18T04:20:33.000Z

OK, that narrows things down a bit, will see if the issue can be reproduced.

Answer 10 · 2020-08-18T08:16:03.000Z

There is a method is to set .pgpass like below, i can manual switch a standby when set the .pgpass like this.
pgpass.txt

[postgres@dragon03 11:42:04 ~]# date Tue Aug 18 11:42:07 CST 2020 [postgres@dragon03 11:42:07 ~]# ssh -o Batchmode=yes -q -o ConnectTimeout=10 dragon01 "psql -d 'user=repmgr connect_timeout=10 dbname=repmgr host=dragon01 port=26432 fallback_application_name=repmgr' -c 'SELECT 1'" ?column?

[postgres@dragon03 11:42:04 ~]# date
Tue Aug 18 11:42:07 CST 2020
[postgres@dragon03 11:42:07 ~]# ssh -o Batchmode=yes -q -o ConnectTimeout=10 dragon01 "psql -d 'user=repmgr connect_timeout=10 dbname=repmgr host=dragon01 port=26432 fallback_application_name=repmgr' -c 'SELECT 1'"
?column?