EnterpriseDB/repmgr

Is it possible that the old primary follow the new secondary?

Closed this issue · 5 comments

guif commented

Hi I need to change the roles between two nodes.

Circumstance:

2servers:

primary: psql01
standby: psql02

After psql01 stops, psql02 becomes the new primary. After a while, start psql01

On psql01:

[-bash-4.2$ repmgr -f /var/lib/pgsql/repmgr/repmgr.conf cluster show
 ID | Name         | Role    | Status               | Upstream | Location | Priority | Timeline | Connection string
----+--------------+---------+----------------------+----------+----------+----------+----------+---------------------------------------------
 1  | psql01       | primary | * running            |          | default  | 100      | 7        | host=psql01 user=repmgr dbname=repmgr
 2  | psql02       | standby | ! running as primary |          | default  | 100      | 8        | host=psql02 user=repmgr dbname=repmgr

WARNING: following issues were detected
  - node "postgresql02" (ID: 2) is registered as standby but running as primary](url)

On psql02:

repmgr -f /var/lib/pgsql/repmgr/repmgr.conf cluster show

	 ID | Name         | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string
	----+--------------+---------+-----------+----------+----------+----------+----------+---------------------------------------------
	 1  | psql01       | primary | ! running |          | default  | 100      | 7        | host=psql01 user=repmgr dbname=repmgr
	 2  | psql02       | primary | * running |          | default  | 100      | 8        | host=psql02 user=repmgr dbname=repmgr

	WARNING: following issues were detected
	  - node "psql01" (ID: 1) is running but the repmgr node record is inactive

I try to put my old primary to standby with the following commands:

-bash-4.2$ repmgr -f /var/lib/pgsql/repmgr/repmgr.conf node service --action=stop --checkpoint
NOTICE: issuing CHECKPOINT
DETAIL: executing server command "sudo systemctl stop postgresql-11"
-bash-4.2$ repmgr -f /var/lib/pgsql/repmgr/repmgr.conf -d 'host=psql01 user=repmgr dbname=repmgr' node rejoin
ERROR: connection to database failed
DETAIL:
could not connect to server: Connection refused
        Is the server running on host "psql01" (172.16.11.171) and accepting
        TCP/IP connections on port 5432?

What's exaclty happen? If psql01 goes down how can I change the role to secondary?

After psql01 stops, psql02 becomes the new primary. After a while, start psql01

You now have a "split brain" situation, as both nodes were running as primary.

repmgr -f /var/lib/pgsql/repmgr/repmgr.conf -d 'host=psql01 user=repmgr dbname=repmgr' node rejoin

For repmgr node rejoin you need to provide the connection information of the current primary, i.e. psql02.

At this point it's entirely possible that psql01 has diverged from psql02 and you might not be able to rejoin it as a standby of psql02. You may be able to use pg_rewind to resolve the situation but that's no guarantee; see the documentation for "repmgr node rejoin".

I need to change the roles between two nodes.

Use the repmgr standby switchover command, which is designed for doing this; see also documentation section Performing a switchover with repmgr.

guif commented

in which node need to execute the rejoin command? If I execute on my previous master (psql01) I get the error as I mentioned "ERROR: connection to database failed"
Should I execute commands if both nodes are in primary?

You need to execute the command on the node which is to be rejoined as a standby (from your post I assume this is psql01). You need to provide the database connection information for the current primary (presumably psql02).

I'm trying to achieve the same thing and after running node rejoin on the old primary cluster show still produces running as primary even though it's registered as a standby. What's the correct way of fixing this?

EDIT: I think my situation occurred due to having conninfo host=host1 on both servers causing repmgr to lie about the actual status. If this was indeed the case it was due to user error.

If a node has the conninfo for another node in the cluster, repmgr isn't lying as such, more a case of "garbage in, garbage out". Unfortunately PostgreSQL itself doesn't provide a built-in robust way of identifying individual nodes which makes catching this kind of thing non-trivial.

Closing this issue, please feel free to open a new one if there are any further questions regarding the preceding post.