EnterpriseDB/repmgr

Question about --siblings-follow

Closed this issue · 5 comments

We have a cluster with 1 primary 2 replicas and 1 witness node and repmgr 5.2.
Previously we did a manual fail-over and it all went smoothly.
Recenty we had an outage on the primary but the fail-over only partially worked.
A new primary was promoted, but the other replica did not start following.

[2021-11-01 06:35:49] [WARNING] 2 sibling nodes found, but option "--siblings-follow" not specified
[2021-11-01 06:35:49] [DETAIL] these nodes will remain attached to the current primary:
  db-node2 (node ID: 2)
  PG-Node-Witness (node ID: 4, witness server)

So I thought that --siblings-follow should be added to the promote command.
This is the current promote command /usr/bin/repmgr standby promote -f /etc/repmgr/12/repmgr.conf --log-to-file

I read the documentation and it states:
Note:
If using repmgrd, when invoking repmgr standby promote (either directly via the promote_command, or in a script called via promote_command), --siblings-follow must not be included as a command line option for repmgr standby promote.

But there is no explanation why it must not be.
How can I prevent this from happening again?

The reason it says is is because repmgrd is supposed to migrate each replica to follow the new Primary. The repmgrd process on each node manages its local Postgres instance, and in the event of a failover, it moves the upstream automatically. At least, that's what's supposed to happen.

What did the logs say on the node that didn't follow properly?

I think I figured out the reason. When I check repmgr service some nodes say that the deamon is not running. Though when I check those nodes, the service is running. How can I make sure that repmgr knows that these are running?

This is from the new primary (node1):

postgres@custom-product-prod-db01:/$ repmgr -f /etc/repmgr/12/repmgr.conf service status
 ID | Name            | Role    | Status    | Upstream | repmgrd     | PID  | Paused? | Upstream last seen   
----+-----------------+---------+-----------+----------+-------------+------+---------+-----------------------
 1  | db-node1        | primary | * running |          | running     | 8674 | no      | n/a                   
 2  | db-node2        | standby |   running | db-node1 | not running | n/a  | n/a     | n/a                   
 4  | PG-Node-Witness | witness | * running | db-node1 | not running | n/a  | n/a     | n/a                   
 5  | db-node4        | standby |   running | db-node1 | running     | 3918 | no      | 1 second(s) ago 

This is from node 2:postgres@custom-product-prod-db02:/$ repmgr -f /etc/repmgr/12/repmgr.conf service status

 ID | Name            | Role    | Status    | Upstream | repmgrd     | PID  | Paused? | Upstream last seen   
----+-----------------+---------+-----------+----------+-------------+------+---------+-----------------------
 1  | db-node1        | primary | * running |          | running     | 8674 | no      | n/a                   
 2  | db-node2        | standby |   running | db-node1 | not running | n/a  | n/a     | n/a                   
 4  | PG-Node-Witness | witness | * running | db-node1 | not running | n/a  | n/a     | n/a                   
 5  | db-node4        | standby |   running | db-node1 | running     | 3918 | no      | 0 second(s) ago       

postgres@custom-product-prod-db02:/$ service repmgrd status
● repmgrd.service - LSB: Start/stop repmgrd
   Loaded: loaded (/etc/init.d/repmgrd; generated)
   Active: active (exited) since Sun 2021-10-24 21:47:44 WAT; 1 weeks 3 days ago
     Docs: man:systemd-sysv-generator(8)
    Tasks: 0 (limit: 4915)
   CGroup: /system.slice/repmgrd.service

stefan@custom-product-prod-db02:~$ sudo service repmgrd restart
stefan@custom-product-prod-db02:~$ sudo service repmgrd status
● repmgrd.service - LSB: Start/stop repmgrd
   Loaded: loaded (/etc/init.d/repmgrd; generated)
   Active: active (exited) since Thu 2021-11-04 07:55:14 WAT; 7s ago
     Docs: man:systemd-sysv-generator(8)
  Process: 29943 ExecStop=/etc/init.d/repmgrd stop (code=exited, status=0/SUCCESS)
  Process: 29966 ExecStart=/etc/init.d/repmgrd start (code=exited, status=0/SUCCESS)

Nov 04 07:55:14 custom-product-prod-db02 systemd[1]: Starting LSB: Start/stop repmgrd...
Nov 04 07:55:14 custom-product-prod-db02 repmgrd[29966]:  * Starting PostgreSQL replication management and monitoring daemon repmgrd
Nov 04 07:55:14 custom-product-prod-db02 repmgrd[29966]:    ...done.
Nov 04 07:55:14 custom-product-prod-db02 systemd[1]: Started LSB: Start/stop repmgrd.

postgres@custom-product-prod-db02:/$ repmgr -f /etc/repmgr/12/repmgr.conf service status
 ID | Name            | Role    | Status    | Upstream | repmgrd     | PID  | Paused? | Upstream last seen   
----+-----------------+---------+-----------+----------+-------------+------+---------+-----------------------
 1  | db-node1        | primary | * running |          | running     | 8674 | no      | n/a                   
 2  | db-node2        | standby |   running | db-node1 | not running | n/a  | n/a     | n/a                   
 4  | PG-Node-Witness | witness | * running | db-node1 | not running | n/a  | n/a     | n/a                   
 5  | db-node4        | standby |   running | db-node1 | running     | 3918 | no      | 0 second(s) ago      

postgres@custom-product-prod-db02:/$ tail /var/log/postgresql/repmgr.log
[2021-11-04 07:55:14] [NOTICE] repmgrd (repmgrd 5.1.0) starting up
[2021-11-04 07:55:14] [INFO] connecting to database "host=172.23.2.42 port=5432 user=repmgr dbname=repmgr connect_timeout=2"
[2021-11-04 07:55:14] [ERROR] this "repmgr" version is older than the installed "repmgr" extension version
[2021-11-04 07:55:14] [DETAIL] "repmgr" version 5.1.0 is installed but extension is version 5.2
[2021-11-04 07:55:14] [HINT] update the repmgr binaries to match the installed extension version

$sudo apt install repmgr --upgrade
The following NEW packages will be installed:
  repmgr
0 upgraded, 1 newly installed, 0 to remove and 108 not upgraded.
Need to get 5380 B of archives.
After this operation, 12.3 kB of additional disk space will be used.
Get:1 https://dl.2ndquadrant.com/default/release/apt bionic-2ndquadrant/main amd64 repmgr all 5.3.0-1.bionic+1 [5380 B]
Fetched 5380 B in 0s (46.1 kB/s) 
Selecting previously unselected package repmgr.
(Reading database ... 510273 files and directories currently installed.)
Preparing to unpack .../repmgr_5.3.0-1.bionic+1_all.deb ...
Unpacking repmgr (5.3.0-1.bionic+1) ...
Setting up repmgr (5.3.0-1.bionic+1) ...
stefan@custom-product-prod-db02:~$ sudo service repmgrd restart
stefan@custom-product-prod-db02:~$ tail /var/log/postgresql/repmgr.log
[2021-11-04 07:55:14] [NOTICE] repmgrd (repmgrd 5.1.0) starting up
[2021-11-04 07:55:14] [INFO] connecting to database "host=172.23.2.42 port=5432 user=repmgr dbname=repmgr connect_timeout=2"
[2021-11-04 07:55:14] [ERROR] this "repmgr" version is older than the installed "repmgr" extension version
[2021-11-04 07:55:14] [DETAIL] "repmgr" version 5.1.0 is installed but extension is version 5.2
[2021-11-04 07:55:14] [HINT] update the repmgr binaries to match the installed extension version

stefan@custom-product-prod-db02:~$ repmgr --version
repmgr 5.1.0

From node 2:

postgresql-12-repmgr/bionic-2ndquadrant,now 5.1.0-1.bionic+1 amd64 [installed,upgradable to: 5.3.0-1.bionic+1]
repmgr/bionic-2ndquadrant,now 5.3.0-1.bionic+1 all [installed]
repmgr-common/bionic-2ndquadrant,now 5.1.0-1.bionic+1 all [installed,upgradable to: 5.3.0-1.bionic+1]
``
It looks like I'm not able to update repmgr.
This is from node1 :

postgres@custom-product-prod-db01:/$ repmgr --version
repmgr 5.2.1

postgresql-12-repmgr/now 5.2.1-1.pgdg18.04+1 amd64 [installed,upgradable to: 5.3.0-1.bionic+1]
repmgr-common/now 5.2.1-1.pgdg18.04+1 all [installed,upgradable to: 5.3.0-1.bionic+1]
``
I checked /etc/apt/sources.list on both servers and they are identical.
Why does `sudo apt install repmgr --upgrade` not install correctly? I dont see any errors in the install log

This is not a support forum for Ubuntu. Please contact Canonical support or your system administrator.

Sure, I figured it out.