EnterpriseDB/repmgr

repmgr showing wrong node status (repmgr14)

Opened this issue · 1 comments

When I did switchover from node1 to node 2 it switch over successfully as like below. Here issue is that node1 is not attached with it.

[postgres@node2 ~]$ repmgr cluster show
 ID | Name  | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string
----+-------+---------+-----------+----------+----------+----------+----------+--------------------------------------------------------------------------
 1  | node1 | standby |   running |          | default  | 100      | 4        | host=192.168.43.21 user=repmgr dbname=repmgr connect_timeout=2 port=5434
 2  | node2 | primary | * running |          | default  | 100      | 5        | host=192.168.43.22 user=repmgr dbname=repmgr connect_timeout=2 port=5434
 3  | node3 | standby |   running | node2    | default  | 100      | 5        | host=192.168.43.23 user=repmgr dbname=repmgr connect_timeout=2 port=5434

After checking replication slot status i found replication slot disable for node1 like below:

postgres=# select * from pg_replication_slots ;
   slot_name   | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn | wal_status | safe_wal_size | two_phase
---------------+--------+-----------+--------+----------+-----------+--------+------------+------+--------------+-------------+---------------------+------------+---------------+-----------
 repmgr_slot_3 |        | physical  |        |          | f         | t      |      34402 |      |              | 3/52035A10  |                     | reserved   |               | f
 repmgr_slot_1 |        | physical  |        |          | f         | f      |            |      |              | 3/520000D0  |                     | reserved   |               | f

But when I checked from database I found below node status which is differ from above information. At backend it still showing node1 is primary. At repmgr14 I found this new issue. Is it bug or something else.

[postgres@node2 ~]$ psql -h 192.168.43.21 -d repmgr -U repmgr  -p 5434
psql (14.2)
Type "help" for help.

repmgr=# select * from nodes;
 node_id | upstream_node_id | active | node_name |  type   | location | priority |                                 conninfo                                 | repluser |   slot_name   |        config_file
---------+------------------+--------+-----------+---------+----------+----------+--------------------------------------------------------------------------+----------+---------------+----------------------------
       1 |                  | t      | node1     | primary | default  |      100 | host=192.168.43.21 user=repmgr dbname=repmgr connect_timeout=2 port=5434 | repmgr   | repmgr_slot_1 | /etc/repmgr/14/repmgr.conf
       2 |                1 | t      | node2     | standby | default  |      100 | host=192.168.43.22 user=repmgr dbname=repmgr connect_timeout=2 port=5434 | repmgr   | repmgr_slot_2 | /etc/repmgr/14/repmgr.conf
       3 |                1 | t      | node3     | standby | default  |      100 | host=192.168.43.23 user=repmgr dbname=repmgr connect_timeout=2 port=5434 | repmgr   | repmgr_slot_3 | /etc/repmgr/14/repmgr.conf
(3 rows)

To solve this I recover from previous backup and remove repmgr database. Next I create repmgr database again with extension. Given search path and repmgr role. As per 2ndquadrant
2ndquadrant document. Later try again with failover on node1 but result same. Here backend node status(select * from nodes) not macthed with frontend node status (repmgr node show);

Other necessary Information given:

OS: Centos 8.5
DB: PostgreSQL 14.2 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 8.5.0 20210514 (Red Hat 8.5.0-4), 64-bit
repmgr:14
barman: 2.17 Barman by EnterpriseDB (www.enterprisedb.com)
Note: Recently I upgrade current PostgreSQL database 13 to 14.

If need other information please let me know. Looking for solution from experts as early as possible.
Thanks in advance @ibarwick

It looks like something went wrong during the switchover; from the available output it looks like node1 somehow started as a primary (which is why you are seeing the original repmgr node metadata when connecting to that node).

Unfortunately it's not possible from the above information to determine why that happened; the repmgr log output during the switchover process and node1's PostgreSQL log file may contain more clues.