EnterpriseDB/repmgr

Primary database is down but witness says reachable

nikhil-postgres opened this issue · 3 comments

Hi,

Primary database went down for 15 minutes but there was no failover. When checking logs we see below

[2022-03-05 21:31:38] [INFO] checking state of sibling node "a" (ID: 3)
[2022-03-05 21:31:38] [DEBUG] connecting to: "user=pgrepmgr connect_timeout=3 dbname=pgrepmgr host=a port=5432 application_name=repmgrd sslmode=require fallback_application_name=repmgr options=-csearch_path="
[2022-03-05 21:31:39] [INFO] node "a" (ID: 3) reports its upstream is node 1, last seen 88 second(s) ago
[2022-03-05 21:31:39] [INFO] standby node "a" (ID: 3) last saw primary node 88 second(s) ago
[2022-03-05 21:31:39] [INFO] last receive LSN for sibling node "a" (ID: 3) is: 225/9800F858
[2022-03-05 21:31:39] [INFO] node "a" (ID: 3) has same LSN as current candidate "b" (ID: 2)

[2022-03-05 21:31:39] [INFO] checking state of sibling node "witness" (ID: 101)
[2022-03-05 21:31:39] [DEBUG] connecting to: "user=pgrepmgr connect_timeout=3 dbname= pgrepmgr host=witness port=5432 application_name=repmgrd sslmode=require fallback_application_name=repmgr options=-csearch_path="
[2022-03-05 21:31:39] [INFO] node "witness" (ID: 101) reports its upstream is node 1, last seen 0 second(s) ago
[2022-03-05 21:31:39] [NOTICE] witness node "witness" (ID: 101) last saw primary node 0 second(s) ago, considering primary still visible


[2022-03-05 21:31:39] [DEBUG] node 101 is witness, not querying state
[2022-03-05 21:31:39] [INFO] 1 nodes can see the primary
[2022-03-05 21:31:39] [DETAIL] following nodes can see the primary:
 - node "witness" (ID: 101): 0 second(s) ago

The primary was down but witness says it saw the primary '0' second ago.

Thanks

Logs in witness node also shows it is not able to connect to primary:
repmgr version - 5.2.1
Postgresql version - 11.7

2022-03-05 21:30:57] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2022-03-05 21:30:57] [WARNING] unable to reconnect to node 1 after 6 attempts
[2022-03-05 21:31:36] [WARNING] new primary "b" (node ID: 2) is in recovery
[2022-03-05 21:31:39] [WARNING] unable to connect to "host=primary port=5432 sslmode=require dbname=pgrepmgr user=pgrepmgr connect_timeout=3 application_name=repmgrd"
[2022-03-05 21:31:39] [DETAIL]
timeout expired

[2022-03-05 21:32:33] [ERROR] unable to determine if server is in recovery
[2022-03-05 21:32:33] [DETAIL]
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.

Hi @ibarwick , Could you please help why we see this behaviour?

Also, I see the function get_upstream_node_id and get_upstream_last_seen is used to check the details on standbys/witness. How do these functions get the data, is it from the memory?

Thanks,
Nikhil

Same issue already open - #744