repmgrd race condition during auto-failover leads to inaccurate repmgr data (repmgr 4.4)

Question

repmgrd race condition during auto-failover leads to inaccurate repmgr data (repmgr 4.4)

Opened this issue 3 years ago · 0 comments

We have seen an odd problem occur where repmgrd appears to have a race condition. The primary failed, and both standbys compared LSNs and determined that they were equal. One promoted itself, and that succeeded. At approx the same time, the other standby tried checking its LSN against that of its sibling, and it found that its sibling was not in recovery mode and therefore must be a primary. It tried to follow it, but was unable to because the new primary had forked a new timeline. So, it wrongly decided to promote itself. That attempt failed, and it eventually successfully started following the new primary. But, for some reason, either the repmgr state in the database was not updated, or the local repmgr status "record" on both the new primary and the surviving standby were not updated.
I see that there was a bug fix in repmgr 5.0 that addresses issues #587 and #588, which by the wording of the bug fix may address this issue. However, the issues reported in #587 and #588 are not exactly the same as the issue we have encountered.
After this problem occurred, the state of the new primary was:

Node "XXXXXX-55i2":
        PostgreSQL version: 10.9
        Total data size: 6856 MB
        Conninfo: host=XX.XX.XX.8 user=repmgr dbname=repmgr connect_timeout=2
        Role: primary
        WAL archiving: off
        Archive command: (none)
        Replication connections: 0 (of maximal 10)
        Replication slots: 0 physical (of maximal 10; 0 missing)
        Replication lag: n/a

 ID    | Name       | Role    | Status    | Upstream     | Location | Priority | Timeline | Connection string
-------+------------+---------+-----------+--------------+----------+----------+----------+-------------------------------------------------------------
 11460 | XXXXXX-54i2 | primary | - failed  |              | default  | 100      | ?        | host=XX.XX.XX.7 user=repmgr dbname=repmgr connect_timeout=2
 17919 | XXXXXX-55i2 | primary | * running |              | default  | 100      | 2        | host=XX.XX.XX.8 user=repmgr dbname=repmgr connect_timeout=2
 22079 | XXXXXX-56i2 | standby |   running | ? XXXXXX-54i2 | default  | 100      | 1        | host=XX.XX.XX.9 user=repmgr dbname=repmgr connect_timeout=2

WARNING: following issues were detected
  - unable to connect to node "XXXXXX-54i2" (ID: 11460)
  - unable to connect to node "XXXXXX-56i2" (ID: 22079)'s upstream node "XXXXXX-54i2" (ID: 11460)
  - unable to determine if node "XXXXXX-56i2" (ID: 22079) is attached to its upstream node "XXXXXX-54i2" (ID: 11460)

The state of the surviving standby was:

Node "XXXXXX-56i2":
        PostgreSQL version: 10.9
        Total data size: 6855 MB
        Conninfo: host=XX.XX.XX.9 user=repmgr dbname=repmgr connect_timeout=2
        Role: standby
        WAL archiving: off
        Archive command: (none)
        Replication connections: 0 (of maximal 10)
        Replication slots: 0 physical (of maximal 10; 0 missing)
        Upstream node: XXXXXX-54i2 (ID: 11460)
        Replication lag: 0 seconds
        Last received LSN: 29A/A0274718
        Last replayed LSN: 29A/A0274718

 ID    | Name       | Role    | Status               | Upstream     | Location | Priority | Timeline | Connection string
-------+------------+---------+----------------------+--------------+----------+----------+----------+-------------------------------------------------------------
 11460 | XXXXXX-54i2 | primary | ? unreachable        |              | default  | 100      | ?        | host=XX.XX.XX.7 user=repmgr dbname=repmgr connect_timeout=2
 17919 | XXXXXX-55i2 | standby | ! running as primary |              | default  | 100      | 2        | host=XX.XX.XX.8 user=repmgr dbname=repmgr connect_timeout=2
 22079 | XXXXXX-56i2 | standby |   running            | ? XXXXXX-54i2 | default  | 100      | 1        | host=XX.XX.XX.9 user=repmgr dbname=repmgr connect_timeout=2

WARNING: following issues were detected
  - unable to connect to node "XXXXXX-54i2" (ID: 11460)
  - node "XXXXXX-54i2" (ID: 11460) is registered as an active primary but is unreachable
  - node "XXXXXX-55i2" (ID: 17919) is registered as standby but running as primary
  - unable to connect to node "XXXXXX-56i2" (ID: 22079)'s upstream node "XXXXXX-54i2" (ID: 11460)
  - unable to determine if node "XXXXXX-56i2" (ID: 22079) is attached to its upstream node "XXXXXX-54i2" (ID: 11460)

Why was the repmgr state not updated to reflect that the surviving standby was following the new primary? Note that the repmgrd log on the surviving standby has two occurrences of the following error message:

could not fork new process for connection: Cannot allocate memory

We do not know why this error occurred, or what step in the process it corresponded to. If for some reason it prevented repmgr from updating its state in the database and/or its status "record" on the new primary and on the surviving standby, how resilient is the step that makes those updates? Is it ever retried to resync?
See attached repmgrd logs.
repmgr-newprimary.log
repmgr-survivingstandby.log