EnterpriseDB/repmgr

Issue re-cloning a standby after a major postgresql upgrade

Opened this issue · 0 comments

Environment:

  • repmgr 5.3.2
  • PostgreSQL 10.20 and PostgreSQL 14.5

Issue:
I recently ran into an issue re-cloning a standby while performing a major postgresql version upgrade (10 to 14). In my environment, I have one primary and two standbys.

The primary was upgraded successfully and, as a result, the installation's "system identifier" changed (as retrieved by: SELECT system_identifier FROM pg_catalog.pg_control_system(); ).

Moving on to the first standby, after successfully upgrading postgresql, I went to re-clone it from the now upgraded primary. Issuing repmgr standby clone --force failed with the following error:

ERROR: source node's system identifier does not match other nodes in the replication cluster
DETAIL: source node's system identifier is 7158597205441811171, replication cluster member "static-151-92"'s system identifier is 7065916388744459236

Note that "static-151-92" is the node name of my second not-yet-upgraded standby.

Looking at the repmgr logic that produces this error, I started reading the repmgr-action-standby.c#check_source_server() function and am wondering if there's an issue with it. I noticed that it calls get_all_node_records(...), which orders its results by the node_id column. This means that the behavior from this point forward, where it iterates through each node, is based on the arbitrary assignment of the each node's integer ID.

In my case, my nodes have these IDs:

ID    | Name          | Role
------+---------------+---------
610   | static-151-91 | standby
1796  | static-151-92 | standby
5015  | static-151-90 | primary

Since "static-151-91" is the one being upgraded, the connection check to it fails and the loop continues moving onto inspecting "static-151-92". Since "static-151-92" has not yet been upgraded, that machine's "system identifier" does not match that of the upgraded primary which causes the if (source_system_identifier != test_system_identifier) check to pass and the clone to ultimately fail.

This behavior is strange to me because if, by chance, my node IDs had been assigned such that they looked like this instead:

ID    | Name          | Role
------+---------------+---------
610   | static-151-91 | standby
1000  | static-151-90 | primary
1796  | static-151-92 | standby

Then "static-151-90" would have been inspected and the "system identifier" would obviously match in that case (seeing as it would be both the source_system_identifier and the test_system_identifier). I confirmed that this is indeed the behavior and my standby clone succeeds.

I'm making this post because it's not clear to me what the right course-of-action is and am hoping to discuss. Thanks!