Segmentation fault during switchover (node rejoin)

Question

Segmentation fault during switchover (node rejoin)

Closed this issue 4 years ago · 7 comments

eriveltonvichroski commented 4 years ago

Environment
CentOS release 5.11 (Final)
Linux lab145 2.6.18-419.el5xen #1 SMP Fri Feb 24 22:12:04 UTC 2017 i686 i686 i386 GNU/Linux
repmgr 5.1.0
PostgreSQL 9.5.3, PostgreSQL 9.5.7 and PostgreSQL 9.5.23
1 primary 1 standby

Issue

When executing repmgr standby switchover --always-promote on the replica, I receive a segmentation fault (log_level=DEBUG):

postgres@lab146 ~]$ repmgr -f repmgr/repmgr.conf standby switchover --always-promote -v --log-level DEBUG
NOTICE: using provided configuration file "repmgr/repmgr.conf"
DEBUG: connecting to: "user=repmgr connect_timeout=60 dbname=repmgr host=srvdgtheartbeatb001 fallback_application_name=repmgr"
DEBUG: set_config():
SET synchronous_commit TO 'local'
DEBUG: get_node_record():
SELECT n.node_id, n.type, n.upstream_node_id, n.node_name, n.conninfo, n.repluser, n.slot_name, n.location, n.priority, n.active, n.config_file, '' AS upstream_node_name, NULL AS attached FROM repmgr.nodes n WHERE n.node_id = 2
NOTICE: executing switchover on node "srvdgtheartbeatb001" (ID: 2)
DEBUG: get_recovery_type(): SELECT pg_catalog.pg_is_in_recovery()
INFO: searching for primary node
DEBUG: get_primary_connection():
SELECT node_id, conninfo, CASE WHEN type = 'primary' THEN 1 ELSE 2 END AS type_priority FROM repmgr.nodes WHERE active IS TRUE AND type != 'witness' ORDER BY active DESC, type_priority, priority, node_id
INFO: checking if node 1 is primary
DEBUG: connecting to: "user=repmgr connect_timeout=60 dbname=repmgr host=srvdgtheartbeata001 fallback_application_name=repmgr"
DEBUG: set_config():
SET synchronous_commit TO 'local'
DEBUG: get_recovery_type(): SELECT pg_catalog.pg_is_in_recovery()
INFO: current primary node is 1
DEBUG: get_node_record():
SELECT n.node_id, n.type, n.upstream_node_id, n.node_name, n.conninfo, n.repluser, n.slot_name, n.location, n.priority, n.active, n.config_file, '' AS upstream_node_name, NULL AS attached FROM repmgr.nodes n WHERE n.node_id = 1
DEBUG: remote node name is "srvdgtheartbeata001"
DEBUG: test_ssh_connection(): executing ssh -o Batchmode=yes -o "StrictHostKeyChecking no" -o UserKnownHostsFile=/dev/null -o ConnectTimeout=60 -o ServerAliveInterval=2 srvdgtheartbeata001 /bin/true 2>/dev/null
INFO: SSH connection to host "srvdgtheartbeata001" succeeded
DEBUG: remote_command():
ssh -o Batchmode=yes -o "StrictHostKeyChecking no" -o UserKnownHostsFile=/dev/null -o ConnectTimeout=60 -o ServerAliveInterval=2 srvdgtheartbeata001 /usr/pgsql/bin/repmgr -f /home/postgres/repmgr/repmgr.conf -L DEBUG --version >/dev/null 2>&1 && echo "1" || echo "0"
DEBUG: remote_command(): output returned was:
1

DEBUG: remote_command():
ssh -o Batchmode=yes -o "StrictHostKeyChecking no" -o UserKnownHostsFile=/dev/null -o ConnectTimeout=60 -o ServerAliveInterval=2 srvdgtheartbeata001 /usr/pgsql/bin/repmgr -f /home/postgres/repmgr/repmgr.conf -L DEBUG --version 2>/dev/null
DEBUG: remote_command(): output returned was:
repmgr 5.1.0

DEBUG: "repmgr" version on "srvdgtheartbeata001" is 50100
DEBUG: remote_command():
ssh -o Batchmode=yes -o "StrictHostKeyChecking no" -o UserKnownHostsFile=/dev/null -o ConnectTimeout=60 -o ServerAliveInterval=2 srvdgtheartbeata001 test -f /home/postgres/repmgr/repmgr.conf && echo 1 || echo 0
Warning: Permanently added 'srvdgtheartbeata001,10.20.0.1' (RSA) to the list of known hosts.
DEBUG: remote_command(): output returned was:
1

DEBUG: remote_command():
ssh -o Batchmode=yes -o "StrictHostKeyChecking no" -o UserKnownHostsFile=/dev/null -o ConnectTimeout=60 -o ServerAliveInterval=2 srvdgtheartbeata001 /usr/pgsql/bin/repmgr -f /home/postgres/repmgr/repmgr.conf -L DEBUG node check --data-directory-config --optformat -LINFO 2>/dev/null
DEBUG: remote_command(): output returned was:
--configured-data-directory=OK

DEBUG: get_node_replication_stats():
SELECT pg_catalog.current_setting('max_wal_senders')::INT AS max_wal_senders, (SELECT pg_catalog.count() FROM pg_catalog.pg_stat_replication) AS attached_wal_receivers, current_setting('max_replication_slots')::INT AS max_replication_slots, (SELECT pg_catalog.count() FROM pg_catalog.pg_replication_slots WHERE slot_type='physical') AS total_replication_slots, (SELECT pg_catalog.count() FROM pg_catalog.pg_replication_slots WHERE active IS TRUE AND slot_type='physical') AS active_replication_slots, (SELECT pg_catalog.count() FROM pg_catalog.pg_replication_slots WHERE active IS FALSE AND slot_type='physical') AS inactive_replication_slots, pg_catalog.pg_is_in_recovery() AS in_recovery
DEBUG: get_active_sibling_node_records():
SELECT n.node_id, n.type, n.upstream_node_id, n.node_name, n.conninfo, n.repluser, n.slot_name, n.location, n.priority, n.active, n.config_file, '' AS upstream_node_name, NULL AS attached FROM repmgr.nodes n WHERE n.upstream_node_id = 1 AND n.node_id != 2 AND n.active IS TRUE ORDER BY n.node_id
DEBUG: clear_node_info_list() - closing open connections
DEBUG: clear_node_info_list() - unlinking
DEBUG: remote_command():
ssh -o Batchmode=yes -o "StrictHostKeyChecking no" -o UserKnownHostsFile=/dev/null -o ConnectTimeout=60 -o ServerAliveInterval=2 srvdgtheartbeata001 /usr/pgsql/bin/repmgr -f /home/postgres/repmgr/repmgr.conf -L DEBUG node check --remote-node-id=2 --replication-connection
Warning: Permanently added 'srvdgtheartbeata001,10.20.0.1' (RSA) to the list of known hosts.
DEBUG: connecting to: "user=repmgr connect_timeout=60 dbname=repmgr host=srvdgtheartbeata001 fallback_application_name=repmgr"
DEBUG: remote_command(): output returned was:
--connection=OK

DEBUG: guc_set():
SELECT true FROM pg_catalog.pg_settings WHERE name = 'archive_mode' AND setting != 'off'
DEBUG: remote_command():
ssh -o Batchmode=yes -o "StrictHostKeyChecking no" -o UserKnownHostsFile=/dev/null -o ConnectTimeout=60 -o ServerAliveInterval=2 srvdgtheartbeata001 /usr/pgsql/bin/repmgr -f /home/postgres/repmgr/repmgr.conf -L DEBUG node check --terse -LERROR --archive-ready --optformat
Warning: Permanently added 'srvdgtheartbeata001,10.20.0.1' (RSA) to the list of known hosts.
DEBUG: remote_command(): output returned was:
--status=OK --files=0

INFO: 0 pending archive files
DEBUG: get_replication_lag_seconds():
SELECT CASE WHEN (pg_catalog.pg_last_xlog_receive_location() = pg_catalog.pg_last_xlog_replay_location()) THEN 0 ELSE EXTRACT(epoch FROM (pg_catalog.clock_timestamp() - pg_catalog.pg_last_xact_replay_timestamp()))::INT END AS lag_seconds
DEBUG: lag is 0
INFO: replication lag on this standby is 0 seconds
DEBUG: minimum of 1 free slots (0 for siblings) required; 5 available
DEBUG: get_all_node_records():
SELECT n.node_id, n.type, n.upstream_node_id, n.node_name, n.conninfo, n.repluser, n.slot_name, n.location, n.priority, n.active, n.config_file, '' AS upstream_node_name, NULL AS attached FROM repmgr.nodes n ORDER BY n.node_id
DEBUG: clear_node_info_list() - closing open connections
DEBUG: clear_node_info_list() - unlinking
DEBUG: connecting to: "user=repmgr connect_timeout=60 dbname=repmgr host=srvdgtheartbeata001 fallback_application_name=repmgr"
DEBUG: set_config():
SET synchronous_commit TO 'local'
DEBUG: connecting to: "user=repmgr connect_timeout=60 dbname=repmgr host=srvdgtheartbeatb001 fallback_application_name=repmgr"
DEBUG: set_config():
SET synchronous_commit TO 'local'
NOTICE: local node "srvdgtheartbeatb001" (ID: 2) will be promoted to primary; current primary "srvdgtheartbeata001" (ID: 1) will be demoted to standby
NOTICE: stopping current primary node "srvdgtheartbeata001" (ID: 1)
DEBUG: remote_command():
ssh -o Batchmode=yes -o "StrictHostKeyChecking no" -o UserKnownHostsFile=/dev/null -o ConnectTimeout=60 -o ServerAliveInterval=2 srvdgtheartbeata001 /usr/pgsql/bin/repmgr -f /home/postgres/repmgr/repmgr.conf -L DEBUG node service --action=stop --checkpoint
Warning: Permanently added 'srvdgtheartbeata001,10.20.0.1' (RSA) to the list of known hosts.
DEBUG: connecting to: "user=repmgr connect_timeout=60 dbname=repmgr host=srvdgtheartbeata001 fallback_application_name=repmgr"
NOTICE: issuing CHECKPOINT on node "srvdgtheartbeata001" (ID: 1)
DETAIL: executing server command "/usr/pgsql/bin/pg_ctl -l /dev/null -D '/home2/postgres/data' -W -m fast stop"
DEBUG: remote_command(): no output returned
INFO: checking for primary shutdown; 1 of 60 attempts ("shutdown_check_timeout")
DEBUG: ping status is: PQPING_REJECT
DEBUG: sleeping 1 second until next check
INFO: checking for primary shutdown; 2 of 60 attempts ("shutdown_check_timeout")
DEBUG: ping status is: PQPING_NO_RESPONSE
DEBUG: remote_command():
ssh -o Batchmode=yes -o "StrictHostKeyChecking no" -o UserKnownHostsFile=/dev/null -o ConnectTimeout=60 -o ServerAliveInterval=2 srvdgtheartbeata001 /usr/pgsql/bin/repmgr -f /home/postgres/repmgr/repmgr.conf -L DEBUG node status --is-shutdown-cleanly
Warning: Permanently added 'srvdgtheartbeata001,10.20.0.1' (RSA) to the list of known hosts.
DEBUG: remote_command(): output returned was:
--state=SHUTDOWN --last-checkpoint-lsn=0/49000024

DEBUG: remote node status is: SHUTDOWN
NOTICE: current primary has been cleanly shut down at location 0/49000024
DEBUG: get_replication_info():
SELECT ts, in_recovery, last_wal_receive_lsn, last_wal_replay_lsn, last_xact_replay_timestamp, CASE WHEN (last_wal_receive_lsn = last_wal_replay_lsn) THEN 0::INT ELSE CASE WHEN last_xact_replay_timestamp IS NULL THEN 0::INT ELSE EXTRACT(epoch FROM (pg_catalog.clock_timestamp() - last_xact_replay_timestamp))::INT END END AS replication_lag_time, last_wal_receive_lsn >= last_wal_replay_lsn AS receiving_streamed_wal, wal_replay_paused, upstream_last_seen, upstream_node_id FROM ( SELECT CURRENT_TIMESTAMP AS ts, pg_catalog.pg_is_in_recovery() AS in_recovery, pg_catalog.pg_last_xact_replay_timestamp() AS last_xact_replay_timestamp, COALESCE(pg_catalog.pg_last_xlog_receive_location(), '0/0'::PG_LSN) AS last_wal_receive_lsn, COALESCE(pg_catalog.pg_last_xlog_replay_location(), '0/0'::PG_LSN) AS last_wal_replay_lsn, CASE WHEN pg_catalog.pg_is_in_recovery() IS FALSE THEN FALSE ELSE pg_catalog.pg_is_xlog_replay_paused() END AS wal_replay_paused, CASE WHEN pg_catalog.pg_is_in_recovery() IS FALSE THEN -1 ELSE repmgr.get_upstream_last_seen() END AS upstream_last_seen, CASE WHEN pg_catalog.pg_is_in_recovery() IS FALSE THEN -1 ELSE repmgr.get_upstream_node_id() END AS upstream_node_id ) q
DEBUG: local node last receive LSN is 0/4900008C, primary shutdown checkpoint LSN is 0/49000024
DEBUG: get_node_record():
SELECT n.node_id, n.type, n.upstream_node_id, n.node_name, n.conninfo, n.repluser, n.slot_name, n.location, n.priority, n.active, n.config_file, '' AS upstream_node_name, NULL AS attached FROM repmgr.nodes n WHERE n.node_id = 2
NOTICE: promoting standby to primary
DETAIL: promoting server "srvdgtheartbeatb001" (ID: 2) using "/usr/pgsql/bin/pg_ctl -l /dev/null -w -D '/home2/postgres/data' promote"
server promoting
NOTICE: waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
DEBUG: get_recovery_type(): SELECT pg_catalog.pg_is_in_recovery()
DEBUG: get_recovery_type(): SELECT pg_catalog.pg_is_in_recovery()
INFO: standby promoted to primary after 1 second(s)
DEBUG: setting node 2 as primary and marking existing primary as failed
DEBUG: begin_transaction()
DEBUG: commit_transaction()
NOTICE: STANDBY PROMOTE successful
DETAIL: server "srvdgtheartbeatb001" (ID: 2) was successfully promoted to primary
DEBUG: _create_event(): event is "standby_promote" for node 2
DEBUG: get_recovery_type(): SELECT pg_catalog.pg_is_in_recovery()
DEBUG: _create_event():
INSERT INTO repmgr.events ( node_id, event, successful, details ) VALUES ($1, $2, $3, $4) RETURNING event_timestamp
DEBUG: _create_event(): Event timestamp is "2020-09-25 16:38:44.865629-03"
DEBUG: executing:
/usr/pgsql/bin/repmgr -f /home/postgres/repmgr/repmgr.conf -L DEBUG --no-wait -d 'user=repmgr connect_timeout=60 dbname=repmgr host=srvdgtheartbeatb001' node rejoin
DEBUG: remote_command():
ssh -o Batchmode=yes -o "StrictHostKeyChecking no" -o UserKnownHostsFile=/dev/null -o ConnectTimeout=60 -o ServerAliveInterval=2 srvdgtheartbeata001 /usr/pgsql/bin/repmgr -f /home/postgres/repmgr/repmgr.conf -L DEBUG --no-wait -d 'user=repmgr connect_timeout=60 dbname=repmgr host=srvdgtheartbeatb001' node rejoin
Warning: Permanently added 'srvdgtheartbeata001,10.20.0.1' (RSA) to the list of known hosts.
DEBUG: connecting to: "user=repmgr connect_timeout=60 dbname=repmgr host=srvdgtheartbeatb001 fallback_application_name=repmgr"
ERROR: this node is not part of the rejoin target node's replication cluster
DETAIL: DEBUG: remote_command(): no output returned
DEBUG: is_server_available(): ping status for "host=srvdgtheartbeatb001 user=repmgr dbname=repmgr connect_timeout=60" is PQPING_OK
INFO: node "srvdgtheartbeata001" (ID: 1) is pingable
WARNING: node "srvdgtheartbeata001" not found in "pg_stat_replication"
INFO: waiting for node "srvdgtheartbeata001" (ID: 1) to connect to new primary; 1 of max 60 attempts (parameter "node_rejoin_timeout")
DETAIL: checking for record in node "srvdgtheartbeatb001"'s "pg_stat_replication" table where "application_name" is "srvdgtheartbeata001"
WARNING: node "srvdgtheartbeata001" not found in "pg_stat_replication"

.......................

DEBUG: get_recovery_type(): SELECT pg_catalog.pg_is_in_recovery()
DEBUG: _create_event():
INSERT INTO repmgr.events ( node_id, event, successful, details ) VALUES ($1, $2, $3, $4) RETURNING event_timestamp
DEBUG: _create_event(): Event timestamp is "2020-09-25 16:39:45.402986-03"
ERROR: node "srvdgtheartbeatb001" (ID: 2) promoted to primary, but demote node "srvdgtheartbeata001" (ID: 1) did not connect to the new primary
DEBUG: clear_node_info_list() - closing open connections
DEBUG: clear_node_info_list() - unlinking
DEBUG: connecting to: "user=repmgr connect_timeout=60 dbname=repmgr host=srvdgtheartbeata001 fallback_application_name=repmgr"
ERROR: connection to database failed
DETAIL:
could not connect to server: Connection refused
Is the server running on host "srvdgtheartbeata001" (10.20.0.1) and accepting
TCP/IP connections on port 5432?

................

DETAIL: attempted to connect using:
user=repmgr connect_timeout=60 dbname=repmgr host=srvdgtheartbeata001 fallback_application_name=repmgr
INFO: sleeping 1 second; 60 of 60 attempts ("standby_reconnect_timeout") to reconnect to demoted primary
WARNING: switchover did not fully complete
DETAIL: node "srvdgtheartbeatb001" (ID: 2) is now primary but node "srvdgtheartbeata001" (ID: 1) is not reachable
HINT: any inactive replication slots on the old primary will need to be dropped manually
DEBUG: clear_node_info_list() - closing open connections
DEBUG: clear_node_info_list() - unlinking
NOTICE: STANDBY SWITCHOVER has completed with issues
HINT: see preceding log message(s) for details

Segmentation fault dump along with a backtrace via gdb:

(gdb) r -f /home/postgres/repmgr/repmgr.conf -L DEBUG --no-wait -d 'user=repmgr connect_timeout=60 dbname=repmgr host=srvdgtheartbeata001' node rejoin
Starting program: /home2/postgres/repmgr-REL5_1_STABLE/repmgr -f /home/postgres/repmgr/repmgr.conf -L DEBUG --no-wait -d 'user=repmgr connect_timeout=60 dbname=repmgr host=srvdgtheartbeata001' node rejoin
warning: .dynamic section for "/lib/libcrypt.so.1" is not at the expected address
warning: difference appears to be caused by prelink, adjusting expectations
warning: .dynamic section for "/lib/i686/nosegneg/libm.so.6" is not at the expected address
warning: difference appears to be caused by prelink, adjusting expectations
warning: .dynamic section for "/lib/i686/nosegneg/libpthread.so.0" is not at the expected address
warning: difference appears to be caused by prelink, adjusting expectations
[Thread debugging using libthread_db enabled]
DEBUG: connecting to: "user=repmgr connect_timeout=60 dbname=repmgr host=srvdgtheartbeata001 fallback_application_name=repmgr"
ERROR: this node is not part of the rejoin target node's replication cluster
DETAIL:
Program received signal SIGSEGV, Segmentation fault.
0x0037b67b in strlen () from /lib/i686/nosegneg/libc.so.6
(gdb) bt
#0 0x0037b67b in strlen () from /lib/i686/nosegneg/libc.so.6
#1 0x0034b5f8 in vfprintf () from /lib/i686/nosegneg/libc.so.6
#2 0x0034bf42 in buffered_vfprintf () from /lib/i686/nosegneg/libc.so.6
#3 0x00347601 in vfprintf () from /lib/i686/nosegneg/libc.so.6
#4 0x0807feb5 in stderr_log_with_level (level_name=0x80a509e "DETAIL", level=,
fmt=0x8090adc "this node's system identifier is %lu, %s target node's system identifier is %lu",
ap=0xbfff9df4 "I\311\302\030n\240k\026M\t\b\377\377\377\177") at log.c:96
#5 0x0808008d in log_detail (fmt=0x8090adc "this node's system identifier is %lu, %s target node's system identifier is %lu") at log.c:124
#6 0x0804ae9b in check_node_can_attach (local_tli=21, local_xlogpos=1275068452, follow_target_conn=0x80c8da0, follow_target_node_record=0xbfffb3fc,
is_rejoin=1 '\001') at repmgr-client.c:4099
#7 0x080749d2 in do_node_rejoin () at repmgr-action-node.c:2521
#8 0x08051aea in main (argc=10, argv=0xbfffd014) at repmgr-client.c:1353

Thanks

eriveltonvichroski commented 4 years ago

Thanks

Answer 1 · 2020-09-26T04:38:48.000Z

Thanks for the report.

The failure is occurring when repmgr, on the former primary (where node rejoin is being executed) is attempting to read the pg_control file. From the logs and code path, it looks like there may be some sort of issue with the pg_control file (which repmgr then isn't catching).

On the former primary could you execute:

pg_control -D /home2/postgres/data

and report the output?

Answer 2 · 2020-09-27T00:43:37.000Z

Hi,

Debugging I saw that the problem is in the follow_target_identification.system_identifier

}
(gdb)
check_node_can_attach (local_tli=21, local_xlogpos=1275068452, follow_target_conn=0x80c8da0, follow_target_node_record=0xbfff9f2c, is_rejoin=1 '\001')
at repmgr-client.c:4096
4096 if (follow_target_identification.system_identifier != local_system_identifier)

(gdb) p local_system_identifier
$32 = 6875765650833459529
(gdb) p follow_target_identification.system_identifier
$33 = 2147483647
(gdb) step
4098 log_error(_("this node is not part of the %s target node's replication cluster"), action);
(gdb) p follow_target_identification.system_identifier
$34 = 2147483647
(gdb) p local_system_identifier
$35 = 6875765650833459529
(gdb)

Former primary:

[postgres@lab146 ~]$ pg_controldata -D /home/postgres/data/
pg_control version number: 942
Catalog version number: 201510051
Database system identifier: 6875765650833459529
Database cluster state: shut down
pg_control last modified: Sex 25 Set 2020 17:47:04 BRT
Latest checkpoint location: 0/4C000024
Prior checkpoint location: 0/4B0034C0
Latest checkpoint's REDO location: 0/4C000024
Latest checkpoint's REDO WAL file: 00000015000000000000004C
Latest checkpoint's TimeLineID: 21
Latest checkpoint's PrevTimeLineID: 21
Latest checkpoint's full_page_writes: on
Latest checkpoint's NextXID: 0/11971
Latest checkpoint's NextOID: 25306
Latest checkpoint's NextMultiXactId: 1
Latest checkpoint's NextMultiOffset: 0
Latest checkpoint's oldestXID: 1615
Latest checkpoint's oldestXID's DB: 1
Latest checkpoint's oldestActiveXID: 0
Latest checkpoint's oldestMultiXid: 1
Latest checkpoint's oldestMulti's DB: 1
Latest checkpoint's oldestCommitTsXid:0
Latest checkpoint's newestCommitTsXid:0
Time of latest checkpoint: Sex 25 Set 2020 17:47:04 BRT
Fake LSN counter for unlogged rels: 0/1
Minimum recovery ending location: 0/0
Min recovery ending loc's timeline: 0
Backup start location: 0/0
Backup end location: 0/0
End-of-backup record required: no
wal_level setting: hot_standby
wal_log_hints setting: on
max_connections setting: 150
max_worker_processes setting: 8
max_prepared_xacts setting: 0
max_locks_per_xact setting: 64
track_commit_timestamp setting: off
Maximum data alignment: 4
Database block size: 8192
Blocks per segment of large relation: 131072
WAL block size: 8192
Bytes per WAL segment: 16777216
Maximum length of identifiers: 64
Maximum columns in an index: 32
Maximum size of a TOAST chunk: 2000
Size of a large-object chunk: 2048
Date/time type storage: 64-bit integers
Float4 argument passing: by value
Float8 argument passing: by reference
Data page checksum version: 0

New primary:

[postgres@lab145 ~]$ pg_controldata -D /home2/postgres/data/
pg_control version number: 942
Catalog version number: 201510051
Database system identifier: 6875765650833459529
Database cluster state: in production
pg_control last modified: Sáb 26 Set 2020 21:06:16 BRT
Latest checkpoint location: 0/4E0451C8
Prior checkpoint location: 0/4E0450F8
Latest checkpoint's REDO location: 0/4E045194
Latest checkpoint's REDO WAL file: 00000016000000000000004E
Latest checkpoint's TimeLineID: 22
Latest checkpoint's PrevTimeLineID: 22
Latest checkpoint's full_page_writes: on
Latest checkpoint's NextXID: 0/11994
Latest checkpoint's NextOID: 25306
Latest checkpoint's NextMultiXactId: 1
Latest checkpoint's NextMultiOffset: 0
Latest checkpoint's oldestXID: 1615
Latest checkpoint's oldestXID's DB: 1
Latest checkpoint's oldestActiveXID: 11994
Latest checkpoint's oldestMultiXid: 1
Latest checkpoint's oldestMulti's DB: 1
Latest checkpoint's oldestCommitTsXid:0
Latest checkpoint's newestCommitTsXid:0
Time of latest checkpoint: Sáb 26 Set 2020 21:06:16 BRT
Fake LSN counter for unlogged rels: 0/1
Minimum recovery ending location: 0/0
Min recovery ending loc's timeline: 0
Backup start location: 0/0
Backup end location: 0/0
End-of-backup record required: no
wal_level setting: hot_standby
wal_log_hints setting: on
max_connections setting: 150
max_worker_processes setting: 8
max_prepared_xacts setting: 0
max_locks_per_xact setting: 64
track_commit_timestamp setting: off
Maximum data alignment: 4
Database block size: 8192
Blocks per segment of large relation: 131072
WAL block size: 8192
Bytes per WAL segment: 16777216
Maximum length of identifiers: 64
Maximum columns in an index: 32
Maximum size of a TOAST chunk: 2000
Size of a large-object chunk: 2048
Date/time type storage: 64-bit integers
Float4 argument passing: by value
Float8 argument passing: by reference
Data page checksum version: 0

Thanks

Answer 3 · 2020-09-27T01:37:43.000Z

Debugging SET follow_target_identification.system_identifier

check_node_can_attach (local_tli=21, local_xlogpos=1275068452, follow_target_conn=0x80c8dc0, follow_target_node_record=0xbfffa57c, is_rejoin=1 '\001')
at repmgr-client.c:4072
4072 if (PQstatus(follow_target_repl_conn) != CONNECTION_OK)
(gdb)
4077 else if (runtime_options.dry_run == true)
(gdb)
4083 if (identify_system(follow_target_repl_conn, &follow_target_identification) == false)
(gdb)
identify_system (repl_conn=0x80c8240, identification=0xbfff8fc4) at dbutils.c:1636
1636 res = PQexec(repl_conn, "IDENTIFY_SYSTEM;");
(gdb)
1638 if (PQresultStatus(res) != PGRES_TUPLES_OK || !PQntuples(res))
(gdb)
1646 identification->system_identifier = atol(PQgetvalue(res, 0, 0));
(gdb)
atol (repl_conn=0x80c8240, identification=0xbfff8fc4) at dbutils.c:1646
1646 identification->system_identifier = atol(PQgetvalue(res, 0, 0));
(gdb) step
strtol (repl_conn=0x80c8240, identification=0xbfff8fc4) at /usr/include/stdlib.h:336
336 return __strtol_internal (__nptr, __endptr, __base, 0);
(gdb)
identify_system (repl_conn=0x80c8240, identification=0xbfff8fc4) at dbutils.c:1646
1646 identification->system_identifier = atol(PQgetvalue(res, 0, 0));
(gdb) p identification->system_identifier
$46 = 0
(gdb)
$47 = 0
(gdb) step
1647 identification->timeline = atoi(PQgetvalue(res, 0, 1));
(gdb) p identification->system_identifier
$48 = 2147483647
(gdb) p identification->timeline
$49 = 4294967295

Answer 4 · 2020-09-27T02:21:06.000Z

Debugging via psql returns okay:

[postgres@lab146 repmgr-REL5_1_STABLE]$ psql 'user=repmgr replication=database connect_timeout=60 dbname=repmgr host=srvdgtheartbeata001' -c "IDENTIFY_SYSTEM;"
systemid | timeline | xlogpos | dbname
---------------------+----------+------------+--------
6875765650833459529 | 22 | 0/4E04CB2C | repmgr
(1 row)

Answer 5 · 2020-09-27T04:57:03.000Z

I compiled repmgr 5.1.0 in 32-bit mode (gcc 4.1.2). I noticed that atol (PQgetvalue (res, 0, 0)); returned an integer instead of a long.

In 32-bit mode, most likely long is 32 bits and long long is 64 bits. In 64-bit mode, both are probably 64 bits.
In 32-bit mode, the compiler (more precisely the <stdint.h> header) defines uint64_t as unsigned long long, because unsigned long isn't wide enough.
In 64-bit mode, it defines uint64_t as unsigned long.

To test I compiled with the function atoll (PQgetvalue (res, 0, 0));

So it ran without segmentation failure:

[postgres@lab146 repmgr-REL5_1_STABLE]$ ./repmgr -f /home/postgres/repmgr/repmgr.conf -L DEBUG --no-wait -d 'user=repmgr connect_timeout=60 dbname=repmgr host=srvdgtheartbeata001' node rejoin --force-rewind
DEBUG: connecting to: "user=repmgr connect_timeout=60 dbname=repmgr host=srvdgtheartbeata001 fallback_application_name=repmgr"
DEBUG: local tli: 21; local_xlogpos: 0/4D000024; follow_target_history->tli: 21; follow_target_history->end: 0/4C00008C
NOTICE: pg_rewind execution required for this node to attach to rejoin target node 1
DETAIL: rejoin target server's timeline 22 forked off current database system timeline 21 before current recovery point 0/4D000024
NOTICE: executing pg_rewind
DETAIL: pg_rewind command is "/usr/pgsql/bin/pg_rewind -D '/home2/postgres/data' --source-server='host=srvdgtheartbeata001 user=repmgr dbname=repmgr connect_timeout=60'"
NOTICE: 0 files copied to /home2/postgres/data
INFO: creating replication slot as user "repmgr"
DEBUG: create_replication_slot_sql(): creating slot "repmgr_slot_2" on upstream
NOTICE: setting node 2's upstream to node 1
DEBUG: create_recovery_file(): creating "/home2/postgres/data/recovery.conf"...
DEBUG: recovery.conf line: standby_mode = 'on'

DEBUG: recovery.conf line: primary_conninfo = 'user=repmgr connect_timeout=60 host=srvdgtheartbeata001 application_name=srvdgtheartbeatb001'

DEBUG: recovery.conf line: recovery_target_timeline = 'latest'

DEBUG: recovery.conf line: primary_slot_name = 'repmgr_slot_2'

WARNING: unable to ping "host=srvdgtheartbeatb001 user=repmgr dbname=repmgr connect_timeout=60"
DETAIL: PQping() returned "PQPING_NO_RESPONSE"
NOTICE: starting server using "/usr/pgsql/bin/pg_ctl -l /dev/null -w -D '/home2/postgres/data' start"
NOTICE: NODE REJOIN successful

Answer 6 · 2020-09-28T01:56:41.000Z

I compiled repmgr 5.1.0 in 32-bit mode (gcc 4.1.2). I noticed that atol (PQgetvalue (res, 0, 0)); returned an integer instead of a long.

In 32-bit mode, most likely long is 32 bits and long long is 64 bits. In 64-bit mode, both are probably 64 bits.
In 32-bit mode, the compiler (more precisely the <stdint.h> header) defines uint64_t as unsigned long long, because unsigned long isn't wide enough.
In 64-bit mode, it defines uint64_t as unsigned long.

To test I compiled with the function atoll (PQgetvalue (res, 0, 0));

That would explain it then. We haven't ever envisaged that repmgr would be compiled in 32bit mode and it's not explicitly supported.

Does this patch work for you?

diff --git a/dbutils.c b/dbutils.c
index a7f1de8..aab66e3 100644
--- a/dbutils.c
+++ b/dbutils.c
@@ -1674,7 +1674,12 @@ identify_system(PGconn *repl_conn, t_system_identification *identification)
                return false;
        }
 
+#if defined(__i386__) || defined(__i386)
+       identification->system_identifier = atoll(PQgetvalue(res, 0, 0));
+#else
        identification->system_identifier = atol(PQgetvalue(res, 0, 0));
+#endif
+
        identification->timeline = atoi(PQgetvalue(res, 0, 1));
        identification->xlogpos = parse_lsn(PQgetvalue(res, 0, 2));
 
@@ -1711,7 +1716,11 @@ system_identifier(PGconn *conn)
        }
        else
        {
+#if defined(__i386__) || defined(__i386)
+               system_identifier = atoll(PQgetvalue(res, 0, 0));
+#else
                system_identifier = atol(PQgetvalue(res, 0, 0));
+#endif
        }
 
        PQclear(res);