EnterpriseDB/repmgr

segfault occurs when "repmgr standby clone" is executed in certain situation

Closed this issue · 7 comments

After completion of failover process, we're trying to bring the former primary server as standby using "repmgr standby clone" command. however when we executed the standby clone command, segfault occurred.

How to reproduce the issue:

under the Environment described below, type the command
"repmgr -h node_2 -U repmgr -d experdb -f /experdb/app/postgres/etc/repmgr.conf --upstream-node-id=2 standby clone"

output:

[experdb@node1 ~]$ repmgr -h node2 -U repmgr -d experdb -f /experdb/app/postgres/etc/repmgr.conf --upstream-node-id=2 standby clone
NOTICE: destination directory "/data" provided
INFO: connecting to source node
DETAIL: connection string is: user=repmgr host=node2 dbname=experdb
DETAIL: current installation size is 30 MB
Segmentation fault

Environment

CentOS 7.6.1810
repmgr 5.1.0
PostgreSQL 11.7
1 primary, 1 standby (2 node)

output of "repmgr cluster show" command (after the failover)

[experdb@node2 etc]$ repmgr cluster show
 ID | Name  | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string
----+-------+---------+-----------+----------+----------+----------+----------+---------------------------------------------------------
 1  | node1 | primary | - failed  | ?        | default  | 100      |          | host=node1 user=repmgr dbname=experdb connect_timeout=2
 2  | node2 | primary | * running |          | default  | 100      | 4        | host=node2 user=repmgr dbname=experdb connect_timeout=2

WARNING: following issues were detected

  - unable to connect to node "node1" (ID: 1)

inside "/etc/hosts" file

127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
127.0.0.1 node1
192.168.56.57 node2

P.S. interestingly, when we take the 3rd row from the "/etc/hosts" file, repmgr command is not occurring segfault.

If possible, could you run the standby clone command at DEBUG log level and maximum verbosity, e.g.:

repmgr -h node2 -U repmgr -d experdb -f /experdb/app/postgres/etc/repmgr.conf --upstream-node-id=2 standby clone -LDEBUG --verbose

If possible, could you run the standby clone command at DEBUG log level and maximum verbosity, e.g.:

repmgr -h node2 -U repmgr -d experdb -f /experdb/app/postgres/etc/repmgr.conf --upstream-node-id=2 standby clone -LDEBUG --verbose
[experdb@node1 data]$ repmgr -h node2 -U repmgr -d experdb -f /experdb/app/postgres/etc/repmgr.conf --upstream-node-id=2 standby clone -LDEBUG --verbose
NOTICE: using provided configuration file "/experdb/app/postgres/etc/repmgr.conf"
NOTICE: destination directory "/data" provided
DEBUG: check_source_server()
INFO: connecting to source node
DETAIL: connection string is: user=repmgr host=node2 dbname=experdb
DEBUG: set_config():
  SET synchronous_commit TO 'local'
DEBUG: get_cluster_size():
SELECT pg_catalog.pg_size_pretty(pg_catalog.sum(pg_catalog.pg_database_size(oid))::bigint)       FROM pg_catalog.pg_database
DETAIL: current installation size is 30 MB
DEBUG: get_recovery_type(): SELECT pg_catalog.pg_is_in_recovery()
DEBUG: get_recovery_type(): SELECT pg_catalog.pg_is_in_recovery()
DEBUG: get_all_node_records():
  SELECT n.node_id, n.type, n.upstream_node_id, n.node_name,  n.conninfo, n.repluser, n.slot_name, n.location, n.priority, n.active, n.config_file, '' AS upstream_node_name, NULL AS attached     FROM repmgr.nodes n ORDER BY n.node_id
DEBUG: clear_node_info_list() - closing open connections
DEBUG: clear_node_info_list() - unlinking
DEBUG: 2 node records returned by source node
DEBUG: connecting to: "user=repmgr connect_timeout=2 dbname=experdb host=node1 fallback_application_name=repmgr"
ERROR: connection to database failed
DETAIL:
could not connect to server: Connection refused
        Is the server running on host "node1" (127.0.0.1) and accepting
        TCP/IP connections on port 5432?

DETAIL: attempted to connect using:
  user=repmgr connect_timeout=2 dbname=experdb host=node1 fallback_application_name=repmgr
DEBUG: connecting to: "user=repmgr connect_timeout=2 dbname=experdb host=node2 fallback_application_name=repmgr"
DEBUG: set_config():
  SET synchronous_commit TO 'local'
DEBUG: clear_node_info_list() - closing open connections
Segmentation fault
[experdb@node1 data]$

Thanks for the quick reply. here's what you've asked.

Thanks, have reproduced the issue.

As a workaround you can delete the record of the failed primary directly, i.e.

DELETE FROM repmgr.nodes WHERE node_id = 1;

The clone operation will then work. (Normally in this case we recommend using repmgr node rejoin to reactivate the former primary as a standby, though that will depend on having the required WAL available and possibly the prerequisites for executing pg_rewind).

If possible, could you run the standby clone command at DEBUG log level and maximum verbosity, e.g.:

repmgr -h node2 -U repmgr -d experdb -f /experdb/app/postgres/etc/repmgr.conf --upstream-node-id=2 standby clone -LDEBUG --verbose
[experdb@node1 data]$ repmgr -h node2 -U repmgr -d experdb -f /experdb/app/postgres/etc/repmgr.conf --upstream-node-id=2 standby clone -LDEBUG --verbose
NOTICE: using provided configuration file "/experdb/app/postgres/etc/repmgr.conf"
NOTICE: destination directory "/data" provided
DEBUG: check_source_server()
INFO: connecting to source node
DETAIL: connection string is: user=repmgr host=node2 dbname=experdb
DEBUG: set_config():
  SET synchronous_commit TO 'local'
DEBUG: get_cluster_size():
SELECT pg_catalog.pg_size_pretty(pg_catalog.sum(pg_catalog.pg_database_size(oid))::bigint)       FROM pg_catalog.pg_database
DETAIL: current installation size is 30 MB
DEBUG: get_recovery_type(): SELECT pg_catalog.pg_is_in_recovery()
DEBUG: get_recovery_type(): SELECT pg_catalog.pg_is_in_recovery()
DEBUG: get_all_node_records():
  SELECT n.node_id, n.type, n.upstream_node_id, n.node_name,  n.conninfo, n.repluser, n.slot_name, n.location, n.priority, n.active, n.config_file, '' AS upstream_node_name, NULL AS attached     FROM repmgr.nodes n ORDER BY n.node_id
DEBUG: clear_node_info_list() - closing open connections
DEBUG: clear_node_info_list() - unlinking
DEBUG: 2 node records returned by source node
DEBUG: connecting to: "user=repmgr connect_timeout=2 dbname=experdb host=node1 fallback_application_name=repmgr"
ERROR: connection to database failed
DETAIL:
could not connect to server: Connection refused
        Is the server running on host "node1" (127.0.0.1) and accepting
        TCP/IP connections on port 5432?

DETAIL: attempted to connect using:
  user=repmgr connect_timeout=2 dbname=experdb host=node1 fallback_application_name=repmgr
DEBUG: connecting to: "user=repmgr connect_timeout=2 dbname=experdb host=node2 fallback_application_name=repmgr"
DEBUG: set_config():
  SET synchronous_commit TO 'local'
DEBUG: clear_node_info_list() - closing open connections
Segmentation fault
[experdb@node1 data]$

Thanks, have reproduced the issue.

As a workaround you can delete the record of the failed primary directly, i.e.

DELETE FROM repmgr.nodes WHERE node_id = 1;

The clone operation will then work. (Normally in this case we recommend using repmgr node rejoin to reactivate the former primary as a standby, though that will depend on having the required WAL available and possibly the prerequisites for executing pg_rewind).

Yes, It will work when there's no node information on the table repmgr.nodes.
I have tried that the day I encountered the bug.

Issue seems to be caused by cleanup operations on a database connection handle which was finished, but not set to NULL. This will cause the intermittent segfaults you were seeing, and is not connected to the presence of a particular entry in /etc/hosts.

Fix pushed, we can provide snapshot packages if you'd like to test.

sjmik commented

Pretty sure I just ran into this as well.
The DELETE workaround worked for me.
Was there a recent change that caused this, or is it just a coincidence that we both had this happen?

$ uname -a
Linux 78d33bde6a96 4.15.0-96-generic #97-Ubuntu SMP Wed Apr 1 03:25:46 UTC 2020 x86_64 GNU/Linux
$ repmgr --version
repmgr 5.1.0
$ postgres --version
postgres (PostgreSQL) 11.7 (Debian 11.7-2.pgdg90+1)
NOTICE: destination directory "/var/lib/postgresql/data" provided
INFO: connecting to source node
DETAIL: connection string is: host=10.0.18.2 user=repmgr dbname=repmgr
DETAIL: current installation size is 37 MB
/usr/local/bin/docker-entrypoint.sh: line 248:
	54 Segmentation fault (core dumped)
	gosu postgres repmgr -h $PARTNER_ADDRESS -U repmgr -d repmgr standby clone --fast-checkpoint
Core was generated by `/usr/lib/postgresql/11/bin/repmgr -h 10.0.18.2 -U repmgr -d repmgr standby clon'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007ff65c736c00 in resetPQExpBuffer () from /usr/lib/x86_64-linux-gnu/libpq.so.5
(gdb) bt full
#0  0x00007ff65c736c00 in resetPQExpBuffer () from /usr/lib/x86_64-linux-gnu/libpq.so.5
No symbol table info available.
#1  0x00007ff65c736e18 in printfPQExpBuffer () from /usr/lib/x86_64-linux-gnu/libpq.so.5
No symbol table info available.
#2  0x00007ff65c737622 in ?? () from /usr/lib/x86_64-linux-gnu/libpq.so.5
No symbol table info available.
#3  0x00007ff65c72ed34 in ?? () from /usr/lib/x86_64-linux-gnu/libpq.so.5
No symbol table info available.
#4  0x00007ff65c7245fd in ?? () from /usr/lib/x86_64-linux-gnu/libpq.so.5
No symbol table info available.
#5  0x00007ff65c724626 in PQfinish () from /usr/lib/x86_64-linux-gnu/libpq.so.5
No symbol table info available.
#6  0x0000562ca08f5ffc in ?? ()
No symbol table info available.
#7  0x0000562ca08c8c20 in ?? ()
No symbol table info available.
#8  0x0000562ca08ccfad in ?? ()
No symbol table info available.
#9  0x0000562ca08beeee in ?? ()
No symbol table info available.
#10 0x00007ff65c17a2e1 in __libc_start_main (main=0x562ca08bc950, argc=10, argv=0x7ffe00a0a238, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffe00a0a228) at ../csu/libc-start.c:291
        result = <optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {0, -3511388801293494682, 94749672078368, 140728908948016, 0, 0, -7142627491962825114, -7137937894434417050}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x7ffe00a0a290, 0x7ff65cb89170}, data = {
              prev = 0x0, cleanup = 0x0, canceltype = 10527376}}}
        not_first_call = <optimized out>
#11 0x0000562ca08bfc4a in ?? ()
No symbol table info available.

Pretty sure I just ran into this as well.
The DELETE workaround worked for me.
Was there a recent change that caused this, or is it just a coincidence that we both had this happen?

It looks like that's been around a while, but because of the nature of the bug it's very intermittent, and the particular combination of activity (performing standby clone with a pre-existing record) is not one we'd foreseen so there was no test coverage.