It is possible that the "reconnect_interval" parameter does not work in repmgr version 5.2.0

Question

It is possible that the "reconnect_interval" parameter does not work in repmgr version 5.2.0

Closed this issue 4 years ago · 8 comments

my_systems.txt
postgresql_zabproxynode1_conf.txt
postgresql_zabservtstnode1_conf.txt
postgresql_zabservtstpve_conf.txt
repmgr_zabproxynode1_conf.txt
repmgr_zabservtstnode1_conf.txt
repmgr_zabservtstpve_conf.txt
does_not_work_reconnect_interval_repmgr_log.txt
my_systems.txt

Good time of the day, everyone.
I will briefly describe the problem.
I have a cluster of two servers: Primary and Standby and Witness with PostgreSQL 12.
I use repmgr version 5.2.0 as a high availability system. Earlier I used repmgr version 5.1.0.
02.11.2020 I updated repmgr to version 5.2.0.
And I noticed the following strange behavior of the repmgr 5.2.0 in the scenario of promoting Standby to the Master.
Namely, in the repmgr.conf file, the "reconnect_interval" parameter is set to 10 seconds by default:
...
#reconnect_interval=10
...
But repmgr on Standby makes attempts to connect with the Master with a frequency of 1 second. I emulated the failure of the Master by the command: systemstl stop postgresql:
...
[2020-11-04 14:28:33] [DEBUG] replication lag in bytes is: 0
[2020-11-04 14:28:35] [DEBUG] connection check type is "ping"
[2020-11-04 14:28:35] [WARNING] unable to ping "host = zabservtstnode1 user = repmgr dbname = repmgr connect_timeout = 10"
[2020-11-04 14:28:35] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2020-11-04 14:28:35] [WARNING] unable to connect to upstream node "zabservtstnode1" (ID: 1)
[2020-11-04 14:28:35] [INFO] checking state of node "zabservtstnode1" (ID: 1), 1 of 6 attempts
[2020-11-04 14:28:35] [WARNING] unable to ping "user=repmgr connect_timeout=10 dbname=repmgr host=zabservtstnode1 fallback_application_name=repmgr"
[2020-11-04 14:28:35] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2020-11-04 14:28:35] [INFO] sleeping up to 1 seconds until next reconnection attempt
[2020-11-04 14:28:36] [INFO] checking state of node "zabservtstnode1" (ID: 1), 2 of 6 attempts
[2020-11-04 14:28:36] [WARNING] unable to ping "user=repmgr connect_timeout=10 dbname=repmgr host=zabservtstnode1 fallback_application_name=repmgr"
[2020-11-04 14:28:36] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
...
I tried setting other values for this parameter, but the repmgr behavior did not change. The polling rate is still 1 second.

In version 5.1.0, there was no such repmgr behavior, the interval of polling attempts exactly matched the "reconnect_interval" parameter.

Please tell me this is a bug in version 5.2.0. Or is this the correct behavior of repmgr and is it configurable by another parameter?

P.S. Here are the necessary files:

parameters of my systems
postgreSQL config files
repmgr configuration files
repmgr.log file with the described situation

stephan-hahn commented 4 years ago

Great!

Answer 1 · 2020-11-09T12:16:11.000Z

Yes, I have exactly the same problem since version 5.2.
Interestingly an additional witness server uses the correct parameter value.

Answer 2 · 2020-11-10T07:33:33.000Z

Hello everybody.
I also noticed that the "reconnect_interval" parameter does not work on the postgreSQL service stop event on the Master. If I disable the network interface eth on the Master (ifdown eth2), then the "reconnect_interval" parameter works correctly.
To test this, I set the parameters: reconnect_attempts = 3 and reconnect_interval = 8. Restarting repmgrd. And then disconnected the network interface on the Master. In the repmgr.log I saw the correct operation of "reconnect_interval".
I would like the developers to give at least some kind of feedback.

Here is an excerpt from repmgr.log.
ifdown_eth_on_Master_reconnect_interval_repmgr_log.txt

Answer 3 · 2020-11-10T07:44:07.000Z

Thanks for the reports, we will look into them ASAP but it may take a few days.

Answer 4 · 2020-11-10T10:30:57.000Z

Unfortunately I can't reproduce the usage of the correct parameter value on the standby by shutting down eth0 on the primary.

Answer 5 · 2020-12-02T00:30:36.000Z

Issue has been reproduced and a fix applied, which will be in the upcoming 5.2.1 release.

Thanks for the report!

Answer 6 · 2020-12-09T07:23:10.000Z

Good time of the day, Ibarwick.
After updating repmgr to version 5.2.1, the problem is resolved. For both types of failures: stop postgresql and ifdown eth2, the "reconnect_interval" parameter works correctly. Here is an excerpt from repmgr.log:

[2020-12-09 16:46:28] [DEBUG] replication lag in bytes is: 0
[2020-12-09 16:46:30] [DEBUG] connection check type is "ping"
[2020-12-09 16:46:30] [DEBUG] replication lag in bytes is: 0
[2020-12-09 16:46:32] [DEBUG] connection check type is "ping"
[2020-12-09 16:46:32] [WARNING] unable to ping "host = zabservtstnode1 user = repmgr dbname = repmgr connect_timeout = 10"
[2020-12-09 16:46:32] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2020-12-09 16:46:32] [WARNING] unable to connect to upstream node "zabservtstnode1" (ID: 1)
[2020-12-09 16:46:32] [INFO] checking state of node "zabservtstnode1" (ID: 1), 1 of 3 attempts
[2020-12-09 16:46:32] [WARNING] unable to ping "user=repmgr connect_timeout=10 dbname=repmgr host=zabservtstnode1 fallback_application_name=repmgr"
[2020-12-09 16:46:32] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2020-12-09 16:46:32] [INFO] sleeping up to 8 seconds until next reconnection attempt
[2020-12-09 16:46:40] [INFO] checking state of node "zabservtstnode1" (ID: 1), 2 of 3 attempts
[2020-12-09 16:46:50] [WARNING] unable to ping "user=repmgr connect_timeout=10 dbname=repmgr host=zabservtstnode1 fallback_application_name=repmgr"
[2020-12-09 16:46:50] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2020-12-09 16:46:50] [INFO] sleeping up to 8 seconds until next reconnection attempt
[2020-12-09 16:46:58] [INFO] checking state of node "zabservtstnode1" (ID: 1), 3 of 3 attempts
[2020-12-09 16:47:08] [WARNING] unable to ping "user=repmgr connect_timeout=10 dbname=repmgr host=zabservtstnode1 fallback_application_name=repmgr"
[2020-12-09 16:47:08] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2020-12-09 16:47:08] [WARNING] unable to reconnect to node "zabservtstnode1" (ID: 1) after 3 attempts
[2020-12-09 16:47:08] [NOTICE] setting "wal_retrieve_retry_interval" to 86405000 milliseconds
[2020-12-09 16:47:08] [INFO] sleeping 5 seconds
[2020-12-09 16:47:13] [NOTICE] killing WAL receiver with PID 16093
[2020-12-09 16:47:14] [INFO] WAL receiver with pid 16093 killed

Thank you and your team for solving this problem !!!

I'll close the problem now.

Answer 7 · 2020-12-09T08:12:46.000Z

Thanks for the confirmation, much appreciated!