EnterpriseDB/repmgr

It is possible that the "reconnect_interval" parameter does not work in repmgr version 5.2.0

Closed this issue · 8 comments

my_systems.txt
postgresql_zabproxynode1_conf.txt
postgresql_zabservtstnode1_conf.txt
postgresql_zabservtstpve_conf.txt
repmgr_zabproxynode1_conf.txt
repmgr_zabservtstnode1_conf.txt
repmgr_zabservtstpve_conf.txt
does_not_work_reconnect_interval_repmgr_log.txt
my_systems.txt

Good time of the day, everyone.
I will briefly describe the problem.
I have a cluster of two servers: Primary and Standby and Witness with PostgreSQL 12.
I use repmgr version 5.2.0 as a high availability system. Earlier I used repmgr version 5.1.0.
02.11.2020 I updated repmgr to version 5.2.0.
And I noticed the following strange behavior of the repmgr 5.2.0 in the scenario of promoting Standby to the Master.
Namely, in the repmgr.conf file, the "reconnect_interval" parameter is set to 10 seconds by default:
...
#reconnect_interval=10
...
But repmgr on Standby makes attempts to connect with the Master with a frequency of 1 second. I emulated the failure of the Master by the command: systemstl stop postgresql:
...
[2020-11-04 14:28:33] [DEBUG] replication lag in bytes is: 0
[2020-11-04 14:28:35] [DEBUG] connection check type is "ping"
[2020-11-04 14:28:35] [WARNING] unable to ping "host = zabservtstnode1 user = repmgr dbname = repmgr connect_timeout = 10"
[2020-11-04 14:28:35] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2020-11-04 14:28:35] [WARNING] unable to connect to upstream node "zabservtstnode1" (ID: 1)
[2020-11-04 14:28:35] [INFO] checking state of node "zabservtstnode1" (ID: 1), 1 of 6 attempts
[2020-11-04 14:28:35] [WARNING] unable to ping "user=repmgr connect_timeout=10 dbname=repmgr host=zabservtstnode1 fallback_application_name=repmgr"
[2020-11-04 14:28:35] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2020-11-04 14:28:35] [INFO] sleeping up to 1 seconds until next reconnection attempt
[2020-11-04 14:28:36] [INFO] checking state of node "zabservtstnode1" (ID: 1), 2 of 6 attempts
[2020-11-04 14:28:36] [WARNING] unable to ping "user=repmgr connect_timeout=10 dbname=repmgr host=zabservtstnode1 fallback_application_name=repmgr"
[2020-11-04 14:28:36] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
...
I tried setting other values for this parameter, but the repmgr behavior did not change. The polling rate is still 1 second.

In version 5.1.0, there was no such repmgr behavior, the interval of polling attempts exactly matched the "reconnect_interval" parameter.

Please tell me this is a bug in version 5.2.0. Or is this the correct behavior of repmgr and is it configurable by another parameter?

P.S. Here are the necessary files:

  • parameters of my systems
  • postgreSQL config files
  • repmgr configuration files
  • repmgr.log file with the described situation

Yes, I have exactly the same problem since version 5.2.
Interestingly an additional witness server uses the correct parameter value.

Hello everybody.
I also noticed that the "reconnect_interval" parameter does not work on the postgreSQL service stop event on the Master. If I disable the network interface eth on the Master (ifdown eth2), then the "reconnect_interval" parameter works correctly.
To test this, I set the parameters: reconnect_attempts = 3 and reconnect_interval = 8. Restarting repmgrd. And then disconnected the network interface on the Master. In the repmgr.log I saw the correct operation of "reconnect_interval".
I would like the developers to give at least some kind of feedback.

Here is an excerpt from repmgr.log.
ifdown_eth_on_Master_reconnect_interval_repmgr_log.txt

Thanks for the reports, we will look into them ASAP but it may take a few days.

Unfortunately I can't reproduce the usage of the correct parameter value on the standby by shutting down eth0 on the primary.

Issue has been reproduced and a fix applied, which will be in the upcoming 5.2.1 release.

Thanks for the report!

Great!

Good time of the day, Ibarwick.
After updating repmgr to version 5.2.1, the problem is resolved. For both types of failures: stop postgresql and ifdown eth2, the "reconnect_interval" parameter works correctly. Here is an excerpt from repmgr.log:

[2020-12-09 16:46:28] [DEBUG] replication lag in bytes is: 0
[2020-12-09 16:46:30] [DEBUG] connection check type is "ping"
[2020-12-09 16:46:30] [DEBUG] replication lag in bytes is: 0
[2020-12-09 16:46:32] [DEBUG] connection check type is "ping"
[2020-12-09 16:46:32] [WARNING] unable to ping "host = zabservtstnode1 user = repmgr dbname = repmgr connect_timeout = 10"
[2020-12-09 16:46:32] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2020-12-09 16:46:32] [WARNING] unable to connect to upstream node "zabservtstnode1" (ID: 1)
[2020-12-09 16:46:32] [INFO] checking state of node "zabservtstnode1" (ID: 1), 1 of 3 attempts
[2020-12-09 16:46:32] [WARNING] unable to ping "user=repmgr connect_timeout=10 dbname=repmgr host=zabservtstnode1 fallback_application_name=repmgr"
[2020-12-09 16:46:32] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2020-12-09 16:46:32] [INFO] sleeping up to 8 seconds until next reconnection attempt
[2020-12-09 16:46:40] [INFO] checking state of node "zabservtstnode1" (ID: 1), 2 of 3 attempts
[2020-12-09 16:46:50] [WARNING] unable to ping "user=repmgr connect_timeout=10 dbname=repmgr host=zabservtstnode1 fallback_application_name=repmgr"
[2020-12-09 16:46:50] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2020-12-09 16:46:50] [INFO] sleeping up to 8 seconds until next reconnection attempt
[2020-12-09 16:46:58] [INFO] checking state of node "zabservtstnode1" (ID: 1), 3 of 3 attempts
[2020-12-09 16:47:08] [WARNING] unable to ping "user=repmgr connect_timeout=10 dbname=repmgr host=zabservtstnode1 fallback_application_name=repmgr"
[2020-12-09 16:47:08] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2020-12-09 16:47:08] [WARNING] unable to reconnect to node "zabservtstnode1" (ID: 1) after 3 attempts
[2020-12-09 16:47:08] [NOTICE] setting "wal_retrieve_retry_interval" to 86405000 milliseconds
[2020-12-09 16:47:08] [INFO] sleeping 5 seconds
[2020-12-09 16:47:13] [NOTICE] killing WAL receiver with PID 16093
[2020-12-09 16:47:14] [INFO] WAL receiver with pid 16093 killed

Thank you and your team for solving this problem !!!

I'll close the problem now.

Thanks for the confirmation, much appreciated!