PostgresSQL does not restart after "repmgr standby follow"
Closed this issue · 2 comments
I installed a new PosgtreSQL 12 cluster with repmgr with one master and 2 standbies:
postgres@postgres1:~$ repmgr cluster show
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----+-----------+---------+-----------+-----------+----------+----------+----------+------------------------------------------------------------
1 | postgres1 | primary | * running | | default | 100 | 1 | host=postgres1 user=repmgr dbname=repmgr connect_timeout=2
2 | postgres2 | standby | running | postgres1 | default | 100 | 1 | host=postgres2 user=repmgr dbname=repmgr connect_timeout=2
3 | postgres3 | standby | running | postgres1 | default | 100 | 1 | host=postgres3 user=repmgr dbname=repmgr connect_timeout=2
After a failure on the postgres1
primary I can correctly promote postgres2
as a new primary.
When I execute repmgr standby follow
on the postgres3
standby I don't have any error, but the postgres
service on postgres3
does not start. If I start it manually it starts and it correctly follows the new master postgres2
.
How to reproduce:
Working repmgr cluster with one primary and two standbies:
postgres@postgres1:~$ repmgr cluster show
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----+-----------+---------+-----------+-----------+----------+----------+----------+------------------------------------------------------------
1 | postgres1 | primary | * running | | default | 100 | 1 | host=postgres1 user=repmgr dbname=repmgr connect_timeout=2
2 | postgres2 | standby | running | postgres1 | default | 100 | 1 | host=postgres2 user=repmgr dbname=repmgr connect_timeout=2
3 | postgres3 | standby | running | postgres1 | default | 100 | 1 | host=postgres3 user=repmgr dbname=repmgr connect_timeout=2
Stop PostgreSQL on the primary:
TEST root@postgres1 ~# systemctl stop postgresql
Now the primary is missing:
postgres@postgres2:~$ repmgr cluster show
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----+-----------+---------+---------------+-------------+----------+----------+----------+------------------------------------------------------------
1 | postgres1 | primary | ? unreachable | ? | default | 100 | | host=postgres1 user=repmgr dbname=repmgr connect_timeout=2
2 | postgres2 | standby | running | ? postgres1 | default | 100 | 1 | host=postgres2 user=repmgr dbname=repmgr connect_timeout=2
3 | postgres3 | standby | running | ? postgres1 | default | 100 | 1 | host=postgres3 user=repmgr dbname=repmgr connect_timeout=2
WARNING: following issues were detected
- unable to connect to node "postgres1" (ID: 1)
- node "postgres1" (ID: 1) is registered as an active primary but is unreachable
- unable to connect to node "postgres2" (ID: 2)'s upstream node "postgres1" (ID: 1)
- unable to determine if node "postgres2" (ID: 2) is attached to its upstream node "postgres1" (ID: 1)
- unable to connect to node "postgres3" (ID: 3)'s upstream node "postgres1" (ID: 1)
- unable to determine if node "postgres3" (ID: 3) is attached to its upstream node "postgres1" (ID: 1)
HINT: execute with --verbose option to see connection error messages
Promote postgres2 standby to new primary:
postgres@postgres2:~$ repmgr standby promote
WARNING: 1 sibling nodes found, but option "--siblings-follow" not specified
DETAIL: these nodes will remain attached to the current primary:
postgres3 (node ID: 3)
NOTICE: promoting standby to primary
DETAIL: promoting server "postgres2" (ID: 2) using pg_promote()
NOTICE: waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
NOTICE: STANDBY PROMOTE successful
DETAIL: server "postgres2" (ID: 2) was successfully promoted to primary
Now postgres2 is primary:
postgres@postgres2:~$ repmgr cluster show
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----+-----------+---------+-----------+-------------+----------+----------+----------+------------------------------------------------------------
1 | postgres1 | primary | - failed | ? | default | 100 | | host=postgres1 user=repmgr dbname=repmgr connect_timeout=2
2 | postgres2 | primary | * running | | default | 100 | 2 | host=postgres2 user=repmgr dbname=repmgr connect_timeout=2
3 | postgres3 | standby | running | ? postgres1 | default | 100 | 1 | host=postgres3 user=repmgr dbname=repmgr connect_timeout=2
WARNING: following issues were detected
- unable to connect to node "postgres1" (ID: 1)
- unable to connect to node "postgres3" (ID: 3)'s upstream node "postgres1" (ID: 1)
- unable to determine if node "postgres3" (ID: 3) is attached to its upstream node "postgres1" (ID: 1)
HINT: execute with --verbose option to see connection error messages
On postgres3 follow the new primary:
postgres@postgres3:~$ repmgr standby follow
NOTICE: attempting to find and follow current primary
INFO: local node 3 can attach to follow target node 2
DETAIL: local node's recovery point: 0/60000A0; follow target node's fork point: 0/60000A0
INFO: creating replication slot as user "repmgr"
NOTICE: setting node 3's upstream to node 2
NOTICE: stopping server using "/usr/lib/postgresql/12/bin/pg_ctl -D '/var/lib/postgresql/12/main' -w -m fast stop"
NOTICE: starting server using "/usr/lib/postgresql/12/bin/pg_ctl -w -D '/var/lib/postgresql/12/main' start"
WARNING: unable to connect to old upstream node 1 to remove replication slot
HINT: if reusing this node, you should manually remove any inactive replication slots
NOTICE: STANDBY FOLLOW successful
DETAIL: standby attached to upstream node "postgres2" (ID: 2)
But after this, PostgreSQL is no more running on postgres3:
postgres@postgres3:~$ repmgr cluster show
ERROR: connection to database failed
DETAIL:
could not connect to server: Connection refused
Is the server running on host "postgres3" (192.168.172.23) and accepting
TCP/IP connections on port 5432?
DETAIL: attempted to connect using:
user=repmgr connect_timeout=2 dbname=repmgr host=postgres3 fallback_application_name=repmgr
PostgreSQL logs on postgres3 during the follow procedure:
2020-08-28 20:40:37.339 UTC [15522] LOG: received fast shutdown request
2020-08-28 20:40:37.340 UTC [15522] LOG: aborting any active transactions
2020-08-28 20:40:42.346 UTC [15524] LOG: shutting down
2020-08-28 20:40:42.360 UTC [15522] LOG: database system is shut down
syslogs on postgres3 during the follow procedure:
Aug 28 20:40:42 postgres3 postgresql@12-main[15902]: Cluster is not running.
Aug 28 20:40:42 postgres3 systemd[1]: postgresql@12-main.service: Control process exited, code=exited, status=2/INVALIDARGUMENT
Aug 28 20:40:42 postgres3 systemd[1]: postgresql@12-main.service: Failed with result 'exit-code'.
If I manually start PostgreSQL on postgres3 it starts without any error and it's following the new master postgres2:
TEST root@postgres3 ~# systemctl start postgresql
TEST root@postgres3 ~# su - postgres
postgres@postgres3:~$ repmgr cluster show
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----+-----------+---------+-----------+-----------+----------+----------+----------+------------------------------------------------------------
1 | postgres1 | primary | - failed | ? | default | 100 | | host=postgres1 user=repmgr dbname=repmgr connect_timeout=2
2 | postgres2 | primary | * running | | default | 100 | 2 | host=postgres2 user=repmgr dbname=repmgr connect_timeout=2
3 | postgres3 | standby | running | postgres2 | default | 100 | 1 | host=postgres3 user=repmgr dbname=repmgr connect_timeout=2
WARNING: following issues were detected
- unable to connect to node "postgres1" (ID: 1)
HINT: execute with --verbose option to see connection error messages
Environment details:
- 3 Debian Buster nodes
- PostgreSQL 12.4
- repmgr 5.1.0
Edit:
I tried the same with Repmgr 4.4 and PostgreSQL 11.5, the problem does not happen (PostgreSQL is correctly restarted after follow). I then upgraded to Repmgr 5.1.0 (maintaining PostgreSQL 11.5) and the same problem occurs.
I solved the problem adding service commands to /etc/repmgr.conf
:
service_start_command='sudo systemctl start postgresql@12-main.service'
service_stop_command='sudo systemctl start postgresql@12-main.service'
service_restart_command='sudo systemctl restart postgresql@12-main.service'
service_reload_command='sudo systemctl reload postgresql@12-main.service'
Typo in your comment @mmartinello The second line has systemctl start instead of systemctl stop!