EnterpriseDB/repmgr

PostgresSQL does not restart after "repmgr standby follow"

Closed this issue · 2 comments

I installed a new PosgtreSQL 12 cluster with repmgr with one master and 2 standbies:

postgres@postgres1:~$ repmgr cluster show
 ID | Name      | Role    | Status    | Upstream  | Location | Priority | Timeline | Connection string                                         
----+-----------+---------+-----------+-----------+----------+----------+----------+------------------------------------------------------------
 1  | postgres1 | primary | * running |           | default  | 100      | 1        | host=postgres1 user=repmgr dbname=repmgr connect_timeout=2
 2  | postgres2 | standby |   running | postgres1 | default  | 100      | 1        | host=postgres2 user=repmgr dbname=repmgr connect_timeout=2
 3  | postgres3 | standby |   running | postgres1 | default  | 100      | 1        | host=postgres3 user=repmgr dbname=repmgr connect_timeout=2

After a failure on the postgres1 primary I can correctly promote postgres2 as a new primary.

When I execute repmgr standby follow on the postgres3 standby I don't have any error, but the postgres service on postgres3 does not start. If I start it manually it starts and it correctly follows the new master postgres2.

How to reproduce:

Working repmgr cluster with one primary and two standbies:

postgres@postgres1:~$ repmgr cluster show
 ID | Name      | Role    | Status    | Upstream  | Location | Priority | Timeline | Connection string                                         
----+-----------+---------+-----------+-----------+----------+----------+----------+------------------------------------------------------------
 1  | postgres1 | primary | * running |           | default  | 100      | 1        | host=postgres1 user=repmgr dbname=repmgr connect_timeout=2
 2  | postgres2 | standby |   running | postgres1 | default  | 100      | 1        | host=postgres2 user=repmgr dbname=repmgr connect_timeout=2
 3  | postgres3 | standby |   running | postgres1 | default  | 100      | 1        | host=postgres3 user=repmgr dbname=repmgr connect_timeout=2

Stop PostgreSQL on the primary:

TEST root@postgres1 ~# systemctl stop postgresql

Now the primary is missing:

postgres@postgres2:~$ repmgr cluster show
 ID | Name      | Role    | Status        | Upstream    | Location | Priority | Timeline | Connection string                                         
----+-----------+---------+---------------+-------------+----------+----------+----------+------------------------------------------------------------
 1  | postgres1 | primary | ? unreachable | ?           | default  | 100      |          | host=postgres1 user=repmgr dbname=repmgr connect_timeout=2
 2  | postgres2 | standby |   running     | ? postgres1 | default  | 100      | 1        | host=postgres2 user=repmgr dbname=repmgr connect_timeout=2
 3  | postgres3 | standby |   running     | ? postgres1 | default  | 100      | 1        | host=postgres3 user=repmgr dbname=repmgr connect_timeout=2

WARNING: following issues were detected
  - unable to connect to node "postgres1" (ID: 1)
  - node "postgres1" (ID: 1) is registered as an active primary but is unreachable
  - unable to connect to node "postgres2" (ID: 2)'s upstream node "postgres1" (ID: 1)
  - unable to determine if node "postgres2" (ID: 2) is attached to its upstream node "postgres1" (ID: 1)
  - unable to connect to node "postgres3" (ID: 3)'s upstream node "postgres1" (ID: 1)
  - unable to determine if node "postgres3" (ID: 3) is attached to its upstream node "postgres1" (ID: 1)

HINT: execute with --verbose option to see connection error messages

Promote postgres2 standby to new primary:

postgres@postgres2:~$ repmgr standby promote
WARNING: 1 sibling nodes found, but option "--siblings-follow" not specified
DETAIL: these nodes will remain attached to the current primary:
  postgres3 (node ID: 3)
NOTICE: promoting standby to primary
DETAIL: promoting server "postgres2" (ID: 2) using pg_promote()
NOTICE: waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
NOTICE: STANDBY PROMOTE successful
DETAIL: server "postgres2" (ID: 2) was successfully promoted to primary

Now postgres2 is primary:

postgres@postgres2:~$ repmgr cluster show
 ID | Name      | Role    | Status    | Upstream    | Location | Priority | Timeline | Connection string                                         
----+-----------+---------+-----------+-------------+----------+----------+----------+------------------------------------------------------------
 1  | postgres1 | primary | - failed  | ?           | default  | 100      |          | host=postgres1 user=repmgr dbname=repmgr connect_timeout=2
 2  | postgres2 | primary | * running |             | default  | 100      | 2        | host=postgres2 user=repmgr dbname=repmgr connect_timeout=2
 3  | postgres3 | standby |   running | ? postgres1 | default  | 100      | 1        | host=postgres3 user=repmgr dbname=repmgr connect_timeout=2

WARNING: following issues were detected
  - unable to connect to node "postgres1" (ID: 1)
  - unable to connect to node "postgres3" (ID: 3)'s upstream node "postgres1" (ID: 1)
  - unable to determine if node "postgres3" (ID: 3) is attached to its upstream node "postgres1" (ID: 1)

HINT: execute with --verbose option to see connection error messages

On postgres3 follow the new primary:

postgres@postgres3:~$ repmgr standby follow
NOTICE: attempting to find and follow current primary
INFO: local node 3 can attach to follow target node 2
DETAIL: local node's recovery point: 0/60000A0; follow target node's fork point: 0/60000A0
INFO: creating replication slot as user "repmgr"
NOTICE: setting node 3's upstream to node 2
NOTICE: stopping server using "/usr/lib/postgresql/12/bin/pg_ctl  -D '/var/lib/postgresql/12/main' -w -m fast stop"
NOTICE: starting server using "/usr/lib/postgresql/12/bin/pg_ctl  -w -D '/var/lib/postgresql/12/main' start"
WARNING: unable to connect to old upstream node 1 to remove replication slot
HINT: if reusing this node, you should manually remove any inactive replication slots
NOTICE: STANDBY FOLLOW successful
DETAIL: standby attached to upstream node "postgres2" (ID: 2)

But after this, PostgreSQL is no more running on postgres3:

postgres@postgres3:~$ repmgr cluster show
ERROR: connection to database failed
DETAIL: 
could not connect to server: Connection refused
	Is the server running on host "postgres3" (192.168.172.23) and accepting
	TCP/IP connections on port 5432?

DETAIL: attempted to connect using:
  user=repmgr connect_timeout=2 dbname=repmgr host=postgres3 fallback_application_name=repmgr

PostgreSQL logs on postgres3 during the follow procedure:

2020-08-28 20:40:37.339 UTC [15522] LOG:  received fast shutdown request
2020-08-28 20:40:37.340 UTC [15522] LOG:  aborting any active transactions
2020-08-28 20:40:42.346 UTC [15524] LOG:  shutting down
2020-08-28 20:40:42.360 UTC [15522] LOG:  database system is shut down

syslogs on postgres3 during the follow procedure:

Aug 28 20:40:42 postgres3 postgresql@12-main[15902]: Cluster is not running.
Aug 28 20:40:42 postgres3 systemd[1]: postgresql@12-main.service: Control process exited, code=exited, status=2/INVALIDARGUMENT
Aug 28 20:40:42 postgres3 systemd[1]: postgresql@12-main.service: Failed with result 'exit-code'.

If I manually start PostgreSQL on postgres3 it starts without any error and it's following the new master postgres2:

TEST root@postgres3 ~# systemctl start postgresql
TEST root@postgres3 ~# su - postgres
postgres@postgres3:~$ repmgr cluster show
 ID | Name      | Role    | Status    | Upstream  | Location | Priority | Timeline | Connection string                                         
----+-----------+---------+-----------+-----------+----------+----------+----------+------------------------------------------------------------
 1  | postgres1 | primary | - failed  | ?         | default  | 100      |          | host=postgres1 user=repmgr dbname=repmgr connect_timeout=2
 2  | postgres2 | primary | * running |           | default  | 100      | 2        | host=postgres2 user=repmgr dbname=repmgr connect_timeout=2
 3  | postgres3 | standby |   running | postgres2 | default  | 100      | 1        | host=postgres3 user=repmgr dbname=repmgr connect_timeout=2

WARNING: following issues were detected
  - unable to connect to node "postgres1" (ID: 1)

HINT: execute with --verbose option to see connection error messages

Environment details:

  • 3 Debian Buster nodes
  • PostgreSQL 12.4
  • repmgr 5.1.0

Edit:

I tried the same with Repmgr 4.4 and PostgreSQL 11.5, the problem does not happen (PostgreSQL is correctly restarted after follow). I then upgraded to Repmgr 5.1.0 (maintaining PostgreSQL 11.5) and the same problem occurs.

I solved the problem adding service commands to /etc/repmgr.conf:

service_start_command='sudo systemctl start postgresql@12-main.service'
service_stop_command='sudo systemctl start postgresql@12-main.service'
service_restart_command='sudo systemctl restart postgresql@12-main.service'
service_reload_command='sudo systemctl reload postgresql@12-main.service'
lukos commented

Typo in your comment @mmartinello The second line has systemctl start instead of systemctl stop!