[PCS] postgres9.6 node in blocked state

Question

[PCS] postgres9.6 node in blocked state

s-blottk opened this issue 3 years ago · 2 comments

s-blottk commented 3 years ago

Hello,

i did setup PAF on my production cluster using 3 postgres 9.6 databases.

Initially there has been not node fencing.

After a reboot of a node the db01 is in a blocked state.

Node fencing is enabled now.

How is it possible to get db01 in a working state again?

Cluster name: cluster_pgsql
Stack: corosync
Current DC: otn-ac-monp-db03.local (version 1.1.23-1.el7_9.1-9acf116022) - partition with quorum
Last updated: Fri Apr  8 17:04:57 2022
Last change: Fri Apr  8 17:01:50 2022 by root via cibadmin on otn-ac-monp-db01.local

3 nodes configured
5 resource instances configured (1 BLOCKED from further action due to failure)

Online: [ otn-ac-monp-db01.local otn-ac-monp-db02.local otn-ac-monp-db03.local ]

Full list of resources:

 Master/Slave Set: pgsql-ha [pgsqld]
     pgsqld     (ocf::heartbeat:pgsqlms):       FAILED otn-ac-monp-db01.local (Monitoring, blocked)
     Masters: [ otn-ac-monp-db02.local ]
     Slaves: [ otn-ac-monp-db03.local ]
 pgsql-pri-ip   (ocf::heartbeat:IPaddr2):       Started otn-ac-monp-db02.local
 fence_device   (stonith:fence_vmware_soap):    Started otn-ac-monp-db03.local

Failed Resource Actions:
* pgsqld_stop_0 on otn-ac-monp-db01.local 'unknown error' (1): call=62, status=complete, exitreason='Unexpected state for instance "pgsqld" (returned 9)',
    last-rc-change='Fri Apr  8 17:01:51 2022', queued=0ms, exec=99ms

Daemon Status:
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/enabled

s-blottk commented 3 years ago

No idea?

Answer 1 · 2022-04-24T16:31:14.000Z

Hi,

PAF requires fencing to work correctly.

Pacemaker mark the resource as blocked if there's an error and no possible action to make sure the resource is really stopped (action stopped failed and no fencing setup).

No matter what kind of maintenance you want to do, if the node is part of a cluster, you must consider the whole cluster. Imagine the cluster is a worker in your team. If you don't talk to it, it will do its job no matter what. So if you want to shut down/reboot your node, make sure to tell the cluster first.

Last but not least, if you don't setup fencing, you'll have more and more trouble and questions. Just set it up, or your cluster will never behave correctly. At least, deploy three node and watchdog if you can't afford/have a two node cluster with remote power-off fencing solution...