Unpromoted master two-node cluster

Question

Unpromoted master two-node cluster

Tsunani opened this issue 3 years ago · 19 comments

Hi all. Configuring replica postgresql, it's work, then i create two-node cluster.

Create Resources. My commands:

pcs cluster cib cluster.xml
pcs -f cluster.xml resource create pgsqld ocf:heartbeat:pgsqlms            \
op start timeout=60s                                                          \
op stop timeout=60s                                                           \
op promote timeout=30s                                                        \
op demote timeout=120s                                                        \
op monitor interval=15s timeout=10s role="Master"                             \
op monitor interval=20s timeout=10s role="Slave"                              \
op notify timeout=60s                                                         

pcs -f cluster.xml resource create ip-virtual ocf:heartbeat:IPaddr2 \
ip=10.61.10.117 cidr_netmask=24 op monitor interval=10s

pcs -f cluster.xml resource promotable pgsqld notify=true

pcs -f cluster.xml constraint order promote pgsqld-clone then start ip-virtual symmetrical=false kind=Mandatory
pcs -f cluster.xml constraint order demote pgsqld-clone then stop ip-virtual symmetrical=false kind=Mandatory


pcs cluster cib-push scope=configuration cluster.xml

After use - pcs resource cleanup and pcs status --full get

Cluster name: pg_cluster
Cluster Summary:
  * Stack: corosync
  * Current DC: t-dirrx-linux-db2 (2) (version 2.1.1-alt1-77db57872) - partition with quorum
  * Last updated: Wed Dec  1 13:11:03 2021
  * Last change:  Wed Dec  1 13:10:28 2021 by root via crm_attribute on t-dirrx-linux-db2
  * 2 nodes configured
  * 3 resource instances configured

Node List:
  * Online: [ t-dirrx-linux-db1 (1) t-dirrx-linux-db2 (2) ]

Full List of Resources:
  * Clone Set: pgsqld-clone [pgsqld] (promotable):
    * pgsqld    (ocf:heartbeat:pgsqlms):         Promoted t-dirrx-linux-db2
    * pgsqld    (ocf:heartbeat:pgsqlms):         Unpromoted t-dirrx-linux-db1
  * ip-virtual  (ocf:heartbeat:IPaddr2):         Started t-dirrx-linux-db1

Node Attributes:
  * Node: t-dirrx-linux-db1 (1):
    * master-pgsqld                     : -1000
  * Node: t-dirrx-linux-db2 (2):
    * master-pgsqld                     : 1001

Migration Summary:

Tickets:

PCSD Status:
  t-dirrx-linux-db1: Online
  t-dirrx-linux-db2: Online

Daemon Status:
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/enabled

I don't know what to do =(

Answer 1 · 2021-12-01T10:31:44.000Z

psqld-clone - Warning: Resource is promotable but has not been promoted on any node. In web.

Answer 2 · 2021-12-01T11:31:46.000Z

I don't understand your question.

According to pcs status, you have one promoted instance on t-dirrx-linux-db2:

pgsqld    (ocf:heartbeat:pgsqlms):         Promoted t-dirrx-linux-db2

However, if the score didn't moved, it seems the standby is not streaming from the primary:

  * Node: t-dirrx-linux-db1 (1):
    * master-pgsqld                     : -1000

While streaming from the primary, the standbys should have a positive score with a maximum of 1000.

Answer 3 · 2021-12-01T11:42:43.000Z

t-dirrx-linux-db2 -replica
t-dirrx-linux-db1 -master
What is the reason for the failure to promote a node?
Can I manually promote?

pcs resource debug-start pgsqld

crm_resource: Error performing operation: Not installed Operation force-start for pgsqld (ocf:heartbeat:pgsqlms) returned: 'not installed' (5) ocf-exit-reason:You must set meta parameter master-max=1 for your master resource

crm_master -r pgsqld -N t-dirrx-linux-db1 -Q
This command promote?

And i update resource meta to master-max=1
pcs resource debug-start pgsqld

Operation force-start for pgsqld (ocf:heartbeat:pgsqlms) returned: 'ok' (0) /tmp:5432 - accepting connections could not change directory to "/tmp/.private/root": Permission denied could not change directory to "/tmp/.private/root": Permission denied Dec 01 15:02:47 INFO: Instance "pgsqld" already started

Answer 4 · 2021-12-01T14:31:38.000Z

Honestly, I don't understand what you are trying to fix. It seems to me you might be a bit lost. Make sure to read available docs and to ask clear questions.

Your cluster HAVE a primary node that have been promoted. The second node is expected to be a standby, not promoted. I see no evidences in your messages you had a promotion failure!

In regard with your debug-start command, I don't understand why you try to start a resource that is already started...
Moreover, the cluster expect only one promoted resource by default, no need to set master-max.
Finaly, you are not supposed to use the debug-* commands. They are not supposed to work correctly with multi state resources, thus the weird error message.

Answer 5 · 2021-12-02T08:00:54.000Z

I try to form a question. I'm destroy cluster and create new.
Check replica, it good. Stop and disable postgresql.service at all servers.

pcs status --full
Cluster name: pg_cluster
Cluster Summary:
  * Stack: corosync
  * Current DC: t-dirrx-linux-db2 (2) (version 2.1.1-alt1-77db57872) - partition with quorum
  * Last updated: Thu Dec  2 10:51:00 2021
  * Last change:  Thu Dec  2 10:15:53 2021 by root via crm_attribute on t-dirrx-linux-db1
  * 2 nodes configured
  * 4 resource instances configured

Node List:
  * Online: [ t-dirrx-linux-db1 (1) t-dirrx-linux-db2 (2) ]

Full List of Resources:
  * fence       (stonith:fence_sbd):     Started t-dirrx-linux-db1
  * Clone Set: pgsqld-clone [pgsqld] (promotable):
    * pgsqld    (ocf:heartbeat:pgsqlms):         Unpromoted t-dirrx-linux-db2
    * pgsqld    (ocf:heartbeat:pgsqlms):         Promoted t-dirrx-linux-db1
  * ip-virtual  (ocf:heartbeat:IPaddr2):         Started t-dirrx-linux-db2

Node Attributes:
  * Node: t-dirrx-linux-db1 (1):
    * master-pgsqld                     : 1001
  * Node: t-dirrx-linux-db2 (2):
    * master-pgsqld                     : 1000

Migration Summary:

Tickets:

PCSD Status:
  t-dirrx-linux-db1: Online
  t-dirrx-linux-db2: Online

Daemon Status:
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/enabled
  sbd: active/enabled

I think, i'll do all with documentation.
Can i ask your help?
Why psqld-clone get this warning?

 Resource: ip-virtual (class=ocf provider=heartbeat type=IPaddr2)
  Attributes: cidr_netmask=24 ip=110.14.132.117
  Operations: monitor interval=10s (ip-virtual-monitor-interval-10s)
              start interval=0s timeout=20s (ip-virtual-start-interval-0s)
              stop interval=0s timeout=20s (ip-virtual-stop-interval-0s)
 Clone: pgsqld-clone
  Meta Attrs: clone-max=2 clone-node-max=1 master-max=1 master-node-max=1 notify=true promotable=true
  Resource: pgsqld (class=ocf provider=heartbeat type=pgsqlms)
   Operations: demote interval=0s timeout=120s (pgsqld-demote-interval-0s)
               methods interval=0s timeout=5 (pgsqld-methods-interval-0s)
               monitor interval=15s role=Master timeout=10s (pgsqld-monitor-interval-15s)
               monitor interval=16s role=Slave timeout=10s (pgsqld-monitor-interval-16s)
               notify interval=0s timeout=60s (pgsqld-notify-interval-0s)
               promote interval=0s timeout=30s (pgsqld-promote-interval-0s)
               reload interval=0s timeout=20 (pgsqld-reload-interval-0s)
               start interval=0s timeout=60s (pgsqld-start-interval-0s)
               stop interval=0s timeout=60s (pgsqld-stop-interval-0s)

Answer 6 · 2021-12-02T11:41:50.000Z

OK, looking at your screenshot of this web interface (what is it?), I understand now.

According to pcs status --full, almost everything is working correctly:

you have a promoted resource on t-dirrx-linux-db1
you have an unpromoted resource on t-dirrx-linux-db2
the standby on t-dirrx-linux-db2 has a promotion score of 1000.

The only thing here that is not working is the virtual IP address located on t-dirrx-linux-db2. Maybe you forgot the location+order constraints?

I have absolutely no idea why your web UI is showing this warning message, but your pcs status is in full contradiction with it.

Answer 7 · 2021-12-02T12:03:42.000Z

The only thing here that is not working is the virtual IP address located on t-dirrx-linux-db2. Maybe you forgot the location+order constraints?

Yes, I think you're right. As intended, it should work like "promote pgsqld-clone then start ip-virtual".
Now i stop --all and start db1. Now ip-virtual at db1.
If i start --all then ip-virtual at db2.
How it work...

Answer 8 · 2021-12-02T16:00:07.000Z

in regard with your virtual IP, you will have to show me your current config.

However, in regard with your original problem, either you have a non related problem between your web UI and your cluster, or the web ui might have a bug.

Answer 9 · 2021-12-06T10:17:16.000Z

in regard with your virtual IP, you will have to show me your current config.

However, in regard with your original problem, either you have a non related problem between your web UI and your cluster, or the web ui might have a bug.

Sorry for the delay.

pcs resource show --full

Warning: This command is deprecated and will be removed. Please use 'pcs resource config' instead.
 Resource: ip-virtual (class=ocf provider=heartbeat type=IPaddr2)
  Attributes: cidr_netmask=24 ip=110.14.132.117
  Operations: monitor interval=10s (ip-virtual-monitor-interval-10s)
              start interval=0s timeout=20s (ip-virtual-start-interval-0s)
              stop interval=0s timeout=20s (ip-virtual-stop-interval-0s)
 Clone: pgsqld-clone
  Meta Attrs: notify=true promotable=true
  Resource: pgsqld (class=ocf provider=heartbeat type=pgsqlms)
   Operations: demote interval=0s timeout=120s (pgsqld-demote-interval-0s)
               methods interval=0s timeout=5 (pgsqld-methods-interval-0s)
               monitor interval=15s role=Master timeout=10s (pgsqld-monitor-interval-15s)
               monitor interval=16s role=Slave timeout=10s (pgsqld-monitor-interval-16s)
               notify interval=0s timeout=60s (pgsqld-notify-interval-0s)
               promote interval=0s timeout=30s (pgsqld-promote-interval-0s)
               reload interval=0s timeout=20 (pgsqld-reload-interval-0s)
               start interval=0s timeout=60s (pgsqld-start-interval-0s)
               stop interval=0s timeout=60s (pgsqld-stop-interval-0s)

Answer 10 · 2021-12-06T19:42:58.000Z

pcs resource show --full

Not only resource, you need to check the whole config, with the location/order constraints. You probably miss some constraints.

Answer 11 · 2021-12-07T06:09:59.000Z

pcs resource show --full

Not only resource, you need to check the whole config, with the location/order constraints. You probably miss some constraints.

About this?

pcs config
Cluster Name: pg_cluster
Corosync Nodes:
 t-dirrx-linux-db1 t-dirrx-linux-db2
Pacemaker Nodes:
 t-dirrx-linux-db1 t-dirrx-linux-db2

Resources:
 Resource: ip-virtual (class=ocf provider=heartbeat type=IPaddr2)
  Attributes: cidr_netmask=24 ip=110.14.132.117
  Operations: monitor interval=10s (ip-virtual-monitor-interval-10s)
              start interval=0s timeout=20s (ip-virtual-start-interval-0s)
              stop interval=0s timeout=20s (ip-virtual-stop-interval-0s)
 Clone: pgsqld-clone
  Meta Attrs: notify=true promotable=true
  Resource: pgsqld (class=ocf provider=heartbeat type=pgsqlms)
   Operations: demote interval=0s timeout=120s (pgsqld-demote-interval-0s)
               methods interval=0s timeout=5 (pgsqld-methods-interval-0s)
               monitor interval=15s role=Master timeout=10s (pgsqld-monitor-interval-15s)
               monitor interval=16s role=Slave timeout=10s (pgsqld-monitor-interval-16s)
               notify interval=0s timeout=60s (pgsqld-notify-interval-0s)
               promote interval=0s timeout=30s (pgsqld-promote-interval-0s)
               reload interval=0s timeout=20 (pgsqld-reload-interval-0s)
               start interval=0s timeout=60s (pgsqld-start-interval-0s)
               stop interval=0s timeout=60s (pgsqld-stop-interval-0s)

Stonith Devices:
 Resource: fence_sbd (class=stonith type=fence_sbd)
  Attributes: devices=/dev/sdb
  Operations: monitor interval=60s (fence_sbd-monitor-interval-60s)
Fencing Levels:

Location Constraints:
Ordering Constraints:
  promote pgsqld-clone then start ip-virtual (kind:Mandatory) (non-symmetrical) (id:order-pgsqld-clone-ip-virtual-Mandatory)
  demote pgsqld-clone then stop ip-virtual (kind:Mandatory) (non-symmetrical) (id:order-pgsqld-clone-ip-virtual-Mandatory-1)
Colocation Constraints:
  ip-virtual with pgsqld-clone (score:INFINITY) (rsc-role:Started) (with-rsc-role:Master) (id:colocation-ip-virtual-pgsqld-clone-INFINITY)
Ticket Constraints:

Alerts:
 No alerts defined

Resources Defaults:
 migration-threshold=5
 resource-stickiness=100
Operations Defaults:
 No defaults set

Cluster Properties:
 cluster-infrastructure: corosync
 cluster-name: pg_cluster
 dc-version: 2.1.1-alt1-77db57872
 have-watchdog: false
 no-quorum-policy: ignore
 stonith-enabled: false

Quorum:
  Options:


pacemakerd --features
Pacemaker 2.1.1-alt1 (Build: 77db57872)
 Supporting v3.10.2: agent-manpages corosync-ge-2 generated-manpages ipmiservicelogd monotonic nagios ncurses profile remote servicelog systemd

Answer 12 · 2021-12-09T09:27:05.000Z

Today i install cluster on Ubuntu-18.

pcs config --full
Cluster Name: pg_cluster
Corosync Nodes:
 t-dirrx-linux-db1 t-dirrx-linux-db2
Pacemaker Nodes:
 t-dirrx-linux-db1 t-dirrx-linux-db2

Resources:
 Master: pgsql-ha
  Meta Attrs: notify=true
  Resource: pgsqld (class=ocf provider=heartbeat type=pgsqlms)
   Attributes: bindir=/usr/lib/postgresql/12/bin datadir=/var/lib/postgresql/12/main pgdata=/etc/postgresql/12/main pghost=/var/run/postgresql
   Operations: demote interval=0s timeout=120s (pgsqld-demote-interval-0s)
               methods interval=0s timeout=5 (pgsqld-methods-interval-0s)
               monitor interval=15s role=Master timeout=10s (pgsqld-monitor-interval-15s)
               monitor interval=16s role=Slave timeout=10s (pgsqld-monitor-interval-16s)
               notify interval=0s timeout=60s (pgsqld-notify-interval-0s)
               promote interval=0s timeout=30s (pgsqld-promote-interval-0s)
               reload interval=0s timeout=20 (pgsqld-reload-interval-0s)
               start interval=0s timeout=60s (pgsqld-start-interval-0s)
               stop interval=0s timeout=60s (pgsqld-stop-interval-0s)
 Resource: ip-virtual (class=ocf provider=heartbeat type=IPaddr2)
  Attributes: cidr_netmask=24 ip=110.14.132.117
  Operations: monitor interval=10s (ip-virtual-monitor-interval-10s)
              start interval=0s timeout=20s (ip-virtual-start-interval-0s)
              stop interval=0s timeout=20s (ip-virtual-stop-interval-0s)

Stonith Devices:
Fencing Levels:

Location Constraints:
Ordering Constraints:
  promote pgsql-ha then start ip-virtual (kind:Mandatory) (non-symmetrical) (id:order-pgsql-ha-ip-virtual-Mandatory)
  demote pgsql-ha then stop ip-virtual (kind:Mandatory) (non-symmetrical) (id:order-pgsql-ha-ip-virtual-Mandatory-1)
Colocation Constraints:
  ip-virtual with pgsql-ha (score:INFINITY) (rsc-role:Started) (with-rsc-role:Master) (id:colocation-ip-virtual-pgsql-ha-INFINITY)
Ticket Constraints:

Alerts:
 No alerts defined

Resources Defaults:
 migration-threshold: 5
 resource-stickiness: 100
Operations Defaults:
 No defaults set

Cluster Properties:
 cluster-infrastructure: corosync
 cluster-name: pg_cluster
 dc-version: 1.1.18-2b07d5c5a9
 have-watchdog: false
 last-lrm-refresh: 1639033695
 no-quorum-policy: ignore
 stonith-enabled: false
Node Attributes:
 t-dirrx-linux-db1: master-pgsqld=1001
 t-dirrx-linux-db2: master-pgsqld=1000

Quorum:
  Options:

And
All good.
But the versions are older. It also uses the master-slave syntax.

And. At Ubuntu 20.04 this work fine.
Version packgates "latest". I think my OS "Alt Linux" have a problem.

Answer 13 · 2022-01-03T11:34:32.000Z

Hi,

I don't think the issue could comes from the OS. I had a hard time following your actions and setups...

I don't use Alt Linux though. If you succeed to pin point your problem, fill free to share what you discovered.

Regards,

Answer 14 · 2022-07-04T22:40:12.000Z

Closed == fixed? How?

Getting same error on Alt Linux.

Here is my dbg info:

/usr/lib/ocf/resource.d/heartbeat/pgsqlms start ; echo $? returning 5 (OCF_ERR_INSTALLED The tools required by the resource are not installed on this machine.)
when executing pcs resource debug-start pgsqld --full getting this result:

Operation force-start for pgsqld (ocf:heartbeat:pgsqlms) could not be executed (Timed Out: Process did not exit within specified timeout)
(log_xmllib_err) error: XML Error: Entity: line 1: parser error : Start tag expected, '<' not found
(log_xmllib_err) error: XML Error: /tmp:5432 - accepting connections
(log_xmllib_err) error: XML Error: ^
(string2xml) warning: Parsing failed (domain=1, level=3, code=4): Start tag expected, '<' not found
crm_resource: Error performing operation: Error occurred
/tmp:5432 - accepting connections

versions:

pacemaker-schemas-2.1.2-alt1.noarch
libpacemaker-2.1.2-alt1.x86_64
pacemaker-cli-2.1.2-alt1.x86_64
pacemaker-2.1.2-alt1.x86_64
resource-agents-paf-2.3.0-alt2.noarch

BTW looks like i fixed could not change directory to "/tmp/.private/root": Permission denied by setting export TMPDIR=PUBLIC_DIR

Answer 15 · 2022-07-06T17:24:33.000Z

Closed == fixed? How?

Did I say it was fixed? So far, I am still not able to tell if it's a bug, or a setup problem.

I don't know why you are able to build your cluster with an older version of Pacemaker, and not with a newer one.

I noticed you did not setup fencing though.

/usr/lib/ocf/resource.d/heartbeat/pgsqlms start [...]

when executing pcs resource debug-start pgsqld --full getting this result

As I wrote earlier here: #199 (comment)

you are not supposed to use the debug-* commands. They are not supposed to work correctly with multi state resources, thus the weird error message.

About the following message:

BTW looks like i fixed [...]

As I told you, debug-* commands are NOT expected to work with a multi-state resource. This error comes straight to the fact pcs is calling pgsqlms without all the environment variables Pacemaker sets, including TMPDIR, but a lot others that are meaningful to pgsqlms...

Answer 16 · 2022-07-06T21:29:26.000Z

Closed == fixed? How?

Did I say it was fixed? So far, I am still not able to tell if it's a bug, or a setup problem.

I don't know why you are able to build your cluster with an older version of Pacemaker, and not with a newer one.

I noticed you did not setup fencing though.

/usr/lib/ocf/resource.d/heartbeat/pgsqlms start [...]

when executing pcs resource debug-start pgsqld --full getting this result

As I wrote earlier here: #199 (comment)

you are not supposed to use the debug-* commands. They are not supposed to work correctly with multi state resources, thus the weird error message.

About the following message:

BTW looks like i fixed [...]

As I told you, debug-* commands are NOT expected to work with a multi-state resource. This error comes straight to the fact pcs is calling pgsqlms without all the environment variables Pacemaker sets, including TMPDIR, but a lot others that are meaningful to pgsqlms...

Thanks a lot for the answer on my weird question:)
Today I updated pcs and pacemaker and now all working good. Node2 becomes promoted when node1 is dying.

The problem comes from alt linux p10 and old repos on it.

Answer 17 · 2022-07-07T13:10:27.000Z

The problem comes from alt linux p10 and old repos on it.

As far as I know, PAF is compatible with at least Pacemaker 1.1.13 using a corosync 2.x stack.

Do you have some more details about this? Have you been able to pinpoint the issue with old repos?

Answer 18 · 2022-07-15T07:08:23.000Z

The problem comes from alt linux p10 and old repos on it.

As far as I know, PAF is compatible with at least Pacemaker 1.1.13 using a corosync 2.x stack.

Do you have some more details about this? Have you been able to pinpoint the issue with old repos?

Problem was in my cluster configuration and not in old repos. I've changed properties of 'no-quorum-policy' to ignore and 'stonith-enabled' to false and after that problem was solved.

Answer 19 · 2022-07-15T10:15:25.000Z

I've changed properties of 'no-quorum-policy' to ignore and 'stonith-enabled' to false and after that problem was solved.

This is not safe and PAF will not behave correctly without stonith and quorum setup and enabled.