Unpromoted master two-node cluster
Tsunani opened this issue · 19 comments
Hi all. Configuring replica postgresql, it's work, then i create two-node cluster.
Create Resources. My commands:
pcs cluster cib cluster.xml
pcs -f cluster.xml resource create pgsqld ocf:heartbeat:pgsqlms \
op start timeout=60s \
op stop timeout=60s \
op promote timeout=30s \
op demote timeout=120s \
op monitor interval=15s timeout=10s role="Master" \
op monitor interval=20s timeout=10s role="Slave" \
op notify timeout=60s
pcs -f cluster.xml resource create ip-virtual ocf:heartbeat:IPaddr2 \
ip=10.61.10.117 cidr_netmask=24 op monitor interval=10s
pcs -f cluster.xml resource promotable pgsqld notify=true
pcs -f cluster.xml constraint order promote pgsqld-clone then start ip-virtual symmetrical=false kind=Mandatory
pcs -f cluster.xml constraint order demote pgsqld-clone then stop ip-virtual symmetrical=false kind=Mandatory
pcs cluster cib-push scope=configuration cluster.xml
After use - pcs resource cleanup and pcs status --full get
Cluster name: pg_cluster
Cluster Summary:
* Stack: corosync
* Current DC: t-dirrx-linux-db2 (2) (version 2.1.1-alt1-77db57872) - partition with quorum
* Last updated: Wed Dec 1 13:11:03 2021
* Last change: Wed Dec 1 13:10:28 2021 by root via crm_attribute on t-dirrx-linux-db2
* 2 nodes configured
* 3 resource instances configured
Node List:
* Online: [ t-dirrx-linux-db1 (1) t-dirrx-linux-db2 (2) ]
Full List of Resources:
* Clone Set: pgsqld-clone [pgsqld] (promotable):
* pgsqld (ocf:heartbeat:pgsqlms): Promoted t-dirrx-linux-db2
* pgsqld (ocf:heartbeat:pgsqlms): Unpromoted t-dirrx-linux-db1
* ip-virtual (ocf:heartbeat:IPaddr2): Started t-dirrx-linux-db1
Node Attributes:
* Node: t-dirrx-linux-db1 (1):
* master-pgsqld : -1000
* Node: t-dirrx-linux-db2 (2):
* master-pgsqld : 1001
Migration Summary:
Tickets:
PCSD Status:
t-dirrx-linux-db1: Online
t-dirrx-linux-db2: Online
Daemon Status:
corosync: active/disabled
pacemaker: active/disabled
pcsd: active/enabled
I don't know what to do =(
psqld-clone - Warning: Resource is promotable but has not been promoted on any node. In web.
I don't understand your question.
According to pcs status
, you have one promoted instance on t-dirrx-linux-db2
:
pgsqld (ocf:heartbeat:pgsqlms): Promoted t-dirrx-linux-db2
However, if the score didn't moved, it seems the standby is not streaming from the primary:
* Node: t-dirrx-linux-db1 (1):
* master-pgsqld : -1000
While streaming from the primary, the standbys should have a positive score with a maximum of 1000.
t-dirrx-linux-db2 -replica
t-dirrx-linux-db1 -master
What is the reason for the failure to promote a node?
Can I manually promote?
pcs resource debug-start pgsqld
crm_resource: Error performing operation: Not installed Operation force-start for pgsqld (ocf:heartbeat:pgsqlms) returned: 'not installed' (5) ocf-exit-reason:You must set meta parameter master-max=1 for your master resource
crm_master -r pgsqld -N t-dirrx-linux-db1 -Q
This command promote?
And i update resource meta to master-max=1
pcs resource debug-start pgsqld
Operation force-start for pgsqld (ocf:heartbeat:pgsqlms) returned: 'ok' (0) /tmp:5432 - accepting connections could not change directory to "/tmp/.private/root": Permission denied could not change directory to "/tmp/.private/root": Permission denied Dec 01 15:02:47 INFO: Instance "pgsqld" already started
Honestly, I don't understand what you are trying to fix. It seems to me you might be a bit lost. Make sure to read available docs and to ask clear questions.
Your cluster HAVE a primary node that have been promoted. The second node is expected to be a standby, not promoted. I see no evidences in your messages you had a promotion failure!
In regard with your debug-start
command, I don't understand why you try to start a resource that is already started...
Moreover, the cluster expect only one promoted resource by default, no need to set master-max
.
Finaly, you are not supposed to use the debug-*
commands. They are not supposed to work correctly with multi state resources, thus the weird error message.
I try to form a question. I'm destroy cluster and create new.
Check replica, it good. Stop and disable postgresql.service at all servers.
pcs status --full
Cluster name: pg_cluster
Cluster Summary:
* Stack: corosync
* Current DC: t-dirrx-linux-db2 (2) (version 2.1.1-alt1-77db57872) - partition with quorum
* Last updated: Thu Dec 2 10:51:00 2021
* Last change: Thu Dec 2 10:15:53 2021 by root via crm_attribute on t-dirrx-linux-db1
* 2 nodes configured
* 4 resource instances configured
Node List:
* Online: [ t-dirrx-linux-db1 (1) t-dirrx-linux-db2 (2) ]
Full List of Resources:
* fence (stonith:fence_sbd): Started t-dirrx-linux-db1
* Clone Set: pgsqld-clone [pgsqld] (promotable):
* pgsqld (ocf:heartbeat:pgsqlms): Unpromoted t-dirrx-linux-db2
* pgsqld (ocf:heartbeat:pgsqlms): Promoted t-dirrx-linux-db1
* ip-virtual (ocf:heartbeat:IPaddr2): Started t-dirrx-linux-db2
Node Attributes:
* Node: t-dirrx-linux-db1 (1):
* master-pgsqld : 1001
* Node: t-dirrx-linux-db2 (2):
* master-pgsqld : 1000
Migration Summary:
Tickets:
PCSD Status:
t-dirrx-linux-db1: Online
t-dirrx-linux-db2: Online
Daemon Status:
corosync: active/disabled
pacemaker: active/disabled
pcsd: active/enabled
sbd: active/enabled
I think, i'll do all with documentation.
Can i ask your help?
Why psqld-clone get this warning?
Resource: ip-virtual (class=ocf provider=heartbeat type=IPaddr2)
Attributes: cidr_netmask=24 ip=110.14.132.117
Operations: monitor interval=10s (ip-virtual-monitor-interval-10s)
start interval=0s timeout=20s (ip-virtual-start-interval-0s)
stop interval=0s timeout=20s (ip-virtual-stop-interval-0s)
Clone: pgsqld-clone
Meta Attrs: clone-max=2 clone-node-max=1 master-max=1 master-node-max=1 notify=true promotable=true
Resource: pgsqld (class=ocf provider=heartbeat type=pgsqlms)
Operations: demote interval=0s timeout=120s (pgsqld-demote-interval-0s)
methods interval=0s timeout=5 (pgsqld-methods-interval-0s)
monitor interval=15s role=Master timeout=10s (pgsqld-monitor-interval-15s)
monitor interval=16s role=Slave timeout=10s (pgsqld-monitor-interval-16s)
notify interval=0s timeout=60s (pgsqld-notify-interval-0s)
promote interval=0s timeout=30s (pgsqld-promote-interval-0s)
reload interval=0s timeout=20 (pgsqld-reload-interval-0s)
start interval=0s timeout=60s (pgsqld-start-interval-0s)
stop interval=0s timeout=60s (pgsqld-stop-interval-0s)
OK, looking at your screenshot of this web interface (what is it?), I understand now.
According to pcs status --full
, almost everything is working correctly:
- you have a promoted resource on
t-dirrx-linux-db1
- you have an unpromoted resource on
t-dirrx-linux-db2
- the standby on
t-dirrx-linux-db2
has a promotion score of1000
.
The only thing here that is not working is the virtual IP address located on t-dirrx-linux-db2
. Maybe you forgot the location+order constraints?
I have absolutely no idea why your web UI is showing this warning message, but your pcs status
is in full contradiction with it.
The only thing here that is not working is the virtual IP address located on
t-dirrx-linux-db2
. Maybe you forgot the location+order constraints?
Yes, I think you're right. As intended, it should work like "promote pgsqld-clone then start ip-virtual".
Now i stop --all and start db1. Now ip-virtual at db1.
If i start --all then ip-virtual at db2.
How it work...
in regard with your virtual IP, you will have to show me your current config.
However, in regard with your original problem, either you have a non related problem between your web UI and your cluster, or the web ui might have a bug.
in regard with your virtual IP, you will have to show me your current config.
However, in regard with your original problem, either you have a non related problem between your web UI and your cluster, or the web ui might have a bug.
Sorry for the delay.
pcs resource show --full
Warning: This command is deprecated and will be removed. Please use 'pcs resource config' instead.
Resource: ip-virtual (class=ocf provider=heartbeat type=IPaddr2)
Attributes: cidr_netmask=24 ip=110.14.132.117
Operations: monitor interval=10s (ip-virtual-monitor-interval-10s)
start interval=0s timeout=20s (ip-virtual-start-interval-0s)
stop interval=0s timeout=20s (ip-virtual-stop-interval-0s)
Clone: pgsqld-clone
Meta Attrs: notify=true promotable=true
Resource: pgsqld (class=ocf provider=heartbeat type=pgsqlms)
Operations: demote interval=0s timeout=120s (pgsqld-demote-interval-0s)
methods interval=0s timeout=5 (pgsqld-methods-interval-0s)
monitor interval=15s role=Master timeout=10s (pgsqld-monitor-interval-15s)
monitor interval=16s role=Slave timeout=10s (pgsqld-monitor-interval-16s)
notify interval=0s timeout=60s (pgsqld-notify-interval-0s)
promote interval=0s timeout=30s (pgsqld-promote-interval-0s)
reload interval=0s timeout=20 (pgsqld-reload-interval-0s)
start interval=0s timeout=60s (pgsqld-start-interval-0s)
stop interval=0s timeout=60s (pgsqld-stop-interval-0s)
pcs resource show --full
Not only resource, you need to check the whole config, with the location/order constraints. You probably miss some constraints.
pcs resource show --full
Not only resource, you need to check the whole config, with the location/order constraints. You probably miss some constraints.
About this?
pcs config
Cluster Name: pg_cluster
Corosync Nodes:
t-dirrx-linux-db1 t-dirrx-linux-db2
Pacemaker Nodes:
t-dirrx-linux-db1 t-dirrx-linux-db2
Resources:
Resource: ip-virtual (class=ocf provider=heartbeat type=IPaddr2)
Attributes: cidr_netmask=24 ip=110.14.132.117
Operations: monitor interval=10s (ip-virtual-monitor-interval-10s)
start interval=0s timeout=20s (ip-virtual-start-interval-0s)
stop interval=0s timeout=20s (ip-virtual-stop-interval-0s)
Clone: pgsqld-clone
Meta Attrs: notify=true promotable=true
Resource: pgsqld (class=ocf provider=heartbeat type=pgsqlms)
Operations: demote interval=0s timeout=120s (pgsqld-demote-interval-0s)
methods interval=0s timeout=5 (pgsqld-methods-interval-0s)
monitor interval=15s role=Master timeout=10s (pgsqld-monitor-interval-15s)
monitor interval=16s role=Slave timeout=10s (pgsqld-monitor-interval-16s)
notify interval=0s timeout=60s (pgsqld-notify-interval-0s)
promote interval=0s timeout=30s (pgsqld-promote-interval-0s)
reload interval=0s timeout=20 (pgsqld-reload-interval-0s)
start interval=0s timeout=60s (pgsqld-start-interval-0s)
stop interval=0s timeout=60s (pgsqld-stop-interval-0s)
Stonith Devices:
Resource: fence_sbd (class=stonith type=fence_sbd)
Attributes: devices=/dev/sdb
Operations: monitor interval=60s (fence_sbd-monitor-interval-60s)
Fencing Levels:
Location Constraints:
Ordering Constraints:
promote pgsqld-clone then start ip-virtual (kind:Mandatory) (non-symmetrical) (id:order-pgsqld-clone-ip-virtual-Mandatory)
demote pgsqld-clone then stop ip-virtual (kind:Mandatory) (non-symmetrical) (id:order-pgsqld-clone-ip-virtual-Mandatory-1)
Colocation Constraints:
ip-virtual with pgsqld-clone (score:INFINITY) (rsc-role:Started) (with-rsc-role:Master) (id:colocation-ip-virtual-pgsqld-clone-INFINITY)
Ticket Constraints:
Alerts:
No alerts defined
Resources Defaults:
migration-threshold=5
resource-stickiness=100
Operations Defaults:
No defaults set
Cluster Properties:
cluster-infrastructure: corosync
cluster-name: pg_cluster
dc-version: 2.1.1-alt1-77db57872
have-watchdog: false
no-quorum-policy: ignore
stonith-enabled: false
Quorum:
Options:
pacemakerd --features
Pacemaker 2.1.1-alt1 (Build: 77db57872)
Supporting v3.10.2: agent-manpages corosync-ge-2 generated-manpages ipmiservicelogd monotonic nagios ncurses profile remote servicelog systemd
Today i install cluster on Ubuntu-18.
pcs config --full
Cluster Name: pg_cluster
Corosync Nodes:
t-dirrx-linux-db1 t-dirrx-linux-db2
Pacemaker Nodes:
t-dirrx-linux-db1 t-dirrx-linux-db2
Resources:
Master: pgsql-ha
Meta Attrs: notify=true
Resource: pgsqld (class=ocf provider=heartbeat type=pgsqlms)
Attributes: bindir=/usr/lib/postgresql/12/bin datadir=/var/lib/postgresql/12/main pgdata=/etc/postgresql/12/main pghost=/var/run/postgresql
Operations: demote interval=0s timeout=120s (pgsqld-demote-interval-0s)
methods interval=0s timeout=5 (pgsqld-methods-interval-0s)
monitor interval=15s role=Master timeout=10s (pgsqld-monitor-interval-15s)
monitor interval=16s role=Slave timeout=10s (pgsqld-monitor-interval-16s)
notify interval=0s timeout=60s (pgsqld-notify-interval-0s)
promote interval=0s timeout=30s (pgsqld-promote-interval-0s)
reload interval=0s timeout=20 (pgsqld-reload-interval-0s)
start interval=0s timeout=60s (pgsqld-start-interval-0s)
stop interval=0s timeout=60s (pgsqld-stop-interval-0s)
Resource: ip-virtual (class=ocf provider=heartbeat type=IPaddr2)
Attributes: cidr_netmask=24 ip=110.14.132.117
Operations: monitor interval=10s (ip-virtual-monitor-interval-10s)
start interval=0s timeout=20s (ip-virtual-start-interval-0s)
stop interval=0s timeout=20s (ip-virtual-stop-interval-0s)
Stonith Devices:
Fencing Levels:
Location Constraints:
Ordering Constraints:
promote pgsql-ha then start ip-virtual (kind:Mandatory) (non-symmetrical) (id:order-pgsql-ha-ip-virtual-Mandatory)
demote pgsql-ha then stop ip-virtual (kind:Mandatory) (non-symmetrical) (id:order-pgsql-ha-ip-virtual-Mandatory-1)
Colocation Constraints:
ip-virtual with pgsql-ha (score:INFINITY) (rsc-role:Started) (with-rsc-role:Master) (id:colocation-ip-virtual-pgsql-ha-INFINITY)
Ticket Constraints:
Alerts:
No alerts defined
Resources Defaults:
migration-threshold: 5
resource-stickiness: 100
Operations Defaults:
No defaults set
Cluster Properties:
cluster-infrastructure: corosync
cluster-name: pg_cluster
dc-version: 1.1.18-2b07d5c5a9
have-watchdog: false
last-lrm-refresh: 1639033695
no-quorum-policy: ignore
stonith-enabled: false
Node Attributes:
t-dirrx-linux-db1: master-pgsqld=1001
t-dirrx-linux-db2: master-pgsqld=1000
Quorum:
Options:
And
All good.
But the versions are older. It also uses the master-slave syntax.
And. At Ubuntu 20.04 this work fine.
Version packgates "latest". I think my OS "Alt Linux" have a problem.
Hi,
I don't think the issue could comes from the OS. I had a hard time following your actions and setups...
I don't use Alt Linux though. If you succeed to pin point your problem, fill free to share what you discovered.
Regards,
Closed == fixed? How?
Getting same error on Alt Linux.
Here is my dbg info:
- /usr/lib/ocf/resource.d/heartbeat/pgsqlms start ; echo $? returning 5 (OCF_ERR_INSTALLED The tools required by the resource are not installed on this machine.)
- when executing
pcs resource debug-start pgsqld --full
getting this result:
Operation force-start for pgsqld (ocf:heartbeat:pgsqlms) could not be executed (Timed Out: Process did not exit within specified timeout)
(log_xmllib_err) error: XML Error: Entity: line 1: parser error : Start tag expected, '<' not found
(log_xmllib_err) error: XML Error: /tmp:5432 - accepting connections
(log_xmllib_err) error: XML Error: ^
(string2xml) warning: Parsing failed (domain=1, level=3, code=4): Start tag expected, '<' not found
crm_resource: Error performing operation: Error occurred
/tmp:5432 - accepting connections
versions:
pacemaker-schemas-2.1.2-alt1.noarch
libpacemaker-2.1.2-alt1.x86_64
pacemaker-cli-2.1.2-alt1.x86_64
pacemaker-2.1.2-alt1.x86_64
resource-agents-paf-2.3.0-alt2.noarch
BTW looks like i fixed could not change directory to "/tmp/.private/root": Permission denied
by setting export TMPDIR=PUBLIC_DIR
Closed == fixed? How?
Did I say it was fixed? So far, I am still not able to tell if it's a bug, or a setup problem.
I don't know why you are able to build your cluster with an older version of Pacemaker, and not with a newer one.
I noticed you did not setup fencing though.
- /usr/lib/ocf/resource.d/heartbeat/pgsqlms start [...]
- when executing pcs resource debug-start pgsqld --full getting this result
As I wrote earlier here: #199 (comment)
you are not supposed to use the debug-* commands. They are not supposed to work correctly with multi state resources, thus the weird error message.
About the following message:
BTW looks like i fixed [...]
As I told you, debug-*
commands are NOT expected to work with a multi-state resource. This error comes straight to the fact pcs is calling pgsqlms
without all the environment variables Pacemaker sets, including TMPDIR
, but a lot others that are meaningful to pgsqlms
...
Closed == fixed? How?
Did I say it was fixed? So far, I am still not able to tell if it's a bug, or a setup problem.
I don't know why you are able to build your cluster with an older version of Pacemaker, and not with a newer one.
I noticed you did not setup fencing though.
- /usr/lib/ocf/resource.d/heartbeat/pgsqlms start [...]
- when executing pcs resource debug-start pgsqld --full getting this result
As I wrote earlier here: #199 (comment)
you are not supposed to use the debug-* commands. They are not supposed to work correctly with multi state resources, thus the weird error message.
About the following message:
BTW looks like i fixed [...]
As I told you,
debug-*
commands are NOT expected to work with a multi-state resource. This error comes straight to the fact pcs is callingpgsqlms
without all the environment variables Pacemaker sets, includingTMPDIR
, but a lot others that are meaningful topgsqlms
...
Thanks a lot for the answer on my weird question:)
Today I updated pcs and pacemaker and now all working good. Node2 becomes promoted when node1 is dying.
The problem comes from alt linux p10 and old repos on it.
The problem comes from alt linux p10 and old repos on it.
As far as I know, PAF is compatible with at least Pacemaker 1.1.13 using a corosync 2.x stack.
Do you have some more details about this? Have you been able to pinpoint the issue with old repos?
The problem comes from alt linux p10 and old repos on it.
As far as I know, PAF is compatible with at least Pacemaker 1.1.13 using a corosync 2.x stack.
Do you have some more details about this? Have you been able to pinpoint the issue with old repos?
Problem was in my cluster configuration and not in old repos. I've changed properties of 'no-quorum-policy' to ignore and 'stonith-enabled' to false and after that problem was solved.
I've changed properties of 'no-quorum-policy' to ignore and 'stonith-enabled' to false and after that problem was solved.
This is not safe and PAF will not behave correctly without stonith and quorum setup and enabled.