cs_clone with drbd failed to promote/demote node
Opened this issue · 1 comments
rotulet commented
Affected Puppet, Ruby, OS and module versions/distributions
- Puppet: 6.3.0
- Ruby: ruby 2.5.3p105 (2018-10-18 revision 65156)
- Distribution: debian sid
- Module version:
├─┬ puppet-corosync (v6.0.1)
│ └── puppetlabs-stdlib (v4.25.1)
├─┬ puppet-drbd (v0.5.2)
│ └─┬ puppetlabs-concat (v5.3.0)
│ └── puppetlabs-translate (v1.2.0) - pcs: 0.10.1.2
- pacemaker: 2.0.1-1
- corosync: 3.0.1-2
How to reproduce (e.g Puppet code you use)
file { '/root/corosync_authkey':
ensure => file,
mode => '0600',
owner => 'root',
source => 'puppet:///modules/fscluster/authkey.pem',
}
class { 'corosync':
package_corosync => true,
version_corosync => '3.0*',
enable_corosync_service => true,
package_pacemaker => true,
version_pacemaker => '2.0*',
enable_pacemaker_service => true,
package_pcs => true,
version_pcs => '0.10*',
package_crmsh => false,
bind_address => $::ipaddress,
cluster_name => 'nfs_cluster',
check_standby => true,
test_corosync_config => true,
enable_secauth => true,
authkey_source => 'file',
authkey => '/root/corosync_authkey',
set_votequorum => true,
quorum_members => [ $fscluster::share_server_ips[0], $fscluster::share_server_ips[1] ],
quorum_members_names => [ $fscluster::share_server_hosts[0], $fscluster::share_server_hosts[1] ],
require => File['/root/corosync_authkey']
}
corosync::service { 'pacemaker':
version => '2.0*',
}
cs_property { 'stonith-enabled' :
value => false,
}
cs_property { 'no-quorum-policy' :
value => 'ignore',
}
cs_rsc_defaults { 'resource-stickiness' :
value => 'INFINITY',
}
cs_primitive { 'DrbdVolume':
primitive_class => 'ocf',
provided_by => 'linbit',
primitive_type => 'drbd',
parameters => { 'drbd_resource' => 'all' },
metadata => {
'master-max' => '1',
'master-node-max' => '1',
'clone-max' => '2',
'clone-node-max' => '1',
'promotable' => true,
'notify' => true
},
operations => [
{'monitor' => { 'interval' => '10s', 'role' => 'Slave' }},
{'monitor' => { 'interval' => '09s', 'role' => 'Master' }},
{'demote' => { 'interval' => '0s', 'timeout' => '90s' }},
{'notify' => { 'interval' => '0s', 'timeout' => '90s' }},
{'promote' => { 'interval' => '0s', 'timeout' => '90s' }},
{'reload' => { 'interval' => '0s', 'timeout' => '30s' }},
{'start' => { 'interval' => '0s', 'timeout' => '240s' }},
{'stop' => { 'interval' => '0s', 'timeout' => '100s' }},
],
}
cs_clone { 'DrbdVolume-clone':
ensure => present,
primitive => 'DrbdVolume',
clone_max => 2,
clone_node_max => 1,
notify_clones => true,
require => Cs_primitive['DrbdVolume'],
}
#
#
#
What are you seeing
pcs status
show me my 2 nodes started (but no master nor slave):
Full list of resources:
Clone Set: DrbdVolume-clone [DrbdVolume]
Started: [ carto-blade-0 carto-blade-1 ]
drbd seems happy:
r0_nfs role:Primary
disk:UpToDate
peer role:Secondary
replication:Established peer-disk:UpToDate
ButI get theses errors in log:
pacemaker-controld[1020]: notice: State transition S_IDLE -> S_POLICY_ENGINE
pacemaker-schedulerd[1019]: notice: On loss of quorum: Ignore
pacemaker-schedulerd[1019]: error: Couldn't expand DrbdVolume-clone_promote_0 to DrbdVolume-clone_confirmed-post_notify_promoted_0 in DrbdVolume-clone
pacemaker-schedulerd[1019]: error: Couldn't expand DrbdVolume-clone_promote_0 to DrbdVolume-clone_confirmed-post_notify_promoted_0 in DrbdVolume-clone
...
...
pacemaker-schedulerd[1019]: notice: Calculated transition 82, saving inputs in /var/lib/pacemaker/pengine/pe-input-120.bz2
pacemaker-controld[1020]: warning: Transition 82 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=3, Source=/var/lib/pacemaker/pengine/pe-input-120.bz2): Terminated
pacemaker-controld[1020]: warning: Transition failed: terminated
What behaviour did you expect instead
If I do the same in bash (without puppet):
sudo pcs cluster --force setup nfs_cluster carto-blade-0 carto-blade-1
sudo pcs cluster start --all
sudo pcs property set stonith-enabled=false
sudo pcs property set no-quorum-policy=ignore
sudo pcs resource defaults resource-stickiness=100
sudo pcs resource create DrbdVolume ocf:linbit:drbd \
drbd_resource=all op monitor interval=10s role="Slave" op monitor interval=09s role="Master"
sudo pcs resource promotable DrbdVolume master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true
It works as intended:
Full list of resources:
Clone Set: DrbdVolume-clone [DrbdVolume] (promotable)
Masters: [ carto-blade-1 ]
Slaves: [ carto-blade-0 ]
Log output:
pacemaker-controld[1020]: notice: State transition S_IDLE -> S_POLICY_ENGINE
pacemaker-schedulerd[1019]: notice: On loss of quorum: Ignore
pacemaker-schedulerd[1019]: notice: * Start DrbdVolume:0 ( carto-blade-0 )
pacemaker-schedulerd[1019]: notice: * Start DrbdVolume:1 ( carto-blade-1 )
pacemaker-schedulerd[1019]: notice: Calculated transition 112, saving inputs in /var/lib/pacemaker/pengine/pe-input-140.bz2
pacemaker-controld[1020]: notice: Initiating monitor operation DrbdVolume:0_monitor_0 locally on carto-blade-0
pacemaker-controld[1020]: notice: Initiating monitor operation DrbdVolume:1_monitor_0 on carto-blade-1
pacemaker-controld[1020]: notice: Result of probe operation for DrbdVolume on carto-blade-0: 7 (not running)
pacemaker-controld[1020]: notice: Initiating start operation DrbdVolume:0_start_0 locally on carto-blade-0
pacemaker-controld[1020]: notice: Initiating start operation DrbdVolume:1_start_0 on carto-blade-1
kernel: [98215.624649] drbd r0_nfs: Starting worker thread (from drbdsetup-84 [40273])
kernel: [98215.648877] block drbd0: disk( Diskless -> Attaching )
kernel: [98215.649123] drbd r0_nfs: Method to ensure write ordering: flush
kernel: [98215.649130] block drbd0: max BIO size = 1048576
kernel: [98215.649140] block drbd0: drbd_bm_resize called with capacity == 16776632
kernel: [98215.649217] block drbd0: resync bitmap: bits=2097079 words=32767 pages=64
kernel: [98215.649223] block drbd0: size = 8192 MB (8388316 KB)
kernel: [98215.650349] block drbd0: recounting of set bits took additional 0 jiffies
kernel: [98215.650355] block drbd0: 0 KB (0 bits) marked out-of-sync by on disk bit-map.
kernel: [98215.650365] block drbd0: disk( Attaching -> UpToDate )
kernel: [98215.650375] block drbd0: attached to UUIDs E34C19A8901F523B:BA3A295CAB28B4BB:AEB4E0E13837A321:AEB3E0E13837A321
kernel: [98215.666086] drbd r0_nfs: conn( StandAlone -> Unconnected )
kernel: [98215.666116] drbd r0_nfs: Starting receiver thread (from drbd_w_r0_n [40276])
kernel: [98215.666157] drbd r0_nfs: receiver (re)started
kernel: [98215.666167] drbd r0_nfs: conn( Unconnected -> WFConnection )
pacemaker-controld[1020]: notice: Transition 112 aborted by status-1-master-DrbdVolume doing create master-DrbdVolume=1000: Transient attribute change
pacemaker-controld[1020]: notice: Result of start operation for DrbdVolume on carto-blade-0: 0 (ok)
pacemaker-controld[1020]: notice: Initiating notify operation DrbdVolume:0_post_notify_start_0 locally on carto-blade-0
pacemaker-controld[1020]: notice: Initiating notify operation DrbdVolume:1_post_notify_start_0 on carto-blade-1
pacemaker-controld[1020]: notice: Result of notify operation for DrbdVolume on carto-blade-0: 0 (ok)
pacemaker-controld[1020]: notice: Transition 112 (Complete=12, Pending=0, Fired=0, Skipped=2, Incomplete=2, Source=/var/lib/pacemaker/pengine/pe-input-140.bz2): Stopped
pacemaker-schedulerd[1019]: notice: On loss of quorum: Ignore
pacemaker-schedulerd[1019]: notice: * Promote DrbdVolume:0 ( Slave -> Master carto-blade-1 )
pacemaker-schedulerd[1019]: notice: Calculated transition 113, saving inputs in /var/lib/pacemaker/pengine/pe-input-141.bz2
pacemaker-controld[1020]: notice: Initiating notify operation DrbdVolume_pre_notify_promote_0 on carto-blade-1
pacemaker-controld[1020]: notice: Initiating notify operation DrbdVolume_pre_notify_promote_0 locally on carto-blade-0
pacemaker-controld[1020]: notice: Result of notify operation for DrbdVolume on carto-blade-0: 0 (ok)
pacemaker-controld[1020]: notice: Initiating promote operation DrbdVolume_promote_0 on carto-blade-1
pacemaker-controld[1020]: notice: Initiating notify operation DrbdVolume_post_notify_promote_0 on carto-blade-1
pacemaker-controld[1020]: notice: Initiating notify operation DrbdVolume_post_notify_promote_0 locally on carto-blade-0
pacemaker-controld[1020]: notice: Result of notify operation for DrbdVolume on carto-blade-0: 0 (ok)
rotulet commented
Does nearly the same with versions:
- pcs: 0.9.155
- pacemaker: 1.1.16-1
pcs status gives:
Full list of resources:
Clone Set: DrbdVolume-clone [DrbdVolume]
Masters: [ carto-blade-0 ]
Started: [ carto-blade-1 ]
slave node is not detected as Slave in the 'Clone Set' and if I shutdown the master he does not promote the slave.
Here is the log if I stop the master:
crmd[30804]: notice: State transition S_IDLE -> S_POLICY_ENGINE
pengine[30803]: notice: On loss of CCM Quorum: Ignore
pengine[30803]: error: Resource start-up disabled since no STONITH resources have been defined
pengine[30803]: error: Either configure some or disable STONITH with the stonith-enabled option
pengine[30803]: error: NOTE: Clusters with shared data need STONITH to ensure data integrity
pengine[30803]: notice: Scheduling Node carto-blade-0 for shutdown
pengine[30803]: error: Couldn't expand DrbdVolume-clone_promote_0
pengine[30803]: notice: Calculated transition 4, saving inputs in /var/lib/pacemaker/pengine/pe-input-95.bz2
pengine[30803]: notice: Configuration ERRORs found during PE processing. Please run "crm_verify -L" to identify issues.
crmd[30804]: notice: Transition 4 (Complete=1, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-95.bz2): Complete
crmd[30804]: notice: State transition S_TRANSITION_ENGINE -> S_IDLE
crmd[30804]: notice: do_shutdown of peer carto-blade-0 is complete
attrd[30802]: notice: Node carto-blade-0 state is now lost
attrd[30802]: notice: Removing all carto-blade-0 attributes for peer loss
attrd[30802]: notice: Purged 1 peers with id=1 and/or uname=carto-blade-0 from the membership cache
stonith-ng[30800]: notice: Node carto-blade-0 state is now lost
stonith-ng[30800]: notice: Purged 1 peers with id=1 and/or uname=carto-blade-0 from the membership cache
cib[30799]: notice: Node carto-blade-0 state is now lost
cib[30799]: notice: Purged 1 peers with id=1 and/or uname=carto-blade-0 from the membership cache
192.168.5.59: Stopping Cluster (pacemaker)...
corosync[30783]: [TOTEM ] A new membership (2:72) was formed. Members left: 1
corosync[30783]: [CPG ] downlist left_list: 1 received
corosync[30783]: [QUORUM] Members[1]: 2
corosync[30783]: [MAIN ] Completed service synchronization, ready to provide service.
crmd[30804]: notice: Node carto-blade-0 state is now lost
pacemakerd[30792]: notice: Node carto-blade-0 state is now lost
crmd[30804]: notice: do_shutdown of peer carto-blade-0 is complete
192.168.5.59: Stopping Cluster (corosync)...
corosync[30783]: [KNET ] link: host: 1 link: 0 is down
corosync[30783]: [KNET ] host: host: 1 has no active links