SUSE/SAPHanaSR

Add a check to identify the status of the primary node and set the "hana__roles" attribute during probe in SAPHana Resource agent

sairamgopal opened this issue · 6 comments

Issue:
On a fully operational cluster, when cluster is put to maintenance mode and Pacemaker/Cluster service is restarted then after removing cluster from maintenance mode DB on primary is stopped and started again which results in an outage to the customers.

Recreate the issue with below steps

Make sure cluster is fully operational with one Promoted and one Demoted node. HSR is in sync
Put the cluster to maintenance mode ( crm configure property maintenance-mode=true )
Stop the cluster service on both the nodes ( crm cluster stop )
Start the cluster service on both the nodes ( crm cluster start )
Remove cluster from Maintenance mode ( crm configure property maintenance-mode=false )
After Step 5 DB on Primary will be restarted or sometimes triggers the failover.

Reason:
This is happening because If you attempt to start cluster services on a node while the cluster or node is in maintenance mode, Pacemaker will initiate a single one-shot monitor operation (a “probe”) for every resource to evaluate which resources are currently running on that node. However, it will take no further action other than determining the resources' status.

so after step 4 a probe is initiated using SAPHana and SAPHanaTopology Resources.

In SAPHanaTopology when it is identified as probe in monitoring clone function it only check and Sets the attribute for Hana Version, but it is not doing any check for current cluster state. Because of this "hana_roles" and "master-rsc_SAPHana_HDB42" attributes are not set in the cluster primary.

Also in SAPHana Resource agent it is trying to get the status of role attribute (which is not set by that time) and setting score to 5 during the probe and later when cluster is removed from maintenance mode, resource agent checks for the roles attribute and its score, as those values are not as expected, agent is trying to fix the cluster and DB stop-Start is happening.

Resolution:
If we add a check to identify the status of the primary node and set the "hana__roles" attribute during probe, then when cluster is removed from the maintenance, cluster will not try to stop and start the DB or to trigger a failover as it will see the operational primary node.

I have already modified the code and tested multiple scenarios, cluster functionality is not disturbed and the mentioned issue is resolved. I don't think these changes to SAPHana Resource agent will cause additional issues because, during probe we are setting the attributes only if the we identify the primary node. But need your expertise to check and finalize if this approach can be used or suggest any other alternative/fix to overcome the above mentioned issue.

Created this new issue as suggested by @PeterPitterling

Not able to create a new branch, so uploading the modified and tested file here.

SAPHana.md

Modified/additional code is from line 2713 to 2780 and line 668 ( already addressed in #122 )

again, you are mixing up different topics in the same issue .. your 2nd modification in line 668 is addressed in #122

Regarding SAPHana.md: you need to fork the repository and create there a new branch, commit your changes and request a pull request

Hi @PeterPitterling,

got it I modified the comment and will try to fork the repository.
Thanks for you.

As discussed in PR#128 the maintenance procedure is working quite well also for SLE12SP4 code base, if you use the maintenance procedure for pacemaker 2.0. -> closing the issue.

closed as written before.