Install srHook
Closed this issue · 11 comments
According to HANA admin guide, it doesn't require a HANA restart to load srHook.
All scripts are loaded during the start up phase of the name server, alternatively, to avoid the need for a restart, run the following command to reload the scripts immediately:
hdbnsutil -reloadHADRProviders
However, if there is no replication synchronization status change event happens after reloading the srHook, the srHook attribute will not be created. If calling the systemReplicationStatus.py timeouts for whatever reason and the sync status is set to SFAIL.
Now if primary HANA crashes, the get_SRHOOK function will take the SFAIL status as the srHOOK attribute is never created, thus the RA prefers local HANA restart instead of failover. Am I understand it correctly?
The reason to open this issue is to confirm whether a HANA restart can be avoided when installing the srHook.
- We are restarting the SAP HANA Database to also check, if the database still starts properly. Your command is also fine, but then you need to check the trace files carefully that there are no load errors.
- If the Resource starts the RA sets SWAIT on the secondary side, because we do not know the last hook status as we could not replay the last matching SR event. For this purpose we are taking the polling attribute sync_state which is written during each monitor call. So sync_state fills the gab you have described.
Hope that helps a bit.
Best Fabian
Thanks Fabian. One more thought, can we explicitly create the srHook attribute if the sync state is SOK after reload the srHook?
crm_attribute -n hana_<sid>_site_srHook_<nodename> -v SOK -t crm_config -s SAPHanaSR
Is this a productive environment? Then I will be careful to bypass your official support provider. For test systems I would give your command a try, however it is never intended that this attribute is set by an admin. It could also cause an harmful takeover, if the SR would not be in sync and the primary would cash after that command. In the deep of the logs we could differ, if the admin or the HANA did set the attributes, but I do not like to get an official support to go that deep to the logs, if something goes wrong.
@cherrylegler BTW: what does SAPHanaSR-showAttr [--sid ] say in your case? Could you quote it here, if there is now confidential info? You could change the node names and virtual host names, if you like.
Yes, it's a production environment, thus trying to avoid restart.
In your scenario, if I understand correctly, sounds more like a racing condition:
- HANA out of sync, srHook set to SFAIL
- I manually set to SOK
- HANA failover may have data loss.
My plan is (SLES 15 SP2 HANA cluster):
- Use systemReplicationStatus.py to confirm HANA is in sync
- Manually add hana_<sid>_site_srHook_<primarysite> to PRIM and hana_<sid>_site_srHook_<secondarysite> to SOK
- If HANA is out of sync, the srHook should set SFAIL
- If HANA crash, the previous known replication status is SOK and failover would happen.
I tested in my SLES 15 SP2 HANA HA cluster. After hdbnsutil -reloadHADRProviders
, nameserver trace shows HADRProviderManager.cpp(00075) : loading HA/DR Provider 'SAPHanaSR' from /hana/shared/myHooks
SAPHanaSR-showAttr
doesn't show the srHook attributes.
Global cib-time maintenance
--------------------------------------------
global Thu Feb 17 05:01:18 2022 false
Hosts clone_state lpa_tst_lpt node_state op_mode remoteHost roles score site srmode sync_state version vhost
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
testhana1 PROMOTED 1645074078 online logreplay testhana2 4:P:master1:master:worker:master 150 testhana1 syncmem PRIM 2.00.052.00.1599235305 testhana1
testhana2 DEMOTED 30 online logreplay testhana1 4:S:master1:master:worker:master 100 testhana2 syncmem SOK 2.00.052.00.1599235305 testhana2
This is a comment here at github without any warranty.
For the production it is not valid to answer here in the github project. This could be a trap, if something goes wrong once you
would set the parameter manually. Or with other words I should and must not bypass the official support.
About your procedure (1 to 4).
I would test (test-cluster) the official maintenance procedure. Which would be
- first to set the msl-resource to maintenance,
- then to do the needed checks (step 1 in your case) and changes (step 2), then do the check again (step 1) to be sure we had no race
- then to refresh the msl-resource - check that all instances of the msl-resource are now in "slave" status in crm_mon output
- then ending the msl-maintenence mode
But again this is a comment here at github without any warranty.
Thanks Fabian. I totally understand this is not the standard procedure. HANA restart will always be recommended.
@cherrylegler Thanks for the feedback. And yes the restart of HANA would trigger the SFAIL and the SOK (after being in sync again). You might send your feedback, if the above non official procedure (including the resource maintenance steps) did work for you.
I just tested in my SLES 15 SP3 cluster, the procedure worked and didn't not cause any interruption. I refreshed the msl resource when cluster in maintenance mode, there was no change in the resource status. Afterwards, the srHook is shown in SAPHanaSR-showAttr.
Did a couple more of tests, as a conclusion, 3 ways to enable srHook
- Restart HANA
- Trigger a failover/takeover
- Manually create srHook attributes (least preferable)
@cherrylegler Could we close this issue? Are there any open points (maybe I missed one)?
- While in your ordered list 1 and 2 are creating downtimes and 3 is not preferred, you could also block the IP communication for the SAP HANA synchronization using a firewall command for a short while. This would break the SR for a short time. After removing the iptable-rule again the SOK status will be written by the srHook once the SR is in sync again.
Yes, thank you!