SUSE/SAPHanaSR

Install srHook

Closed this issue · 11 comments

According to HANA admin guide, it doesn't require a HANA restart to load srHook.

All scripts are loaded during the start up phase of the name server, alternatively, to avoid the need for a restart, run the following command to reload the scripts immediately:

hdbnsutil -reloadHADRProviders

However, if there is no replication synchronization status change event happens after reloading the srHook, the srHook attribute will not be created. If calling the systemReplicationStatus.py timeouts for whatever reason and the sync status is set to SFAIL.

Now if primary HANA crashes, the get_SRHOOK function will take the SFAIL status as the srHOOK attribute is never created, thus the RA prefers local HANA restart instead of failover. Am I understand it correctly?

The reason to open this issue is to confirm whether a HANA restart can be avoided when installing the srHook.

@cherrylegler

  1. We are restarting the SAP HANA Database to also check, if the database still starts properly. Your command is also fine, but then you need to check the trace files carefully that there are no load errors.
  2. If the Resource starts the RA sets SWAIT on the secondary side, because we do not know the last hook status as we could not replay the last matching SR event. For this purpose we are taking the polling attribute sync_state which is written during each monitor call. So sync_state fills the gab you have described.

Hope that helps a bit.
Best Fabian

Thanks Fabian. One more thought, can we explicitly create the srHook attribute if the sync state is SOK after reload the srHook?

crm_attribute -n hana_<sid>_site_srHook_<nodename> -v SOK -t crm_config -s SAPHanaSR

Is this a productive environment? Then I will be careful to bypass your official support provider. For test systems I would give your command a try, however it is never intended that this attribute is set by an admin. It could also cause an harmful takeover, if the SR would not be in sync and the primary would cash after that command. In the deep of the logs we could differ, if the admin or the HANA did set the attributes, but I do not like to get an official support to go that deep to the logs, if something goes wrong.

@cherrylegler BTW: what does SAPHanaSR-showAttr [--sid ] say in your case? Could you quote it here, if there is now confidential info? You could change the node names and virtual host names, if you like.

Yes, it's a production environment, thus trying to avoid restart.

In your scenario, if I understand correctly, sounds more like a racing condition:

  1. HANA out of sync, srHook set to SFAIL
  2. I manually set to SOK
  3. HANA failover may have data loss.

My plan is (SLES 15 SP2 HANA cluster):

  1. Use systemReplicationStatus.py to confirm HANA is in sync
  2. Manually add hana_<sid>_site_srHook_<primarysite> to PRIM and hana_<sid>_site_srHook_<secondarysite> to SOK
  3. If HANA is out of sync, the srHook should set SFAIL
  4. If HANA crash, the previous known replication status is SOK and failover would happen.

I tested in my SLES 15 SP2 HANA HA cluster. After hdbnsutil -reloadHADRProviders, nameserver trace shows HADRProviderManager.cpp(00075) : loading HA/DR Provider 'SAPHanaSR' from /hana/shared/myHooks

SAPHanaSR-showAttr doesn't show the srHook attributes.

Global cib-time                 maintenance 
--------------------------------------------
global Thu Feb 17 05:01:18 2022 false       

Hosts     clone_state lpa_tst_lpt node_state op_mode   remoteHost roles                            score site      srmode  sync_state version                vhost     
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
testhana1 PROMOTED    1645074078  online     logreplay testhana2  4:P:master1:master:worker:master 150   testhana1 syncmem PRIM       2.00.052.00.1599235305 testhana1 
testhana2 DEMOTED     30          online     logreplay testhana1  4:S:master1:master:worker:master 100   testhana2 syncmem SOK        2.00.052.00.1599235305 testhana2 

This is a comment here at github without any warranty.
For the production it is not valid to answer here in the github project. This could be a trap, if something goes wrong once you
would set the parameter manually. Or with other words I should and must not bypass the official support.
About your procedure (1 to 4).
I would test (test-cluster) the official maintenance procedure. Which would be

  • first to set the msl-resource to maintenance,
  • then to do the needed checks (step 1 in your case) and changes (step 2), then do the check again (step 1) to be sure we had no race
  • then to refresh the msl-resource - check that all instances of the msl-resource are now in "slave" status in crm_mon output
  • then ending the msl-maintenence mode
    But again this is a comment here at github without any warranty.

Thanks Fabian. I totally understand this is not the standard procedure. HANA restart will always be recommended.

@cherrylegler Thanks for the feedback. And yes the restart of HANA would trigger the SFAIL and the SOK (after being in sync again). You might send your feedback, if the above non official procedure (including the resource maintenance steps) did work for you.

I just tested in my SLES 15 SP3 cluster, the procedure worked and didn't not cause any interruption. I refreshed the msl resource when cluster in maintenance mode, there was no change in the resource status. Afterwards, the srHook is shown in SAPHanaSR-showAttr.

Did a couple more of tests, as a conclusion, 3 ways to enable srHook

  1. Restart HANA
  2. Trigger a failover/takeover
  3. Manually create srHook attributes (least preferable)

@cherrylegler Could we close this issue? Are there any open points (maybe I missed one)?

  1. While in your ordered list 1 and 2 are creating downtimes and 3 is not preferred, you could also block the IP communication for the SAP HANA synchronization using a firewall command for a short while. This would break the SR for a short time. After removing the iptable-rule again the SOK status will be written by the srHook once the SR is in sync again.

Yes, thank you!