fortinet/azure-templates

Link monitor Best Practice Azure

AndreasMWalter opened this issue · 2 comments

We recently had an issue where the heartbeat was lost between both fortigates, the main reason was due to a vnet network update within the Microsoft backend one machine had a total loss of connectivity.

This can happen from time to time, a reason why Microsoft of course only gives an SLA on a zonal VM deployment or availability set deployment.

In our case the SDWAN was borked and the tunnel needed to be reset, similar to the following issue:
#51

In general I would assume that in order to prevent a split brain condition between the machines there would be the need for some sort of Quorum. However link monitor can't be used for that purpose, since Microsoft has no single service that answers to ICMP and is highly available.

However a Storage Account could be used (similar to Windows Failover Cluster Cloud Witness) in order to achieve Quorum.
Is there a best practice in this regard how to solve a split brain condition, without having the need to deploy a third Fortigate.

(Setup is Active-Passive, a third Fortigate would incur extra license/compute costs creating an Active-Passive-Passive configuration only in order to achieve Quorum)

Hi,

Thank you for opening this issue. For a review of your HA issue specifically it is best for our support to review this. In an Active/Passive setup the FGCP protocol will provide the failover protection (https://docs.fortinet.com/document/fortigate/7.2.4/administration-guide/489324/failover-protection).

The IPSEC failover is most likely an issue with the deployment in public cloud. You are deploying behind a Load Balancer which performs NAT. The IPSEC tunnels should be setup from the remote site to the FortiGate cluster in Azure. You can force this in the Azure based FortiGate by adding the command 'set passive enabled' in the ipsec phase1 configuration. This should prevent issues when a failover occurs.

Regards,

Joeri

Thank you, as far as I understand the IPSEC tunnel issue was a routing bug fixed in the latest firmware updates.
The witness can be considered a recommendation, to improve your service, as you cannot use icmp to monitor the next hop in Azure Cloud.

Other than that the issue can be considered resolved.