fortinet/fortigate-terraform-deploy

Fortigate user-data script unable to complete the configuration

Closed this issue · 7 comments

I have deployed the FortiGate HA solution multiple times and it seems that there is some timing issue where secondary NICs are attached to FortiGate VM after a delay of ranging 40-50 seconds and by that time cloud-init (user-data) script is already triggered during FortiGate VM first reboot. During this reboot time since some of the Secondary NICs are not available (still being created) , not all interfaces in FortiGate VM are configured and as a result HA configuration is also failed. It worked sometime but sometime does not work probably OCI cloud API response timing/delays or quick VM reboot time etc. I tried to add delay in cloud-init script by adding "fnsysctl sleep 120" , however this is not recognized by FortiGate CLI/shell and getting following error:
FortiGate-A # fnsysctl sleep 120
can not find command sleep

Did someone else encountered this issue and were able to find work around?

Following is the timing for one of the run:

2023-03-09T03:47:41.0033865Z oci_core_instance.vm-a[0]: Creation complete after 37s
2023-03-09T03:47:41.0075511Z oci_core_vnic_attachment.vnic_attach_untrust_a[0]: Creating...
2023-03-09T03:47:55.9809283Z oci_core_vnic_attachment.vnic_attach_untrust_a[0]: Creation complete after 15s
2023-03-09T03:47:55.9901000Z oci_core_vnic_attachment.vnic_attach_trust_a[0]: Creating...
2023-03-09T03:48:10.8680863Z oci_core_vnic_attachment.vnic_attach_trust_a[0]: Creation complete after 15s
2023-03-09T03:48:10.8729656Z oci_core_vnic_attachment.vnic_attach_hb_a[0]: Creating...
2023-03-09T03:48:25.9414560Z oci_core_vnic_attachment.vnic_attach_hb_a[0]: Creation complete after 15s

Thanks

Hi mhca99,

Yes. We have also noticed the similar issue only on OCI's terraform module. As it can only attach nic during when the instance is up. And, that can sometimes can take longer than when the bootstrap config is executed. Hence, it can cause issue when trying to do the configuration for the nic.

Right now, we don't have any method to delay the boot. So, there is no such thing as fnsysctl sleep 120.

You can try to do a exec factoryreset manually after the VM is up. As by then the nic should have already been finished attached to the instance.

Cheers

mhca99 commented

Hi mobilesuitzero,

Thanks for checking it out. I found this issue is so far only with following PAYG FortiGate image as I was doing most of my testing with this image, while the other BYOL FortiGate image is working fine. So issue is not just OCI cloud but also with this particular image that is not working. Did you test this PAYG with terraform code ?

FortiGate-VM 7.2.4 PAYG (Not-working):
mp_listing_id = "ocid1.appcataloglisting.oc1..aaaaaaaa6d5wbjlrlihw7l33nvdso74lv2s66snabevr33awotpgjownggiq"
mp_listing_resource_id= "ocid1.image.oc1..aaaaaaaaodxeegjaovwo72se7jafarm24im6n2c6afjhznli3xxufuejt4mq"

FortiGate-VM 7.2.4 BYOL (working):
mp_listing_id = "ocid1.appcataloglisting.oc1..aaaaaaaam7ewzrjbltqiarxukuk72v2lqkdtpqtwxqpszqqvrm7likfnpt5q"
mp_listing_resource_id= "ocid1.image.oc1..aaaaaaaa5m67jbvb33hoxpefr7fhfhf7gaeie4xjg7p4heixg25osr5warcq"

FYI , following are the cloudinit logs for the "Non-Working" FortiGate Image:

OCI-HA-Active $ config system interface
OCI-HA-Active (interface) $ edit port2
OCI-HA-Active (port2) $ set alias public
OCI-HA-Active (port2) $ set mode static
OCI-HA-Active (port2) $ set ip 10.1.1.4 255.255.255.0
OCI-HA-Active (port2) $ set allowaccess ping https ssh fgfm
OCI-HA-Active (port2) $ set mtu-override enable
OCI-HA-Active (port2) $ set mtu 9000
Please input interface of the physical device first.
MTU size not valid. Should be in the range of 68 - 1500.
node_check_object fail! for mtu 9000
value parse error before '9000'
Command fail. Return code -2
OCI-HA-Active (port2) $ next
Attribute 'vdom' MUST be set.
Command fail. Return code 1
OCI-HA-Active (interface) $ end
OCI-HA-Active $ config system interface
OCI-HA-Active (interface) $ edit port3
OCI-HA-Active (port3) $ set alias trust
OCI-HA-Active (port3) $ set mode static
OCI-HA-Active (port3) $ set ip 10.1.2.4 255.255.255.0
OCI-HA-Active (port3) $ set allowaccess ping https ssh fgfm
OCI-HA-Active (port3) $ set mtu-override enable
OCI-HA-Active (port3) $ set mtu 9000
Please input interface of the physical device first.
MTU size not valid. Should be in the range of 68 - 1500.
node_check_object fail! for mtu 9000
value parse error before '9000'
Command fail. Return code -2
OCI-HA-Active (port3) $ next
Attribute 'vdom' MUST be set.
Command fail. Return code 1
OCI-HA-Active (interface) $ end
OCI-HA-Active $ config system interface
OCI-HA-Active (interface) $ edit port4
OCI-HA-Active (port4) $ set alias hasync
OCI-HA-Active (port4) $ set mode static
OCI-HA-Active (port4) $ set ip 10.1.3.3 255.255.255.0
OCI-HA-Active (port4) $ set allowaccess ping https ssh fgfm
OCI-HA-Active (port4) $ set mtu-override enable
OCI-HA-Active (port4) $ set mtu 9000
Please input interface of the physical device first.
MTU size not valid. Should be in the range of 68 - 1500.
node_check_object fail! for mtu 9000
value parse error before '9000'
Command fail. Return code -2
OCI-HA-Active (port4) $ next
Attribute 'vdom' MUST be set.
Command fail. Return code 1
OCI-HA-Active (interface) $ end
OCI-HA-Active $ config sys ha
OCI-HA-Active (ha) $ set group-name OCI-HA
OCI-HA-Active (ha) $ set mode a-p
OCI-HA-Active (ha) $ set hbdev port4 100
'port4' is not a valid interface
node_check_object fail! for hbdev port4
value parse error before 'port4'
Command fail. Return code -651

Hi mhca99,

PAYG usually has this issue because, it has no reboot to import license which BYOL does.
Hence, that's why when you try BYOL, it will work fine. But with PAYG, you can see the nic not yet been attached to the instance by the time it trying to run the bootstrap configuration.

Cheers

Side Note:

In the readme file.

--snippet---
After deployment, FortiGate-VM instances may not get the proper configurations during the initial bootstrap configuration. User may need to do a manual factoryreset on the units in order to get proper configurations. To do a factoryreset, user can login to the units via Console, and do exec factoryreset
--snippet---

mhca99 commented

Hi mobilesuitzero,

Thanks for the information. I believe bootstrap configuration is applied on the first boot , where attached license is applied as well as per user_data config for both PAYG and BYOL images. Just trying to understand your comments that PAYG does not reboot to import license, in that case it should stay up (longer then BYOL) until all the NICs are attached to VM before it reboots. In that way bootstrap config will be successful. However, it looks like it reboots before the NICs are attached.
I have compared the bootstrap logs in both case and listed below for your reference if they can pinpoint something. The only difference I see is the additional line "Trying to install vmlicense ..." in logs for BYOL image as follows:

BYOL Image bootstarp logs:
vma # diag debug cloudinit sh

Checking metadata source opc
OPC has obtained user data
OPC user data decrypted
MIME parsed VM license
MIME parsed config script
OPC customdata processed successfully
Trying to install vmlicense ...
Run config script
FGVM04TM2XXXX $ config system global

PYAG image bootstrap logs:
vma # diag debug cloudinit sh

Checking metadata source opc
OPC has obtained user data
OPC user data decrypted
MIME parsed VM license
MIME parsed config script
OPC customdata processed successfully
Run config script
FortiGate-VM64-OPC $ config system global

Further , even though BYOL ends up completing the configuration , however, most of the times the gui stucks at message "FortiGate VM License License is being validated by FortiGuard" forever and I see partitioned HA cluster, in that case "exec factoryreset" doesnt work and I have to manually remove and configure HA. Looks like thats another bug.

Thanks

Hi,

Because with byol, it will try to install license, and that usually will trigger a reboot, which will take couple minutes to boot up. By then, the nics are usually already attached to the instances already from OCI side.

As for the license stuck at validating license, it means FortiGate is trying to validate the licence with FortiGuard.
Please make sure the FortiGate can connect to FortiGuard and user can try to do exec update-now via CLI to trigger a license validation with FortiGuard.

Hope that helps.

Cheers

close this issue for now.