cdot65/pan-os-upgrade

Enhance HA Handling in Batch Upgrades for pan-os-upgrade Script

Closed this issue · 1 comments

Summary

Requesting an enhancement to the pan-os-upgrade script to improve handling of High Availability (HA) configurations during batch upgrades. The goal is to automate the determination of which firewall in an HA pair should be upgraded first, ensuring a smoother and more reliable upgrade process.

Current Behavior

The current implementation of batch upgrades in the pan-os-upgrade script does not specifically account for HA configurations. The script assumes that the operator targets the passive firewall within an HA pair. However, this approach requires manual intervention and verification, which can be error-prone and inefficient in environments with multiple HA pairs.

Desired Behavior

The script should automatically identify firewalls in HA configurations and determine the passive unit within each HA pair. The upgrade process should then be initiated on the passive firewall first. Upon successful completion of the upgrade and HA synchronization, the script should then trigger a failover to upgrade the formerly active unit.

Implementation Considerations

HA Pair Detection: Automatically identify firewalls that are part of an HA pair.
HA Status Determination: Dynamically determine which firewall in the HA pair is passive and should be upgraded first.
Sequential Upgrade Process: Automate the upgrade process, starting with the passive unit, followed by a controlled failover and upgrade of the other unit.
Error Handling and Recovery: Implement robust error handling to manage potential issues during the upgrade process, especially in HA environments.

Use Cases

Batch Upgrades in HA Environments: Efficiently handle batch upgrades where multiple firewalls are configured in HA pairs.
Reduced Operator Overhead: Minimize manual intervention and reduce the risk of errors in HA upgrade processes.

Potential Benefits

Automated HA Handling: Simplify the upgrade process in HA environments, making it more reliable and less prone to human error.
Enhanced Reliability: Ensure high availability and minimal downtime during upgrades.
Scalability: Improve the script's ability to handle complex environments with multiple HA configurations.

Request for Comments

Feedback is requested from the community on the proposed enhancement. This includes any concerns about HA handling, alternative approaches, and specific requirements from users managing HA configurations in PAN-OS environments.

@adambaumeister My initial pass at the logic for this is getting to become a bit too heavy for my liking, and I'm worried that it'll cause more issues when we open up multi-threading. For instance, how can we prevent a thread being allocated to both the active and passive firewalls concurrently?

I'm tempted to scrap what I've built up in favor of suggesting something along these lines:

- if target firewall belongs to a HA pair:
  - if target firewall is the active within the pair:
    - if the active and passive firewalls are running the same version of PAN-OS:
      - log to console that the peer needs to be targetted for upgrade first
      - expect the multi-threading to handle the peer upgrade
      - append firewall object to a list that's globally available outside of the threading
      - safely exit the thread
    - elif the active is running an older version of PAN-OS than the passive:
      - if the peer firewall is running the target version:
        - if HA is sync'd:
          - suspend the active firewall from the HA state
          - begin upgrade process
          - reboot
          - wait for the firewall to come back online
          - if HA is sync'd:
            - safely exit the thread
          - else:
            - loop every sixty seconds until HA is sync'd
      - else:
        - log to console that the peer needs to be running the targeted release first
        - safely exit the thread
  - if target firewall is the passive within the pair:
    - if the passive firewall is running the target version:
      - safely exit the thread
    - else:
      - suspend the passive firewall from the HA state
      - begin upgrade process
      - reboot
      - wait for the firewall to come back online
      - if HA is sync'd:
        - safely exit the thread
      - else:
        - loop every sixty seconds until HA is sync'd

once the threading tasks complete, we could iterate over the list of active firewalls that were skipped on the initial pass and begin the processing again with this refined list.