cdot65/pan-os-upgrade

Implement environmental status capture and failure reporting

Opened this issue · 0 comments

Is your feature request related to a problem? Please describe.
When upgrading PAN-OS on firewalls using the pan-os-upgrade utility, it is important to monitor the environmental status of the devices before and after the upgrade process. The environmental status includes information such as temperature, fan speed, power supply health, and other hardware-related metrics. Capturing this information helps in identifying any potential hardware issues or failures that may impact the upgrade process or the device's stability after the upgrade. Currently, the utility does not have a built-in mechanism to capture the environmental status and report on any failures.

Describe the solution you'd like
Enhance the pan-os-upgrade utility to include the ability to capture the environmental status of the devices before and after the upgrade process and report on any failures or anomalies. The utility should:

  1. Use the PAN-OS SDK to execute the equivalent of the show system environmentals command on the firewall to retrieve the environmental status information.
  2. Parse the environmental status information returned by the SDK and extract relevant metrics, such as:
    • Temperature readings for critical components (e.g., CPU, power supplies, fan trays)
    • Fan speeds and operational status
    • Power supply status and health
    • Any other pertinent environmental metrics
  3. Store the captured environmental status information in a structured format (e.g., JSON or XML) along with metadata such as the device model, serial number, and timestamp.
  4. Proceed with the normal upgrade process.
  5. After the upgrade is completed and the firewall is back online, re-capture the environmental status information using the same SDK command.
  6. Compare the pre-upgrade and post-upgrade environmental status information to identify any changes, failures, or anomalies.
  7. Generate a report or display the comparison results to the user, highlighting any issues or potential problems detected.
  8. Implement threshold-based alerting or notifications for critical environmental metrics, such as high temperatures or failed power supplies.
  9. Provide recommendations or suggested actions to address any identified environmental failures or issues.

Describe alternatives you've considered
An alternative approach could be to rely on external monitoring systems or SNMP traps to capture and monitor the environmental status of the devices. However, this would require additional setup and integration efforts and may not provide a seamless experience within the pan-os-upgrade utility itself.

Additional context
Here are a few additional points to consider:

  • Ensure that the utility handles the SDK authentication and communication securely, using appropriate authentication mechanisms and encryption.
  • Implement error handling and retry mechanisms to handle scenarios where the environmental status retrieval may fail due to network issues or API errors.
  • Provide options to customize the environmental status capture, such as specifying additional metrics to monitor or adjusting the threshold values for alerts.
  • Consider integrating with existing monitoring and alerting systems to centralize the environmental status information and align with the organization's monitoring practices.
  • Provide clear documentation and examples on how to use the environmental status capture and failure reporting feature, including any prerequisites or configuration steps.
  • Update the project's documentation to include information about this new feature, explaining its benefits and how it can assist in proactively identifying and addressing environmental issues during the upgrade process.

By implementing this feature, the pan-os-upgrade utility will provide a comprehensive solution for capturing and monitoring the environmental status of the devices before and after the upgrade process. This will help in identifying any potential hardware failures or issues that may impact the upgrade success or the device's stability, enabling proactive troubleshooting and remediation. It enhances the overall reliability and resilience of the upgrade workflow.