paregupt/ucs_traffic_monitor

Announcing UTM v0.6 release

Opened this issue · 0 comments

Today is Kiara's 3rd Birthday. She is my daughter and the upgrade manager of UTM. v0.6 of UTM will be out soon with 20+ changes. You will be able to upgrade your UTM installation with Kiara's help (upgrade_utm.sh) - Thanks.

This issue describes the changes in detail and serve as the release notes or documentation.

In addition to using UTM for detailed troubleshooting and reactive investigation, v0.6 adds features for detecting issues proactive.

The landing page of UTM shows over-utilized links and busy servers (peak and average usage) - helps you in detecting issues quickly.
UTM_v0 6-overview

UTM v0.6 has a re-design of congestion detection. In one click, you can find the most congested servers, ports, etc. In two clicks, you can find the exact culprit of congestion, time of congestion, and severity of congestion.

UTM_v0 6-congestion

In one click, you can get find errors – Where (ports), When (time), severity (how many errors)
UTM_v0 6-errors

UTM_v0 6-link-tabular-view

Details

UTM Collector changes

  • Pulls class MgmtEntity to get the FI leadership of primary and subordinate
  • Added location in BackplanePortStats

Front-end UI changes

  • FI-A and FI-B show their leadership states - Primary or Subordinate
  • Fixed the over-reporting of PAUSE frames in Locations dashboard
  • Added new use-case for top 10 congested servers
  • Edited the links to carry the current time range
  • Edited the links to not open the new tab. Use browser functionality (middle-click or right click > open in new tab) to open in a new tab.
  • Improved calculation of total FC and Eth traffic on locations dashboard
  • Added location filter in the query of top-10 panels on Locations dashboard
  • Added UTM version in the locations dashboard
  • Removed horizontal bar charts in locations dashboard using Multistat panel. Now, the bar graphs use the native Grafana table gradient bars.
  • Because of the above change, the locations dashboard bar graphs offer a compact design with more high-level visualization in less space.
  • Fixed - Occasional showing of 0 as the total number of uplink and server ports on Locations dashboard.
  • Added new bar charts with domain name, FI ID, and port name for Eth and FC errors on Locations dashboard. Also added the errors from Server ports.
  • Error counters now use sum() instead of mean()
  • Changes on Ingress Traffic Congestion:
    • Renamed the dashboard from Ingress Traffic Congestion to Congestion Monitoring
    • Deprecating the Chassis PAUSE frame monitoring dashboard. Migrated the use-cases to Congestion Monitoring dashboard
    • Added use-cases for top-10 congested ports, top-10 congested servers, and many more.
    • Updated the navigation on the other dashboards
  • The top-10 tabular views offer Avg and Peak utilization
  • In Domain traffic dashboard, under the row for tabular view of uplink and server ports, added avg, peak, errors, and port speed.
  • Using max instead of mean for all graphs.
    • The mean calculation flattens any peaks when the traffic is fluctuating. The mean calculation may look nice, but it may mislead by hiding any high link utilization. This behavior is not visible when traffic is constant and the selected time duration is short so that Grafana interval is 1m which is also the default UTM collector polling interval. However, as the duration increases, the Grafana interval increases. As a result, the mean calculation flattens the peak in a group by time bucket. For example, when the interval is 1m, the max, mean, and last remain the same as the value. But when the interval is 2m, with values 10 and 2, the mean becomes 6 which is flattening the peak of 10. Using max, 10 is used which retains the peak and also retains the severity of the utilization.

Special thanks to Ian Jones for making v0.6 release possible.
Paresh