Parent's inclusive time may be smaller than child's in spot caliper data
slabasan opened this issue · 0 comments
Not a bug in how Hatchet is reading the data, but users may be confused with some of the spot caliper data. Tracking this caliper discussion here.
This case can happen is if node N (and its subgraph) occurs on only a subset of ranks.
Caliper computes the metrics from the records it has, e.g. if some node N exists on 4 out of 8 ranks it computes the average (and min) for only those 4 records, whereas the result for the root would be based on all 8 ranks.
One of the issues here is maintaining compatibility with existing Spot data. If we change the way Caliper computes the min/max/avg, it'll change the metric name and we won't be able to compare new with old data anymore - not just in hatchet but also in the Spot web GUI.
The issue is that in the Average tree, F6 is 5x larger than its parent, F1. I do not understand how that is possible mathematically, as the global sum of F1 should include the global sum of F6, and therefore ave_F1 >> ave_F6 (the division by num_procs should not change that)
Ave time (inc)
├─ 14.145 F1
│ └─ 14.140 F2
│ ├─ 0.445 F3
│ │ ├─ 0.359 F4
│ │ └─ 0.045 F5
│ └─ 70.790 F6
│ ├─ 38.856 F7
│ │ └─ 37.614 F8
│ │ ├─ 10.260 F9
│ │ └─ 12.561 F10
│ └─ 11.160 F11