toplev: Info_Bottlenecks reports negative Scaled_Slots on SKX
Opened this issue · 4 comments
- @aayasin TMA problem?
e.g. on SKL
./toplev --metrics -l3 -q ./workloads/GITGREP 2>&1 | grep Bottleneck
C0-T0 Info.Bottleneck Mispredictions Scaled_Slots -1.85 [ 1.0%]
C0-T0 Info.Bottleneck Irregular_Overhead Scaled_Slots -7.60 [ 1.0%]
...
Interestingly it goes away with --single-thread so it might be a SMT issue?
There are at least two problems with this test workload & recent toplev:
- The Bottlenecks View required at least level 4 tree
- The run time is too short of ~1 second which runs into multiplexing issues
- Trunk toplev stops to list the nodes with zero counts; which is used by perf-tools. revert that please.
Here is a reproducer. First line is the command to run inside perf-tools folder, followed by its output on ICX.
The first run with trunk pmu-tools and --no-multiplex shows no negative bottlenecks. Actual toplev command kept for reference.
./do.py --tune :forgive:0 :help:0 :msr:1 :sample:3 :size:1 :loops:3 :loop-ideal-ipc:1 -v0 profile -a './workloads/GITGREP pmu-tools1 no-mux' -pm 10 -v1 --pmu-tools ../pmu-tools --toplev-args ' --no-multiplex'
INFO: App: ./workloads/GITGREP pmu-tools1 no-mux .
topdown full tree + All Bottlenecks ..
../pmu-tools/toplev.py --no-desc -vl6 --nodes '+IPC,+Instructions,+UopPI,+Time,+SLOTS,+CLKS,+Mispredictions,+Big_Code,+Instruction_Fetch_BW,+Branching_Overhead,+DSB_Misses,+Cache_Memory_Bandwidth,+Cache_Memory_Latency,+Memory_Data_TLBs,+Memory_Synchronization,+Irregular_Overhead,+Other_Bottlenecks,+Base_Non_Br' -V GITGREP-pmu-tools1-no-mux.toplev-vl6-perf.csv --no-multiplex --tune 'DEDUP_NODE = "MEM_Parallel_Reads,Lock_Latency,Slots_Utilization,Power,L2_Bound,Big_Code,DSB_Misses,IC_Misses,Contested_Accesses,Data_Sharing,PMM_Bound,Memory_Operations,DRAM_Bound,Other_Light_Ops,Mispredictions,Cache_Memory_Bandwidth,Cache_Memory_Latency,Memory_Data_TLBs,Memory_Synchronization,Base_Non_Br,Instruction_Fetch_BW,Irregular_Overhead,Core_Bound_Likely,Branch_Misprediction_Cost,Other_Bottlenecks"' -- ./workloads/GITGREP pmu-tools1 no-mux 2>&1 | tee GITGREP-pmu-tools1-no-mux.toplev-vl6.log | egrep '<==|MUX|Info(\.Bot|.*Time)|warning.*zero' | sort
BE/Core Backend_Bound.Core_Bound.Ports_Utilization.Ports_Utilized_2 % Clocks 18.2 <==
Info.Botlnk.L2 DSB_Misses Scaled_Slots 2.38
Info.Bottleneck Base_Non_Br Scaled_Slots 32.35
Info.Bottleneck Big_Code Scaled_Slots 1.67
Info.Bottleneck Branching_Overhead Scaled_Slots 9.56
Info.Bottleneck Cache_Memory_Bandwidth Scaled_Slots 1.26
Info.Bottleneck Cache_Memory_Latency Scaled_Slots 1.55
Info.Bottleneck Instruction_Fetch_BW Scaled_Slots 9.60
Info.Bottleneck Irregular_Overhead Scaled_Slots 4.69
Info.Bottleneck Memory_Data_TLBs Scaled_Slots 1.42
Info.Bottleneck Memory_Synchronization Scaled_Slots 0.01
Info.Bottleneck Mispredictions Scaled_Slots 19.24
Info.Bottleneck Other_Bottlenecks Scaled_Slots 18.64
Info.System Time Seconds 1.77
MUX % 100.00
This is the failure by default using pmu-tools at 4.6 release point.
./do.py --tune :forgive:0 :help:0 :msr:1 :sample:3 :size:1 :loops:3 :loop-ideal-ipc:1 -v0 profile -a './workloads/GITGREP pmu-tools1 do-mux' -pm 10 -v1
INFO: App: ./workloads/GITGREP pmu-tools1 do-mux .
topdown full tree + All Bottlenecks ..
/usr/bin/python /home/admin1/ayasin/perf-tools/pmu-tools/toplev.py --no-desc -vl6 --nodes '+IPC,+Instructions,+UopPI,+Time,+SLOTS,+CLKS,+Mispredictions,+Big_Code,+Instruction_Fetch_BW,+Branching_Overhead,+DSB_Misses,+Cache_Memory_Bandwidth,+Cache_Memory_Latency,+Memory_Data_TLBs,+Memory_Synchronization,+Irregular_Overhead,+Other_Bottlenecks,+Base_Non_Br' -V GITGREP-pmu-tools1-do-mux.toplev-vl6-perf.csv --frequency --metric-group +Summary --tune 'DEDUP_NODE = "MEM_Parallel_Reads,Lock_Latency,Slots_Utilization,Power,L2_Bound,Big_Code,DSB_Misses,IC_Misses,Contested_Accesses,Data_Sharing,PMM_Bound,Memory_Operations,DRAM_Bound,Other_Light_Ops,Mispredictions,Cache_Memory_Bandwidth,Cache_Memory_Latency,Memory_Data_TLBs,Memory_Synchronization,Base_Non_Br,Instruction_Fetch_BW,Irregular_Overhead,Core_Bound_Likely,Branch_Misprediction_Cost,Other_Bottlenecks"' -- ./workloads/GITGREP pmu-tools1 do-mux 2>&1 | tee GITGREP-pmu-tools1-do-mux.toplev-vl6.log | egrep '<==|MUX|Info(\.Bot|.*Time)|warning.*zero' | sort
BE/Core Backend_Bound.Core_Bound % Slots 21.2 [30.0%]<==
Info.Botlnk.L2 DSB_Misses Scaled_Slots 0.58 [ 6.1%]
Info.Bottleneck Base_Non_Br Scaled_Slots -75.96 [ 7.5%]
Info.Bottleneck Big_Code Scaled_Slots 5.49 [85.8%]
Info.Bottleneck Branching_Overhead Scaled_Slots 114.85 [ 7.5%]
Info.Bottleneck Cache_Memory_Bandwidth Scaled_Slots 2.24 [ 7.5%]
Info.Bottleneck Cache_Memory_Latency Scaled_Slots 1.25 [12.0%]
Info.Bottleneck Instruction_Fetch_BW Scaled_Slots 9.59 [23.1%]
Info.Bottleneck Irregular_Overhead Scaled_Slots 8.49 [ 7.0%]
Info.Bottleneck Memory_Data_TLBs Scaled_Slots 0.42 [ 7.0%]
Info.Bottleneck Memory_Synchronization Scaled_Slots 0.02 [ 7.0%]
Info.Bottleneck Mispredictions Scaled_Slots 14.51 [85.8%]
Info.Bottleneck Other_Bottlenecks Scaled_Slots 19.11 [ 7.0%]
Info.System Time Seconds 1.77
MUX % 0.00
warning: 35 nodes had zero counts: ALU_Op_Utilization Clears_Resteers DSB DTLB_Load DTLB_Store Decoder0_Alone L1_Bound L3_Hit_Latency Load_Op_Utilization Local_DRAM MITE MITE_4wide Microcode_Sequencer Mispredicts_Resteers Mixing_Vectors Other_Mispredicts Other_Nukes Port_0 Port_1 Port_5 Port_6 Ports_Utilization Ports_Utilized_0 Ports_Utilized_1 Remote_Cache Remote_DRAM Serializing_Operation Slow_Pause Split_Loads Split_Stores Store_Latency Store_Op_Utilization Store_STLB_Miss Unknown_Branches X87_Use
ERROR: Too many metrics with zero counts; 35 unexpected (ALU_Op_Utilization Clears_Resteers DSB DTLB_Load DTLB_Store Decoder0_Alone L1_Bound L3_Hit_Latency Load_Op_Utilization Local_DRAM MITE MITE_4wide Microcode_Sequencer Mispredicts_Resteers Mixing_Vectors Other_Mispredicts Other_Nukes Port_0 Port_1 Port_5 Port_6 Ports_Utilization Ports_Utilized_0 Ports_Utilized_1 Remote_Cache Remote_DRAM Serializing_Operation Slow_Pause Split_Loads Split_Stores Store_Latency Store_Op_Utilization Store_STLB_Miss Unknown_Branches X87_Use). Run longer or use: --toplev-args ' --no-multiplex' !
!
ERROR: Command "./do.py --tune :forgive:0 :help:0 :msr:1 :sample:3 :size:1 :loops:3 :loop-ideal-ipc:1 -v0 profile -a './workloads/GITGREP pmu-tools1 do-mux' -pm 10 -v1" failed with '256' !
!
perf-tools flags the zero counts & suggests to run longer or use no-multiplex.
But even with multiplex issues shouldn't the formula guard against bad values? These are not uncommon.
I have a open bug on detecting too short run time for multiplexing in toplev
Also I'm surprised that 1s is not enough anymore to get through all the groups. It must have really grown a lot.
1s is too short.
There are around a couple dozen groups for the full tree with current toplev each group get sample <5% of time.