toplev: Info_Bottlenecks reports negative Scaled_Slots on SKX

Question

toplev: Info_Bottlenecks reports negative Scaled_Slots on SKX

Opened this issue a year ago · 4 comments

@aayasin TMA problem?

e.g. on SKL

./toplev --metrics -l3 -q ./workloads/GITGREP 2>&1 | grep Bottleneck
C0-T0 Info.Bottleneck Mispredictions Scaled_Slots -1.85 [ 1.0%]
C0-T0 Info.Bottleneck Irregular_Overhead Scaled_Slots -7.60 [ 1.0%]
...

Interestingly it goes away with --single-thread so it might be a SMT issue?

Answer 1 · 2023-10-06T22:45:11.000Z

There are at least two problems with this test workload & recent toplev:

The Bottlenecks View required at least level 4 tree
The run time is too short of ~1 second which runs into multiplexing issues
Trunk toplev stops to list the nodes with zero counts; which is used by perf-tools. revert that please.

Here is a reproducer. First line is the command to run inside perf-tools folder, followed by its output on ICX.

The first run with trunk pmu-tools and --no-multiplex shows no negative bottlenecks. Actual toplev command kept for reference.

./do.py --tune :forgive:0 :help:0 :msr:1 :sample:3 :size:1 :loops:3 :loop-ideal-ipc:1 -v0 profile -a './workloads/GITGREP pmu-tools1 no-mux' -pm 10 -v1 --pmu-tools ../pmu-tools --toplev-args ' --no-multiplex'                                                                                                                                                    
INFO: App: ./workloads/GITGREP pmu-tools1 no-mux .                                                                                                                                
topdown full tree + All Bottlenecks ..                                                                                                                                            
../pmu-tools/toplev.py --no-desc -vl6 --nodes '+IPC,+Instructions,+UopPI,+Time,+SLOTS,+CLKS,+Mispredictions,+Big_Code,+Instruction_Fetch_BW,+Branching_Overhead,+DSB_Misses,+Cache_Memory_Bandwidth,+Cache_Memory_Latency,+Memory_Data_TLBs,+Memory_Synchronization,+Irregular_Overhead,+Other_Bottlenecks,+Base_Non_Br' -V GITGREP-pmu-tools1-no-mux.toplev-vl6-perf.csv --no-multiplex --tune 'DEDUP_NODE = "MEM_Parallel_Reads,Lock_Latency,Slots_Utilization,Power,L2_Bound,Big_Code,DSB_Misses,IC_Misses,Contested_Accesses,Data_Sharing,PMM_Bound,Memory_Operations,DRAM_Bound,Other_Light_Ops,Mispredictions,Cache_Memory_Bandwidth,Cache_Memory_Latency,Memory_Data_TLBs,Memory_Synchronization,Base_Non_Br,Instruction_Fetch_BW,Irregular_Overhead,Core_Bound_Likely,Branch_Misprediction_Cost,Other_Bottlenecks"' -- ./workloads/GITGREP pmu-tools1 no-mux 2>&1 | tee GITGREP-pmu-tools1-no-mux.toplev-vl6.log | egrep '<==|MUX|Info(\.Bot|.*Time)|warning.*zero' | sort                                                                                                                          
BE/Core          Backend_Bound.Core_Bound.Ports_Utilization.Ports_Utilized_2                                   % Clocks                           18.2   <==                      
Info.Botlnk.L2   DSB_Misses                                                                                      Scaled_Slots                      2.38                           
Info.Bottleneck  Base_Non_Br                                                                                     Scaled_Slots                     32.35                           
Info.Bottleneck  Big_Code                                                                                        Scaled_Slots                      1.67                           
Info.Bottleneck  Branching_Overhead                                                                              Scaled_Slots                      9.56                           
Info.Bottleneck  Cache_Memory_Bandwidth                                                                          Scaled_Slots                      1.26                           
Info.Bottleneck  Cache_Memory_Latency                                                                            Scaled_Slots                      1.55                           
Info.Bottleneck  Instruction_Fetch_BW                                                                            Scaled_Slots                      9.60                           
Info.Bottleneck  Irregular_Overhead                                                                              Scaled_Slots                      4.69                           
Info.Bottleneck  Memory_Data_TLBs                                                                                Scaled_Slots                      1.42                           
Info.Bottleneck  Memory_Synchronization                                                                          Scaled_Slots                      0.01                           
Info.Bottleneck  Mispredictions                                                                                  Scaled_Slots                     19.24                           
Info.Bottleneck  Other_Bottlenecks                                                                               Scaled_Slots                     18.64                           
Info.System      Time                                                                                            Seconds                           1.77                           
MUX                                                                                                            %                                 100.00

This is the failure by default using pmu-tools at 4.6 release point.

./do.py --tune :forgive:0 :help:0 :msr:1 :sample:3 :size:1 :loops:3 :loop-ideal-ipc:1 -v0 profile -a './workloads/GITGREP pmu-tools1 do-mux' -pm 10 -v1
INFO: App: ./workloads/GITGREP pmu-tools1 do-mux .
topdown full tree + All Bottlenecks ..
/usr/bin/python /home/admin1/ayasin/perf-tools/pmu-tools/toplev.py --no-desc -vl6 --nodes '+IPC,+Instructions,+UopPI,+Time,+SLOTS,+CLKS,+Mispredictions,+Big_Code,+Instruction_Fetch_BW,+Branching_Overhead,+DSB_Misses,+Cache_Memory_Bandwidth,+Cache_Memory_Latency,+Memory_Data_TLBs,+Memory_Synchronization,+Irregular_Overhead,+Other_Bottlenecks,+Base_Non_Br' -V GITGREP-pmu-tools1-do-mux.toplev-vl6-perf.csv --frequency --metric-group +Summary --tune 'DEDUP_NODE = "MEM_Parallel_Reads,Lock_Latency,Slots_Utilization,Power,L2_Bound,Big_Code,DSB_Misses,IC_Misses,Contested_Accesses,Data_Sharing,PMM_Bound,Memory_Operations,DRAM_Bound,Other_Light_Ops,Mispredictions,Cache_Memory_Bandwidth,Cache_Memory_Latency,Memory_Data_TLBs,Memory_Synchronization,Base_Non_Br,Instruction_Fetch_BW,Irregular_Overhead,Core_Bound_Likely,Branch_Misprediction_Cost,Other_Bottlenecks"' -- ./workloads/GITGREP pmu-tools1 do-mux 2>&1 | tee GITGREP-pmu-tools1-do-mux.toplev-vl6.log | egrep '<==|MUX|Info(\.Bot|.*Time)|warning.*zero' | sort
BE/Core        Backend_Bound.Core_Bound                                                                      % Slots                           21.2    [30.0%]<==
Info.Botlnk.L2 DSB_Misses                                                                                      Scaled_Slots                     0.58   [ 6.1%]
Info.Bottleneck Base_Non_Br                                                                                    Scaled_Slots                   -75.96   [ 7.5%]
Info.Bottleneck Big_Code                                                                                       Scaled_Slots                     5.49   [85.8%]
Info.Bottleneck Branching_Overhead                                                                             Scaled_Slots                   114.85   [ 7.5%]
Info.Bottleneck Cache_Memory_Bandwidth                                                                         Scaled_Slots                     2.24   [ 7.5%]
Info.Bottleneck Cache_Memory_Latency                                                                           Scaled_Slots                     1.25   [12.0%]
Info.Bottleneck Instruction_Fetch_BW                                                                           Scaled_Slots                     9.59   [23.1%]
Info.Bottleneck Irregular_Overhead                                                                             Scaled_Slots                     8.49   [ 7.0%]
Info.Bottleneck Memory_Data_TLBs                                                                               Scaled_Slots                     0.42   [ 7.0%]
Info.Bottleneck Memory_Synchronization                                                                         Scaled_Slots                     0.02   [ 7.0%]
Info.Bottleneck Mispredictions                                                                                 Scaled_Slots                    14.51   [85.8%]
Info.Bottleneck Other_Bottlenecks                                                                              Scaled_Slots                    19.11   [ 7.0%]
Info.System    Time                                                                                            Seconds                          1.77
MUX                                                                                                          %                                  0.00
warning: 35 nodes had zero counts: ALU_Op_Utilization Clears_Resteers DSB DTLB_Load DTLB_Store Decoder0_Alone L1_Bound L3_Hit_Latency Load_Op_Utilization Local_DRAM MITE MITE_4wide Microcode_Sequencer Mispredicts_Resteers Mixing_Vectors Other_Mispredicts Other_Nukes Port_0 Port_1 Port_5 Port_6 Ports_Utilization Ports_Utilized_0 Ports_Utilized_1 Remote_Cache Remote_DRAM Serializing_Operation Slow_Pause Split_Loads Split_Stores Store_Latency Store_Op_Utilization Store_STLB_Miss Unknown_Branches X87_Use
ERROR: Too many metrics with zero counts; 35 unexpected (ALU_Op_Utilization Clears_Resteers DSB DTLB_Load DTLB_Store Decoder0_Alone L1_Bound L3_Hit_Latency Load_Op_Utilization Local_DRAM MITE MITE_4wide Microcode_Sequencer Mispredicts_Resteers Mixing_Vectors Other_Mispredicts Other_Nukes Port_0 Port_1 Port_5 Port_6 Ports_Utilization Ports_Utilized_0 Ports_Utilized_1 Remote_Cache Remote_DRAM Serializing_Operation Slow_Pause Split_Loads Split_Stores Store_Latency Store_Op_Utilization Store_STLB_Miss Unknown_Branches X87_Use). Run longer or use: --toplev-args ' --no-multiplex' !
 !
ERROR: Command "./do.py --tune :forgive:0 :help:0 :msr:1 :sample:3 :size:1 :loops:3 :loop-ideal-ipc:1 -v0 profile -a './workloads/GITGREP pmu-tools1 do-mux' -pm 10 -v1" failed with '256' !
 !

perf-tools flags the zero counts & suggests to run longer or use no-multiplex.

Answer 2 · 2023-10-08T15:17:48.000Z

But even with multiplex issues shouldn't the formula guard against bad values? These are not uncommon.

I have a open bug on detecting too short run time for multiplexing in toplev

Answer 3 · 2023-10-08T20:00:36.000Z

Also I'm surprised that 1s is not enough anymore to get through all the groups. It must have really grown a lot.

Answer 4 · 2023-10-09T23:35:52.000Z

1s is too short.

There are around a couple dozen groups for the full tree with current toplev each group get sample <5% of time.