andikleen/pmu-tools

toplev: Info_Bottlenecks reports negative Scaled_Slots on SKX

Opened this issue · 4 comments

e.g. on SKL

./toplev --metrics -l3 -q ./workloads/GITGREP 2>&1 | grep Bottleneck
C0-T0 Info.Bottleneck Mispredictions Scaled_Slots -1.85 [ 1.0%]
C0-T0 Info.Bottleneck Irregular_Overhead Scaled_Slots -7.60 [ 1.0%]
...

Interestingly it goes away with --single-thread so it might be a SMT issue?

There are at least two problems with this test workload & recent toplev:

  1. The Bottlenecks View required at least level 4 tree
  2. The run time is too short of ~1 second which runs into multiplexing issues
  3. Trunk toplev stops to list the nodes with zero counts; which is used by perf-tools. revert that please.

Here is a reproducer. First line is the command to run inside perf-tools folder, followed by its output on ICX.

The first run with trunk pmu-tools and --no-multiplex shows no negative bottlenecks. Actual toplev command kept for reference.

./do.py --tune :forgive:0 :help:0 :msr:1 :sample:3 :size:1 :loops:3 :loop-ideal-ipc:1 -v0 profile -a './workloads/GITGREP pmu-tools1 no-mux' -pm 10 -v1 --pmu-tools ../pmu-tools --toplev-args ' --no-multiplex'                                                                                                                                                    
INFO: App: ./workloads/GITGREP pmu-tools1 no-mux .                                                                                                                                
topdown full tree + All Bottlenecks ..                                                                                                                                            
../pmu-tools/toplev.py --no-desc -vl6 --nodes '+IPC,+Instructions,+UopPI,+Time,+SLOTS,+CLKS,+Mispredictions,+Big_Code,+Instruction_Fetch_BW,+Branching_Overhead,+DSB_Misses,+Cache_Memory_Bandwidth,+Cache_Memory_Latency,+Memory_Data_TLBs,+Memory_Synchronization,+Irregular_Overhead,+Other_Bottlenecks,+Base_Non_Br' -V GITGREP-pmu-tools1-no-mux.toplev-vl6-perf.csv --no-multiplex --tune 'DEDUP_NODE = "MEM_Parallel_Reads,Lock_Latency,Slots_Utilization,Power,L2_Bound,Big_Code,DSB_Misses,IC_Misses,Contested_Accesses,Data_Sharing,PMM_Bound,Memory_Operations,DRAM_Bound,Other_Light_Ops,Mispredictions,Cache_Memory_Bandwidth,Cache_Memory_Latency,Memory_Data_TLBs,Memory_Synchronization,Base_Non_Br,Instruction_Fetch_BW,Irregular_Overhead,Core_Bound_Likely,Branch_Misprediction_Cost,Other_Bottlenecks"' -- ./workloads/GITGREP pmu-tools1 no-mux 2>&1 | tee GITGREP-pmu-tools1-no-mux.toplev-vl6.log | egrep '<==|MUX|Info(\.Bot|.*Time)|warning.*zero' | sort                                                                                                                          
BE/Core          Backend_Bound.Core_Bound.Ports_Utilization.Ports_Utilized_2                                   % Clocks                           18.2   <==                      
Info.Botlnk.L2   DSB_Misses                                                                                      Scaled_Slots                      2.38                           
Info.Bottleneck  Base_Non_Br                                                                                     Scaled_Slots                     32.35                           
Info.Bottleneck  Big_Code                                                                                        Scaled_Slots                      1.67                           
Info.Bottleneck  Branching_Overhead                                                                              Scaled_Slots                      9.56                           
Info.Bottleneck  Cache_Memory_Bandwidth                                                                          Scaled_Slots                      1.26                           
Info.Bottleneck  Cache_Memory_Latency                                                                            Scaled_Slots                      1.55                           
Info.Bottleneck  Instruction_Fetch_BW                                                                            Scaled_Slots                      9.60                           
Info.Bottleneck  Irregular_Overhead                                                                              Scaled_Slots                      4.69                           
Info.Bottleneck  Memory_Data_TLBs                                                                                Scaled_Slots                      1.42                           
Info.Bottleneck  Memory_Synchronization                                                                          Scaled_Slots                      0.01                           
Info.Bottleneck  Mispredictions                                                                                  Scaled_Slots                     19.24                           
Info.Bottleneck  Other_Bottlenecks                                                                               Scaled_Slots                     18.64                           
Info.System      Time                                                                                            Seconds                           1.77                           
MUX                                                                                                            %                                 100.00                           

This is the failure by default using pmu-tools at 4.6 release point.

./do.py --tune :forgive:0 :help:0 :msr:1 :sample:3 :size:1 :loops:3 :loop-ideal-ipc:1 -v0 profile -a './workloads/GITGREP pmu-tools1 do-mux' -pm 10 -v1
INFO: App: ./workloads/GITGREP pmu-tools1 do-mux .
topdown full tree + All Bottlenecks ..
/usr/bin/python /home/admin1/ayasin/perf-tools/pmu-tools/toplev.py --no-desc -vl6 --nodes '+IPC,+Instructions,+UopPI,+Time,+SLOTS,+CLKS,+Mispredictions,+Big_Code,+Instruction_Fetch_BW,+Branching_Overhead,+DSB_Misses,+Cache_Memory_Bandwidth,+Cache_Memory_Latency,+Memory_Data_TLBs,+Memory_Synchronization,+Irregular_Overhead,+Other_Bottlenecks,+Base_Non_Br' -V GITGREP-pmu-tools1-do-mux.toplev-vl6-perf.csv --frequency --metric-group +Summary --tune 'DEDUP_NODE = "MEM_Parallel_Reads,Lock_Latency,Slots_Utilization,Power,L2_Bound,Big_Code,DSB_Misses,IC_Misses,Contested_Accesses,Data_Sharing,PMM_Bound,Memory_Operations,DRAM_Bound,Other_Light_Ops,Mispredictions,Cache_Memory_Bandwidth,Cache_Memory_Latency,Memory_Data_TLBs,Memory_Synchronization,Base_Non_Br,Instruction_Fetch_BW,Irregular_Overhead,Core_Bound_Likely,Branch_Misprediction_Cost,Other_Bottlenecks"' -- ./workloads/GITGREP pmu-tools1 do-mux 2>&1 | tee GITGREP-pmu-tools1-do-mux.toplev-vl6.log | egrep '<==|MUX|Info(\.Bot|.*Time)|warning.*zero' | sort
BE/Core        Backend_Bound.Core_Bound                                                                      % Slots                           21.2    [30.0%]<==
Info.Botlnk.L2 DSB_Misses                                                                                      Scaled_Slots                     0.58   [ 6.1%]
Info.Bottleneck Base_Non_Br                                                                                    Scaled_Slots                   -75.96   [ 7.5%]
Info.Bottleneck Big_Code                                                                                       Scaled_Slots                     5.49   [85.8%]
Info.Bottleneck Branching_Overhead                                                                             Scaled_Slots                   114.85   [ 7.5%]
Info.Bottleneck Cache_Memory_Bandwidth                                                                         Scaled_Slots                     2.24   [ 7.5%]
Info.Bottleneck Cache_Memory_Latency                                                                           Scaled_Slots                     1.25   [12.0%]
Info.Bottleneck Instruction_Fetch_BW                                                                           Scaled_Slots                     9.59   [23.1%]
Info.Bottleneck Irregular_Overhead                                                                             Scaled_Slots                     8.49   [ 7.0%]
Info.Bottleneck Memory_Data_TLBs                                                                               Scaled_Slots                     0.42   [ 7.0%]
Info.Bottleneck Memory_Synchronization                                                                         Scaled_Slots                     0.02   [ 7.0%]
Info.Bottleneck Mispredictions                                                                                 Scaled_Slots                    14.51   [85.8%]
Info.Bottleneck Other_Bottlenecks                                                                              Scaled_Slots                    19.11   [ 7.0%]
Info.System    Time                                                                                            Seconds                          1.77
MUX                                                                                                          %                                  0.00
warning: 35 nodes had zero counts: ALU_Op_Utilization Clears_Resteers DSB DTLB_Load DTLB_Store Decoder0_Alone L1_Bound L3_Hit_Latency Load_Op_Utilization Local_DRAM MITE MITE_4wide Microcode_Sequencer Mispredicts_Resteers Mixing_Vectors Other_Mispredicts Other_Nukes Port_0 Port_1 Port_5 Port_6 Ports_Utilization Ports_Utilized_0 Ports_Utilized_1 Remote_Cache Remote_DRAM Serializing_Operation Slow_Pause Split_Loads Split_Stores Store_Latency Store_Op_Utilization Store_STLB_Miss Unknown_Branches X87_Use
ERROR: Too many metrics with zero counts; 35 unexpected (ALU_Op_Utilization Clears_Resteers DSB DTLB_Load DTLB_Store Decoder0_Alone L1_Bound L3_Hit_Latency Load_Op_Utilization Local_DRAM MITE MITE_4wide Microcode_Sequencer Mispredicts_Resteers Mixing_Vectors Other_Mispredicts Other_Nukes Port_0 Port_1 Port_5 Port_6 Ports_Utilization Ports_Utilized_0 Ports_Utilized_1 Remote_Cache Remote_DRAM Serializing_Operation Slow_Pause Split_Loads Split_Stores Store_Latency Store_Op_Utilization Store_STLB_Miss Unknown_Branches X87_Use). Run longer or use: --toplev-args ' --no-multiplex' !
 !
ERROR: Command "./do.py --tune :forgive:0 :help:0 :msr:1 :sample:3 :size:1 :loops:3 :loop-ideal-ipc:1 -v0 profile -a './workloads/GITGREP pmu-tools1 do-mux' -pm 10 -v1" failed with '256' !
 !

perf-tools flags the zero counts & suggests to run longer or use no-multiplex.

But even with multiplex issues shouldn't the formula guard against bad values? These are not uncommon.

I have a open bug on detecting too short run time for multiplexing in toplev

Also I'm surprised that 1s is not enough anymore to get through all the groups. It must have really grown a lot.

1s is too short.

There are around a couple dozen groups for the full tree with current toplev each group get sample <5% of time.