auto-drill down on modified STREAM is missing L2

Question

auto-drill down on modified STREAM is missing L2

Closed this issue 9 months ago · 5 comments

Again, thanks for creating this too.

I am running a modified STREAM benchmark. The benchmark has been modified so that the matrices are not big so you don't have to fetch from main memory. I am running on a Intel IceLake.

This is the output I get:

INFO: App: ./stream.x.icelake.
grep: setup-cpuid.log: No such file or directory
topdown auto-drilldown ..
Compile Flags: -O2 -Wall -Wpedantic
Element Size: 8 Bytes    OpenMP Threads: 8    Reported BW: GByte/s
Iters    Bytes             copy     scale       add     triad    reduce
1000     4096              0.89      0.92      1.44      1.53      0.52
1000     8192              2.13      2.14      3.16      3.16      1.07
1000     16384             4.29      4.29      6.43      6.46      2.05
1000     32768             8.51      8.44     11.86     12.44      4.05
# 4.7-full on Intel(R) Xeon(R) Gold 6348 CPU @ 2.60GHz [icx/icelake]
BE               Backend_Bound  % Slots                       94.0  <==
Info.Thread      IPC              Metric                       0.17
Info.System      Time             Seconds                      0.16
Rerunning workload
Compile Flags: -O2 -Wall -Wpedantic
Element Size: 8 Bytes    OpenMP Threads: 8    Reported BW: GByte/s
Iters    Bytes             copy     scale       add     triad    reduce
1000     4096              0.87      0.91      1.44      1.51      0.52
1000     8192              2.09      2.09      3.09      3.11      1.01
1000     16384             4.27      4.24      6.31      6.42      2.02
1000     32768             8.50      8.41     11.89     12.11      3.94
BE               Backend_Bound               % Slots                       94.2   [33.1%]
BE/Mem           Backend_Bound.Memory_Bound  % Slots                       50.1   [33.1%]<==
BE/Core          Backend_Bound.Core_Bound    % Slots                       44.0   [33.1%]
Info.Thread      IPC                           Metric                       0.14  [33.1%]
Info.System      Time                          Seconds                      0.17
MUX                                          %                             33.07
Rerunning workload
Compile Flags: -O2 -Wall -Wpedantic
Element Size: 8 Bytes    OpenMP Threads: 8    Reported BW: GByte/s
Iters    Bytes             copy     scale       add     triad    reduce
1000     4096              0.87      0.90      1.45      1.53      0.52
1000     8192              2.08      2.06      3.11      3.14      1.00
1000     16384             4.20      4.22      6.38      6.37      2.03
1000     32768             8.32      8.29     11.70     12.37      3.95
8 events not counted
BE               Backend_Bound                        % Slots                       94.4   [47.0%]<==
BE/Mem           Backend_Bound.Memory_Bound           % Slots                       49.4   [47.0%]
BE/Core          Backend_Bound.Core_Bound             % Slots                       45.0   [47.0%]
BE/Mem           Backend_Bound.Memory_Bound.L1_Bound  % Stalls                      14.6   [25.1%]
BE/Mem           Backend_Bound.Memory_Bound.L3_Bound  % Stalls                      22.3   [47.0%]
Info.Thread      IPC                                    Metric                       0.19  [47.0%]
Info.System      Time                                   Seconds                      0.17
warning: 2 nodes had zero counts: DRAM_Bound L2_Bound
        description of nodes in TMA tree path to critical node
Backend_Bound
        This category represents fraction of slots where no uops are
        being delivered due to a lack of required resources for
        accepting new uops in the Backend. Backend is the portion of
        the processor core where the out-of-order scheduler
        dispatches ready uops into their respective execution units;
        and once completed these uops get retired according to
        program order. For example; stalls due to data-cache misses
        or stalls due to the divider unit being overloaded are both
        categorized under Backend Bound. Backend Bound is further
        divided into two main categories: Memory Bound and Core
        Bound.

I can understand the lack of counts in DRAM_Bound, but why do I get L3 but not L2?

Answer 1 · 2024-02-03T21:22:39.000Z

can you please re-profile using this command and upload output:

do.py profile -a <invoke-STREAM-here> --tune :help:0 :levels:3 :tma-group:"'Mem'" -pm 8021 -v2

Answer 2 · 2024-02-05T04:18:17.000Z

I am uploading as a text file attachment.
tma.txt

Answer 3 · 2024-02-06T23:40:30.000Z

I see all of L._Bound are zero with warning printed:

% cat tma.txt | cut -c10- | grep 'L._Bound' | sort | uniq -c
      2  BE/Mem           Backend_Bound.Memory_Bound.L1_Bound                   % Stalls                         0.0   
      8  BE/Mem           Backend_Bound.Memory_Bound.L1_Bound                   % Stalls                         0.0  <
     10  BE/Mem           Backend_Bound.Memory_Bound.L2_Bound                   % Stalls                         0.0  <
     10  BE/Mem           Backend_Bound.Memory_Bound.L3_Bound                   % Stalls                         0.0  <
      1 19 nodes had zero counts: Branch_Resteers DRAM_Bound DSB DSB_Switches Divider Heavy_Operations ICache_Misses ITLB_Misses L1_Bound L2_Bound L3_Bound MITE Memory_Operations Microcode_Sequencer Other_Mispredicts Other_Nukes Ports_Utilization Serializing_Operation Store_Bound

This probably happens since too many events were collected. Your original run was better as it incrementally added relevant metrics. You may append -v1 or -v2 to the original drilldown command to see all nodes, or just check the drilldown.log file you have.

With that said, looking at the MPKI it seems that L2 cache misses are satisfied by L2. Note since L2 has shorter latency (hence lesser chance to stall execution) the heuristic may skid to L3_Bound.

Besides, good chance the HW prefetchers are hiding some of the latencies as STREAM accesses are easy to infer.
Disabling SMT would increase the fidelity of the profile.
There are handy commands to disable them; see do.py -h

Answer 4 · 2024-03-16T20:20:27.000Z

I haven't seen a reply @naromero77 ; did you managed?
If so, I'd appreciate if the issue can be closed.

Answer 5 · 2024-03-18T15:46:56.000Z

Thank you Ahmad.