auto-drill down on modified STREAM is missing L2
Closed this issue · 5 comments
Again, thanks for creating this too.
I am running a modified STREAM benchmark. The benchmark has been modified so that the matrices are not big so you don't have to fetch from main memory. I am running on a Intel IceLake.
This is the output I get:
INFO: App: ./stream.x.icelake.
grep: setup-cpuid.log: No such file or directory
topdown auto-drilldown ..
Compile Flags: -O2 -Wall -Wpedantic
Element Size: 8 Bytes OpenMP Threads: 8 Reported BW: GByte/s
Iters Bytes copy scale add triad reduce
1000 4096 0.89 0.92 1.44 1.53 0.52
1000 8192 2.13 2.14 3.16 3.16 1.07
1000 16384 4.29 4.29 6.43 6.46 2.05
1000 32768 8.51 8.44 11.86 12.44 4.05
# 4.7-full on Intel(R) Xeon(R) Gold 6348 CPU @ 2.60GHz [icx/icelake]
BE Backend_Bound % Slots 94.0 <==
Info.Thread IPC Metric 0.17
Info.System Time Seconds 0.16
Rerunning workload
Compile Flags: -O2 -Wall -Wpedantic
Element Size: 8 Bytes OpenMP Threads: 8 Reported BW: GByte/s
Iters Bytes copy scale add triad reduce
1000 4096 0.87 0.91 1.44 1.51 0.52
1000 8192 2.09 2.09 3.09 3.11 1.01
1000 16384 4.27 4.24 6.31 6.42 2.02
1000 32768 8.50 8.41 11.89 12.11 3.94
BE Backend_Bound % Slots 94.2 [33.1%]
BE/Mem Backend_Bound.Memory_Bound % Slots 50.1 [33.1%]<==
BE/Core Backend_Bound.Core_Bound % Slots 44.0 [33.1%]
Info.Thread IPC Metric 0.14 [33.1%]
Info.System Time Seconds 0.17
MUX % 33.07
Rerunning workload
Compile Flags: -O2 -Wall -Wpedantic
Element Size: 8 Bytes OpenMP Threads: 8 Reported BW: GByte/s
Iters Bytes copy scale add triad reduce
1000 4096 0.87 0.90 1.45 1.53 0.52
1000 8192 2.08 2.06 3.11 3.14 1.00
1000 16384 4.20 4.22 6.38 6.37 2.03
1000 32768 8.32 8.29 11.70 12.37 3.95
8 events not counted
BE Backend_Bound % Slots 94.4 [47.0%]<==
BE/Mem Backend_Bound.Memory_Bound % Slots 49.4 [47.0%]
BE/Core Backend_Bound.Core_Bound % Slots 45.0 [47.0%]
BE/Mem Backend_Bound.Memory_Bound.L1_Bound % Stalls 14.6 [25.1%]
BE/Mem Backend_Bound.Memory_Bound.L3_Bound % Stalls 22.3 [47.0%]
Info.Thread IPC Metric 0.19 [47.0%]
Info.System Time Seconds 0.17
warning: 2 nodes had zero counts: DRAM_Bound L2_Bound
description of nodes in TMA tree path to critical node
Backend_Bound
This category represents fraction of slots where no uops are
being delivered due to a lack of required resources for
accepting new uops in the Backend. Backend is the portion of
the processor core where the out-of-order scheduler
dispatches ready uops into their respective execution units;
and once completed these uops get retired according to
program order. For example; stalls due to data-cache misses
or stalls due to the divider unit being overloaded are both
categorized under Backend Bound. Backend Bound is further
divided into two main categories: Memory Bound and Core
Bound.
I can understand the lack of counts in DRAM_Bound, but why do I get L3 but not L2?
can you please re-profile using this command and upload output:
do.py profile -a <invoke-STREAM-here> --tune :help:0 :levels:3 :tma-group:"'Mem'" -pm 8021 -v2
I am uploading as a text file attachment.
tma.txt
I see all of L._Bound are zero with warning printed:
% cat tma.txt | cut -c10- | grep 'L._Bound' | sort | uniq -c
2 BE/Mem Backend_Bound.Memory_Bound.L1_Bound % Stalls 0.0
8 BE/Mem Backend_Bound.Memory_Bound.L1_Bound % Stalls 0.0 <
10 BE/Mem Backend_Bound.Memory_Bound.L2_Bound % Stalls 0.0 <
10 BE/Mem Backend_Bound.Memory_Bound.L3_Bound % Stalls 0.0 <
1 19 nodes had zero counts: Branch_Resteers DRAM_Bound DSB DSB_Switches Divider Heavy_Operations ICache_Misses ITLB_Misses L1_Bound L2_Bound L3_Bound MITE Memory_Operations Microcode_Sequencer Other_Mispredicts Other_Nukes Ports_Utilization Serializing_Operation Store_Bound
This probably happens since too many events were collected. Your original run was better as it incrementally added relevant metrics. You may append -v1 or -v2 to the original drilldown command to see all nodes, or just check the drilldown.log file you have.
With that said, looking at the MPKI it seems that L2 cache misses are satisfied by L2. Note since L2 has shorter latency (hence lesser chance to stall execution) the heuristic may skid to L3_Bound.
Besides, good chance the HW prefetchers are hiding some of the latencies as STREAM accesses are easy to infer.
Disabling SMT would increase the fidelity of the profile.
There are handy commands to disable them; see do.py -h
I haven't seen a reply @naromero77 ; did you managed?
If so, I'd appreciate if the issue can be closed.
Thank you Ahmad.