Execution units and performance counters
lukego opened this issue · 1 comments
Each Haswell CPU core has eight special-purpose execution units that can each execute some part of an instruction in parallel. For example, calculate an address, load an operand from memory, perform arithmetic.
I realized today that pmu-tools offers some visibility into CPU performance counters that track how much work each execution unit is doing:
$ ocperf.py stat -e cycles,uops_executed_port.port_0,uops_executed_port.port_1,uops_executed_port.port_2,uops_executed_port.port_3,uops_executed_port.port_4,uops_executed_port.port_5,uops_executed_port.port_6,uops_executed_port.port_7 head -c 10000000 /dev/urandom > /dev/null
Performance counter stats for 'head -c 10000000 /dev/urandom':
2,065,534,404 cycles [44.69%]
705,149,766 uops_executed_port_port_0 [44.93%]
728,047,007 uops_executed_port_port_1 [44.94%]
405,801,626 uops_executed_port_port_2 [44.94%]
441,800,214 uops_executed_port_port_3 [44.50%]
289,902,540 uops_executed_port_port_4 [44.06%]
733,201,801 uops_executed_port_port_5 [44.05%]
786,927,002 uops_executed_port_port_6 [44.64%]
174,929,604 uops_executed_port_port_7 [44.44%]
0.908605822 seconds time elapsed
This seems rather nifty. I have recently been needing more visibility into the CPU for debugging difficult performance problems like collisions due to cache associativity.
I would love to be better with auditing performance counters. Tips welcome? ("Ten CPU Performance Counters You Won't Believe You Ever Lived Without?").
The output above makes sense. The workload is getting pseudo-random numbers from /dev/urandom
and the busy execution units are 0,1,5,6 which are exactly the ones that can perform integer arithmetic. That is gratifying :-).