lukego/blog

Execution units and performance counters

lukego opened this issue · 1 comments

Each Haswell CPU core has eight special-purpose execution units that can each execute some part of an instruction in parallel. For example, calculate an address, load an operand from memory, perform arithmetic.

I realized today that pmu-tools offers some visibility into CPU performance counters that track how much work each execution unit is doing:

$ ocperf.py stat -e cycles,uops_executed_port.port_0,uops_executed_port.port_1,uops_executed_port.port_2,uops_executed_port.port_3,uops_executed_port.port_4,uops_executed_port.port_5,uops_executed_port.port_6,uops_executed_port.port_7 head -c 10000000 /dev/urandom > /dev/null
 Performance counter stats for 'head -c 10000000 /dev/urandom':

     2,065,534,404      cycles                    [44.69%]
       705,149,766      uops_executed_port_port_0                                    [44.93%]
       728,047,007      uops_executed_port_port_1                                    [44.94%]
       405,801,626      uops_executed_port_port_2                                    [44.94%]
       441,800,214      uops_executed_port_port_3                                    [44.50%]
       289,902,540      uops_executed_port_port_4                                    [44.06%]
       733,201,801      uops_executed_port_port_5                                    [44.05%]
       786,927,002      uops_executed_port_port_6                                    [44.64%]
       174,929,604      uops_executed_port_port_7                                    [44.44%]

       0.908605822 seconds time elapsed

This seems rather nifty. I have recently been needing more visibility into the CPU for debugging difficult performance problems like collisions due to cache associativity.

I would love to be better with auditing performance counters. Tips welcome? ("Ten CPU Performance Counters You Won't Believe You Ever Lived Without?").

The output above makes sense. The workload is getting pseudo-random numbers from /dev/urandom and the busy execution units are 0,1,5,6 which are exactly the ones that can perform integer arithmetic. That is gratifying :-).