wastes cores on hyperthreaded CPUs
Opened this issue · 2 comments
As of 2016-04-30, this scheduler may assign jobs to both hyperthreads on a single core before assigning jobs to other fully idle physical cores, leaving them wasted.
Reproduced on a single Intel(R) Core(TM) i7-6700K CPU @ 4.70GHz, with 4 cores, 8 threads.
First, we discover which logical CPUs are provided by each physical core:
grep "core id" /proc/cpuinfo
core id : 0
core id : 1
core id : 2
core id : 3
core id : 0
core id : 1
core id : 2
core id : 3
Logical CPU's 0 and 4 are provided by physical core 1
Logical CPU's 1 and 5 are provided by physical core 2
Logical CPU's 2 and 6 are provided by physical core 3
Logical CPU's 3 and 7 are provided by physical core 4
You can also get this information here:
grep '' /sys/devices/system/cpu/cpu*/topology/thread_siblings_list
/sys/devices/system/cpu/cpu0/topology/thread_siblings_list:0,4
/sys/devices/system/cpu/cpu1/topology/thread_siblings_list:1,5
/sys/devices/system/cpu/cpu2/topology/thread_siblings_list:2,6
/sys/devices/system/cpu/cpu3/topology/thread_siblings_list:3,7
...
How to reproduce it:
Launch a CPU bound job with as many threads (or processes) as physical cores on an idle system (install 'schedtool' if you don't have it). Batch scheduling gives longer runtime slices, giving you time to inspect job placement.
schedtool -B -e openssl speed rsa4096 -multi 4
Ideally each openssl thread would run on one physical core; so any combination of logical CPUs (0 or 4) and (1 or 5) and (2 or 6) and (3 or 7) should be loaded. What actually happens?
mpstat -P ALL 1
Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
Average: all 50.00 0.00 0.00 0.06 0.00 0.00 0.00 0.00 0.00 49.94
Average: 0 100.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Average: 1 100.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Average: 2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
Average: 3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
Average: 4 100.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Average: 5 100.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Average: 6 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
Average: 7 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
The first core is over-scheduled with two jobs - CPU 0 is 100% and CPU 4 is 100%.
The second core is over-scheduled with two jobs - CPU 1 is 100% and CPU 5 is 100%.
The third core is wasted - logical CPUs 2 and 6 are both idle.
The fourth core is wasted - logical CPUs 3 and 7 are both idle.
Try another round ..
10:59:48 AM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
10:59:49 AM all 50.19 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 49.81
10:59:49 AM 0 100.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
10:59:49 AM 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
10:59:49 AM 2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
10:59:49 AM 3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
10:59:49 AM 4 100.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
10:59:49 AM 5 100.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
10:59:49 AM 6 100.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
10:59:49 AM 7 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
The first core is over-scheduled with two jobs - CPU 0 is 100% and CPU 4 is 100%.
The second core is ideally scheduled with one job - CPU 1 is idle and CPU 5 is 100%.
The third core is ideally scheduled with one job - CPU 2 is idle and CPU 6 is 100%.
The fourth core is wasted - logical CPUs 3 and 7 are both idle.
When cores are wasted, this is the typical result:
OpenSSL 1.0.2g-fips 1 Mar 2016
sign verify sign/s verify/s
rsa 4096 bits 0.001132s 0.000019s 883.6 51948.1
What happens if we disable one logical CPU for each hyper threaded core? I'll disable CPUs 4-7:
echo 0 > /sys/devices/system/cpu/cpu4/online
echo 0 > /sys/devices/system/cpu/cpu5/online
echo 0 > /sys/devices/system/cpu/cpu6/online
echo 0 > /sys/devices/system/cpu/cpu7/online
Relaunching the same job shows the loading is now ideal:
11:44:37 AM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
11:44:38 AM all 99.75 0.00 0.25 0.00 0.00 0.00 0.00 0.00 0.00 0.00
11:44:38 AM 0 100.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
11:44:38 AM 1 100.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
11:44:38 AM 2 100.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
11:44:38 AM 3 100.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
11:44:38 AM 4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
11:44:38 AM 5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
11:44:38 AM 6 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
11:44:38 AM 7 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Resulting in 68% better performance:
OpenSSL 1.0.2g-fips 1 Mar 2016
sign verify sign/s verify/s
rsa 4096 bits 0.000673s 0.000011s 1485.2 93650.8
This improved our server scheduling performance, prior we would get harsh spikes due to over scheduling.
Perhaps a way to resolve this is to schedule tasks to half the cores, and then offload half of each core's tasks over to the hyper-threaded core.