When two threads are signalling at a rapid rate the performance is dominated by the context switch performance. With the introduction of C-States things can be slowed down a lot. The test will do thread ping-pong.
0.0 Start two threads
1.1 Thread 1 Wait for signal
2.1 Thread 1 Signal Thread 2
3.1 Thread 1 Goto 1.1
1.2 Thread 2 Wait for signal
2.2 Thread 2 Signal Thread 1
3.2 Thread 2 Goto 1.2
The tests were executed on Windows 11 (10.0.22621.2715) Intel I5 13600K.
OpenMPTest.exe -event
Event Test of 100000 signals on two threads. Elapsed time: 0.9 s
OpenMPTest.exe -event
Event Test of 100000 signals on two threads. Elapsed time: 0.8 s
OpenMPTest.exe -event
Event Test of 100000 signals on two threads. Elapsed time: 1.9 s
When the time is < 3s then only C1/C1E power states are enabled. If the time goes up to ca. 16s then the OS transitions to deep sleep states which need ca. 140us per Context Switch to wake up the CPU. This can happen even when the High Performance Power plan is enabled e.g. on Windows Server 2022.
You can switch between power plans without a reboot to quickly test the effects of the power settings.
This will create many small sized OpenMP (Intel OpenMP library) tasks which need a lot of synchronization by the OS if no CPU spinning is configured via the environment variable KMP_BLOCKTIME. Depending on the number of threads and Power Profile (C-States)
The tests were executed on Windows 11 (10.0.22621.2715) Intel I5 13600K.
OpenMPTest.exe 0
Setting KMP_BLOCKTIME to 0
OpenMP loop with 500000 outer and 10000 parallel inner loop. Elapsed time: 46.3 s
OpenMPTest.exe 1
Setting KMP_BLOCKTIME to 1
OpenMP loop with 500000 outer and 10000 parallel inner loop. Elapsed time: 4.5 s
OpenMPTest.exe 0
Setting KMP_BLOCKTIME to 0
OpenMP loop with 500000 outer and 10000 parallel inner loop. Elapsed time: 18.5 s
OpenMPTest.exe 1
Setting KMP_BLOCKTIME to 1
OpenMP loop with 500000 outer and 10000 parallel inner loop. Elapsed time: 1.3 s
OpenMPTest.exe 0
Setting KMP_BLOCKTIME to 0
OpenMP loop with 500000 outer and 10000 parallel inner loop. Elapsed time: 18.2 s
OpenMPTest.exe 1
Setting KMP_BLOCKTIME to 1
OpenMP loop with 500000 outer and 10000 parallel inner loop. Elapsed time: 1.3 s
The CPU pause instruction is defined as a number of cycles which a processor is put to sleep. This value has changed between CPU generations.
Pause latency from https://uops.info/table.html in cycles:
Haswell | Broadwell | Skylake-X | Kaby Lake | Coffee Lake | Cannon Lake | Cascade Lake | Ice Lake | Tiger Lake | Rocket Lake | Alder Lake-P | AMD Zen2 | AMD Zen4 |
---|---|---|---|---|---|---|---|---|---|---|---|---|
9.00 | 9.00 | 140.00 | 140.00 | 152.50 | 157.00 | 40.00 | 138.20 | 138.20 | 138.20 | 160.17 | 65.00 | 65.00 |
Since the numbers are defined in cycles the actual duration depends on the current clock speed of the CPU. Server CPU frequencies have been relatively constant but with the introduction of Priority Cores a similar concept of P/E cores has arrived at servers which makes the previously constant latency of the pause instruction more dynamic.
C:\Source\OpenMPTest_1.0.0.1\OpenMPTest.exe -pause
Pause Latency CPU[0]: 33.5 ns
Pause Latency CPU[1]: 34.0 ns
Pause Latency CPU[2]: 32.7 ns
Pause Latency CPU[3]: 32.9 ns
Pause Latency CPU[4]: 33.2 ns
Pause Latency CPU[5]: 33.0 ns
Pause Latency CPU[6]: 32.8 ns
Pause Latency CPU[7]: 34.2 ns
Pause Latency CPU[8]: 32.9 ns
Pause Latency CPU[9]: 33.2 ns
Pause Latency CPU[10]: 32.9 ns
Pause Latency CPU[11]: 32.9 ns
Pause Latency CPU[12]: 15.8 ns
Pause Latency CPU[13]: 15.7 ns
Pause Latency CPU[14]: 15.7 ns
Pause Latency CPU[15]: 15.8 ns
Pause Latency CPU[16]: 15.7 ns
Pause Latency CPU[17]: 16.0 ns
Pause Latency CPU[18]: 15.9 ns
Pause Latency CPU[19]: 16.0 ns
This gets even more complex because pause latency not only depends on CPU generation and frequency but also if code is running on a Power (P) or Efficiency (E) core.
C:\source>"C:\source\vc22\OpenMPTest_1.0.0.1\OpenMPTest.exe" -pause
Pause Latency CPU[0]: 54.6 ns
Pause Latency CPU[1]: 45.9 ns
Pause Latency CPU[2]: 54.5 ns
Pause Latency CPU[3]: 54.9 ns
Pause Latency CPU[4]: 52.9 ns
Pause Latency CPU[5]: 54.8 ns
Pause Latency CPU[6]: 54.7 ns
Pause Latency CPU[7]: 54.8 ns
Pause Latency CPU[8]: 54.8 ns
Pause Latency CPU[9]: 55.0 ns
Pause Latency CPU[10]: 54.6 ns
Pause Latency CPU[11]: 55.1 ns
Pause Latency CPU[12]: 54.5 ns
Pause Latency CPU[13]: 54.8 ns
Pause Latency CPU[14]: 54.6 ns
Pause Latency CPU[15]: 54.8 ns
Pause Latency CPU[16]: 14.4 ns
Pause Latency CPU[17]: 14.5 ns
Pause Latency CPU[18]: 14.5 ns
Pause Latency CPU[19]: 14.5 ns
Pause Latency CPU[20]: 14.5 ns
Pause Latency CPU[21]: 14.5 ns
Pause Latency CPU[22]: 14.6 ns
Pause Latency CPU[23]: 14.5 ns
C:\temp\OpenMPTest_1.0.0.1>OpenMPTest.exe -pause
Pause Latency CPU[0]: 10.0 ns
Pause Latency CPU[1]: 10.5 ns
Pause Latency CPU[2]: 10.5 ns
Pause Latency CPU[3]: 10.3 ns
Pause Latency CPU[4]: 10.3 ns
Pause Latency CPU[5]: 10.3 ns
Pause Latency CPU[6]: 10.3 ns
Pause Latency CPU[7]: 10.3 ns
Pause Latency CPU[8]: 10.3 ns
Pause Latency CPU[9]: 10.3 ns
Pause Latency CPU[10]: 10.3 ns
Pause Latency CPU[11]: 10.3 ns
Pause Latency CPU[12]: 10.0 ns
Pause Latency CPU[13]: 10.0 ns
Pause Latency CPU[14]: 10.2 ns
Pause Latency CPU[15]: 10.6 ns
Pause Latency CPU[16]: 10.1 ns
Pause Latency CPU[17]: 10.3 ns
Pause Latency CPU[18]: 10.0 ns
Pause Latency CPU[19]: 10.5 ns
Pause Latency CPU[20]: 10.2 ns
Pause Latency CPU[21]: 10.3 ns
Pause Latency CPU[22]: 10.3 ns
Pause Latency CPU[23]: 10.4 ns
Pause Latency CPU[24]: 10.0 ns
Pause Latency CPU[25]: 10.3 ns
Pause Latency CPU[26]: 10.2 ns
Pause Latency CPU[27]: 10.3 ns
Pause Latency CPU[28]: 10.0 ns
Pause Latency CPU[29]: 10.3 ns
Pause Latency CPU[30]: 10.0 ns
Pause Latency CPU[31]: 10.3 ns