YIELD is not how you delay on Arm
Opened this issue · 2 comments
On Arm (AArch64), you are using the YIELD instruction to delay when spinning. This is not the purpose of the YIELD instruction. The purpose of the YIELD instruction is to notify the hardware that the current hardware thread should yield to any other hardware thread(s) on the same processor core. Without hardware multithreading (SMT), YIELD is a NOP (it is allocated from the HINT space and unsupported hint opcodes behave as NOPs). NOP's are used for padding and are discarded as quickly as possibly and don't contribute to any delay other than clogging up fetch and decode stages.
Unfortunately, Arm does not have a direct (drop-in) replacement for the x86 PAUSE instruction. Other instruction sequences are possible though. If you prefer an unspecified delay in the 30-40 cycle range, ISB can be used. Using the counter-timer (CNTVCT_EL0), more specific delay periods can be achieved but accuracy is dependent on the machine-specific update rate of the counter-timer. Check this function: https://github.com/ARM-software/progress64/blob/master/src/arch/aarch64.h#L112-L131
In my experience, backoff makes spin locks (e.g. basic spin lock and ticket lock) perform (throughput, fairness) much better. Every time you experience memory contention, backing off should improve the situation. But how long you should back off for best result probably depends on many factors (number of contending threads/cores, interconnect latency, CPU microarchitecture etc). No single answer is going to be optimal for all cases.
My blog post on this subject has been published: https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/multi-threaded-applications-arm
On Arm (AArch64), you are using the YIELD instruction to delay when spinning. This is not the purpose of the YIELD instruction.
I am not using it to delay. I am using it for the purpose you mentioned: to yield execution resources to the other SMT thread if there is one and more generally to hint that we are in a busy spin loop. These are spin locks, they don't delay per se, they try to obtain the lock "as fast as possible" but since this is busy polling we might as well hint to the processor what we are doing so it doesn't run ahead filling up the OoO window with many copies of the polling loop.
PAUSE on modern x86 has a similar purpose, it is just named differently: the idea is not to wait but to hint to the processor that the current loop is a busy wait. They may have slightly different semantics, and even with one architecture the behavior may vary across microarchitecture, but it seems well established that YIELD and PAUSE are each-others closest equivalent and they are commonly used for the body of spin loops.