MengRao/SPSC_Queue

Target Architecture (armv8 compatibility)?

FloatingUpstream opened this issue · 4 comments

Hi,

I am trying to figure out the portability / correctness of SPSCQueue.h on different architetures, especially armv8.

For this, I am mostly going by https://www.kernel.org/doc/Documentation/memory-barriers.txt
My understanding is that both pop / push obviously need to happen in correct order / atomic.

I am familiar with
__asm__ volatile("" ::: "memory")
to enforce a compiler level barrier and
__asm__ volatile ("mfence" ::: "memory")
to actually enforce a cpu barrier at runtime.

This should boil down to _mm_mfence (x86) or __mf (arm).

It's not clear to me how, e.g.
asm volatile("" : "=m"(data) : "m"(read_idx) :); // memory fence
enforce a barrier at runtime.

Is this the case because x86 has a more or less strict memory ordering, and the ops are atomic (are they?)? Or is a full memory fence simply not needed in this case.

Also: How portable is this to architectures with weak memory ordering, e.g. to armv8?
Would std::atomic_thread_fence be needed as replacement? I see you used that at the beginning before replacing it with the asm instructions.

Many thanks for your efforts! :-)

I've replaced those asm statements with std::atomic to make it cross-platform and performance should be the same. Thanks.

Many thanks for your response. But I was actually not sure if the asm volatile instruction is enough, and was trying to understand if it is sufficient in this case / what the rational behind it is. :)

Those asm should only work for X86 platforms, you can think them as a light-weighed version of std::atomic_thread_fence(memory_order_relaxed), becasue x86 has strong memory-order so acquire/release is no different from relaxed.
The main overhead of the general compile-level memory fence(i.e. atomic_thread_fence(memory_order_relaxed)) is that data cached in registers needs to be stored to memory(in case other thread need to read) and then loaded from memory(in case other thread has written it), and compile can't tell what data can be accessed by other threads then it'll be the most conservative.
In using the asms, we explicitly tell the compiler what data can be written or read by the others, so it can do a little optimization(so I call it light-weighted memory fence).
Being said that, from benchmark result I didn't see notable performance difference between asms and std::atomic/memory fence, so I switched to std::atomic as it's more readable and cross-platform.

Thanks for the explanation :-)