Do we need to use `_mm_lddqu_si128`

Question

Do we need to use `_mm_lddqu_si128`

Closed this issue 4 years ago · 4 comments

Compiler Explorer shows that there's some sort of difference between _mm_loadu_si128 and _mm_lddqu_si128, but I can't tell what and we need to investigate that probably.

Answer 1 · 2020-05-11T23:15:07.000Z

And to be clear, the "problem" here is that both _mm_loadu_si128 and _mm_lddqu_si128, when converted to this crate's "standard naming convention" would be something like load_unaligned_m128i

Answer 2 · 2020-05-12T12:04:12.000Z

To quote Intel:

This intrinsic [lddqu] may perform better than _mm_loadu_si128 when the data crosses a cache line boundary.

I have no fucking idea, why Intel didn't just improve the SSE2 unaligned load, but there, now you have a second one that may or may not be better. At least not worse. ~~So, put simply, expose just one load_unaligned_m128i function and conditionally make it call the SSE3 intrinsic when available, the SSE2 one otherwise.~~

Answer 3 · 2020-05-12T12:15:00.000Z

Full quotes from Intel's optimisation manual:

SSE3 provides an instruction LDDQU for loading from memory address that are not 16-byte aligned. LDDQU is a special 128-bit unaligned load designed to avoid cache line splits. If the address of the load is aligned on a 16-byte boundary, LDQQU loads the 16 bytes requested. If the address of the load is not aligned on a 16-byte boundary, LDDQU loads a 32-byte block starting at the 16-byte aligned address immediately below the address of the load request. It then provides the requested 16 bytes. If the address is aligned on a 16-byte boundary, the effective number of memory requests is implementation dependent (one, or more).

LDDQU is designed for programming usage of loading data from memory without storing modified data back to the same address. Thus, the usage of LDDQU should be restricted to situations where no store-to-load forwarding is expected. For situations where store-to-load forwarding is expected, use regular store/load pairs (either aligned or unaligned based on the alignment of the data accessed).

So it's a little trickier than just merging intrinsics. So really you'd have load_unaligned_m128i_for_write_back and load_unaligned_m128i_read_only.

And there's another goodie:

Loading 16 bytes of SIMD data efficiently requires data alignment on 16-byte boundaries. SSSE3 provides the PALIGNR instruction. It reduces overhead in situations that requires software to processing data elements from non-aligned address. The PALIGNR instruction is most valuable when loading or storing unaligned data with the address shifts by a few bytes. You can replace a set of unaligned loads with aligned loads followed by using PALIGNR instructions and simple register to register copies

Using PALIGNRs to replace unaligned loads improves performance by eliminating cache line splits and other penalties. In routines like MEMCPY( ), PALIGNR can boost the performance of misaligned cases. Example 5-2 shows a situation that benefits by using PALIGNR.

Le Big Doc Just Ctrl+F for the instructions you are interested in. The text always uses CAPS, the code snippets always use lowercase, so it's easy to navigate.

Answer 4 · 2020-05-12T16:03:13.000Z

I think that I'll just not implement this one at all to minimize confusion, and if absolutely necessary it could be added later.

However, if you're writing your own memcopy you sure don't need my help.