memcpy_sse: A C repository from level1wendell

Trying to make memcpy() go faster when compiled as a 32 bit binary.

Let's use sse3/128 bit copies.

32 bit
gcc -march=sse3 -O3 -m32 testmem_modified.c -o tm32

64 bit
gcc -march=sse3 -O3 -m64 testmem_modified.c -o tm64



usage:

./tm32 (or ./tm64) 32 

for a 32 meg memory to memory copy. if you want, replace memcpy_sse with
memcpy() which is a built-in function. When compiling with -m32 it is glacially
slow even for architectures that have things more advanced than e.g. the eax register.  

with default memcpy() on 32 bit compile:

./tm32 32
32 MB = 3.388237 ms
-Compare match (should be zero):  0 

with memcpy_sse() on 32 bit compile:

./tm32 32
32 MB = 0.759420 ms  (LOWER IS ISANELY WAY BETTER...) 
-Compare match (should be zero):  0 

./tm64 32 with standard memcpy() on 64 bit compile:

32 MB = 0.759102 ms (AS INTENDED)
-Compare match (should be zero):  0 

The test system is an AMD Threadripper 1950x Clocked at 4.1ghz with DDR4-3200 memory. 


UPDATE: It has been suggested to try some additional gcc params to try not inlining memcpy(). This does help on SOME systems. But not all, not on Fedora 27+Threadripper for example.

#Non-inlined memcpy is sometimes, but not always, garbage. On this TR system it is garbage... 
gcc -march=k8-sse3 -m32 -O3 -o tm32 -fno-builtin-inline -fno-inline testmem_modified.c -S

This also doesn't work for memcpy speedup, which forces gcc to not inline memcpy (call as memcpy_ptr) 
void *(*memcpy_ptr)(void *, const void *, size_t) = memcpy;
level1wendell/memcpy_sse