Trying to make memcpy() go faster when compiled as a 32 bit binary. Let's use sse3/128 bit copies. 32 bit gcc -march=sse3 -O3 -m32 testmem_modified.c -o tm32 64 bit gcc -march=sse3 -O3 -m64 testmem_modified.c -o tm64 usage: ./tm32 (or ./tm64) 32 for a 32 meg memory to memory copy. if you want, replace memcpy_sse with memcpy() which is a built-in function. When compiling with -m32 it is glacially slow even for architectures that have things more advanced than e.g. the eax register. with default memcpy() on 32 bit compile: ./tm32 32 32 MB = 3.388237 ms -Compare match (should be zero): 0 with memcpy_sse() on 32 bit compile: ./tm32 32 32 MB = 0.759420 ms (LOWER IS ISANELY WAY BETTER...) -Compare match (should be zero): 0 ./tm64 32 with standard memcpy() on 64 bit compile: 32 MB = 0.759102 ms (AS INTENDED) -Compare match (should be zero): 0 The test system is an AMD Threadripper 1950x Clocked at 4.1ghz with DDR4-3200 memory. UPDATE: It has been suggested to try some additional gcc params to try not inlining memcpy(). This does help on SOME systems. But not all, not on Fedora 27+Threadripper for example. #Non-inlined memcpy is sometimes, but not always, garbage. On this TR system it is garbage... gcc -march=k8-sse3 -m32 -O3 -o tm32 -fno-builtin-inline -fno-inline testmem_modified.c -S This also doesn't work for memcpy speedup, which forces gcc to not inline memcpy (call as memcpy_ptr) void *(*memcpy_ptr)(void *, const void *, size_t) = memcpy;