fancy-memset: An Assembly repository from moon-chilled

'fm' - 'fancy memset', a hopefully fast memset implementation
numbers are variable and depend heavily on sizes, but the sse2 version is:
20-150% faster than freebsd memset
300-500% faster than naive rep stos (e.g. musl, ~openbsd) on erms systems
5-100% faster than bionic memset[*], except for extremely small sizes (which do not really make for a fair comparison since the compiler will inline small constant memsets, and small nonconstant memsets are rare and not performance-critical)
20-110% faster than solaris memset for medium sizes; even or slightly faster at large sizes and slower for very small sizes
(don't know about glibc, idk how to force it to use sse2)

avx2 version is:

300-400% faster than a naive loop compiled with -O3 by gcc/icc/clang
10-370% faster than glibc memset
20-370% faster than naive rep stos on erms systems (however this is hard to measure, as rep stos can use non-temporal stores, so the performance characteristics are broadly different)
50-470% faster than freebsd memset, for sizes larger than 64 bytes (see previous note about small/constant sizes)
Much faster (up to 2x) than well-made (e.g. bionic, fm) sse2 implementations

Code size is tiny compared to solaris (20x) and glibc (how do you even begin to quantify this rat's-nest??), which has farther-reaching effects.  Similar size to freebsd/bionic (though still slightly smaller)

try out the benchmark suite on your own system: 'make'
if you have an amd cpu or an older intel cpu, try: 'make erms='
on e.g. freebsd you may need to add 'AS=/usr/local/bin/as'

* this one is actually quite interesting.  On intel cpus, performance is only 5-30% better for small-med sizes.  Once sizes get quite large, though, fm performs 2x better because it takes advantage of erms.  On amd cpus, erms are not available, so performance for very large sizes is comparable (though fm still wins).  But performance for small-med sizes is much better due to differing priorities.  So, 5-100% performance improvement is measurable on any cpu (except old intel chips, probably), but for different reasons and in different size classes.
moon-chilled/fancy-memset