buybackoff/1brc

New C++ version 3x faster on 10k key dataset

lehuyduc opened this issue · 6 comments

https://github.com/lehuyduc/1brc-simd

Hi, I've updated my code to optimize for the 10k keys dataset. On my PC it's ~3x faster (excluding munmap time) than the commit you tested. Default dataset performance is a bit slower.

Just ./run_cpp.sh to compile and run.

To test the effect of hyper threading, you can do ./run_cpp.sh 12 12 (12 == number of threads total on your CPU). You will see interesting effects on the 10K dataset :D

Thanks! Looking forwards to your updated result.

@lehuyduc I assume you checked the output vs correct one? I'm too lazy to redo that every time.

Yes. I tested on 3 different measurements.txt files, and they're correct. If I find any new error, i'll fix it.

./run_cpp.sh 12 12 could you test the result of this one too? To see how hyper threading is bad for performance when there's many branch miss or L3 cache miss.

./run_cpp.sh 12 12 vs ./run_cpp.sh 12 6 are not so much different, it would be the same second decimal even if the delta > sigma. Didn't look too deeply into that.

Huh, so I guess this is an AMD specific problem. If I run with all virtual threads on 2950X, it's much slower, like 30+% slower. Anyway, thanks for testing!

The blog update is now deploying