Last refresh
pedrosakuma opened this issue ยท 45 comments
Hello Victor,
I know is very close to the official cut time.
I submitted a PR on 1brc-bench with my repository.
Thanks in advance.
I know is very close to the official cut time.
Actually, there is no official cut off time for me. That even sounds weird to me. I'm not affiliated with the Java competition. And they were unwelcoming enough to make me not even trying. What I liked about this challenge is that no one just blindly copied others' solutions and everyone kept their original designs to a big extent. For me the main constraint was that I understood the code and given lot's of free time I had a chance to come up with something similar. Or something obvious that I just forgot. These things I could reuse. But never I would reuse SIMD simultaneous 4-number parsing by @noahfalk and similar stuff.
So my cut off time is dictated by my own time constraints and availability of big machines for the final benchmarking. I'm going to a 3-days vacation tomorrow evening, but I will have time during the day tomorrow for extensive benchmarking. Feel free to update your solution, let's say before 14PM UTC tomorrow, if it's not just verbatim copy of someone's code. Same for @kpocza. Maybe even later. But I may lose access to the big machines after that vacation.
I'm still waiting when @RagnarGrootKoerkamp updates his fastest-Rust but not compliant code, the comparison would be very interesting anyway.
And @austindonisan has had some issues with 10K dataset. I had to disable it 2 commits ago (tweak hash sizes...
was failing on 4 cores). It looked OK as of the last commit yesterday for some combination of cores/threads, but since it was disabled I could not run it through all combinations.
Two more things. Since this issue is called "Last refresh", some final touches remain.
I include repos as submodules here: https://github.com/buybackoff/1brc-bench/tree/main/src
I'm going to run many benchmarks tomorrow on 2 AMD / 2 Intel machines and AWS m6/7g.metal Graviton. I have two issues.
-
License. @lehuyduc @austindonisan @dzaima @RagnarGrootKoerkamp @CameronAavik @kpocza @NimaAra @mtopolnik (Rust) do not have any license. It would be nice if every submodule in my repo had one of MIT/Apache/BSD. I may be reluctant to keep a proprietary submodule there, even if I understand that probably the intent was FOSS from the beginning.
-
ARM compatibility. I believe native and some .NET solutions are out (FYI @xoofx and @nietras it looks like you are fine on M1/2 Macs given the benchmarks Alexandre did on M1 Pro. Should I expect this to work on Gravitons? Or that was Rosetta emulation?). I use only Vector256 API and that should be accelerated on ARM as well. I'm not sure about Rust, but from a quick search it looks like it should work. Anyway if you do not use anything x64/x86 specific and could just use x-arch Vector or other API that would be nice. Since Java does not have easy access to SIMD and they had to reinvent the wheel with parallel int64 reads I expect most Java solutions will just work on ARM, that's a positive thing of a constrained environment. I would love to have more things to compare in ARM.
Added a license to my repo (forgot to do so before!).
I could probably get an aarch64 port made without too much trouble, though we'll have to see if I get around to it today, tomorrow, or ever.
But never I would reuse SIMD simultaneous 4-number parsing by @noahfalk and similar stuff.
In case you were worried on my behalf, I'm happy to have anyone reuse anything that appears useful from my entry :) Of course as the curator of these cross-language 1BRC results its totally your call on how distinctive you want each entry to be before you decide to include it.
I have added the MIT license now on master.
In case you were worried on my behalf, I'm happy to have anyone reuse anything that appears useful from my entry :) Of course as the curator of these cross-language 1BRC results its totally your call on how distinctive you want each entry to be before you decide to include it.
@noahfalk I was talking for myself. Frankly, I did not have time to grok your solution in every detail, especially the SIMD 4-in-1 number parsing. But I could have copied your main loop and adapt it to my other parts without groking its every detail. It would be of zero interest to me and would bring zero fun. Actually on many cores, e.g. 64+, we are sometimes becoming competitive. You are consuming memory so aggressively on each core, so probably (my ungrounded theory) too many cores do not help.
@noahfalk You use AVX directly a lot. Do you think it's feasible to just use Vector256? So that we may have a chance for reasonable perf on ARM?
You are consuming memory so aggressively on each core, so probably (my ungrounded theory) too many cores do not help
Yeah, I could easily believe that. I never tested or tuned it for operation above 8 threads so it makes sense to me that it doesn't do as well there.
Do you think it's feasible to just use Vector256? So that we may have a chance for reasonable perf on ARM?
Some of those places probably are replaceable with Vector256 without changing the generated code, but unfortunately not all of them. In particular:
- MultiplyAddAdjacent doesn't have a Vector256 equivalent I'm aware of
- There is a Vector256.Shuffle(), but it is dramatically slower. I think it was generating a large number of instructions that would correctly handle shuffling across the 128bit lane boundary whereas the Avx2.Shuffle() fast hardware intrinsic doesn't support shuffling across that boundary.
Probably the easiest option is to exclude my entry from any ARM benchmarks with a note that it used architecture specific intrinsics. If anyone wanted to try porting my approach to ARM I'm happy to have them do that but I don't plan to tackle it myself.
License added.
ARM support is a good question for me. If the tests fail, I can check it on my Raspberry Pi if it supports .NET 8.
There is an extension for VS, which can show geneeated assembly code at function level. You may use it to check what Avx2 is generated for certain Vector operations (and other parts as well).
I added the license as well.
I added a license, too. Some last minute improvements, too, so please pull the latest copy.
I'm pretty confident it's bugfree and should work on the 10K data set now. The 10k set is only taking 2.4x as long now which might make it competitive?
I don't expect it to work on ARM (well at least). My entire codebase is centered around 256bit vectors.
@NimaAra your code does not work in the bench setup, probably because the input file is a symlink there. See @kpocza fix for that: kpocza/1brc@9ccfe6a
@xoofx and @nietras it looks like you are fine on M1/2 Macs given the benchmarks Alexandre did on M1 Pro. Should I expect this to work on Gravitons? Or that was Rosetta emulation?
Should work fine on ARM. No simulation. Most cross-platform Vector256 (which is impl on ARM as 2 x Vector128). My impl has fallback when no Avx2 and only used for a specific hash optimization. On ARM likely not run as well as not optimized particularly for that.
Symlink support added to master albeit not tested.
Done! Try to digest 2540 files now :)
buybackoff/1brc-bench@41fb2bb
Some basic formatting (mean time): original dataset, 10K
@buybackoff with your added sudo
to all the numactl
s in run.sh
, I think the THREADS_1BRC=
and THREADS=
envvars don't make it to their respective programs? You'd need sudo -E
or similar.
Interesting, @dzaima Is that showing that for 10k my "Safe" code is out-performing other solutions? That can't be right or am I reading this wrong?
@NimaAra I re-ran your code and then updated the files but obviously something old remained there.
@buybackoff with your added sudo to all the numactls in run.sh, I think the THREADS_1BRC= and THREADS= envvars don't make it to their respective programs? You'd need sudo -E or similar.
Right, I was thinking why your results are the same. Shit, it's too late to re-run it now. But anyone could run this on AWS metal themselves.
@dzaima you are the only one affected. How to fix that? I may be able to re-run your and NimaAra results on non-AWS machines.
Don't remove it! Let me bask in this empty victory a bit longer...๐ญ
I'd be fine with just removing non-8-thread timings of my entry within the current results (as my solution defaults to THREADS_1BRC=8
).
@NimaAra Do not worry, I may start the benches but will not wait for the results today. No time, need to go ๐โท๏ธ
Actually, with the numactl-limited data, can keep all thread counts less than or equal to 8; so that's removing all these files, resulting in these staying
(edit: added bad file names in original dataset too)
@NimaAra actually you results just should be in the 10K 20 runs benches, I forgot to delete them from there. Other numbers are fine.
@dzaima would it be difficult to aggregate by min value? Or throw away top/bottom 5 for 20 runs?
Oh, the run counts for 1B
are 5 & 20, but 1B_10K
has 3 & 5; the directory names are just wrong then.
Thanks! Yes 20 runs are only for the default data. Otherwise too long to wait.
There is logic to divide default runs by 4 for 10K, but keep at least 3.
And here's, for tests with 20 runs, average from without the best 5 & worst 5 times.
I see different number of participants in the test cases. The maximum is 19 in the previous outputs, while the new tables show only 16-17 rows per test. Even the same machines have different number of participants, like 6c6t has 17, while the rest has only 16. Why not having 19 participants in all new outputs?
@kpocza some are very long to wait, especially on 20 runs or 10K. 5 runs (and 3 for 10K) is enough for a big picture, everyone is included there.
@kpocza I dropped my (dzaima
) entries on tests with over 8 threads from my latest tables (as discussed above, it was capped to 8 threads so has meaningless/misleading data above that). That should be the only change in participant counts between my initial tables and the newer ones though. (the rest of the differences are just in the data; the initial tables had a varying number of participants too, just not within one CPU)
Thanks for all the numbers!
However, your tests on multi-socket systems (which I think is all except for Intel Raptor Lake) aren't valid. The page cache is split across NUMA nodes, and how that data is loaded is being decided by how the 1st program chunked its data . Later programs then just have to deal with that, loading the file across the CPU interconnect instead of directly from RAM.
To fairly test this you need flush the page cache between each competitor's runs (including every time the core count changes).
Also I'm fairly certain my hyper threaded results are cheating. My coded blindly calls sched_setaffinity() which overrides numactl. I can fix that with some more command line parameters if anybody cares.
The other side of this is that my code was not designed to be run on consumer Intel chips. It doesn't verify the processor layout and assumes that processor numbers are contiguous and not interleaved. So on the Raptor Lake (i5-13500) test 6c/6t is actually 3c6t for my code. I don't care and it's certainly just because I was lazy and not your fault at all.
@austindonisan you are already the winner. But could you just help to make this repo truly reproducible? I'm not a Bash expert and do not want to be one, frankly. This repo is reasonable, i believe. I even wanted to exclude results when they go to another socket. But I'm listening...
@austindonisan you are already the winner. But could you just help to make this repo truly reproducible? I'm not a Bash expert and do not want to be one, frankly. This repo is reasonable, i believe. I even wanted to exclude results when they go to another socket. But I'm listening...
Multi-socket tests really do suck. I wasted so much time not understanding out why my code was slow when testing on a GCP instances.
For all of these you need to make sure the machine has at least 48GB RAM, so each socket has it's own 24GB.
Limiting tests to a single socket means you wouldn't have to flush to page cache. But you have to make sure that there aren't any badly behaved implementations (e.g. mine) that ignore numactl and will use the socket 2nd when you're just trying to force hyperthreading on the 1st core.
The easier fix is to just drop the caches and reload the file. If it's being run as root it will be a simple as sprinkling in a few "echo 3 > /proc/sys/vm/drop_caches". But then file then has to be read back in from disk, so if this machines have a HDDs that will be quite painful (~90 seconds).
I'll take a look at your test file and see how easy it is to modify.
I'm not too concerned with winning, it's just not very satisfying to see high, effectively arbitrary numbers for everybody at the high core counts that are dominated artificial memory limits (bandwidth is effectively cut 35% for one socket and 75% for the other). The Zen3 (7R1348) tests clearly had bad memory placement for everybody, while the Zen4 (9374F32) test seems to have been good for just my code. Or maybe there's something else going on with AVX-512, but it's impossible to know.
One last comment, it would be nice to include a test with a Zen2 CPU (the contest target). Your AMD tests are only on Zen3 and Zen4.
I went through your scripts and it's definitely not trivial to change. You have to rip out the tmpfs which was ok, and clearing the caches before each run above 1/2 max cpu is ok, but it becomes painfully slow refilling the cache each run. Also 1 warm up run isn't sufficient, but that needs to be tweaked in a different file.
Overall I'd say not worth the effort. I did go and fix my program to listen to numactl, but it's unfortunate I didn't do that earlier.
Again thanks again for putting this together! The 10k results were the most surprising so I'm glad you ran those. I 100% prioritized the standard dataset and was expecting my performance on the 10k set to be awful.
@austindonisan thank you very much for such deep insights here. From your comments I infer that results on a single socket are OK as is. If there's memory slowdown it's the same for everyone. Is that right?
Have added MIT License to my solution now
@austindonisan thank you very much for such deep insights here. From your comments I infer that results on a single socket are OK as is. If there's memory slowdown it's the same for everyone. Is that right?
The tests where everybody's threads were supposed to run on a single CPU might be fair enough. If the servers had at least 128GB of RAM and you ran the tests without any faffing (i.e. didn't load more than 1 copy each of 1B.txt and 1B_10K.txt).
The results are at best bi-modal because of your tmpfs usage. Normally the files would have been loaded into node 0's memory on the first run since the tests were scheduled to run on cores 0-7. But since use_input.sh is doing the the memory loading, without numactl, it's 50-50 which node "cp" will run on and therefore which node the memory will be loaded into.
Additionally, copying a file into tmpfs also loads it into the page cache, duplicating it's memory footprint. That means that the servers needed at least 128GB of RAM to guarantee no weird effects (128GB -> 64GB per node -> ~30GB of text files duplicated in page cache and tmpfs).
There is the problem of the actual binaries being loaded into memory on the wrong node, but I didn't test the size of that effect.
I ran the tests for my code on a 112-core (2 x 56-core) machine explicitly loading the measurements into Node #0, Node#1, and Split (where the file is loaded naturally). The first set of numbers is 1B.txt, the second is 1B_10K.txt.
C | T | Node 0 | Node 1 | Split | Node 0 | Node 1 | Split |
---|---|---|---|---|---|---|---|
8 | 8 | 617 | 869 | 1625 | 2009 | ||
8 | 16 | 488 | 770 | 1542 | 1920 | ||
16 | 16 | 323 | 552 | 821 | 1047 | ||
16 | 32 | 256 | 528 | 860 | 1032 | ||
32 | 32 | 168 | 330 | 431 | 559 | ||
32 | 64 | (missed) | 352 | 574 | 624 | ||
56 | 56 | 146 | 222 | 309 | 354 | ||
56 | 112 | 163 | 236 | 456 | 461 | ||
112 | 112 | 166 | 173 | 89 | 249 | 257 | 209 |
112 | 224 | 189 | 191 | 114 | 374 | 392 | 322 |
Some interesting things:
- Loading the measurements across nodes is obviously slower. I'm unsure if it's related to latency or bandwidth. Hyperthreading helps less in this case surprisingly.
- In the 112 core case the Node 0 vs Node 1 distinction is mostly gone since half the cores are always loading data from the wrong node. What's left might be the effect of which node the binary is on.
- The 32->56 core throughput boost is very small for the 1B.txt Node 0 case. It's likely hitting either memory bandwidth and/or power limits. The 10K through boost is higher, suggesting that's at least partially memory bandwidth.
- My 10k code does poorly at high core counts since the merging code isn't optimized (it's n^2 for large cities).
- I previously was getting 132ms and 75ms in the 56/112 core cases. Not sure what changed with my code but I didn't want rent the server for longer to debug it.
If the servers had at least 128GB of RAM
All multi-sockets machines had at least 192 GB (c5.metal which is the smallest).
But since use_input.sh is doing the the memory loading, without numactl, it's 50-50 which node "cp" will run on and therefore which node the memory will be loaded into.
On non-AWS machines the Bash itself was limited by taskset to the first 32 cores. That's why I had to add --all
to numactl
.
Thanks a lot everyone! I've learned so much from every repo. I hope to find time soon to finish the blog and hopefully highlight some most audacious ideas, like reading 8 rows in parallel per core ๐ The single-core parallelism was the biggest learning outcome for me, especially given Java guys managed to be eventually faster over the last two days just by using the instruction parallelism without SIMD.
Also thanks to @artsiomkorzun. I stopped at some point, but then there was no reason why Java without SIMD should be so much faster, so I got some insights from your code and found some glaring oversights in my code. Yet, even with 10K on Intel that was tight. Why it's so fast on AMD remains a mystery for me ๐
If we take every run on [AMD/Intel]/Threads_Mix/Dataset and make the fastest one as 100% (or 1 sec) and scale everything and then we average it by threads and then by CPU vendor => the resulting numbers should be quite fair. Default/10K separation should remain. This is what I want to get as the final result.
Good luck, that was fun!
Thanks for all the work you put in benchmarking and publishing everything @buybackoff! It was nice to have a place where even the non-Java entries felt like they belonged.