frevib/io_uring-echo-server

Wild results, cannot reproduce

Opened this issue · 26 comments

I have tested your epoll and io_uring examples and I get 250k (req/sec) with your epoll example and only 220k with io_uring. I also get 250k with my own epoll implementation so that confirms we are both using efficient use of epoll.

I'm running on Linux 5.7 Clear Linux - do you have any hints on how I can reproduce your results?

When I strace your example I get a lot of these:

io_uring_enter(4, 59, 1, IORING_ENTER_GETEVENTS, NULL, 8) = 59
io_uring_enter(4, 61, 1, IORING_ENTER_GETEVENTS, NULL, 8) = 61
io_uring_enter(4, 59, 1, IORING_ENTER_GETEVENTS, NULL, 8) = 59
io_uring_enter(4, 61, 1, IORING_ENTER_GETEVENTS, NULL, 8) = 61
io_uring_enter(4, 59, 1, IORING_ENTER_GETEVENTS, NULL, 8) = 59
io_uring_enter(4, 61, 1, IORING_ENTER_GETEVENTS, NULL, 8) = 61
io_uring_enter(4, 59, 1, IORING_ENTER_GETEVENTS, NULL, 8) = 59
io_uring_enter(4, 61, 1, IORING_ENTER_GETEVENTS, NULL, 8) = 61
io_uring_enter(4, 59, 1, IORING_ENTER_GETEVENTS, NULL, 8) = 59
io_uring_enter(4, 61, 1, IORING_ENTER_GETEVENTS, NULL, 8) = 61

Isn't it supposed to poll without any syscalls to maximize the performance advantage?

I absolutely cannot reproduce your results. I've tested on a completely different machine, with a Fedora Rawhide system with Linux 5.8, mitigations for spectre default. This machine confirms my findings from above; io_uring (this example at least) is not faster than epoll.

I get consistently worse results with io_uring, 227k vs. 235k and the like. I have no idea where you would get 99% performance increase with io_uring like your reddit post claims, and not even 45%. I only get worse results, with or without spectre mitigations. I really cannot reproduce the findings you claim.

Hi Alex,

The 99% increase was with an early version of io_uring and buggy version of liburing. Besides, the performance test tool used had some bugs as well: #2 (comment)

So these early results were quite off.

If you use this latest echo server and have support for IORING_FEAT_FAST_POLL, you should in most cases get a peformance increase: https://twitter.com/hielkedv/status/1234135064323280897?s=21

This was at the time using Ubuntu + Linux 5.6 + Jens’ IORING_FEAT_FAST_POLL branch. Current state of io_uring and liburing could have changed the performance. However, using IORING_FEAT_FAST_POLL saves a syscall so theoretically it should be faster.

The test I ran was reporting support of IORING_FEAT_FAST_POLL and yes, I agree that in theory it should be faster since there are a lot fewer syscalls. But in practice, it's not. It even goes the wrong way -> the more clients I add the more the advantage goes for epoll, which is the opposite one would expect since you have more syscalls the more clients you have (epoll_wait, (recv, send) * N).

But I don't care what the theory says, this looks similar to when everyone and their mother was telling me writev was so much faster because it was scatter/gather, but in my testing it was slower than just copying up to 4kb to an append-buffer and write:ing that copy off with the old send syscall.

@axboe Do you have any benchmarks of your own in this regard?

So these early results were quite off.

If you use this latest echo server and have support for IORING_FEAT_FAST_POLL, you should in most cases get a peformance increase

I have run the latest version many times on many different machines and kernels and it does not perform better than epoll.

This echo server also uses IORING_OP_PROVIDE_BUFFERS, which causes a performance drop: https://twitter.com/hielkedv/status/1255492941960949760?s=21

You could try running it without automatic buffer selection, using your own buffer management implementation or no buffer management at all.

I don't have the time to tinker with your code right now - I was hoping to simply confirm or deny your findings, using your provided code (as per anything scientific). So far I cannot confirm it, your findings, and it seems you hint of it not really being possible right now.

I would like to see an example that does in fact prove the efficiency of io_uring over epoll.

So what commit should I use? You didn't tag any release so I can only check out master.

My guess is you'd use this branch: https://github.com/frevib/io_uring-echo-server/tree/io-uring-feat-fast-poll, like he linked in the benchmark description.

This could very well the branch, it’s indeed without buffer selection but with fast poll. I haven’t had the time to look for the exact commit hash.

Maybe try this one: https://github.com/frevib/io_uring-echo-server/tree/b003989ecb6343b5815999527310251601531acc

This commit is right before buffer selection was implemented.

Qix- commented

I'm with @alexhultman on this. I simply cannot reproduce anything close to the wild, game-changing claims others have made.

I allocated a 4 vcpu machine on a google cloud instance. Seeing as how the bechnmarks were run on a VM on with a MacOS host I would imagine this (should) not affect the benchmark so much that epoll is just as fast.

Note: I did test this locally on a Windows host running VMWare player with an Ubuntu guest, but I did not isolate the CPUs so I didn't consider them here. However, I got the exact same results - epoll and uring are neck and neck, all things considered.

I upgraded to mainline 5.8.12-050812-generic, set isolcpus=0,1 and told systemd to affine to 2 and 3. Rebuilt GRUB, rebooted, and ran the benchmarks using taskset -c 0|1 <cmd> <args...>.

After a few runs, I got very mixed results, but they were all within the same range. Sometimes uring achieved more throughput, sometimes epoll. Never with a spread with more than about 2k requests a second, however.

I tried both the io-uring-feat-fast-poll branch and the direct commit in #8 (comment). Both segfault for me upon boot.

Tests were run with a variety of parameters to the echo server benchmark tool. All of the spreads were roughly the same - +/- 2k req/sec with epoll and uring alternating between the winners over a 60 second period.

I'm beginning to think this is another case of "too good to be true". Just my take.


Further, the linked tweet above is quoting the same benchmark wiki here claiming 68% increase. Sorry, but that's misleading and simply wrong.

To understand these performance gains that the echo server is claiming, maybe some extra context is needed.

  • When io_uring just got out in Kernel 5.1, I used a version of liburing and the testing tool that contained several bugs. This caused a massively overstated performance gain, and was fixed here: #2 (comment). After this fix, io_uring was only a few percentages faster than epoll.

  • Then came IORING_FEAT_FAST_POLL, which made polling unnecessary and therefore could avoid syscalls to poll for completed IO. The maximum performance increase I got is indeed 68%, using 500 connections with 512 bytes message size. If you use 1 connection and 128 bytes the performance increase is minimal. Using > 500 connections and message sizes > 1000 bytes also decreases the performance. As netty/netty#10622 (comment) also states, it seems io_uring prefers lots of simultaneous connections. The trade off here is that you have to manage the buffers that you give to the sockets yourself.

  • Then came IORING_OP_PROVIDE_BUFFERS, which could automatically manage buffers for you. This however meant that you have to reassign a "used" buffer each time a buffer has been used. This means the iterations in the event loop doubled, and caused a performance drop. However, in many connections size/message size configs io_uring is still a bit faster.

Of course echo server is not a real use case, so have a look at some of the "real" implementations like Netty or NodeJS.

If you use 1 connection and 128 bytes the performance increase is minimal.

We know. It's obvious. That's why we tested with 1k connections and epoll was faster (it got a bigger lead the more clients - the opposite of what is claimed).

have a look at some of the "real" implementations like Netty or NodeJS.

Node.js is the most nondeterministic, unjust and imprecise benchmark of a kernel feature. You cannot possibly mean to use a highly bloated JavaScript environment with nondeterministic garbage collection and JIT:ing to reliably benchmark a kernel feature.

Then maybe have a look at Netty, if it suits you better as OP suggested? netty/netty#10622 (comment)

@alexhultman when I have some time left I will try and find the right commit that does not use IORING_OP_PROVIDE_BUFFERS. I’m quite out of the io_uring scene at the moment, so for now please take this software as-is. I’m also pretty sure there are more ppl who created examples without IORING_OP_PROVIDE_BUFFERS, it really performs better that epoll ☺️

I will try and find the right commit that does not use IORING_OP_PROVIDE_BUFFERS

Great, thanks!

Then maybe have a look at Netty, if it suits you better as OP suggested?

What are you even talking about? I am OP and I have no intention whatsoever in any Java or JavaScript wrapper. This is a kernel feature in C, not a wrapper in some nondeterministic garbage collected virtual machine.

Netty is using C via jni and it has an implementation for epoll and now io_uring so it is a good reference to see the difference when used by a library like this.

All applications written in the history of man kind uses C on Linux, it is the main gateway to the kernel. You could make the argument Ruby on Rails makes a good Linux kernel benchmark because, under the hood, it is too C.

Of course, anyone with more than 2 brain cells would see that any such benchmark would massively be tainted by the fact you also drive this whole mountain of bloat, making the benchmark as a whole useless. Only a minimal C client directly using the syscalls / liburing makes sense here.

All of Java is built on JNI at some level (see above logic, it has to), and JNI has a demonstrated 4x FFI overhead due to the fact you are executing in a virtual environment much like an operating system inside an operating system. So you essentially have a measurement stick which gives you a taint of "a shit ton".

It's like measuring the size of an atom using your thumb and a squinted eye.

Dude, you made your point. We understand that, and of course you are right. A lifetime ago I worked with embedded systems too. Now we could keep diving, and saying that all applications in the history of mankind compile to assembly. And we would but much further in the discussion...

I'm sorry to be stupid, close minded or whatever you wanna call it, but if systems / libraries / languages as popular as Netty and Node are taking serious interest in io_uring, maybe there is an actual reason. And if most of those report benefits, maybe there is also a reason.

Now, can you please stop the toxicity for more than a second and put the same amount of energy in making a PR that showcases what you mention? @frevib has spent quite a bit of energy into making that repo, trying to be as clear as possible. If you think it's not, we'd all benefit from your improvements. And I'm serious, we would.

We are all educated folks here, can we behave as such and simply try to improve the platform?

1Jo1 commented

you should try IOSQE_ASYNC it will be executed directly in the workerpool, 124% better performance in netty(using non blocking sockets) :)

but if systems / libraries / languages as popular as Netty and Node are taking serious interest in io_uring, maybe there is an actual reason. And if most of those report benefits, maybe there is also a reason.

The Node.js sphere is not driven by logic, it is driven by hype and nothing else. I know this intimately from experience. "io_uring" is hype right now. That's why they added it. Node.js will not be affected one single bit by it, because they have way, way, way more serious problems in other places in their stack. That is why their existing epoll path is executing at less than 10% of what epoll is capable of. So Node.js is the worst possible example of a reliable kernel benchmark.

I'm going to rerun benchmarks when @frevib post the commit.

The Node.js sphere is not driven by logic, it is driven by hype and nothing else. I know this intimately from experience. "io_uring" is hype right now. That's why they added it. Node.js will not be affected one single bit by it, because they have way, way, way more serious problems in other places in their stack. That is why their existing epoll path is executing at less than 10% of what epoll is capable of. So Node.js is the worst possible example of a reliable kernel benchmark.

Mostly agree. Also agree the benchmark should run as close from the kernel as possible.

@frevib Let me know how I can help to create something more reproducible. I know how busy you are at work lately.

Hi, all

I still can't reproduce the benchmarking result. Any update about this?

Thanks!

As I understand @frevib's responses, this project is dead/abandoned. You can look at other benchmarks:

  • MonoIO vs Tokio, the small slowdown on 1 core and speedups on 4+ cores are attributed to io_uring vs epoll
  • PhotonOS, io_uring backend gives 25% speedup in random reads over libaio backend

These results confirm the original observation, that io_uring can offer significant speedups in some cases.

Benchmarks have the usual caveats: they are for highly specific cases which most likely do not match your intended use. So it is better to actually try switching backends in your application than to waste time resurrecting a dead project.

This project is indeed quite stale. But if there are any bugfixes or improvements I can merge them.

The benchmarking results should be reproducible if you use the exact same specs, Linux kernel version, liburing version, benchmark tools, etc. There has been so many changes since then, most likely the benchmark results are different now.

What @Mathnerd314 states seems to be about right: “io_uring can offer significant speedups in some cases”. Some projects benefit, others not at all. Have a look Netty for instance, they see a nice perf increase.