EPOLLET in non-blocking sc_sock
svladykin opened this issue · 10 comments
Do I understand correctly that EPOLLET
is not used because Windows WSAPoll
does not support it?
Yes, one reason is that.
More importantly, I don't see any benefit of using it(for my use-cases). I guess that flag is just for very specific applications. Why do you want to use it? Do you have a different use-case?
According to the docs https://man7.org/linux/man-pages/man7/epoll.7.html
When used as an edge-triggered interface, for performance
reasons, it is possible to add the file descriptor inside the
epoll interface (EPOLL_CTL_ADD) once by specifying
(EPOLLIN|EPOLLOUT). This allows you to avoid continuously
switching between EPOLLIN and EPOLLOUT
calling epoll_ctl() with EPOLL_CTL_MOD.
I assume it can be beneficial to have less kernel calls (less context switches / less lock contention).
I can see that at least Nginx uses this flag: https://github.com/nginx/nginx/blob/master/src/event/modules/ngx_epoll_module.c#L705
I remember I experimented with with EPOLLET years ago and decided it wasn't making performance any better(for my use-case). Also, it makes your app a bit complex.
Personally, I've never seen epoll_ctl() in flamegraphs. Maybe this is because of the types of app I worked on. Maybe it isn't the case with HTTP servers. I don't know.
It is not necessary that you will see epoll_ctl
in flamegraphs, it makes more sense to look at the actual latency/throughput numbers.
What your use case is? I guess if it is about sending small messages across fast local network to a small number of receivers, then probably EPOLLET will not make any difference because pretty much always you must be able to send the full message right away without subscribing to EPOLLOUT at all. In case of http server it is often needed to stream large chunks of content over the slow internet to thousands of receivers and here potentially EPOLLET may make some difference (I assume otherwise this flag would be completely useless).
Yes, totally I agree. If you fill output buffer of the socket continously, it might be noticable. Just it’s never been the case for me. If you are sending small messages / you have big output buffers, if you have app level backpressure, if your app is already spending quite time somewhere else, then it won’t be noticable. Probably, these were my scenarios.
Ok, thanks for the inputs!
I'm going to play around this stuff, so far it seems like it is possible to add edge-triggered mode in a backward compatible way by adding a new flag to sc_sock_ev
. For Linux and FreeBSD/MacOSX the implementation must be straightforward (use EPOLLET
and EV_CLEAR
flags respectively) and for Windows we could just emulate the needed behaviour (not for performance but for compatibility with the other platforms).
Are you willing to review/accept such a contribution if I do it?
Sure, it’d be great!
I started looking closely into implementation and have some questions: first of all why do we expand the events array (sc_sock_poll->events
) to accommodate all the sockets? In my understanding at least epoll
and kqueue
assume that this array is just a small intermediate buffer for receiving the the events as a batch. Some articles even suggest to have this batch to be as small as one element in length: https://habr.com/en/post/600123/ . I don't think 1 element array would be sufficient for high load use case, but growing this array to thousands of sockets does not sound right as well. Why don't just use fixed size array (hard coded or configurable)?
We have to do it for WSAPoll
if I remember correctly. For epoll
or kqueue
, it is not necessary but I wanted have all possible events in one go. If your array is small, you'll have to call epoll_wait()
multiple times. We can use a fixed length array but I think its size should be >=1024
. I usually use event loop iteration for batching. For example, assume you receive some data from network and you're supposed to write it to the disk. I read all network events (events that are reported in a single epoll_wait()
call) first and then trigger a single disk write. So, if you read a few events in one iteration, obviously this is not going to work well. Making it configurable is not necessary I guess. Extra options are always confusing :) So, yeah I think we can use a hard coded smaller array if you want.