tezc/sc

EPOLLET in non-blocking sc_sock

svladykin opened this issue · 10 comments

Do I understand correctly that EPOLLET is not used because Windows WSAPoll does not support it?

tezc commented

Yes, one reason is that.

More importantly, I don't see any benefit of using it(for my use-cases). I guess that flag is just for very specific applications. Why do you want to use it? Do you have a different use-case?

According to the docs https://man7.org/linux/man-pages/man7/epoll.7.html

When used as an edge-triggered interface, for performance
reasons, it is possible to add the file descriptor inside the
epoll interface (EPOLL_CTL_ADD) once by specifying
(EPOLLIN|EPOLLOUT).  This allows you to avoid continuously
switching between EPOLLIN and EPOLLOUT 
calling epoll_ctl() with EPOLL_CTL_MOD.

I assume it can be beneficial to have less kernel calls (less context switches / less lock contention).

I can see that at least Nginx uses this flag: https://github.com/nginx/nginx/blob/master/src/event/modules/ngx_epoll_module.c#L705

tezc commented

I remember I experimented with with EPOLLET years ago and decided it wasn't making performance any better(for my use-case). Also, it makes your app a bit complex.

Personally, I've never seen epoll_ctl() in flamegraphs. Maybe this is because of the types of app I worked on. Maybe it isn't the case with HTTP servers. I don't know.

It is not necessary that you will see epoll_ctl in flamegraphs, it makes more sense to look at the actual latency/throughput numbers.

What your use case is? I guess if it is about sending small messages across fast local network to a small number of receivers, then probably EPOLLET will not make any difference because pretty much always you must be able to send the full message right away without subscribing to EPOLLOUT at all. In case of http server it is often needed to stream large chunks of content over the slow internet to thousands of receivers and here potentially EPOLLET may make some difference (I assume otherwise this flag would be completely useless).

tezc commented

Yes, totally I agree. If you fill output buffer of the socket continously, it might be noticable. Just it’s never been the case for me. If you are sending small messages / you have big output buffers, if you have app level backpressure, if your app is already spending quite time somewhere else, then it won’t be noticable. Probably, these were my scenarios.

Ok, thanks for the inputs!

I'm going to play around this stuff, so far it seems like it is possible to add edge-triggered mode in a backward compatible way by adding a new flag to sc_sock_ev. For Linux and FreeBSD/MacOSX the implementation must be straightforward (use EPOLLET and EV_CLEAR flags respectively) and for Windows we could just emulate the needed behaviour (not for performance but for compatibility with the other platforms).

Are you willing to review/accept such a contribution if I do it?

tezc commented

Sure, it’d be great!

I started looking closely into implementation and have some questions: first of all why do we expand the events array (sc_sock_poll->events) to accommodate all the sockets? In my understanding at least epoll and kqueue assume that this array is just a small intermediate buffer for receiving the the events as a batch. Some articles even suggest to have this batch to be as small as one element in length: https://habr.com/en/post/600123/ . I don't think 1 element array would be sufficient for high load use case, but growing this array to thousands of sockets does not sound right as well. Why don't just use fixed size array (hard coded or configurable)?

tezc commented

We have to do it for WSAPoll if I remember correctly. For epoll or kqueue, it is not necessary but I wanted have all possible events in one go. If your array is small, you'll have to call epoll_wait() multiple times. We can use a fixed length array but I think its size should be >=1024. I usually use event loop iteration for batching. For example, assume you receive some data from network and you're supposed to write it to the disk. I read all network events (events that are reported in a single epoll_wait() call) first and then trigger a single disk write. So, if you read a few events in one iteration, obviously this is not going to work well. Making it configurable is not necessary I guess. Extra options are always confusing :) So, yeah I think we can use a hard coded smaller array if you want.

Please review #97