Smithay/wayland-rs

E2BIG error when sending too fast on non-blocking fd leading to queue.flush() being unusable.

feschber opened this issue · 8 comments

https://github.com/Smithay/wayland-rs/blob/8581b9d298e3bfc872a598380a09eb386a9db408/wayland-backend/src/rs/socket.rs#L188C18-L188C18

I'm running into an issue where I get E2BIG errors when sending too fast using non-blocking IO.
Through debugging I figured out it must have to do with the above line.

I think what's happening is that I'm running into WouldBlock Errors here because of the fast io requests and the output buffer fills up.

However instead of returning wouldblock, a second try is attempted and an E2BIG error is returned.

For my use case this leads to guard.last_error being set and thus subsequent calls to queue.flush() failing ... (

)

This error apparently never gets cleared again, rendering queue.flush() unusable.
The result is that the sent events arrive at the compositor in large chunks (presumably the size of the output buffer) after encountering an E2BIG error once.

Do you have a reproducing example?

As far as I can tell E2BIG means something very different than EWOULDBLOCK, and probably should not be just ignored, so having this error raised is a hint of a deeper problem, and likely not just the buffer filling up.

This is pretty puzzling however because I can't find any documentation of when and why sendmsg would error with E2BIG, so this is probably going to be "fun" to figure out.

Do you have a reproducing example?

As far as I can tell E2BIG means something very different than EWOULDBLOCK, and probably should not be just ignored, so having this error raised is a hint of a deeper problem, and likely not just the buffer filling up.

This is pretty puzzling however because I can't find any documentation of when and why sendmsg would error with E2BIG, so this is probably going to be "fun" to figure out.

I think the issue is that EWOULDBLOCK results in a second call to attempt_write_message.
However this should only be attempted when flush succeeds (which it does not in the case of ewouldblock).

I will try to write a minimal reproducible example.

And the E2BIG is not a result of sendmsg, it is directly set in the code I linked.

And the E2BIG is not a result of sendmsg, it is directly set in the code I linked.

Oh right, sorry, I read that too quickly. My bad

Okay, I get what happens.

Once both the socket of the unix buffer and the internal outgoing buffers of the backend are full, we cannot actually write the request anywhere, and hence we generate a fatal error, because there is not really anything better to do given the API cannot handle having "sending a message" to block.

Note that libwayland-client has the same behavior, as can be seen here: https://gitlab.freedesktop.org/wayland/wayland/-/blob/main/src/wayland-client.c?ref_type=heads#L891-894 In the same situation, wl_closure_send will return an error, and this is treated as a fatal error.

The main question I have now would be: what are you actually trying to do that results in such a large traffic over the socket? Having all the buffers fill up is not something I'd expect to happen unless the other side is unresponsive.

Yeah that seems to be, what's happening.

My usecase is the following:

I'm writing a mouse sharing software (Lan Mouse), which uses the zwlr_virtual_pointer_v1 protocol to emulate mouse events.

Now if I reload sway (compositor) while receiving a lot of mouse events over the network, a whole bunch of mouse events (which are received via udp) queue up and then get dispatched all at one (because the wayland socket is not writable while sway is being reloaded).
The same thing can happen in rare cases, when the network has some delay and suddenly sends a bunch of packets at once.
The result is the above described issue.

Now I would like to handle this case - when no space is left in the output buffer.
However it does not seem to be possible atm because of the described issue, which leaves the connection in a permanently broken state (last_error == E2BIG).

Okay, that seems like a pretty similar context as this libwayland issue https://gitlab.freedesktop.org/wayland/wayland/-/merge_requests/188 , which may eventually bring a libwayland solution to that.

Given wayland-backend is meant to be compatible with libwayland, I try to match its behavior on this kind of things as closely as possible: I don't want to have downstream programs completely break after enabling the _system cargo features (which can happen behind your back through dependencies). So my default position is to see what libwayland does and follow its behavior.

Regarding your case, a workaround for now would be to monitor the behavior of flush() manually, and throttle your sending whenever you it returns a WouldBlock (for example by monitoring the Wayland socket for write-readiness). The internal buffer is 4096 bytes, while a virtual_pointer.motion message would be 20 bytes, so for example you could make a manual flush() call every 100 requests sent and throttle yourself if the method returns WouldBlock until you can actually flush.

Okay I understand the issue. What I will do is store the status of the last flush and if it failed, try to flush before dispatching a new event so I can discard if it if that flush() fails again.

Thanks a lot for your help!