Smithay/wayland-rs

wayland-server: wl_callback.done events not flushning properly

amshafer opened this issue · 2 comments

I'm seeing some odd behavior with wayland-server not actually flushing properly. My compositor calls callback.done for a number of callbacks, then later calls flush_clients. For some strange reason the client sees the wl_display@1.delete_id() evnets but not the wl_callback.done() events.

Server wayland debug output:

[3958819.810] wl_subsurface@22.set_desync()
[3958819.822] wl_subsurface@20.set_desync()
[3958820.614]  -> wl_buffer@32.release()
[3958821.426]  -> wl_buffer@36.release()
[3958821.478]  -> wl_callback@28.done(1425592996)
[3958821.493]  -> wl_display@1.delete_id(28)
[3958821.506]  -> wl_callback@37.done(1425592996)
[3958821.510]  -> wl_display@1.delete_id(37)
[3958821.519]  -> wl_callback@33.done(1425592996)
[3958821.523]  -> wl_display@1.delete_id(33)

Client wayland debug output:

[3958819.532]  -> wl_subsurface@22.set_desync()
[3958819.534]  -> wl_subsurface@20.set_desync()
[3958825.358] wl_display@1.delete_id(29)
[3958825.364] wl_display@1.delete_id(30)
[3958825.367] wl_display@1.delete_id(23)
[3958825.369] wl_display@1.delete_id(26)
[3958825.371] wl_display@1.delete_id(28)
[3958825.373] wl_display@1.delete_id(37)
[3958825.375] wl_display@1.delete_id(33)

The above is the last few messages in the output with weston-subsurfaces, which hangs at this point. It looks like the compositor sends the done event for 28, 37, and 33, but the client only sees those ids getting deleted. No amount of calling flush_clients causes the done events to be sent, which seems like a bug?

My compositor uses the system libwayland through the server_system+dlopen features, but I've disabled that and still see this issue. flush_clients and callback.done locations here.

From trying to dig into it myself I notice that even in system lib mode calling wayland-server's callback.done function doesn't actually trigger a call to the system libwaylands wl_callback_send_done. Is this intentional? It seems like if I requested to use the system library it should be routing callback.done through the system library.

What's really odd is this only seems to happen with wl_callback. I've never had any other issues with wayland-rs's event delivery before, and while I'm sure my implementation has its issues it looks to me like I'm calling done and flush_clients correctly. Any suggestions?

Thanks for the great bindings!

[...] doesn't actually trigger a call to the system libwaylands wl_callback_send_done.

This is 100% intentional, this part of libwayland's API is not really the "real" API of the library, but rather scanner generated code that is bundled in libwayland. wayland-rs uses its own scanner for code generation, and so directly plugs in the lower-level API of libwayland.

Now, this whole thing seems like a potential red herring, if libwayland-server logs that it sends the events, then that means that it sends the event. I suspect this is likely a symptom of some other issue. If the client use multiple event queues, then it can be that it processes events out of order (and thus logs them out of order). delete_id events are special though, because while all other events are internally enqueued by libwayland-client, the delete_id events are processed (and logged) as soon as they are read from the socket. So it is not rare to see in the client log delete_id events being processed before events that were actually sent earlier by the server.

You mentioned weston-subsurfaces, this program uses OpenGL to draw part of its content, and the GL drivers use their own internal event queue. So my guess would be that the app freezes at some point inside the GL code (maybe waiting for another resource to be released?). A way to make sure of that would be to spin a debugger and see where exactly the app ends up blocked.

Or similarly, does the issue also occurs with non-GL apps?

wayland-rs uses its own scanner for code generation, and so directly plugs in the lower-level API of libwayland

Thanks that makes sense

You mentioned weston-subsurfaces, this program uses OpenGL to draw part of its content, and the GL drivers use their own internal event queue.

Good point, I think something related to this is going wrong. Turns out I can reproduce this with sway as well, so closing since it's not a wayland-rs issue