NVIDIA/egl-wayland

Swapchain Creation fails on Wayland (mutter) when nvidia is not the primary interface

tim-rex opened this issue · 6 comments

In a similar vein to #94, I am also seeing vkCreateSwapchainKHR fail with VK_ERROR_INITIALIZATION_FAILED

I notice also that having EGL_LOG_LEVEL=debug will cause the following to be logged when this failure occurs, which may provide a clue.

libEGL debug: EGL user error 0x3004 (EGL_BAD_ATTRIBUTE) in eglGetPlatformDisplay

FWIW, I'm not explicitly calling eglGetPlatformDisplay or any other egl functionality in this application.

However unlike issue #94 this is not remediated by setting nvidia_drm modeset=1 (which is already enabled) per this comment

vkCreateSwapchainKHR is being called with the following createInfo struct

sType:VK_STRUCTURE_TYPE_SWAPCHAIN_CREATE_INFO_KHR
pNext:0x0
flags:0
surface:0xfd5b260000000001
minImageCount:3
imageFormat:VK_FORMAT_B8G8R8A8_UNORM
imageColorSpace:VK_COLOR_SPACE_SRGB_NONLINEAR_KHR
imageExtent: 800x600
imageArrayLayers:1
imageUsage:16
imageSharingMode:VK_SHARING_MODE_EXCLUSIVE
queueFamilyIndexCount:0
pQueueFamilyIndices:0x0
preTransform:VK_SURFACE_TRANSFORM_IDENTITY_BIT_KHR
compositeAlpha:VK_COMPOSITE_ALPHA_OPAQUE_BIT_KHR
presentMode:VK_PRESENT_MODE_MAILBOX_KHR
clipped:1
oldSwapchain:0x0

Importantly..
I'm running a dual GPU system with nvidia + amdgpu under Gnome Wayland.
This only seems to occur when Gnome is using amdgpu as the primary interface. Swapchain creation seems fine when nvidia is the primary interface, or when it is the only interface in use.


Fedora Linux 39 (Workstation Edition)
Linux 6.5.11-300.fc39.x86_64
GNOME Version 45.1
nVidia Driver version 535.129.03

Output of eglinfo attached
eglinfo.txt


Some interesting observations.. probably unrelated
When this occurs WAYLAND_DEBUG emits the following:

[ 922868.934] wl_callback@60.done(7540)
[ 922868.946]  -> wl_display@1.sync(new id wl_callback@60)
[ 922869.098] wl_display@1.delete_id(60)
[ 922869.102] wl_drm@24.device("/dev/dri/renderD128")
[ 922869.108] wl_drm@24.format(808669761)
[ 922869.111] wl_drm@24.format(808669784)
[ 922869.116] wl_drm@24.format(808665665)
[ 922869.120] wl_drm@24.format(808665688)
[ 922869.124] wl_drm@24.format(875713089)
[ 922869.128] wl_drm@24.format(875713112)
[ 922869.132] wl_drm@24.format(909199186)
[ 922869.136] wl_drm@24.format(961959257)
[ 922869.139] wl_drm@24.format(825316697)
[ 922869.142] wl_drm@24.format(842093913)
[ 922869.145] wl_drm@24.format(909202777)
[ 922869.148] wl_drm@24.format(875713881)
[ 922869.151] wl_drm@24.format(842094158)
[ 922869.154] wl_drm@24.format(909203022)
[ 922869.157] wl_drm@24.format(1448695129)
[ 922869.160] wl_drm@24.capabilities(1)
[ 922869.163] wl_callback@60.done(7540)
libEGL debug: EGL user error 0x3004 (EGL_BAD_ATTRIBUTE) in eglGetPlatformDisplay

In particular, that reference to /dev/dri/renderD128 is confusing, as that is my AMD device.. despite that I am using an nVidia logical device in my Vulkan initialisation.

/dev/dri/by-path/pci-0000:01:00.0-card -> ../card1
/dev/dri/by-path/pci-0000:01:00.0-render -> ../renderD129
/dev/dri/by-path/pci-0000:02:00.0-card -> ../card0
/dev/dri/by-path/pci-0000:02:00.0-render -> ../renderD128

/sys/class/drm/renderD128/device/driver -> ../../../../bus/pci/drivers/amdgpu
/sys/class/drm/renderD129/device/driver -> ../../../../bus/pci/drivers/nvidia

And for the sake of experimentation..
Setting the following does allow me to proceed further..

DRI_PRIME=pci-0000_01_00_0" __NV_PRIME_RENDER_OFFLOAD=1 __GLX_VENDOR_LIBRARY_NAMEnvidia

Swapchain creation succeeds, but ultimately fails on vkQueuePresentKHR with

[ 439850.098]  -> zwp_linux_dmabuf_v1@52.create_params(new id zwp_linux_buffer_params_v1@44)
[ 439850.107]  -> zwp_linux_buffer_params_v1@44.add(fd 42, 0, 0, 3200, 50331648, 5234708)
[ 439850.111]  -> zwp_linux_buffer_params_v1@44.create_immed(new id wl_buffer@40, 800, 600, 875713112, 0)
[ 439850.115]  -> zwp_linux_buffer_params_v1@44.destroy()
[ 439850.121]  -> wl_surface@16.attach(wl_buffer@40, 0, 0)
[ 439850.124]  -> wl_surface@16.damage(0, 0, 800, 600)
[ 439850.127]  -> wl_surface@16.commit()
[ 439850.130]  -> wl_display@1.sync(new id wl_callback@36)
[ 439850.621] wl_display@1.error(nil, 7, "failed to import supplied dmabufs: Arguments are inconsistent (for example, a valid context requires buffers not supplied by a ")
[destroyed object]: error 7: failed to import supplied dmabufs: Arguments are inconsistent (for example, a valid context requires buffers not supplied by a 
Wayland bailed!! errno=71 : Protocol error

The only reference I can find for eglGetPlatformDisplay to return EGL_BAD_ATTRIBUTE is noted in the EGL_EXT_explicit_device extension notes

If EGL_EXT_platform_device is supported, passing EGL_DEVICE_EXT as an attribute to eglGetPlatformDisplay(EGL_PLATFORM_DEVICE_EXT) generates EGL_BAD_ATTRIBUTE.

The only reference I can find for eglGetPlatformDisplay to return EGL_BAD_ATTRIBUTE is noted in the EGL_EXT_explicit_device extension notes

EGL_BAD_ATTRIBUTE is a generic error for any case where the implementation doesn't recognize an attribute enum. From the EGL spec, section 3.1:

EGL_BAD_ATTRIBUTE

An unrecognized attribute or attribute value was passed in an attribute list. Any command taking an attribute parameter or attribute list may generate this error.

Unfortunately, that doesn't tell us what's calling eglGetPlatformDisplay or what the offending attribute is...

I can pull on that thread..
Here's the call stack when eglGetPlatformDisplay gets called

#0  eglGetPlatformDisplay (platform=12760, native_display=0x6564090, attrib_list=0x7fff89b93200) at /usr/src/debug/libglvnd-1.7.0-1.fc39.x86_64/src/EGL/libegl.c:409
#1  0x00007fff22e01fe0 in ProducerInit () from /lib64/libnvidia-vulkan-producer.so
#2  0x00007fff32a19872 in ?? () from /lib64/libnvidia-glcore.so.535.129.03
#3  0x00007fff32a43bbf in ?? () from /lib64/libnvidia-glcore.so.535.129.03
#4  0x00007fff32a67bdd in ?? () from /lib64/libnvidia-glcore.so.535.129.03
#5  0x00007fff88453b20 in ?? () from /lib64/libGLX_nvidia.so.0
#6  0x00007fff885d0fb7 in terminator_CreateSwapchainKHR (device=0x7fff85c5e430, pCreateInfo=0x7fff85c3f050, pAllocator=0x0, pSwapchain=0xa34520 <vulkan+6352>) at /vulkan-sdk/1.3.268.0/source/Vulkan-Loader/loader/wsi.c:499
#7  0x00007fff24f92b9d in DispatchCreateSwapchainKHR (device=device@entry=0x7fff85c5e430, pCreateInfo=pCreateInfo@entry=0x7fff89b93880, pAllocator=pAllocator@entry=0x0, pSwapchain=pSwapchain@entry=0xa34520 <vulkan+6352>)
    at /vulkan-sdk/1.3.268.0/source/Vulkan-ValidationLayers/layers/vulkan/generated/vk_safe_struct.h:4590
#8  0x00007fff24e79ab3 in vulkan_layer_chassis::CreateSwapchainKHR (device=0x7fff85c5e430, pCreateInfo=0x7fff89b93880, pAllocator=0x0, pSwapchain=0xa34520 <vulkan+6352>)
    at /vulkan-sdk/1.3.268.0/source/Vulkan-ValidationLayers/layers/vulkan/generated/chassis.cpp:5714
#9  0x00000000005a7bac in VulkanWrapper::createSwapChain (this=0xa32c50 <vulkan>, swapchainSupport=..., surfaceFormat=..., surfaceFormat2=..., preferredPresentMode=VK_PRESENT_MODE_MAILBOX_KHR) at ./common/vulkan_helper.cpp:1644
#10 0x00000000005940c4 in VulkanWrapper::createSwapChain (this=0xa32c50 <vulkan>, swapchainSupport=..., surfaceFormat=..., surfaceFormat2=...) at ./common/vulkan_helper.cpp:1471
#11 0x000000000056e454 in VulkanWrapper::initVulkan (this=0xa32c50 <vulkan>, hWnd=0x6569e70) at ./common/vulkan_helper.cpp:5612
#12 0x00000000004f7838 in processRenderEvents () at ./core/render.cpp:155
#13 0x000000000050312b in update_loop () at ./core/main.cpp:917
#14 0x0000000000503ab5 in main_loop (argc=0, argv=0x0) at ./core/main.cpp:1106
#15 0x000000000040f22d in main_loop_bootstrap () at ./platform/Linux/linux_main.cpp:803
#16 0x00007ffff77e4897 in start_thread (arg=<optimized out>) at pthread_create.c:444
#17 0x00007ffff786b6bc in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78

The attrib_list at the call site is

(gdb) p/x *attrib_list@3
$43 = {0x3352, 0x1, 0x3038}

That would seem to map to:

EGL_TRACK_REFERENCES_KHR, EGL_TRUE, EGL_NONE

EGL_KHR_display_references indicates

An EGL_BAD_ATTRIBUTE error is generated if the requested value for EGL_TRACK_REFERENCES_KHR is not supported.

Stepping through from there, EGL_BAD_ATTRIBUTE is generated from the following frame

gdb) p/x *attrib_list@3
$2 = {0x3352, 0x1, 0x3038}
(gdb) frame
#0  _eglGetWaylandDisplay (native_display=0x6564090, attrib_list=0x7fff89b93200) at ../src/egl/main/egldisplay.c:535
535	      _eglError(EGL_BAD_ATTRIBUTE, "eglGetPlatformDisplay");

with the trace

(gdb) bt
#0  _eglGetWaylandDisplay (native_display=0x6564090, attrib_list=0x7fff89b93200) at ../src/egl/main/egldisplay.c:535
#1  0x00007ffff7bf4fd5 in GetPlatformDisplayCommon (platform=12760, native_display=0x6564090, attrib_list=0x7fff89b93200, funcName=0x7ffff7bfb2da "eglGetPlatformDisplay")
    at /usr/src/debug/libglvnd-1.7.0-1.fc39.x86_64/src/EGL/libegl.c:324
#2  0x00007fff22e01fe0 in ProducerInit () from /lib64/libnvidia-vulkan-producer.so
#3  0x00007fff32a19872 in ?? () from /lib64/libnvidia-glcore.so.535.129.03
#4  0x00007fff32a43bbf in ?? () from /lib64/libnvidia-glcore.so.535.129.03
#5  0x00007fff32a67bdd in ?? () from /lib64/libnvidia-glcore.so.535.129.03
#6  0x00007fff88453b20 in ?? () from /lib64/libGLX_nvidia.so.0
#7  0x00007fff885d0fb7 in terminator_CreateSwapchainKHR (device=0x7fff85c5f000, pCreateInfo=0x7fff85c3fc50, pAllocator=0x0, pSwapchain=0xa34520 <vulkan+6352>) at /vulkan-sdk/1.3.268.0/source/Vulkan-Loader/loader/wsi.c:499
#8  0x00007fff24f92b9d in DispatchCreateSwapchainKHR (device=device@entry=0x7fff85c5f000, pCreateInfo=pCreateInfo@entry=0x7fff89b93880, pAllocator=pAllocator@entry=0x0, pSwapchain=pSwapchain@entry=0xa34520 <vulkan+6352>)
    at /vulkan-sdk/1.3.268.0/source/Vulkan-ValidationLayers/layers/vulkan/generated/vk_safe_struct.h:4590
#9  0x00007fff24e79ab3 in vulkan_layer_chassis::CreateSwapchainKHR (device=0x7fff85c5f000, pCreateInfo=0x7fff89b93880, pAllocator=0x0, pSwapchain=0xa34520 <vulkan+6352>)
    at /vulkan-sdk/1.3.268.0/source/Vulkan-ValidationLayers/layers/vulkan/generated/chassis.cpp:5714
#10 0x00000000005a7bac in VulkanWrapper::createSwapChain (this=0xa32c50 <vulkan>, swapchainSupport=..., surfaceFormat=..., surfaceFormat2=..., preferredPresentMode=VK_PRESENT_MODE_MAILBOX_KHR) at ./common/vulkan_helper.cpp:1644
#11 0x00000000005940c4 in VulkanWrapper::createSwapChain (this=0xa32c50 <vulkan>, swapchainSupport=..., surfaceFormat=..., surfaceFormat2=...) at ./common/vulkan_helper.cpp:1471
#12 0x000000000056e454 in VulkanWrapper::initVulkan (this=0xa32c50 <vulkan>, hWnd=0x6569e70) at ./common/vulkan_helper.cpp:5612
#13 0x00000000004f7838 in processRenderEvents () at ./core/render.cpp:155
#14 0x000000000050312b in update_loop () at ./core/main.cpp:917
#15 0x0000000000503ab5 in main_loop (argc=0, argv=0x0) at ./core/main.cpp:1106
#16 0x000000000040f22d in main_loop_bootstrap () at ./platform/Linux/linux_main.cpp:803
#17 0x00007ffff77e4897 in start_thread (arg=<optimized out>) at pthread_create.c:444
#18 0x00007ffff786b6bc in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78

Our Vulkan Wayland WSI has been pretty much entirely re-written for the 545 release, so it might be worth checking if updating fixes the issue.

I'm unable to validate on my Fedora 39 setup at this time, but i've tried to repro on a freshly installed arch linux setup on the same machine using the latest 545.29.06 drivers and running into different (earlier) issues.

Raised #96

I'll update here when I can confirm with newer drivers on F39

I'm unable to validate this further at this time, currently blocked on #96