vsg-dev/VulkanSceneGraph

Corrupted image in some vsg examples on Windows 11 while fedora 38 looks good on the same hardware

Mikalai opened this issue · 22 comments

Describe the bug
Corrupted image in examples.
I made a video with the issue in vsgdynamicviews.exe https://www.youtube.com/watch?v=i8O--1sKfcU
Though vsgviewer.exe works good. Same issue is also observed in vsgcameras.exe. Maybe it is a problem with driver because on fedora this samples work good. When I had nvidia geforce gtx 770 I don't recall such issues. Recently I've upgraded hardware and switch to AMD Radeon RX 6950 XT and when returned to vsg occasionally observed such behavior. Maybe I can perform some tests to figure out the root cause but don't know what to start with.

To Reproduce
Steps to reproduce the behavior:
I've build latest main branch and some examples shows corrupted image.

Expected behavior
Behavior like in vsgviewer.exe

Screenshots
Recorded video
https://www.youtube.com/watch?v=i8O--1sKfcU

image

Uploading image.png…

Desktop (please complete the following information):

  • Windows 11

Additional context
Add any other context about the problem here.

I have just tried:

vsgwindows models/teapot.vsgt models/lz.vsgt -d 

Under Kubuntu 22.04 with AMD 5700G and it works correctly without reporting any Vulkan debug errors, which is enabled by the -d option on the command line. Could you try add a -d to your command line run and see if any errors are reported.

If there are no Vulkan debug layer errors reported and it works fine on other hardware and driver combinations then it may well be a driver bug. At this point I think this is the most likely cause of the issue. Reverting to older or trying to find newer drivers would be a useful test.

vsgwindows models/teapot.vsgt models/lz.vsgt -d works good
image

But vsgdynamicviews -d shows
image

But vsgdynamicloads works good
image

vsgcameras also corrupted
image

Maybe the issue is just with this samples?

If not errors are being printed to the console when the Vulkan debug layer is reported then we don't have any indications of errors with the commands & data being passed to Vulkan. This doesn't guarantee that everything is correct with how Vulkan is being used but has proven to be a pretty good validation check.

At this point I think it's most likely a driver bug, my guess is that something is wrong with presentation of the swap chain image, as the general shape of the image looks somewhat appropriate but is corrupted in an orderly way, as if writing to the framebuffer or presentation of the swap chain is messing up.

If there is an VSG issue that causing these glitches then it might be on the synchronization/timing of presentation of swapchain, but I would expect that to cause problems on other hardware/drivers and would be picked up by the Vulkan debug layer. This is also a part of the VSG that has not changed recently so has been testing by lots of folks, on lots of hardware/driver combinations, so if there was a problem in this area one would have expected the issue to have been raised already.

My recommendation is to look into bug reports w.r.t the hardware and Vulkan drivers on Windows. There isn't anything else I can suggest on the VSG side.

I had outdated driver, but upgrade to the latest didn't help. Also upgraded vulkan sdk to the latest version rebuild everything but no luck still an issue. Don't think this can help somehow, but I've launched vsgdynamicviewer in renderdoc and it showed that first render pass completed fine but as soon as second render pass started buffer become corrupted. Tough there are no validation errors.
image
image
image
image

Could you try renderdoc with the same settings on Fedora with this hardware to see if it picks up any similar.

image

On fedora it also shows undefined and on the first glance everything looks very similar except of the final result

image

I have built RenderDoc and will test things out. From the RenderDoc screenshots it looks like the assignment of the second RenderGraph/renderPass/View is setting the contents of the framebuffer to undefined, even though the original RenderGraph/RenderPass/View that has just run before and set all of the framebuffer.

My best guess at this point would be that there is a don't care on the RenderPass setup w.r.t previous values of the pixels. but the rendering is on just that viewport area so some drivers are just fine, while others are treating the don't care as set to invalidate the whole window. This is the driver doing work that it hasn't been asked to do, and extra work it doesn't need to do.

If this hunch is correct then changing the default RenderPass configuration to use load on the previous values, but to add a clear to make sure the previous frames results aren't seen might solve it, but this is adding extra work per render pass which is crappy. The other approach would be to just use a single RenderGraph/RenaderPass and use a clear at the beginning of each view, this will have a lower overhead as there are less pixels to clear but it's still not as efficient as the current incarnation.

@Mikalai, what are the best way to get Renderdoc to reproduce the results you have?

I've downloaded it from https://renderdoc.org/ there are builds for linux and other systems. Then just launched vsgdynamicviews and captured a frame. I can also attach frame capture from my linux and windows host. But when I've tried to open capture from linux on windows it failed, so probably the reverse will also fail.

I have built the latest RenderDoc and have stepped through and captured something similar to you. While I can see the UNDEFINED backgroud on the second render pass with the TexureViewer the calls to Vulkan all look properly defined, the onscreen results for the frame are correct, its just this textureViewer one, but it's not the actual memory contents as we see these when it runs on screen, so the my actual RADV Renior Vulkan driver doesn't have an issue it's how Renderdoc is presenting what it thinks is a relevant representation.

My best guess given all this is that the driver is trying to make an inappropriate optimization and not storing complete results of the first render pass in the framebuffer before it begins the next render pass. The Vulkan debug layer is happy with what the VSG is doing, and most drivers are happy with the VSG is doing, and I haven't come across undefined state that could introduce ambiguity.

Perhaps the Vulkan spec doesn't nail down what to do with this particular usage case your AMD driver for your hardware takes an approach that breaks this usage case but they never have it flagged up as a bug.

Finally it uploaded my capture from fedora, but guess it is not needed any more.
myCapture.zip

I've tried to ask a question on AMD forum, but failed to do that it doesn't accept my post asking to fix something highlighted, but nothing is highlighted. So, let it be like that, one more point not to use neither windows nor amd gpu)

psi29a commented

Just to chime here as I'm currently helping on vsgopenmw project and encountering weird issues, I validated first using vsgexamples as well. On Ubuntu Lunar (23.04) using AMD's 6900 XT, vsgdynamicsviews seems to work.

libdrm-radeon1/lunar,now 2.4.114-1 amd64 [installed]
libdrm-radeon1-dbgsym/lunar,now 2.4.114-1 amd64 [installed]
radeontop/lunar,now 1.4-2 amd64 [installed]
xserver-xorg-video-amdgpu/lunar,now 23.0.0-1 amd64 [installed]
xserver-xorg-video-ati/lunar,now 1:19.1.0-3 amd64 [installed]
xserver-xorg-video-radeon/lunar,now 1:19.1.0-3 amd64 [installed]
xserver-xorg-video-radeon-dbgsym/lunar,now 1:19.1.0-3 amd64 [installed]
Linux Wintermute 6.2.0-26-generic Ubuntu SMP PREEMPT_DYNAMIC Mon Jul 10 23:39:54 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

gnome-shell-screenshot-8qdomm

@Mikalai what versions are you running?

The dynamicviews example would do just fine with one rende pass.
Do you mean one RenderGraph and multiple View, each with a clear before it? Or do you mean the current approach with multiple RenderGraph each with their own View but sharing the Window's RenderPass, like is done in the example right now?

I meant one RenderGraph. Having a RenderGraph / render pass per view isn't buying us much; there doesn't need to be any kind of barrier between the rendering commands for each view. Someone's got to do the clear; don't know if there's any difference between encoding it in the render pass definition vs using a clear command. There might be some advantage in setting the render area in the render pass.

With the current final layout of VK_IMAGE_LAYOUT_PRESENT_SRC_KHR, I don't think it is correct to use the window's render pass as a kind of default render pass. It would be good to define a "render atop" render pass that performs the clear operation, uses the proper color attachment layout etc.

It would an interesting test to let each RenderGraph create it's own RenderPass with settings that are appropriate to the the stage in rendering and see how that works. However, this would require existing RenderGraph to be adapted when a new one is created.

@psi29a I'm using latest vsg master branch

This is vulkansummary from fedora and windows


fedora vulkan summary

vulkaninfo --summary
==========
VULKANINFO
==========

Vulkan Instance Version: 1.3.243


Instance Extensions: count = 23
-------------------------------
VK_EXT_acquire_drm_display : extension revision 1
VK_EXT_acquire_xlib_display : extension revision 1
VK_EXT_debug_report : extension revision 10
VK_EXT_debug_utils : extension revision 2
VK_EXT_direct_mode_display : extension revision 1
VK_EXT_display_surface_counter : extension revision 1
VK_EXT_surface_maintenance1 : extension revision 1
VK_EXT_swapchain_colorspace : extension revision 4
VK_KHR_device_group_creation : extension revision 1
VK_KHR_display : extension revision 23
VK_KHR_external_fence_capabilities : extension revision 1
VK_KHR_external_memory_capabilities : extension revision 1
VK_KHR_external_semaphore_capabilities : extension revision 1
VK_KHR_get_display_properties2 : extension revision 1
VK_KHR_get_physical_device_properties2 : extension revision 2
VK_KHR_get_surface_capabilities2 : extension revision 1
VK_KHR_portability_enumeration : extension revision 1
VK_KHR_surface : extension revision 25
VK_KHR_surface_protected_capabilities : extension revision 1
VK_KHR_wayland_surface : extension revision 6
VK_KHR_xcb_surface : extension revision 6
VK_KHR_xlib_surface : extension revision 6
VK_LUNARG_direct_driver_loading : extension revision 1

Instance Layers: count = 6
--------------------------
VK_LAYER_MESA_device_select Linux device selection layer 1.3.211 version 1
VK_LAYER_RENDERDOC_Capture Debugging capture layer for RenderDoc 1.3.131 version 28
VK_LAYER_VALVE_steam_fossilize_32 Steam Pipeline Caching Layer 1.3.207 version 1
VK_LAYER_VALVE_steam_fossilize_64 Steam Pipeline Caching Layer 1.3.207 version 1
VK_LAYER_VALVE_steam_overlay_32 Steam Overlay Layer 1.3.207 version 1
VK_LAYER_VALVE_steam_overlay_64 Steam Overlay Layer 1.3.207 version 1

Devices:
========
GPU0:
apiVersion = 1.3.246
driverVersion = 23.1.4
vendorID = 0x1002
deviceID = 0x73a5
deviceType = PHYSICAL_DEVICE_TYPE_DISCRETE_GPU
deviceName = AMD Radeon RX 6950 XT (RADV NAVI21)
driverID = DRIVER_ID_MESA_RADV
driverName = radv
driverInfo = Mesa 23.1.4
conformanceVersion = 1.3.0.0
deviceUUID = 00000000-0b00-0000-0000-000000000000
driverUUID = 414d442d-4d45-5341-2d44-525600000000
GPU1:
apiVersion = 1.3.246
driverVersion = 0.0.1
vendorID = 0x10005
deviceID = 0x0000
deviceType = PHYSICAL_DEVICE_TYPE_CPU
deviceName = llvmpipe (LLVM 16.0.6, 256 bits)
driverID = DRIVER_ID_MESA_LLVMPIPE
driverName = llvmpipe
driverInfo = Mesa 23.1.4 (LLVM 16.0.6)
conformanceVersion = 1.3.1.1
deviceUUID = 6d657361-3233-2e31-2e34-000000000000
driverUUID = 6c6c766d-7069-7065-5555-494400000000

 

windows:

vulkaninfo --summary
WARNING: [Loader Message] Code 0 : Layer name GalaxyOverlayVkLayer does not conform to naming standard (Policy #LLP_LAYER_3)
WARNING: [Loader Message] Code 0 : Layer name GalaxyOverlayVkLayer_VERBOSE does not conform to naming standard (Policy #LLP_LAYER_3)
WARNING: [Loader Message] Code 0 : Layer name GalaxyOverlayVkLayer_DEBUG does not conform to naming standard (Policy #LLP_LAYER_3)
ERROR: [Loader Message] Code 0 : loader_get_json: Failed to open JSON file D:\Epic Games\Launcher\Portal\Extras\Overlay\EOSOverlayVkLayer-Win32.json
ERROR: [Loader Message] Code 0 : loader_get_json: Failed to open JSON file D:\Epic Games\Launcher\Portal\Extras\Overlay\EOSOverlayVkLayer-Win64.json
==========
VULKANINFO
==========

Vulkan Instance Version: 1.3.243


Instance Extensions: count = 13
-------------------------------
VK_EXT_debug_report : extension revision 10
VK_EXT_debug_utils : extension revision 2
VK_EXT_swapchain_colorspace : extension revision 4
VK_KHR_device_group_creation : extension revision 1
VK_KHR_external_fence_capabilities : extension revision 1
VK_KHR_external_memory_capabilities : extension revision 1
VK_KHR_external_semaphore_capabilities : extension revision 1
VK_KHR_get_physical_device_properties2 : extension revision 2
VK_KHR_get_surface_capabilities2 : extension revision 1
VK_KHR_portability_enumeration : extension revision 1
VK_KHR_surface : extension revision 25
VK_KHR_win32_surface : extension revision 6
VK_LUNARG_direct_driver_loading : extension revision 1

Instance Layers: count = 17
---------------------------
GalaxyOverlayVkLayer Galaxy Overlay Vulkan Layer 1.1.73 version 1
GalaxyOverlayVkLayer_DEBUG Galaxy Overlay Vulkan Layer 1.1.73 version 1
GalaxyOverlayVkLayer_VERBOSE Galaxy Overlay Vulkan Layer 1.1.73 version 1
VK_LAYER_AMD_switchable_graphics AMD switchable graphics layer 1.3.250 version 1
VK_LAYER_KHRONOS_profiles Khronos Profiles layer 1.3.250 version 1
VK_LAYER_KHRONOS_shader_object Shader object layer 1.3.250 version 1
VK_LAYER_KHRONOS_synchronization2 Khronos Synchronization2 layer 1.3.250 version 1
VK_LAYER_KHRONOS_validation Khronos Validation Layer 1.3.250 version 1
VK_LAYER_LUNARG_api_dump LunarG API dump layer 1.3.250 version 2
VK_LAYER_LUNARG_gfxreconstruct GFXReconstruct Capture Layer Version 0.9.20 1.3.250 version 36884
VK_LAYER_LUNARG_monitor Execution Monitoring Layer 1.3.250 version 1
VK_LAYER_LUNARG_screenshot LunarG image capture layer 1.3.250 version 1
VK_LAYER_OBS_HOOK Open Broadcaster Software hook 1.3.216 version 1
VK_LAYER_RENDERDOC_Capture Debugging capture layer for RenderDoc 1.3.131 version 28
VK_LAYER_ROCKSTAR_GAMES_social_club Rockstar Games Social Club Layer 1.0.70 version 1
VK_LAYER_VALVE_steam_fossilize Steam Pipeline Caching Layer 1.3.207 version 1
VK_LAYER_VALVE_steam_overlay Steam Overlay Layer 1.3.207 version 1

Devices:
========
GPU0:
apiVersion = 1.3.250
driverVersion = 2.0.270
vendorID = 0x1002
deviceID = 0x73a5
deviceType = PHYSICAL_DEVICE_TYPE_DISCRETE_GPU
deviceName = AMD Radeon RX 6950 XT
driverID = DRIVER_ID_AMD_PROPRIETARY
driverName = AMD proprietary driver
driverInfo = 23.7.2 (AMD proprietary shader compiler)
conformanceVersion = 1.3.3.1
deviceUUID = 00000000-0b00-0000-0000-000000000000
driverUUID = 414d442d-5749-4e2d-4452-560000000000
Mikalaj

psi29a commented
psi29a@Wintermute:~$ vulkaninfo --summary
==========
VULKANINFO
==========

Vulkan Instance Version: 1.3.239


Instance Extensions: count = 21
-------------------------------
VK_EXT_acquire_drm_display             : extension revision 1
VK_EXT_acquire_xlib_display            : extension revision 1
VK_EXT_debug_report                    : extension revision 10
VK_EXT_debug_utils                     : extension revision 2
VK_EXT_direct_mode_display             : extension revision 1
VK_EXT_display_surface_counter         : extension revision 1
VK_EXT_swapchain_colorspace            : extension revision 4
VK_KHR_device_group_creation           : extension revision 1
VK_KHR_display                         : extension revision 23
VK_KHR_external_fence_capabilities     : extension revision 1
VK_KHR_external_memory_capabilities    : extension revision 1
VK_KHR_external_semaphore_capabilities : extension revision 1
VK_KHR_get_display_properties2         : extension revision 1
VK_KHR_get_physical_device_properties2 : extension revision 2
VK_KHR_get_surface_capabilities2       : extension revision 1
VK_KHR_portability_enumeration         : extension revision 1
VK_KHR_surface                         : extension revision 25
VK_KHR_surface_protected_capabilities  : extension revision 1
VK_KHR_wayland_surface                 : extension revision 6
VK_KHR_xcb_surface                     : extension revision 6
VK_KHR_xlib_surface                    : extension revision 6

Instance Layers: count = 8
--------------------------
VK_LAYER_INTEL_nullhw             INTEL NULL HW                1.1.73   version 1
VK_LAYER_KHRONOS_validation       Khronos Validation Layer     1.3.239  version 1
VK_LAYER_MESA_device_select       Linux device selection layer 1.3.211  version 1
VK_LAYER_MESA_overlay             Mesa Overlay layer           1.3.211  version 1
VK_LAYER_VALVE_steam_fossilize_32 Steam Pipeline Caching Layer 1.3.207  version 1
VK_LAYER_VALVE_steam_fossilize_64 Steam Pipeline Caching Layer 1.3.207  version 1
VK_LAYER_VALVE_steam_overlay_32   Steam Overlay Layer          1.3.207  version 1
VK_LAYER_VALVE_steam_overlay_64   Steam Overlay Layer          1.3.207  version 1

Devices:
========
GPU0:
	apiVersion         = 1.3.238
	driverVersion      = 23.0.4
	vendorID           = 0x1002
	deviceID           = 0x73bf
	deviceType         = PHYSICAL_DEVICE_TYPE_DISCRETE_GPU
	deviceName         = AMD Radeon RX 6900 XT (RADV NAVI21)
	driverID           = DRIVER_ID_MESA_RADV
	driverName         = radv
	driverInfo         = Mesa 23.0.4-0ubuntu1~23.04.1
	conformanceVersion = 1.3.0.0
	deviceUUID         = 00000000-0b00-0000-0000-000000000000
	driverUUID         = 414d442d-4d45-5341-2d44-525600000000
GPU1:
	apiVersion         = 1.3.238
	driverVersion      = 0.0.1
	vendorID           = 0x10005
	deviceID           = 0x0000
	deviceType         = PHYSICAL_DEVICE_TYPE_CPU
	deviceName         = llvmpipe (LLVM 15.0.7, 256 bits)
	driverID           = DRIVER_ID_MESA_LLVMPIPE
	driverName         = llvmpipe
	driverInfo         = Mesa 23.0.4-0ubuntu1~23.04.1 (LLVM 15.0.7)
	conformanceVersion = 1.3.1.1
	deviceUUID         = 6d657361-3233-2e30-2e34-2d3075627500
	driverUUID         = 6c6c766d-7069-7065-5555-494400000000

It turned out to be a driver issue. With

Devices:
========
GPU0:
        apiVersion         = 1.3.262
        driverVersion      = 2.0.283
        vendorID           = 0x1002
        deviceID           = 0x73a5
        deviceType         = PHYSICAL_DEVICE_TYPE_DISCRETE_GPU
        deviceName         = AMD Radeon RX 6950 XT
        driverID           = DRIVER_ID_AMD_PROPRIETARY
        driverName         = AMD proprietary driver
        driverInfo         = 23.10.2 (AMD proprietary shader compiler)
        conformanceVersion = 1.3.3.1
        deviceUUID         = 00000000-0b00-0000-0000-000000000000
        driverUUID         = 414d442d-5749-4e2d-4452-560000000000

vsgdynamicviews works great

image

Great to hear a driver update fixed the issue.