gfx-rs/wgpu

Validation error & crash on wgpu Vulkan + Windows

Opened this issue · 3 comments

Description
When running my app (https://github.com/ArthurBrussee/brush), training proceeds steadily for a while, until the app crashes. The symptons seem hard to pin down, it happens fairly randomly. Just before the crash the Vulkan validation layer spits out a bunch of errors about semaphores. Most tellingly some semaphore value seems to be u64::MAX which Vulkan trips over.

This causes a device loss (possibly?) after which wgpu crashes because of #6378, I think.

I have not been able to reproduce this on Metal, not sure about Vulkan + Linux.

Extra materials

Log with validation errors
log.txt

Platform
wgpu (trunk or 23.0 or 23.1 repro), windows 11, Vulkan, 4070 on 566.36.

Another issue I reported for an early wgpu 23 version might or might not be related: #6279. If nothing else the bisection there also pointed to some locking behaviour.

It also looks similair to #6323 - comptue heavy workload, and I am getting validation errors of the form

VUID-vkResetCommandPool-commandPool-00040(ERROR / SPEC): msgNum: -1254218959 - Validation Error: [ VUID-vkResetCommandPool-commandPool-00040 ] Object 0: handle = 0x282c846d320, name = (wgpu internal) Pre Pass, type = VK_OBJECT_TYPE_COMMAND_BUFFER; Object 1: handle = 0x282c8100750, type = VK_OBJECT_TYPE_COMMAND_POOL; | MessageID = 0xb53e2331 | vkResetCommandPool():  (VkCommandBuffer 0x282c846d320[(wgpu internal) Pre Pass]) is in use.
The Vulkan spec states: All VkCommandBuffer objects allocated from commandPool must not be in the pending state (https://vulkan.lunarg.com/doc/view/1.3.296.0/windows/1.3-extensions/vkspec.html#VUID-vkResetCommandPool-commandPool-00040)
    Objects: 2
        [0] 0x282c846d320, type: 6, name: (wgpu internal) Pre Pass
        [1] 0x282c8100750, type: 25, name: NULL
VUID-vkResetCommandPool-commandPool-00040(ERROR / SPEC): msgNum: -1254218959 - Validation Error: [ VUID-vkResetCommandPool-commandPool-00040 ] Object 0: handle = 0x282c8473760, type = VK_OBJECT_TYPE_COMMAND_BUFFER; Object 1: handle = 0x282c8100750, type = VK_OBJECT_TYPE_COMMAND_POOL; | MessageID = 0xb53e2331 | vkResetCommandPool():  (VkCommandBuffer 0x282c8473760[]) is in use.
The Vulkan spec states: All VkCommandBuffer objects allocated from commandPool must not be in the pending state (https://vulkan.lunarg.com/doc/view/1.3.296.0/windows/1.3-extensions/vkspec.html#VUID-vkResetCommandPool-commandPool-00040)
    Objects: 2
        [0] 0x282c8473760, type: 6, name: NULL
        [1] 0x282c8100750, type: 25, name: NULL
VUID-vkResetCommandPool-commandPool-00040(ERROR / SPEC): msgNum: -1254218959 - Validation Error: [ VUID-vkResetCommandPool-commandPool-00040 ] Object 0: handle = 0x282c8471e50, name = (wgpu internal) Transit, type = VK_OBJECT_TYPE_COMMAND_BUFFER; Object 1: handle = 0x282c8100750, type = VK_OBJECT_TYPE_COMMAND_POOL; | MessageID = 0xb53e2331 | vkResetCommandPool():  (VkCommandBuffer 0x282c8471e50[(wgpu internal) Transit]) is in use.
The Vulkan spec states: All VkCommandBuffer objects allocated from commandPool must not be in the pending state (https://vulkan.lunarg.com/doc/view/1.3.296.0/windows/1.3-extensions/vkspec.html#VUID-vkResetCommandPool-commandPool-00040)

If a single submission goes longer than 60s, you might see that, if that's not the case I'm not sure wht the issue is on the top of my head.

It's definitely not going over 60s, the amount of GPU work in the order of ~100ms, and putting a submit() after every submit() call still crashes.

I've tried downgrading to 22.10 but it still seems to crash. I've also tried adding

wgpu-hal = { version = "22.0.0", features = [
    "device_lost_panic",
    "internal_error_panic",
    "oom_panic",
] }

But the stack trace is still

thread 'tokio-runtime-worker' panicked at C:\Users\A-Bru\.cargo\registry\src\index.crates.io-6f17d22bba15001f\wgpu-22.1.0\src\backend\wgpu_core.rs:2314:30:
Error in Queue::submit: Validation Error

Caused by:
  Parent device is lost

With a stacktrace pointing to wherever the last submit was, or other similair traces.

If you have any tips what to try / how to investigate this would be much appreciated!