Panic in GPU culler for bind group too large.

Question

Panic in GPU culler for bind group too large.

John-Nagle opened this issue a year ago · 11 comments

Internal panic in GPU culler when bind group is too large.

05:36:12 [ERROR] =========> Panic wgpu error: Validation Error

Caused by:
    In Device::create_bind_group
      note: label = `GpuCuller rend3_routine::pbr::material::PbrMaterial BG`
    Buffer binding 4 range 2147483656 exceeds `max_*_buffer_binding_size` limit 2147483648

 at file /home/john/.cargo/registry/src/index.crates.io-6f17d22bba15001f/wgpu-0.18.0/src/backend/direct.rs, line 3111 in thread main.
Backtrace:
 libcommon::common::commonutils::catch_panic::{{closure}}
             at /home/john/projects/sl/SL-test-viewer/libcommon/src/common/commonutils.rs:215:25
 wgpu::backend::direct::default_error_handler
             at /home/john/.cargo/registry/src/index.crates.io-6f17d22bba15001f/wgpu-0.18.0/src/backend/direct.rs:3111:5
 wgpu::backend::direct::ErrorSinkRaw::handle_error
             at /home/john/.cargo/registry/src/index.crates.io-6f17d22bba15001f/wgpu-0.18.0/src/backend/direct.rs:3097:17
 wgpu::backend::direct::Context::handle_error
             at /home/john/.cargo/registry/src/index.crates.io-6f17d22bba15001f/wgpu-0.18.0/src/backend/direct.rs:333:9
 <wgpu::backend::direct::Context as wgpu::context::Context>::device_create_bind_group
             at /home/john/.cargo/registry/src/index.crates.io-6f17d22bba15001f/wgpu-0.18.0/src/backend/direct.rs:1107:13
 <T as wgpu::context::DynContext>::device_create_bind_group
             at /home/john/.cargo/registry/src/index.crates.io-6f17d22bba15001f/wgpu-0.18.0/src/context.rs:2308:13
 wgpu::Device::create_bind_group
             at /home/john/.cargo/registry/src/index.crates.io-6f17d22bba15001f/wgpu-0.18.0/src/lib.rs:2507:26
 rend3_routine::culling::culler::GpuCuller::cull
             at /home/john/.cargo/git/checkouts/rend3-e03f89403de3386a/9065f1e/rend3-routine/src/culling/culler.rs:613:26
 rend3_routine::culling::culler::GpuCuller::add_culling_to_graph::{{closure}}
             at /home/john/.cargo/git/checkouts/rend3-e03f89403de3386a/9065f1e/rend3-routine/src/culling/culler.rs:757:30
 rend3::graph::graph::RenderGraph::execute
             at /home/john/.cargo/git/checkouts/rend3-e03f89403de3386a/9065f1e/rend3/src/graph/graph.rs:501:17

Rend3 rev = "9065f1e".

Answer 1 · 2023-12-18T05:45:10.000Z

Running out of GPU memory in mesh creation is now being properly reported to the application level, and the program continues to run. So that worked. Looks like there are other places where that limit can be hit.

Answer 2 · 2023-12-18T09:01:30.000Z

Interesting to note this is only 8 bytes over the limit, I wonder if this is an off-by-a-smidge error.

Answer 3 · 2023-12-18T09:26:26.000Z

I'm operating very close to the limit right now. I create meshes until I hit the bind group limit and get the mesh error. Then I put the failed request on hold. New requests continue to hit the limit, and they, too, get put on hold There's a background task which manages levels of detail, and it will take steps to reduce the memory pressure and redo the failed items, but that's only partly written and not working yet. Once it's all working, it will only hit the limit occasionally. Then it will back off.

So if something in the GPU culler needs some bind group space during rendering, it's likely to hit the limit.

There are two ways to go at this. 1) Bang into the limit, get an error return, and recover. This requires that all components be able to operate right up to the limit. That's the current implementation. 2) Provide info on how much of the resource is left, so the application can back off before hitting the limit. The current choice is 1). I've figured out how to work with that, and it's going well.

With 2), it's necessary to have reliable info about how much of the resource is left. This is apparently difficult. Fragmentation may be an issue. (Does bind group space get fragmented?) It's extremely difficult to get memory info out of the WPGU and below levels, as I understand it. For Vulkan it's listed a proposed enhancement. So, as I understand it, we're stuck with 1).

Answer 4 · 2023-12-19T03:00:18.000Z

Somewhat related: At the 2147483648 limit, my own count of vertices is 37098544.
That's 57.8 bytes per vertex. Reasonable?

Answer 5 · 2023-12-29T21:16:38.000Z

I'm getting this too, it happens randomly and I don't believe I am ever near the bind group limit

Answer 6 · 2023-12-30T00:55:38.000Z

So this problem is caused by the result index buffer getting too large - if the total indices in the scene are greater than 2^27, you'll hit this problem. This is one pretty major disadvantage of the culling system as it stands, and I'm currently scheming on how to remove this limit. I can raise it to 2^28 pretty easily as there's currently an off-by-8-bytes situation. But I'm generally concerned about the limitations the culling system has, and the minimal performance benefits, so I may remove it in favor of other culling techniques.

Answer 7 · 2023-12-30T01:45:36.000Z

Sounds good. I've been able to rework things such that hitting the limit is now recoverable. It now tells the level of detail system to cut back on quality. But a higher ceiling would be nice.

Answer 8 · 2024-02-29T04:25:33.000Z

I just built a version of Sharpview where this is a hard error that fails at startup every time even on simple scenes. In addition, the rendered images have random triangles all over the place. These have been rare intermittent problems for months, but now I have a soild repro.

04:14:15 [ERROR] =========> Panic wgpu error: Validation Error

Caused by:
In Device::create_bind_group
note: label = GpuCuller rend3_routine::pbr::material::PbrMaterial BG
Buffer binding 4 range 134217736 exceeds max_*_buffer_binding_size limit 134217728

at file /home/john/.cargo/registry/src/index.crates.io-6f17d22bba15001f/wgpu-0.19.1/src/backend/wgpu_core.rs, line 3009 in thread main.
Backtrace:
libcommon::common::commonutils::catch_panic::{{closure}}
at /home/john/projects/sl/SL-test-viewer/libcommon/src/common/commonutils.rs:215:25
wgpu::backend::wgpu_core::default_error_handler
at /home/john/.cargo/registry/src/index.crates.io-6f17d22bba15001f/wgpu-0.19.1/src/backend/wgpu_core.rs:3009:5
wgpu::backend::wgpu_core::ErrorSinkRaw::handle_error
at /home/john/.cargo/registry/src/index.crates.io-6f17d22bba15001f/wgpu-0.19.1/src/backend/wgpu_core.rs:2995:17
wgpu::backend::wgpu_core::ContextWgpuCore::handle_error
at /home/john/.cargo/registry/src/index.crates.io-6f17d22bba15001f/wgpu-0.19.1/src/backend/wgpu_core.rs:262:9
<wgpu::backend::wgpu_core::ContextWgpuCore as wgpu::context::Context>::device_create_bind_group
at /home/john/.cargo/registry/src/index.crates.io-6f17d22bba15001f/wgpu-0.19.1/src/backend/wgpu_core.rs:1043:13
::device_create_bind_group
at /home/john/.cargo/registry/src/index.crates.io-6f17d22bba15001f/wgpu-0.19.1/src/context.rs:2236:13
wgpu::Device::create_bind_group
at /home/john/.cargo/registry/src/index.crates.io-6f17d22bba15001f/wgpu-0.19.1/src/lib.rs:2430:26
rend3_routine::culling::culler::GpuCuller::cull
at /home/john/.cargo/git/checkouts/rend3-issue570-7a55d7cece9b9b17/bafdc3b/rend3-routine/src/culling/culler.rs:614:26
rend3_routine::culling::culler::GpuCuller::add_culling_to_graph::{{closure}}
at /home/john/.cargo/git/checkouts/rend3-issue570-7a55d7cece9b9b17/bafdc3b/rend3-routine/src/culling/culler.rs:765:30
rend3::graph::graph::RenderGraph::execute
at /home/john/.cargo/git/checkouts/rend3-issue570-7a55d7cece9b9b17/bafdc3b/rend3/src/graph/graph.rs:503:17

This started failing after I changed some visibility of modules in mod.rs files. Didn't even change any code. So it may depend on memory layout. My own code is 100% safe Rust, so short of a compiler error, that shouldn't matter.

Saved the bad executable, did cargo clean, and rebuilt. Rebuilt version still fails in the same way. So it wasn't a transient bad compile.

This is a relatively simple test scene and is nowhere near the bind limit. I've tried logging into different places in Second Life and OSGrid, and all fail the same way.

Answer 9 · 2024-02-29T04:48:39.000Z

This is what I'm seeing on screen. Some legit content, some flickering triangles.

Answer 10 · 2024-02-29T04:49:14.000Z

Fails in both debug and release mode in the same way. Just slower in debug.

Answer 11 · 2024-04-23T04:50:03.000Z

Closed by #593