kvark/blade

Sporadic GPU crashes on load

kvark opened this issue · 1 comments

kvark commented

Seeing them occasionally as VK_ERROR_DEVICE_LOST coming out of vkQueueSubmit.
Symptoms:

  • only happens on Sponza, which is a giant scene.
  • once it starts happening, it keeps happening. Once it works, it keeps working.
  • seems to be more likely when on battery?
  • seems to be more likely in debug mode?
  • often falls on the first or second frame rendered with UI.
  • when I needed to present this on Rust Gamedev meetup, I disabled most of the UI rendering, and it seemed to improve things.
  • Markers from #38 always point to BLAS construction. There is only one BLAS, and it's giant.
  • All primitive/instance/geometry counts are well within limits. All device address alignments are good, too.
kvark commented

I think it's just a TDR in the driver.

Story

I clear the asset cache and then try to load the big scene. The model is being processed, and then served. This is where BLAS is constructed. It schedules a bunch of transfers (for big meshes as well as textures that are loaded on the side), and then have this BLAS construction.all

This is a giant BLAS, and constructing it on GPU take some significant time. However, all the CPU threads are busy doing the texture compression of the assets that haven't been cached yet. So the AMD power management can't allocate enough power for the GPU operations. More to this, we are running on an integrated APU, which means the memory bandwidth is shared between the CPU and GPU operations. It's easy to starve this while heavy-loading assets on many threads.

This is also affected by: whether or not we run on battery, and what other kind of rendering is requested (UI may need some texture updates as well, and there are other apps like WezTerm consuming GPU). Result is - job gets too much time and is considered to be handing. Job is getting killed by the driver, I'm getting DEVICE_LOST. And all of the textures in process are dropped, meaning they will be converted again on the next run, repeating the cycle.

Workarounds

  1. They might be a way to configure TDR? Probably locally only, which isn't going to help other users.
  2. Mark texture loading to be dependent on the model being served. This would mean there are less (or no) things running during BLAS construction.
  3. Detect if the system has an integrated GPU and limit the number of worker threads more, e.g. 1/2 instead of 2/3 of the logical cores.