TeamWisp/WispRenderer

VRAM usage is very high

Nielsbishere opened this issue · 1 comments

Is your request related to a problem? Please describe.
My 970M only has 2 GiB VRAM, so it can't (properly) run 1080p, only 720p in emibl, but it can in viknell. It only runs because I have 8 GiB shared RAM, so it swaps out memory all the time. This means that on 1k it runs at 10-15 fps and 4k is probably way worse.

Describe the solution you'd like
A lot taken from my old branch https://github.com/TeamWisp/Procedural-Ray-Tracing/tree/feature_optimization.

  • Frame buffers:
    • G-Buffer
      • Use RG8 instead of RG16f for roughness (2 Bpp reduce)
      • Use normal compression to use RG16f instead of RGBA16f (4 Bpp reduce)
    • Get rid of unused (24 Bpp)
    • Total: 30 Bpp (30 / 76.75)
  • Textures:
    • Load HDR textures using RGBA16f instead of RGBA32f
    • Internally bake it to dds or load it with compression (always)
  • Models:
    • uv & pos can use ushort instead; 10 Bpv (saving 10 Bpv)
    • normal, tangent, bitangent can use unorm quat instead; 4 Bpv (saving 32 Bpv)
    • color can use uint instead; 4 Bpv (saving 8 Bpv)
    • Total: 18 Bpv instead of 68 Bpv

Describe alternatives you've considered
N.A. it's optimization.

Additional context
Note: The following is taken after ad0a90b, meaning that we use a lot less VRAM, like 40 Bpp less
Constant usage:

  • 128 - 256 MiB are taken by models (64 per pool)
    • Vertex size is 56, so approx 1 Mil vertices per pool (think about fragmentation that gets worse if vertices aren't compressed, since we have less vertices).
      • Use vertex compression; this can reduce it to approx 24 Bpv instead of 56, giving us 2.3x the vertices (might be important for big scenes)
      • Compression might have to use a compute shader, since we have A LOT of work to be done; 1 compute shader should get the max values, 1 should use those to compress them. Using 1 thread or 8 (multi threaded) is a lot less logical.
  • Textures take up A LOT in Emibl
    • One image already takes up 128 MiB (reason: DirectXTexHDR reads it as RGBA32f, while RGBA16f should be used)
    • Use texture compression (internally bake it to dds or process it)
      Variable usage:
  • Frame buffers
    • Individual
      • Deferred main: 20 Bpp (4 depth, 6 normal, 6 color, 2 roughness, 2 metallic)
      • Convolution & Cubemap take up: 24 Bpp (12 each; unused)
      • Deferred composition: 8 Bpp (color)
      • Post processing: 4 - 8 Bpp (color; depending on if you render to HDR backbuffer)
      • DoF CoC: 4 Bpp (cone of confusion)
      • DoF down scale: 4 Bpp (16 Bpp 8 near, 8 far color, but half res)
      • DoF dilate near: .25 Bpp (4 Bpp, but quarter res)
      • DoF dilate flatten: .25 Bpp (4 Bpp, but quarter res)
      • DoF dilate flatten 2: .25 Bpp (4 Bpp, but quarter res)
      • Bokeh: 4 Bpp (16 Bpp, but half res)
      • Bokeh filter: 4 Bpp (16 Bpp, but half res)
      • DoF post filter: 4 - 8 Bpp (color; depending on if you render to HDR backbuffer)
    • Per task
      • Deferred main: 20 Bpp
      • Deferred composition: 8 Bpp
      • Post processing: 4-8 Bpp
      • DoF: 12.75- 16.75 Bpp
      • Bokeh: 8 Bpp
      • Unused: 24 Bpp
    • Total:
      • Non HDR: 76.75 Bpp
      • HDR: 84.75 Bpp
    • Meaning that (non HDR):
      • 1280x720 = 67 MiB
      • 1920x1080 = 152 MiB
      • 3840x2160 = 607 MiB