No performance difference between original BF16 and Q4_0 quantized GGUF models

Question

No performance difference between original BF16 and Q4_0 quantized GGUF models

Opened this issue 2 months ago · 3 comments

The original 23.8 GB flux1-dev model runs at around the same speed as the 6.8 GB Q4_0 quant that should fit completely into my 12 GB of vram.

This is my workflow:
workflow.json

My GPU is a rx 6700xt and I'm using rocm on Ubuntu, so far without any problems.

I hope someone can help me or at least explain why this is happening.

Answer 1 · 2024-11-08T20:57:59.000Z

"should fit completely into my 12 GB of vram."

Well is it fitting into your Vram? have you checked with task manager while generating?
A Q4_0 GGUF model is still over 11GB and you will need to force the T5 Text encoder to run on the CPU to save Vram and also not have lots of internet tabs etc open to be able to fit it all into Vram.

Answer 2 · 2024-11-12T10:00:21.000Z

Q4_0 has 6.8 GB as I stated in my post. Text encoder is on cpu using force clip device.

Answer 3 · 2024-12-08T08:52:20.000Z

I have the same performance issues even when the model fits entirely in VRAM. The task manager graph shows that the CUDA cores are not fully utilized during generation.