ggerganov/llama.cpp

Description of "-t N" option for server is inaccurate

tigran123 opened this issue · 1 comments

The documentation for the server says that -t N option is not used if model layers are offloaded to GPU. However, when some layers are offloaded to GPU I still see the load on CPU grow to 1200% with -t 12 option during inference, but the load on GPU is very small and happens in short bursts up to 10% or so. However, if the model is so small thar ALL layers can be offloaded to GPU, then the load on CPU does not exceed 100%, but the load on GPU attains 100%.

So, my guess is that the documentation is supposed to say "-t N option not used when ALL layers are offloaded to GPU", right?

You're right, the documentation is wrong, see #7362 .