Description of "-t N" option for server is inaccurate
tigran123 opened this issue · 1 comments
tigran123 commented
The documentation for the server
says that -t N
option is not used if model layers are offloaded to GPU. However, when some layers are offloaded to GPU I still see the load on CPU grow to 1200% with -t 12
option during inference, but the load on GPU is very small and happens in short bursts up to 10% or so. However, if the model is so small thar ALL layers can be offloaded to GPU, then the load on CPU does not exceed 100%, but the load on GPU attains 100%.
So, my guess is that the documentation is supposed to say "-t N option not used when ALL layers are offloaded to GPU", right?
JohannesGaessler commented
You're right, the documentation is wrong, see #7362 .