[BUG] After fasttensors was removed, the memory usage seemed abnormal.
Closed this issue · 9 comments
OS
Windows
GPU Library
CUDA 12.x
Python version
3.10
Describe the bug
After fasttensors was removed, the memory usage seemed abnormal.
exllamav2 should use the GPU entirely for inference, but even though the model is fully loaded into VRAM, the RAM still occupies nearly 30GB, which seems a bit strange.
By the way, https://huggingface.co/anthracite-org/magnum-v2.5-12b-kto-exl2/discussions/1 can't be stopped at all in the latest version of tabbyapi. This is the official exl2 conversion, but it can't stop the output normally anyway. Is it also related to the latest changes?
Reproduction steps
Download the 6.0bpw branch of https://huggingface.co/anthracite-org/magnum-v2.5-12b-kto-exl2 and try running it.
Expected behavior
The RAM is not used while loading, or at least it is released after the model is loaded, since we are not using the CPU for inference, right?
Logs
No response
Additional context
No response
Acknowledgements
- I have looked for similar issues before submitting this one.
- I have read the disclaimer, and this issue is related to a code bug. If I have a question, I will use the Discord server.
- I understand that the developers have lives and my issue will be answered when possible.
- I understand the developers of this program are human, and I will ask my questions politely.
Are you on the latest version, and can you check if the dependencies updated when you updated Tabby? pip show exllamav2
should give:
Name: exllamav2
Version: 0.2.3
...
As for Magnum, what chat template are you using?
Are you on the latest version, and can you check if the dependencies updated when you updated Tabby?
pip show exllamav2
should give:Name: exllamav2 Version: 0.2.3 ...
As for Magnum, what chat template are you using?
Name: exllamav2
Version: 0.2.3+cu121.torch2.4.0
Summary:
Home-page: https://github.com/turboderp/exllamav2
Author: turboderp
Author-email:
License: MIT
Location: c:\users\34503\appdata\local\programs\python\python310\lib\site-packages
Requires: fastparquet, ninja, numpy, pandas, pygments, regex, rich, safetensors, sentencepiece, torch, websockets
Required-by:
Use the default provided by the model.
I also tried specifying it as chatml, but it still doesn't stop.
Can you confirm that the model works as intended with ExLlama 0.2.2?
Can you confirm that the model works as intended with ExLlama 0.2.2?
Tabbyapi requires the latest version of ExLlama, so how can I roll back?
Added: with Torch 2.4.1
The issue of not stopping has been resolved with the help of the original author at https://huggingface.co/anthracite-org/magnum-v2.5-12b-kto-exl2/discussions/1.
Thank you very much for your help.
But I am still confused about the large amount of RAM required for inference.
The system RAM leak is definitely not supposed to be happening. I'll need to investigate some more. If you can run the previous version with the same settings maybe that can help confirm that it's definitely happening because of the update, even though the update is specifically meant to prevent it.
You can roll back Tabby to any earlier version with git
. Go here to find the commit you want, grab the hash and run e.g. git checkout 56ce82ef77443140657d531ebb32d51fcdb72624
(the commit right before the ExLlama dependency was last updated.) This should then run with the ExLlama 0.2.2 wheel. Revert with git checkout main
.