[BUG] After fasttensors was removed, the memory usage seemed abnormal.

Question

[BUG] After fasttensors was removed, the memory usage seemed abnormal.

Closed this issue 3 months ago · 9 comments

Pevernow commented 3 months ago

OS

Windows

GPU Library

CUDA 12.x

Python version

3.10

Describe the bug

After fasttensors was removed, the memory usage seemed abnormal.

exllamav2 should use the GPU entirely for inference, but even though the model is fully loaded into VRAM, the RAM still occupies nearly 30GB, which seems a bit strange.

By the way, https://huggingface.co/anthracite-org/magnum-v2.5-12b-kto-exl2/discussions/1 can't be stopped at all in the latest version of tabbyapi. This is the official exl2 conversion, but it can't stop the output normally anyway. Is it also related to the latest changes?

Reproduction steps

Download the 6.0bpw branch of https://huggingface.co/anthracite-org/magnum-v2.5-12b-kto-exl2 and try running it.

Expected behavior

The RAM is not used while loading, or at least it is released after the model is loaded, since we are not using the CPU for inference, right?

Logs

No response

Additional context

No response

Acknowledgements

I have looked for similar issues before submitting this one.
I have read the disclaimer, and this issue is related to a code bug. If I have a question, I will use the Discord server.
I understand that the developers have lives and my issue will be answered when possible.
I understand the developers of this program are human, and I will ask my questions politely.

Answer 1 · 2024-10-01T12:39:35.000Z

Are you on the latest version, and can you check if the dependencies updated when you updated Tabby? pip show exllamav2 should give:

Name: exllamav2
Version: 0.2.3
...

As for Magnum, what chat template are you using?

Answer 2 · 2024-10-01T23:14:28.000Z

Are you on the latest version, and can you check if the dependencies updated when you updated Tabby? pip show exllamav2 should give:
Name: exllamav2
Version: 0.2.3
...
As for Magnum, what chat template are you using?

Name: exllamav2
Version: 0.2.3+cu121.torch2.4.0
Summary:
Home-page: https://github.com/turboderp/exllamav2
Author: turboderp
Author-email:
License: MIT
Location: c:\users\34503\appdata\local\programs\python\python310\lib\site-packages
Requires: fastparquet, ninja, numpy, pandas, pygments, regex, rich, safetensors, sentencepiece, torch, websockets
Required-by:

Use the default provided by the model.

Answer 3 · 2024-10-01T23:24:40.000Z

I also tried specifying it as chatml, but it still doesn't stop.

Answer 4 · 2024-10-02T00:36:08.000Z

Can you confirm that the model works as intended with ExLlama 0.2.2?

Answer 5 · 2024-10-02T00:58:36.000Z

Can you confirm that the model works as intended with ExLlama 0.2.2?

Tabbyapi requires the latest version of ExLlama, so how can I roll back?

Added: with Torch 2.4.1

Answer 6 · 2024-10-02T01:36:26.000Z

The issue of not stopping has been resolved with the help of the original author at https://huggingface.co/anthracite-org/magnum-v2.5-12b-kto-exl2/discussions/1.
Thank you very much for your help.
But I am still confused about the large amount of RAM required for inference.

Answer 7 · 2024-10-02T03:38:37.000Z

The system RAM leak is definitely not supposed to be happening. I'll need to investigate some more. If you can run the previous version with the same settings maybe that can help confirm that it's definitely happening because of the update, even though the update is specifically meant to prevent it.

You can roll back Tabby to any earlier version with git. Go here to find the commit you want, grab the hash and run e.g. git checkout 56ce82ef77443140657d531ebb32d51fcdb72624 (the commit right before the ExLlama dependency was last updated.) This should then run with the ExLlama 0.2.2 wheel. Revert with git checkout main.

Answer 8 · 2024-10-02T08:46:40.000Z

Before loading:

After loading:

Shared VRAM is not occupied.

There is no noticeable change between the two versions.
But I want to know if this behavior is normal?

Answer 9 · 2024-10-10T04:00:03.000Z

Closing due to inactivity. If the issue persists, please open an issue in ExllamaV2 and not here.