LocalGemma2ForCausalLM.from_pretrained("google/gemma-2-27b-it", preset="memory_extreme") can't load model.

Question

LocalGemma2ForCausalLM.from_pretrained("google/gemma-2-27b-it", preset="memory_extreme") can't load model.

webbigdata-jp opened this issue 6 months ago · 1 comments

Hi, Thank you for this interesting project.

Maybe this is the same case as issues/24.
I can't run local-gemma in my python code.

Script

from local_gemma import LocalGemma2ForCausalLM
from transformers import AutoTokenizer

model = LocalGemma2ForCausalLM.from_pretrained("google/gemma-2-27b-it", preset="memory_extreme")
tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-27b-it")

messages = [
    {"role": "user", "content": "What is your favourite condiment?"},
    {"role": "assistant", "content": "Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavour to whatever I'm cooking up in the kitchen!"},
    {"role": "user", "content": "Do you have mayonnaise recipes?"}
]

encodeds = tokenizer.apply_chat_template(messages, return_tensors="pt")
model_inputs = encodeds.to(model.device)

generated_ids = model.generate(model_inputs, max_new_tokens=1000, do_sample=True)
decoded_text = tokenizer.batch_decode(generated_ids)

error message

$ python3 check.py
Traceback (most recent call last):
  File "/dataset/localgemma/check.py", line 4, in <module>
    model = LocalGemma2ForCausalLM.from_pretrained(
  File "/dataset/localgemma/gemma-venv/lib/python3.10/site-packages/local_gemma/modeling_local_gemma_2.py", line 153, in from_pretrained
    model = super().from_pretrained(
  File "/dataset/localgemma/gemma-venv/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3787, in from_pretrained
    hf_quantizer.validate_environment(device_map=device_map)
  File "/dataset/localgemma/gemma-venv/lib/python3.10/site-packages/transformers/quantizers/quantizer_bnb_4bit.py", line 86, in validate_environment
    raise ValueError(
ValueError: Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit the quantized model. If you want to dispatch the model on the CPU or the disk while keeping these modules in 32-bit, you need to set `load_in_8bit_fp32_cpu_offload=True` and pass a custom `device_map` to `from_pretrained`. Check https://huggingface.co/docs/transformers/main/en/main_classes/quantization#offload-between-cpu-and-gpu for more details.

pip

$ pip list
Package                  Version
------------------------ ----------
accelerate               0.32.1
attrs                    23.2.0
bitsandbytes             0.43.1
certifi                  2024.7.4
charset-normalizer       3.3.2
filelock                 3.15.4
fsspec                   2024.6.1
huggingface-hub          0.23.4
idna                     3.7
Jinja2                   3.1.4
jsonlines                4.0.0
local_gemma              0.1.0
MarkupSafe               2.1.5
mpmath                   1.3.0
networkx                 3.3
numpy                    1.26.4
nvidia-cublas-cu12       12.1.3.1
nvidia-cuda-cupti-cu12   12.1.105
nvidia-cuda-nvrtc-cu12   12.1.105
nvidia-cuda-runtime-cu12 12.1.105
nvidia-cudnn-cu12        8.9.2.26
nvidia-cufft-cu12        11.0.2.54
nvidia-curand-cu12       10.3.2.106
nvidia-cusolver-cu12     11.4.5.107
nvidia-cusparse-cu12     12.1.0.106
nvidia-nccl-cu12         2.20.5
nvidia-nvjitlink-cu12    12.5.82
nvidia-nvtx-cu12         12.1.105
packaging                24.1
pip                      22.0.2
psutil                   6.0.0
PyYAML                   6.0.1
regex                    2024.5.15
requests                 2.32.3
safetensors              0.4.3
setuptools               59.6.0
sympy                    1.12.1
tokenizers               0.19.1
torch                    2.3.1
tqdm                     4.66.4
transformers             4.42.3
triton                   2.3.1
typing_extensions        4.12.2
urllib3                  2.2.2

nvidia-smi

Fri Jul  5 15:53:05 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.113.01             Driver Version: 535.113.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 4060 Ti     Off | 00000000:01:00.0 Off |                  N/A |
|  0%   46C    P8               9W / 165W |    183MiB / 16380MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      6459      G   /usr/lib/xorg/Xorg                          162MiB |
|    0   N/A  N/A      6576      G   /usr/bin/gnome-shell                         13MiB |
+---------------------------------------------------------------------------------------+

Answer 1 · 2024-07-10T13:35:48.000Z

Hi @webbigdata-jp, thanks for reporting. We indeed have an issue with offloading when the model is quantized. I'll try to fix this soon.