LocalGemma2ForCausalLM.from_pretrained("google/gemma-2-27b-it", preset="memory_extreme") can't load model.
webbigdata-jp opened this issue · 1 comments
webbigdata-jp commented
Hi, Thank you for this interesting project.
Maybe this is the same case as issues/24.
I can't run local-gemma in my python code.
Script
from local_gemma import LocalGemma2ForCausalLM
from transformers import AutoTokenizer
model = LocalGemma2ForCausalLM.from_pretrained("google/gemma-2-27b-it", preset="memory_extreme")
tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-27b-it")
messages = [
{"role": "user", "content": "What is your favourite condiment?"},
{"role": "assistant", "content": "Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavour to whatever I'm cooking up in the kitchen!"},
{"role": "user", "content": "Do you have mayonnaise recipes?"}
]
encodeds = tokenizer.apply_chat_template(messages, return_tensors="pt")
model_inputs = encodeds.to(model.device)
generated_ids = model.generate(model_inputs, max_new_tokens=1000, do_sample=True)
decoded_text = tokenizer.batch_decode(generated_ids)
error message
$ python3 check.py
Traceback (most recent call last):
File "/dataset/localgemma/check.py", line 4, in <module>
model = LocalGemma2ForCausalLM.from_pretrained(
File "/dataset/localgemma/gemma-venv/lib/python3.10/site-packages/local_gemma/modeling_local_gemma_2.py", line 153, in from_pretrained
model = super().from_pretrained(
File "/dataset/localgemma/gemma-venv/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3787, in from_pretrained
hf_quantizer.validate_environment(device_map=device_map)
File "/dataset/localgemma/gemma-venv/lib/python3.10/site-packages/transformers/quantizers/quantizer_bnb_4bit.py", line 86, in validate_environment
raise ValueError(
ValueError: Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit the quantized model. If you want to dispatch the model on the CPU or the disk while keeping these modules in 32-bit, you need to set `load_in_8bit_fp32_cpu_offload=True` and pass a custom `device_map` to `from_pretrained`. Check https://huggingface.co/docs/transformers/main/en/main_classes/quantization#offload-between-cpu-and-gpu for more details.
pip
$ pip list
Package Version
------------------------ ----------
accelerate 0.32.1
attrs 23.2.0
bitsandbytes 0.43.1
certifi 2024.7.4
charset-normalizer 3.3.2
filelock 3.15.4
fsspec 2024.6.1
huggingface-hub 0.23.4
idna 3.7
Jinja2 3.1.4
jsonlines 4.0.0
local_gemma 0.1.0
MarkupSafe 2.1.5
mpmath 1.3.0
networkx 3.3
numpy 1.26.4
nvidia-cublas-cu12 12.1.3.1
nvidia-cuda-cupti-cu12 12.1.105
nvidia-cuda-nvrtc-cu12 12.1.105
nvidia-cuda-runtime-cu12 12.1.105
nvidia-cudnn-cu12 8.9.2.26
nvidia-cufft-cu12 11.0.2.54
nvidia-curand-cu12 10.3.2.106
nvidia-cusolver-cu12 11.4.5.107
nvidia-cusparse-cu12 12.1.0.106
nvidia-nccl-cu12 2.20.5
nvidia-nvjitlink-cu12 12.5.82
nvidia-nvtx-cu12 12.1.105
packaging 24.1
pip 22.0.2
psutil 6.0.0
PyYAML 6.0.1
regex 2024.5.15
requests 2.32.3
safetensors 0.4.3
setuptools 59.6.0
sympy 1.12.1
tokenizers 0.19.1
torch 2.3.1
tqdm 4.66.4
transformers 4.42.3
triton 2.3.1
typing_extensions 4.12.2
urllib3 2.2.2
nvidia-smi
Fri Jul 5 15:53:05 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.113.01 Driver Version: 535.113.01 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 4060 Ti Off | 00000000:01:00.0 Off | N/A |
| 0% 46C P8 9W / 165W | 183MiB / 16380MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 6459 G /usr/lib/xorg/Xorg 162MiB |
| 0 N/A N/A 6576 G /usr/bin/gnome-shell 13MiB |
+---------------------------------------------------------------------------------------+
SunMarc commented
Hi @webbigdata-jp, thanks for reporting. We indeed have an issue with offloading when the model is quantized. I'll try to fix this soon.