intel-analytics/ipex-llm

all-in-one benchmark with Baichuan2-13B OOM

Closed this issue · 2 comments

AIO Benchmark Baichuan2-13B 1024 in 1024 out OOM with the following config.

accelerate 0.30.0
bigdl-core-xe-21 2.5.0b20240508
bigdl-core-xe-esimd-21 2.5.0b20240508
elastic-transport 7.16.0
intel-extension-for-pytorch 2.1.30+xpu
ipex-llm 2.1.0b20240508
pytorch_revgrad 0.2.0
s3transfer 0.10.0
torch 2.1.0.post2+cxx11.abi
torchaudio 2.1.0.post2+cxx11.abi
torchvision 0.16.0.post2+cxx11.abi
transformer-smaller-training-vocab 0.3.3
transformers 4.37.2
transformers-stream-generator 0.0.4

Traceback (most recent call last):
File "/home/a770/miniconda3/envs/llm-test/lib/python3.9/threading.py", line 980, in _bootstrap_inner
self.run() File "/home/a770/miniconda3/envs/llm-test/lib/python3.9/threading.py", line 917, in run
self._target(*self._args, **self._kwargs)
File "/opt/WD/091-GFX-Benchmark/BigDL/python/llm/dev/benchmark/all-in-one/run.py", line 53, in run_model_in_thread
output_ids = model.generate(input_ids, do_sample=False, max_new_tokens=out_len,
File "/home/a770/miniconda3/envs/llm-test/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/opt/WD/091-GFX-Benchmark/BigDL/python/llm/dev/benchmark/all-in-one/../benchmark_util.py", line 1563, in generate
return self.greedy_search(
File "/opt/WD/091-GFX-Benchmark/BigDL/python/llm/dev/benchmark/all-in-one/../benchmark_util.py", line 2385, in greedy_search
outputs = self(
File "/opt/WD/091-GFX-Benchmark/BigDL/python/llm/dev/benchmark/all-in-one/../benchmark_util.py", line 533, in call
return self.model(*args, **kwargs)
File "/home/a770/miniconda3/envs/llm-test/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/a770/miniconda3/envs/llm-test/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/a770/.cache/huggingface/modules/transformers_modules/Baichuan2-13B-Chat/modeling_baichuan.py", line 693, in forward
outputs = self.model(
File "/home/a770/miniconda3/envs/llm-test/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/a770/miniconda3/envs/llm-test/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/a770/.cache/huggingface/modules/transformers_modules/Baichuan2-13B-Chat/modeling_baichuan.py", line 425, in forward
attention_mask = inverted_mask + alibi_mask.unsqueeze(0)
RuntimeError: Allocation is out of device memory on current platform.
Token indices sequence length is longer than the specified maximum sequence length for this model (9533 > 4096). Running this sequence through the model will
result in indexing errors
Exception in thread Thread-27:
Traceback (most recent call last):
File "/home/a770/miniconda3/envs/llm-test/lib/python3.9/threading.py", line 980, in _bootstrap_inner
self.run()
File "/home/a770/miniconda3/envs/llm-test/lib/python3.9/threading.py", line 917, in run
self._target(*self._args, **self._kwargs)
File "/opt/WD/091-GFX-Benchmark/BigDL/python/llm/dev/benchmark/all-in-one/run.py", line 53, in run_model_in_thread
output_ids = model.generate(input_ids, do_sample=False, max_new_tokens=out_len,
File "/home/a770/miniconda3/envs/llm-test/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/opt/WD/091-GFX-Benchmark/BigDL/python/llm/dev/benchmark/all-in-one/../benchmark_util.py", line 1344, in generate model_kwargs["attention_mask"] = self._prepare_attention_mask_for_generation(
File "/opt/WD/091-GFX-Benchmark/BigDL/python/llm/dev/benchmark/all-in-one/../benchmark_util.py", line 647, in _prepare_attention_mask_for_generation
return inputs.ne(pad_token_id).long()
RuntimeError: Allocation is out of device memory on current platform.

Please enable low memory mode and try again. You could use export IPEX_LLM_LOW_MEM=1 before running scripts to enable low memory mode.

Currently we need code change to convert alibi mask to fp16, synced with user offline, close issue for now