Issue running meta-llama/Llama-2-13b-chat-hf
muhammad-asn opened this issue · 32 comments
System Info
- OS Version: 22.04.3 LTS
- Model being used: meta-llama/Llama-2-13b-chat-hf (local model)
$ ls -la deploy-gaudi/data/meta-llama_Llama-2-13b-chat-hf total 25424104 drwxrwxr-x 2 smci smci 4096 十 23 13:28 . drwxrwxr-x 3 smci smci 4096 二 16 16:42 .. -rw-rw-r-- 1 smci smci 7020 十 23 13:28 LICENSE.txt -rw-rw-r-- 1 smci smci 10409 十 23 13:28 README.md -rw-rw-r-- 1 smci smci 4766 十 23 13:28 USE_POLICY.md -rw-rw-r-- 1 smci smci 587 十 23 13:28 config.json -rw-rw-r-- 1 smci smci 188 十 23 13:28 generation_config.json -rw-rw-r-- 1 smci smci 815 十 23 13:28 huggingface-metadata.txt -rw-rw-r-- 1 smci smci 9948693272 十 23 13:29 model-00001-of-00003.safetensors -rw-rw-r-- 1 smci smci 9904129368 十 23 13:29 model-00002-of-00003.safetensors -rw-rw-r-- 1 smci smci 6178962272 十 23 13:28 model-00003-of-00003.safetensors -rw-rw-r-- 1 smci smci 33444 十 23 13:28 model.safetensors.index.json -rw-rw-r-- 1 smci smci 33444 十 23 13:28 pytorch_model.bin.index.json -rw-rw-r-- 1 smci smci 414 十 23 13:28 special_tokens_map.json -rw-rw-r-- 1 smci smci 1842767 十 23 13:28 tokenizer.json -rw-rw-r-- 1 smci smci 499723 十 23 13:28 tokenizer.model -rw-rw-r-- 1 smci smci 1618 十 23 13:28 tokenizer_config.json
-
HL-SMI
$ sudo hl-smi +-----------------------------------------------------------------------------+ | HL-SMI Version: hl-1.14.0-fw-48.0.1.0 | | Driver Version: 1.14.0-9e8ecf8 | |-------------------------------+----------------------+----------------------+ | AIP Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | AIP-Util Compute M. | |===============================+======================+======================| | 0 HL-225 N/A | 0000:b3:00.0 N/A | 0 | | N/A 25C N/A 106W / 600W | 768MiB / 98304MiB | 0% N/A | |-------------------------------+----------------------+----------------------+ | 1 HL-225 N/A | 0000:b4:00.0 N/A | 0 | | N/A 29C N/A 104W / 600W | 768MiB / 98304MiB | 0% N/A | |-------------------------------+----------------------+----------------------+ | 2 HL-225 N/A | 0000:19:00.0 N/A | 0 | | N/A 24C N/A 103W / 600W | 768MiB / 98304MiB | 0% N/A | |-------------------------------+----------------------+----------------------+ | 3 HL-225 N/A | 0000:cc:00.0 N/A | 0 | | N/A 28C N/A 87W / 600W | 768MiB / 98304MiB | 0% N/A | |-------------------------------+----------------------+----------------------+ | 4 HL-225 N/A | 0000:1a:00.0 N/A | 0 | | N/A 30C N/A 93W / 600W | 768MiB / 98304MiB | 0% N/A | |-------------------------------+----------------------+----------------------+ | 5 HL-225 N/A | 0000:43:00.0 N/A | 0 | | N/A 30C N/A 93W / 600W | 768MiB / 98304MiB | 0% N/A | |-------------------------------+----------------------+----------------------+ | 6 HL-225 N/A | 0000:cd:00.0 N/A | 0 | | N/A 27C N/A 116W / 600W | 768MiB / 98304MiB | 0% N/A | |-------------------------------+----------------------+----------------------+ | 7 HL-225 N/A | 0000:44:00.0 N/A | 0 | | N/A 25C N/A 94W / 600W | 768MiB / 98304MiB | 0% N/A | |-------------------------------+----------------------+----------------------+ | Compute Processes: AIP Memory | | AIP PID Type Process name Usage | |=============================================================================| | 0 N/A N/A N/A N/A | | 1 N/A N/A N/A N/A | | 2 N/A N/A N/A N/A | | 3 N/A N/A N/A N/A | | 4 N/A N/A N/A N/A | | 5 N/A N/A N/A N/A | | 6 N/A N/A N/A N/A | | 7 N/A N/A N/A N/A | +=============================================================================+
-
CPU: Intel Xeon Platinum 8380 (160) @ 3.400GHz
-
GPU: 03:00.0 ASPEED Technology, Inc. ASPEED Graphics Family
Information
- Docker
- The CLI directly
Tasks
- An officially supported command
- My own modifications
Reproduction
-
Clone the repository https://github.com/huggingface/tgi-gaudi and checkout to branch
habana-dev
-
Run the docker command
$ model=/data/meta-llama_Llama-2-13b-chat-hf $ volume=$PWD/data $ docker run -p 8080:80 -v $volume:/data --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --ipc=host tgi_gaudi --model-id $model
-
After a while, the error log shows:
synStatus=20 [Device already acquired] Device acquire failed.
2024-02-19T04:48:59.198129Z INFO text_generation_launcher: Args { model_id: "/data/meta-llama_Llama-2-13b-chat-hf", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_length: 1024, max_total_tokens: 2048, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: None, max_waiting_tokens: 20, hostname: "6a3b385d5f37", port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, env: false }
2024-02-19T04:48:59.198315Z INFO download: text_generation_launcher: Starting download process.
2024-02-19T04:49:00.906024Z INFO text_generation_launcher: Files are already present on the host. Skipping download.
2024-02-19T04:49:01.202998Z INFO download: text_generation_launcher: Successfully downloaded weights.
2024-02-19T04:49:01.203458Z INFO shard-manager: text_generation_launcher: Starting shard rank=0
2024-02-19T04:49:05.090203Z INFO text_generation_launcher: CLI SHARDED = False DTYPE = bfloat16
2024-02-19T04:49:11.216418Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-02-19T04:49:21.226050Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-02-19T04:49:31.236352Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-02-19T04:49:41.246660Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-02-19T04:49:51.257064Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-02-19T04:50:01.265897Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-02-19T04:50:11.275136Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-02-19T04:50:21.284766Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-02-19T04:50:31.295121Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-02-19T04:50:41.305467Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-02-19T04:50:46.211569Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:
/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:252: UserWarning: Device capability of hccl unspecified, assuming `cpu` and `cuda`. Please specify it via the `devices` argument of `register_backend`.
warnings.warn(
[WARNING|utils.py:185] 2024-02-19 04:49:03,996 >> optimum-habana v1.10.0 has been validated for SynapseAI v1.14.0 but habana-frameworks v1.13.0.463 was found, this could lead to undefined behavior!
Loading checkpoint shards: 100%|██████████| 3/3 [00:01<00:00, 2.11it/s]
Traceback (most recent call last):
File "/usr/local/bin/text-generation-server", line 8, in <module>
sys.exit(app())
File "/usr/local/lib/python3.10/dist-packages/text_generation_server/cli.py", line 120, in serve
server.serve(model_id, revision, dtype, uds_path, sharded)
File "/usr/local/lib/python3.10/dist-packages/text_generation_server/server.py", line 216, in serve
asyncio.run(serve_inner(model_id, revision, dtype, sharded))
File "/usr/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/usr/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
return future.result()
File "/usr/local/lib/python3.10/dist-packages/text_generation_server/server.py", line 177, in serve_inner
model = get_model(model_id, revision=revision, dtype=data_type)
File "/usr/local/lib/python3.10/dist-packages/text_generation_server/models/__init__.py", line 33, in get_model
return CausalLM(model_id, revision, dtype)
File "/usr/local/lib/python3.10/dist-packages/text_generation_server/models/causal_lm.py", line 589, in __init__
model = model.eval().to(device)
File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 2179, in to
return super().to(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/core/weight_sharing.py", line 173, in wrapped_to
result = self.original_to(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1163, in to
return self._apply(convert)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 810, in _apply
module._apply(fn)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 810, in _apply
module._apply(fn)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 833, in _apply
param_applied = fn(param)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1161, in convert
return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/core/weight_sharing.py", line 53, in __torch_function__
return super().__torch_function__(func, types, new_args, kwargs)
RuntimeError: synStatus=20 [Device already acquired] Device acquire failed.
rank=0
2024-02-19T04:50:46.290320Z ERROR text_generation_launcher: Shard 0 failed to start
2024-02-19T04:50:46.290347Z INFO text_generation_launcher: Shutting down shards
Error: ShardCannotStart
Expected behavior
The model should run properly and without issue
I just merged #56, can we close this issue @muhammad-asn ?
I just merged #56, can we close this issue @muhammad-asn ?
Yup you can close this issue
Sorry if I reply on this thread, currently the issue is solved when I ran using 1 HPU card only. When I try 8 HPU (--num-shard 8
). New error arised
Loading 0 checkpoint shards: 0it [00:00, ?it/s]Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/text_generation_server/tgi_service.py", line 29, in <module>
main(args)
File "/usr/local/lib/python3.10/dist-packages/text_generation_server/tgi_service.py", line 16, in main
server.serve(
File "/usr/local/lib/python3.10/dist-packages/text_generation_server/server.py", line 213, in serve
asyncio.run(serve_inner(model_id, revision, dtype, sharded))
File "/usr/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/usr/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
return future.result()
File "/usr/local/lib/python3.10/dist-packages/text_generation_server/server.py", line 177, in serve_inner
model = get_model(model_id, revision=revision, dtype=data_type)
File "/usr/local/lib/python3.10/dist-packages/text_generation_server/models/__init__.py", line 33, in get_model
return CausalLM(model_id, revision, dtype)
File "/usr/local/lib/python3.10/dist-packages/text_generation_server/models/causal_lm.py", line 526, in __init__
model = deepspeed.init_inference(model, **ds_inference_kwargs)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/__init__.py", line 346, in init_inference
engine = InferenceEngine(model, config=ds_inference_config)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/inference/engine.py", line 168, in __init__
self._apply_injection_policy(config, client_module)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/inference/engine.py", line 417, in _apply_injection_policy
replace_transformer_layer(client_module, self.module, checkpoint, config, self.config)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/module_inject/replace_module.py", line 340, in replace_transformer_layer
replaced_module = set_lm_head(replaced_module)
UnboundLocalError: local variable 'replaced_module' referenced before assignment
Can you share the command line you used to launch your server instance please?
I didn't manage to reproduce this issue with:
docker run \
-p 8080:80 \
-v /scratch-1/:/data \
--runtime=habana \
-e PT_HPU_ENABLE_LAZY_COLLECTIVES=true \
-e HABANA_VISIBLE_DEVICES=all \
-e OMPI_MCA_btl_vader_single_copy_mechanism=none \
-e HUGGING_FACE_HUB_TOKEN=my_token \
--cap-add=sys_nice \
--ipc=host tgi_gaudi \
--model-id meta-llama/Llama-2-70b-hf \
--sharded true \
--num-shard 8
Loading 0 checkpoint shards
looks weird, it makes me think that it was not able to find the checkpoint shards
Sorry for late reply I will check it first
I use docker compose for running the inference
services:
tgi_gaudi:
image: tgi_gaudi
container_name: llm
runtime: habana
environment:
- HABANA_VISIBLE_DEVICES=all
- OMPI_MCA_btl_vader_single_copy_mechanism=none
- ENABLE_HPU_GRAPH=False
- LOG_LEVEL=debug,text_generation_router=debug
- PT_HPU_ENABLE_LAZY_COLLECTIVES=true
command: >
--model-id /data/meta-llama_Llama-2-13b-chat-hf
--max-total-tokens 8192
--max-input-length 4096
--num-shard 8
--max-top-n-tokens 1
--max-best-of 1
--disable-custom-kernels
--trust-remote-code
--max-stop-sequences 1
--validation-workers 1
--max-batch-total-tokens 8192
--max-batch-prefill-tokens 4096
--waiting-served-ratio 0
--max-waiting-tokens 4096
--sharded true
cap_add:
- sys_nice
ipc: host
shm_size: '1gb'
restart: always
ports:
- "8080:80"
volumes:
- ./data:/data
networks:
default:
name: habana
external: true
$ sudo hl-smi
+-----------------------------------------------------------------------------+
| HL-SMI Version: hl-1.14.0-fw-48.0.1.0 |
| Driver Version: 1.14.0-9e8ecf8 |
|-------------------------------+----------------------+----------------------+
| AIP Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | AIP-Util Compute M. |
|===============================+======================+======================|
| 0 HL-225 N/A | 0000:b3:00.0 N/A | 0 |
| N/A 25C N/A 106W / 600W | 768MiB / 98304MiB | 0% N/A |
|-------------------------------+----------------------+----------------------+
| 1 HL-225 N/A | 0000:b4:00.0 N/A | 0 |
| N/A 28C N/A 111W / 600W | 768MiB / 98304MiB | 0% N/A |
|-------------------------------+----------------------+----------------------+
| 2 HL-225 N/A | 0000:19:00.0 N/A | 0 |
| N/A 23C N/A 103W / 600W | 768MiB / 98304MiB | 0% N/A |
|-------------------------------+----------------------+----------------------+
| 3 HL-225 N/A | 0000:cc:00.0 N/A | 0 |
| N/A 27C N/A 91W / 600W | 768MiB / 98304MiB | 0% N/A |
|-------------------------------+----------------------+----------------------+
| 4 HL-225 N/A | 0000:1a:00.0 N/A | 0 |
| N/A 29C N/A 101W / 600W | 768MiB / 98304MiB | 0% N/A |
|-------------------------------+----------------------+----------------------+
| 5 HL-225 N/A | 0000:43:00.0 N/A | 0 |
| N/A 29C N/A 101W / 600W | 31373MiB / 98304MiB | 0% N/A |
|-------------------------------+----------------------+----------------------+
| 6 HL-225 N/A | 0000:cd:00.0 N/A | 0 |
| N/A 26C N/A 117W / 600W | 768MiB / 98304MiB | 0% N/A |
|-------------------------------+----------------------+----------------------+
| 7 HL-225 N/A | 0000:44:00.0 N/A | 0 |
| N/A 24C N/A 89W / 600W | 768MiB / 98304MiB | 0% N/A |
|-------------------------------+----------------------+----------------------+
| Compute Processes: AIP Memory |
| AIP PID Type Process name Usage |
|=============================================================================|
| 0 N/A N/A N/A N/A |
| 1 N/A N/A N/A N/A |
| 2 N/A N/A N/A N/A |
| 3 N/A N/A N/A N/A |
| 4 N/A N/A N/A N/A |
| 5 3472794 C text-generation 30605MiB
| 6 N/A N/A N/A N/A |
| 7 N/A N/A N/A N/A |
+=============================================================================+
Also when I try for longer text, seems the TGI has issue as well (1 shard only)
Json with ~ 1024 word (data.json
)
{
"inputs": "Far far away, behind the word mountains, far from the countries Vokalia and Consonantia, there live the blind texts. Separated they live in Bookmarksgrove right at the coast of the Semantics, a large language ocean. A small river named Duden flows by their place and supplies it with the necessary regelialia. It is a paradisematic country, in which roasted parts of sentences fly into your mouth. Even the all-powerful Pointing has no control about the blind texts it is an almost unorthographic life One day however a small line of blind text by the name of Lorem Ipsum decided to leave for the far World of Grammar. The Big Oxmox advised her not to do so, because there were thousands of bad Commas, wild Question Marks and devious Semikoli, but the Little Blind Text didn’t listen. She packed her seven versalia, put her initial into the belt and made herself on the way. When she reached the first hills of the Italic Mountains, she had a last view back on the skyline of her hometown Bookmarksgrove, the headline of Alphabet Village and the subline of her own road, the Line Lane. Pityful a rethoric question ran over her cheek, then she continued her way. On her way she met a copy. The copy warned the Little Blind Text, that where it came from it would have been rewritten a thousand times and everything that was left from its origin would be the word and and the Little Blind Text should turn around and return to its own, safe country. But nothing the copy said could convince her and so it didn’t take long until a few insidious Copy Writers ambushed her, made her drunk with Longe and Parole and dragged her into their agency, where they abused her for their projects again and again. And if she hasn’t been rewritten, then they are still using her. Far far away, behind the word mountains, far from the countries Vokalia and Consonantia, there live the blind texts. Separated they live in Bookmarksgrove right at the coast of the Semantics, a large language ocean. A small river named Duden flows by their place and supplies it with the necessary regelialia. It is a paradisematic country, in which roasted parts of sentences fly into your mouth. Even the all-powerful Pointing has no control about the blind texts it is an almost unorthographic life One day however a small line of blind text by the name of Lorem Ipsum decided to leave for the far World of Grammar. The Big Oxmox advised her not to do so, because there were thousands of bad Commas, wild Question Marks and devious Semikoli, but the Little Blind Text didn’t listen. She packed her seven versalia, put her initial into the belt and made herself on the way. When she reached the first hills of the Italic Mountains, she had a last view back on the skyline of her hometown Bookmarksgrove, the headline of Alphabet Village and the subline of her own road, the Line Lane. Pityful a rethoric question ran over her cheek, then she continued her way. On her way she met a copy. The copy warned the Little Blind Text, that where it came from it would have been rewritten a thousand times and everything that was left from its origin would be the word and and the Little Blind Text should turn around and return to its own, safe country. But nothing the copy said could convince her and so it didn’t take long until a few insidious Copy Writers ambushed her, made her drunk with Longe and Parole and dragged her into their agency, where they abused her for their projects again and again. And if she hasn’t been rewritten, then they are still using her. Far far away, behind the word mountains, far from the countries Vokalia and Consonantia, there live the blind texts. Separated they live in Bookmarksgrove right at the coast of the Semantics, a large language ocean. A small river named Duden flows by their place and supplies it with the necessary regelialia. It is a paradisematic country, in which roasted parts of sentences fly into your mouth. Even the all-powerful Pointing has no control about the blind texts it is an almost unorthographic life One day however a small line of blind text by the name of Lorem Ipsum decided to leave for the far World of Grammar. The Big Oxmox advised her not to do so, because there were thousands of bad Commas, wild Question Marks and devious Semikoli, but the Little Blind Text didn’t listen. She packed her seven versalia, put her initial into the belt and made herself on the way. When she reached the first hills of the Italic Mountains, she had a last view back on the skyline of her hometown Bookmarksgrove, the headline of Alphabet Village and the subline of her own road, the Line Lane. Pityful a rethoric question ran over her cheek, then she continued her way. On her way she met a copy. The copy warned the Little Blind Text, that where it came from it would have been rewritten a thousand times and everything that was left from its origin would be the word and and the Little Blind Text should turn around and return to its own, safe country. But nothing the copy said could convince her and so it didn’t take long until a few insidious Copy Writers ambushed her, made her drunk with Longe and Parole and dragged her into their agency, where they abused her for their projects again and again. And if she hasn’t been rewritten, then they are still using her.Far far away, behind the word mountains, far from the countries Vokalia and Consonantia, there live the blind texts. Separated they live in Bookmarksgrove right at the coast of the Semantics, a large language ocean. A small river named Duden flows by their place and supplies it with the necessary regelialia. It is a paradisematic country, in which roasted parts of sentences fly into your mouth. Even the all-powerful Pointing has no control about the blind. Summarize it",
"parameters": {
"max_new_tokens": 4096,
"best_of": 1,
"repetition_penalty": 1.17,
"return_full_text": false,
"temperature": 0.01,
"top_p": 0.14,
"top_k": 49,
"truncate": 4096,
"typical_p": 0.99,
"watermark": false,
"decoder_input_details": false
}
}
The command:
curl -d @data.json -H "Content-Type: application/json" "http://127.0.0.1:8080/generate"
{"error":"Request failed during generation: Server error: Graph compile failed. synStatus=synStatus 26 [Generice failure]. ","error_type":"generation"}%
The log:
e.rs:213: send frame=Ping { ack: true, payload: [0, 0, 0, 0, 0, 0, 0, 173] }
2024-02-21T05:54:46.683643Z DEBUG text_generation_launcher: MAX_TOTAL_TOKENS = 0
2024-02-21T05:55:25.318039Z DEBUG text_generation_launcher: Method Prefill encountered an error.
2024-02-21T05:55:25.318074Z DEBUG text_generation_launcher: Traceback (most recent call last):
2024-02-21T05:55:25.318081Z DEBUG text_generation_launcher: File "/usr/local/bin/text-generation-server", line 8, in <module>
2024-02-21T05:55:25.318087Z DEBUG text_generation_launcher: sys.exit(app())
2024-02-21T05:55:25.318093Z DEBUG text_generation_launcher: File "/usr/local/lib/python3.10/dist-packages/typer/main.py", line 311, in __call__
2024-02-21T05:55:25.318099Z DEBUG text_generation_launcher: return get_command(self)(*args, **kwargs)
2024-02-21T05:55:25.318104Z DEBUG text_generation_launcher: File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1157, in __call__
2024-02-21T05:55:25.318109Z DEBUG text_generation_launcher: return self.main(*args, **kwargs)
2024-02-21T05:55:25.318114Z DEBUG text_generation_launcher: File "/usr/local/lib/python3.10/dist-packages/typer/core.py", line 778, in main
2024-02-21T05:55:25.318119Z DEBUG text_generation_launcher: return _main(
2024-02-21T05:55:25.318125Z DEBUG text_generation_launcher: File "/usr/local/lib/python3.10/dist-packages/typer/core.py", line 216, in _main
2024-02-21T05:55:25.318130Z DEBUG text_generation_launcher: rv = self.invoke(ctx)
2024-02-21T05:55:25.318135Z DEBUG text_generation_launcher: File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1688, in invoke
2024-02-21T05:55:25.318140Z DEBUG text_generation_launcher: return _process_result(sub_ctx.command.invoke(sub_ctx))
2024-02-21T05:55:25.318146Z DEBUG text_generation_launcher: File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1434, in invoke
2024-02-21T05:55:25.318151Z DEBUG text_generation_launcher: return ctx.invoke(self.callback, **ctx.params)
2024-02-21T05:55:25.318156Z DEBUG text_generation_launcher: File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 783, in invoke
2024-02-21T05:55:25.318161Z DEBUG text_generation_launcher: return __callback(*args, **kwargs)
2024-02-21T05:55:25.318166Z DEBUG text_generation_launcher: File "/usr/local/lib/python3.10/dist-packages/typer/main.py", line 683, in wrapper
2024-02-21T05:55:25.318171Z DEBUG text_generation_launcher: return callback(**use_params) # type: ignore
2024-02-21T05:55:25.318177Z DEBUG text_generation_launcher: File "/usr/local/lib/python3.10/dist-packages/text_generation_server/cli.py", line 120, in serve
2024-02-21T05:55:25.318182Z DEBUG text_generation_launcher: server.serve(model_id, revision, dtype, uds_path, sharded)
2024-02-21T05:55:25.318188Z DEBUG text_generation_launcher: File "/usr/local/lib/python3.10/dist-packages/text_generation_server/server.py", line 213, in serve
2024-02-21T05:55:25.318193Z DEBUG text_generation_launcher: asyncio.run(serve_inner(model_id, revision, dtype, sharded))
2024-02-21T05:55:25.318198Z DEBUG text_generation_launcher: File "/usr/lib/python3.10/asyncio/runners.py", line 44, in run
2024-02-21T05:55:25.318203Z DEBUG text_generation_launcher: return loop.run_until_complete(main)
2024-02-21T05:55:25.318209Z DEBUG text_generation_launcher: File "/usr/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
2024-02-21T05:55:25.318214Z DEBUG text_generation_launcher: self.run_forever()
2024-02-21T05:55:25.318220Z DEBUG text_generation_launcher: File "/usr/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
2024-02-21T05:55:25.318225Z DEBUG text_generation_launcher: self._run_once()
2024-02-21T05:55:25.318230Z DEBUG text_generation_launcher: File "/usr/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
2024-02-21T05:55:25.318235Z DEBUG text_generation_launcher: handle._run()
2024-02-21T05:55:25.318241Z DEBUG text_generation_launcher: File "/usr/lib/python3.10/asyncio/events.py", line 80, in _run
2024-02-21T05:55:25.318247Z DEBUG text_generation_launcher: self._context.run(self._callback, *self._args)
2024-02-21T05:55:25.318252Z DEBUG text_generation_launcher: File "/usr/local/lib/python3.10/dist-packages/grpc_interceptor/server.py", line 159, in invoke_intercept_method
2024-02-21T05:55:25.318258Z DEBUG text_generation_launcher: return await self.intercept(
2024-02-21T05:55:25.318264Z DEBUG text_generation_launcher: > File "/usr/local/lib/python3.10/dist-packages/text_generation_server/interceptor.py", line 23, in intercept
2024-02-21T05:55:25.318270Z DEBUG text_generation_launcher: return await response
2024-02-21T05:55:25.318276Z DEBUG text_generation_launcher: File "/usr/local/lib/python3.10/dist-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 82, in _unary_interceptor
2024-02-21T05:55:25.318282Z DEBUG text_generation_launcher: raise error
2024-02-21T05:55:25.318288Z DEBUG text_generation_launcher: File "/usr/local/lib/python3.10/dist-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 73, in _unary_interceptor
2024-02-21T05:55:25.318294Z DEBUG text_generation_launcher: return await behavior(request_or_iterator, context)
2024-02-21T05:55:25.318300Z DEBUG text_generation_launcher: File "/usr/local/lib/python3.10/dist-packages/text_generation_server/server.py", line 87, in Prefill
2024-02-21T05:55:25.318307Z DEBUG text_generation_launcher: generations, next_batch = self.model.generate_token(batch)
2024-02-21T05:55:25.318312Z DEBUG text_generation_launcher: File "/usr/lib/python3.10/contextlib.py", line 79, in inner
2024-02-21T05:55:25.318318Z DEBUG text_generation_launcher: return func(*args, **kwds)
2024-02-21T05:55:25.318323Z DEBUG text_generation_launcher: File "/usr/local/lib/python3.10/dist-packages/text_generation_server/models/causal_lm.py", line 704, in generate_token
2024-02-21T05:55:25.318330Z DEBUG text_generation_launcher: batch.input_ids[:, :token_idx], logits.squeeze(-2)
2024-02-21T05:55:25.318335Z DEBUG text_generation_launcher: RuntimeError: Graph compile failed. synStatus=synStatus 26 [Generice failure].
2024-02-21T05:55:25.318358Z DEBUG Connection{peer=Client}: h2::codec::framed_read: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.24/src/codec/framed_read.rs:360: received frame=Headers { stream_id: StreamId(429), flags: (0x5: END_HEADERS | END_STREAM) }
2024-02-21T05:55:25.318424Z DEBUG Connection{peer=Client}: h2::codec::framed_read: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.24/src/codec/framed_read.rs:360: received frame=WindowUpdate { stream_id: StreamId(0), size_increment: 5848 }
2024-02-21T05:55:25.318524Z ERROR batch{batch_size=1}:prefill:prefill{id=83 size=1}:prefill{id=83 size=1}: text_generation_client: router/client/src/lib.rs:33: Server error: Graph compile failed. synStatus=synStatus 26 [Generice failure].
2024-02-21T05:55:25.318642Z DEBUG batch{batch_size=1}:prefill:clear_cache{batch_id=Some(83)}:clear_cache{batch_id=Some(83)}: tower::buffer::worker: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tower-0.4.13/src/buffer/worker.rs:197: service.ready=true processing request
2024-02-21T05:55:25.318799Z DEBUG Connection{peer=Client}: h2::codec::framed_write: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.24/src/codec/framed_write.rs:213: send frame=Headers { stream_id: StreamId(431), flags: (0x4: END_HEADERS) }
2024-02-21T05:55:25.318838Z DEBUG Connection{peer=Client}: h2::codec::framed_write: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.24/src/codec/framed_write.rs:213: send frame=Data { stream_id: StreamId(431) }
2024-02-21T05:55:25.318855Z DEBUG Connection{peer=Client}: h2::codec::framed_write: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.24/src/codec/framed_write.rs:213: send frame=Data { stream_id: StreamId(431), flags: (0x1: END_STREAM) }
2024-02-21T05:55:25.319267Z DEBUG Connection{peer=Client}: h2::codec::framed_read: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.24/src/codec/framed_read.rs:360: received frame=Ping { ack: false, payload: [0, 0, 0, 0, 0, 0, 0, 174] }
2024-02-21T05:55:25.319307Z DEBUG Connection{peer=Client}: h2::codec::framed_write: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.24/src/codec/framed_write.rs:213: send frame=Ping { ack: true, payload: [0, 0, 0, 0, 0, 0, 0, 174] }
2024-02-21T05:55:25.319664Z DEBUG Connection{peer=Client}: h2::codec::framed_read: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.24/src/codec/framed_read.rs:360: received frame=Headers { stream_id: StreamId(431), flags: (0x4: END_HEADERS) }
2024-02-21T05:55:25.319709Z DEBUG Connection{peer=Client}: h2::codec::framed_read: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.24/src/codec/framed_read.rs:360: received frame=Data { stream_id: StreamId(431) }
2024-02-21T05:55:25.319734Z DEBUG Connection{peer=Client}: h2::codec::framed_read: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.24/src/codec/framed_read.rs:360: received frame=Headers { stream_id: StreamId(431), flags: (0x5: END_HEADERS | END_STREAM) }
2024-02-21T05:55:25.319750Z DEBUG Connection{peer=Client}: h2::codec::framed_read: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.24/src/codec/framed_read.rs:360: received frame=WindowUpdate { stream_id: StreamId(0), size_increment: 7 }
2024-02-21T05:55:25.319858Z ERROR generate{parameters=GenerateParameters { best_of: Some(1), temperature: Some(0.01), repetition_penalty: Some(1.17), top_k: Some(49), top_p: Some(0.14), typical_p: Some(0.99), do_sample: false, max_new_tokens: Some(4096), return_full_text: Some(false), stop: [], truncate: Some(4096), watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None }}:generate:generate_stream:infer:send_error: text_generation_router::infer: router/src/infer.rs:601: Request failed during generation: Server error: Graph compile failed. synStatus=synStatus 26 [Generice failure].
2024-02-21T05:55:25.320127Z DEBUG hyper::proto::h1::io: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/hyper-0.14.28/src/proto/h1/io.rs:318: flushed 396 bytes
2024-02-21T05:55:25.510810Z DEBUG hyper::proto::h1::conn: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/hyper-0.14.28/src/proto/h1/conn.rs:283: read eof
It works on my side with 8 shards running:
docker run -p 8080:80 -v /scratch-1/:/data --runtime=habana -e PT_HPU_ENABLE_LAZY_COLLECTIVES=true -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none -e HUGGING_FACE_HUB_TOKEN=my_token --cap-add=sys_nice --ipc=host tgi_gaudi --model-id meta-llama/Llama-2-13b-chat-hf --sharded true --num-shard 8 --max-total-tokens 8192 --max-input-length 4096 --max-top-n-tokens 1 --max-best-of 1 --disable-custom-kernels --max-stop-sequences 1 --validation-workers 1 --max-batch-total-tokens 8192 --max-batch-prefill-tokens 4096 --waiting-served-ratio 0 --max-waiting-tokens 4096
With 1 shard, I can reproduce the error but I think it's just an out-of-memory error and sharding is needed to make it work with these dimensions. It does work on 1 shard with smaller inputs.
Sorry @regisss
With 1 shard, I can reproduce the error but I think it's just an out-of-memory error and sharding is needed to make it work with these dimensions. It does work on 1 shard with smaller inputs.
The error regarding this #54 (comment) right?
In the hl-smi
shown, every 1 card have ~ 98304MiB
memory, seems is not about out of memory?
5 3541452 C text-generation 24834MiB
And for this
It works on my side with 8 shards running:
which github branch did you use? on my side still got this issue (using the v1.2-release)
replaced_module = set_lm_head(replaced_module)
UnboundLocalError: local variable 'replaced_module' referenced before assignment
The error regarding this #54 (comment) right?
Yes, this one.
In the hl-smi shown, every 1 card have ~ 98304MiB memory, seems is not about out of memory?
Yes, but the model already accounts for ~26GB, and you have to store a key-value cache of size 8192. This is very big. However, sharding will help a lot here as it divides the memory footprint of the model by the number of shards.
which github branch did you use? on my side still got this issue (using the v1.2-release)
I use v1.2-release
. Are you sure your branch and Docker image are up to date?
My server spec
$ neofetch
`:+ssssssssssssssssss+:` ---------------------
-+ssssssssssssssssssyyssss+- OS: Ubuntu 22.04.3 LTS x86_64
.ossssssssssssssssssdMMMNysssso. Host: Super Server 0123456789
/ssssssssssshdmmNNmmyNMMMMhssssss/ Kernel: 6.5.0-15-generic
+ssssssssshmydMMMMMMMNddddyssssssss+ Uptime: 15 days, 52 mins
/sssssssshNMMMyhhyyyyhmNMMMNhssssssss/ Packages: 1834 (dpkg), 11 (snap)
.ssssssssdMMMNhsssssssssshNMMMdssssssss. Shell: bash 5.1.16
+sssshhhyNMMNyssssssssssssyNMMMysssssss+ Resolution: 1024x768
ossyNMMMNyMMhsssssssssssssshmmmhssssssso Terminal: /dev/pts/0
ossyNMMMNyMMhsssssssssssssshmmmhssssssso CPU: Intel Xeon Platinum 8380 (160) @ 3.400GHz
+sssshhhyNMMNyssssssssssssyNMMMysssssss+ GPU: 03:00.0 ASPEED Technology, Inc. ASPEED Graphics Family
.ssssssssdMMMNhsssssssssshNMMMdssssssss. Memory: 45394MiB / 1031678MiB
/sssssssshNMMMyhhyyyyhdNMMMNhssssssss/
+sssssssssdmydMMMMMMMMddddyssssssss+
/ssssssssssshdmNNNNmyNMMMMhssssss/
.ossssssssssssssssssdMMMNysssso.
-+sssssssssssssssssyyyssss+-
`:+ssssssssssssssssss+:`
.-/+oossssoo+/-.
Here's the full error log, when I try to run using 8 shards
error-tgi-gaudi.txt
Still got the issue using the latest v1.2-release
branch, not sure if it's about my hardware or the library. I still analyze the deepspeed library source code
Weird 🤔
Can you set trust_remote_code
to False please? I don't think that will solve it but it may interfere with the modeling code.
Also, can you show me the output of pip show deepspeed
?
root@107aca520a2c:/usr/src# pip show deepspeed
Name: deepspeed
Version: 0.12.4+hpu.synapse.v1.14.0
Summary: DeepSpeed library
Home-page: http://deepspeed.ai
Author: DeepSpeed Team
Author-email: deepspeed-info@microsoft.com
License: Apache Software License 2.0
Location: /usr/local/lib/python3.10/dist-packages
Requires: hjson, ninja, numpy, packaging, psutil, py-cpuinfo, pydantic, pynvml, torch, tqdm
Required-by: text-generation-server
Still the same error when disable trust_remote_code
@regisss
Hmm that looks all right, can you share the output of pip freeze
please?
Here is the pip freeze output
root@abf201df1755:/usr/src# pip freeze
absl-py==2.1.0
accelerate==0.27.2
aiohttp==3.8.5
aiosignal==1.3.1
async-timeout==4.0.3
attrs==23.1.0
av==9.2.0
backoff==2.2.1
cachetools==5.3.2
certifi==2023.7.22
cffi==1.15.1
cfgv==3.4.0
charset-normalizer==3.2.0
click==8.1.7
cmake==3.28.1
coloredlogs==15.0.1
datasets==2.14.4
deepspeed @ git+https://github.com/HabanaAI/DeepSpeed.git@fad45b24c7c9070251711a0d7d6f1b82805072ad
Deprecated==1.2.14
diffusers==0.20.1
dill==0.3.7
distlib==0.3.8
exceptiongroup==1.2.0
expecttest==0.2.1
filelock==3.12.3
frozenlist==1.4.0
fsspec==2023.6.0
google-auth==2.26.2
google-auth-oauthlib==0.4.6
googleapis-common-protos==1.60.0
grpc-interceptor==0.15.3
grpcio==1.57.0
grpcio-reflection==1.48.2
grpcio-status==1.48.2
grpcio-tools==1.51.1
habana-media-loader==1.14.0.493
habana-pyhlml==1.14.0.493
habana-torch-dataloader @ file:///tmp/tmp.Y8DXnLRS3C/habana_torch_dataloader-1.14.0.493-cp310-cp310-linux_x86_64.whl#sha256=d57c0e52bf97b9a38a261986ed34d1ed59986fccd0d48d8ca15712221855640e
habana-torch-plugin @ file:///tmp/tmp.Y8DXnLRS3C/habana_torch_plugin-1.14.0.493-cp310-cp310-linux_x86_64.whl#sha256=a342bf1183f7813d2ddfb893370bf10a06bd4490ac435d6c1e262a096f1986a4
habana_gpu_migration @ file:///tmp/tmp.Y8DXnLRS3C/habana_gpu_migration-1.14.0.493-cp310-cp310-linux_x86_64.whl#sha256=dd08b8a0b53571b9f9019cddf1ba38d53312f200c2cd7de66b7185ea5c6cccc2
habana_quantization_toolkit @ file:///tmp/tmp.Y8DXnLRS3C/habana_quantization_toolkit-1.14.0.493-py3-none-any.whl#sha256=32bf985b89ca80889442ce2961f2ec831f1352fdbff34bc0089bcb48f47f8809
hf_transfer==0.1.3
hjson==3.1.0
huggingface-hub==0.16.4
humanfriendly==10.0
identify==2.5.33
idna==3.4
importlib-metadata==6.8.0
iniconfig==2.0.0
intel-openmp==2023.2.3
Jinja2==3.1.2
lightning==2.1.2
lightning-habana==1.3.0
lightning-utilities==0.10.1
loguru==0.6.0
Markdown==3.5.2
MarkupSafe==2.1.3
mkl==2023.1.0
mkl-include==2023.1.0
mpi4py==3.1.4
mpmath==1.3.0
multidict==6.0.4
multiprocess==0.70.15
mypy-protobuf==3.4.0
networkx==3.1
ninja==1.11.1.1
nodeenv==1.8.0
numpy==1.25.2
oauthlib==3.2.2
opentelemetry-api==1.15.0
opentelemetry-exporter-otlp==1.15.0
opentelemetry-exporter-otlp-proto-grpc==1.15.0
opentelemetry-exporter-otlp-proto-http==1.15.0
opentelemetry-instrumentation==0.36b0
opentelemetry-instrumentation-grpc==0.36b0
opentelemetry-proto==1.15.0
opentelemetry-sdk==1.15.0
opentelemetry-semantic-conventions==0.36b0
optimum==1.13.2
optimum-habana==1.10.0
packaging==23.1
pandas==2.0.3
pathspec==0.12.1
peft==0.4.0
perfetto==0.7.0
Pillow==10.0.0
Pillow-SIMD==7.0.0.post3
platformdirs==4.1.0
pluggy==1.3.0
pre-commit==3.3.3
protobuf==3.20.3
psutil==5.9.5
py-cpuinfo==9.0.0
pyarrow==13.0.0
pyasn1==0.5.1
pyasn1-modules==0.3.0
pybind11==2.10.4
pycparser==2.21
pydantic==1.10.13
pynvml==8.0.4
pytest==7.4.4
python-dateutil==2.8.2
pytorch-lightning==2.1.3
pytz==2023.3
PyYAML==6.0.1
regex==2023.8.8
requests==2.31.0
requests-oauthlib==1.3.1
rsa==4.9
safetensors==0.3.2
sentencepiece==0.1.99
six==1.16.0
sympy==1.12
tbb==2021.11.0
tdqm==0.0.1
tensorboard==2.11.2
tensorboard-data-server==0.6.1
tensorboard-plugin-wit==1.8.1
text-generation-server @ file:///usr/src/server
tokenizers==0.14.1
tomli==2.0.1
torch @ file:///tmp/tmp.Y8DXnLRS3C/torch-2.1.1a0%2Bgitb51c9f6-cp310-cp310-linux_x86_64.whl#sha256=1abf98885ccc265886480bdc3e26f3b7eebf19d0e7913eb75e2ad980b7d70089
torch_tb_profiler @ file:///tmp/tmp.Y8DXnLRS3C/torch_tb_profiler-0.4.0-py3-none-any.whl#sha256=0d3af22de662e6641215b5e7cd2b3472d4ef2c4fa90a6b5ae43fcca72301db7d
torchaudio @ file:///tmp/tmp.Y8DXnLRS3C/torchaudio-2.1.0%2B6ea1133-cp310-cp310-linux_x86_64.whl#sha256=d32495f49785a114acdeb2299c9006015b9d7b0f2c4c5ba81908dc35ae09d237
torchdata @ file:///tmp/tmp.Y8DXnLRS3C/torchdata-0.7.0%2Bc5f2204-py3-none-any.whl#sha256=a675577c0018ca609e5e21e0c6bc712e6aa3d1e119d9ffd2ec1a09194f8dae4e
torchmetrics==1.3.0.post0
torchtext @ file:///tmp/tmp.Y8DXnLRS3C/torchtext-0.16.0a0%2B4e255c9-cp310-cp310-linux_x86_64.whl#sha256=4a373211b2f80e632aed4143f2789d7102516a61c69eaab9890814543022b192
torchvision @ file:///tmp/tmp.Y8DXnLRS3C/torchvision-0.16.0%2Bfbb4cc5-cp310-cp310-linux_x86_64.whl#sha256=273904fb11dacebc32e66e3a03a9d206fe61d3d45bad914f8e0eaf439f8f43fc
tqdm==4.66.1
transformers==4.34.1
typer==0.6.1
types-protobuf==4.24.0.20240129
typing_extensions==4.7.1
tzdata==2023.3
urllib3==2.0.4
virtualenv==20.25.0
Werkzeug==3.0.1
wrapt==1.15.0
xxhash==3.3.0
yamllint==1.33.0
yarl==1.9.2
zipp==3.16.2
Will check it as soon as possible, many thanks @regisss
Loading 0 checkpoint shards: 0it [00:00, ?it/s]Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/text_generation_server/tgi_service.py", line 29, in <module>
main(args)
File "/usr/local/lib/python3.10/dist-packages/text_generation_server/tgi_service.py", line 16, in main
server.serve(
File "/usr/local/lib/python3.10/dist-packages/text_generation_server/server.py", line 213, in serve
asyncio.run(serve_inner(model_id, revision, dtype, sharded))
File "/usr/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/usr/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
return future.result()
File "/usr/local/lib/python3.10/dist-packages/text_generation_server/server.py", line 177, in serve_inner
model = get_model(model_id, revision=revision, dtype=data_type)
File "/usr/local/lib/python3.10/dist-packages/text_generation_server/models/__init__.py", line 33, in get_model
return CausalLM(model_id, revision, dtype)
File "/usr/local/lib/python3.10/dist-packages/text_generation_server/models/causal_lm.py", line 556, in __init__
model = deepspeed.init_inference(model, **ds_inference_kwargs)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/__init__.py", line 346, in init_inference
engine = InferenceEngine(model, config=ds_inference_config)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/inference/engine.py", line 168, in __init__
self._apply_injection_policy(config, client_module)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/inference/engine.py", line 417, in _apply_injection_policy
replace_transformer_layer(client_module, self.module, checkpoint, config, self.config)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/module_inject/replace_module.py", line 340, in replace_transformer_layer
replaced_module = set_lm_head(replaced_module)
UnboundLocalError: local variable 'replaced_module' referenced before assignment
Loading 0 checkpoint shards: 0it [00:00, ?it/s]
Loading 0 checkpoint shards: 0it [00:00, ?it/s]
Loading 0 checkpoint shards: 0it [00:00, ?it/s] rank=0
2024-02-23T11:07:51.845305Z ERROR text_generation_launcher: Shard 0 failed to start
2024-02-23T11:07:51.845337Z INFO text_generation_launcher: Shutting down shards
Error: ShardCannotStart
Still got the same error, here is the pip freeze
$ docker exec -it llm bash -c "pip freeze"
absl-py==2.1.0
accelerate==0.27.2
aiohttp==3.9.0
aiosignal==1.3.1
async-timeout==4.0.3
attrs==23.1.0
av==9.2.0
backoff==2.2.1
cachetools==5.3.2
certifi==2023.11.17
cffi==1.15.1
cfgv==3.4.0
charset-normalizer==3.3.2
click==8.1.7
cmake==3.28.1
coloredlogs==15.0.1
datasets==2.14.7
deepspeed @ git+https://github.com/HabanaAI/DeepSpeed.git@fad45b24c7c9070251711a0d7d6f1b82805072ad
Deprecated==1.2.14
diffusers==0.26.3
dill==0.3.7
distlib==0.3.8
exceptiongroup==1.2.0
expecttest==0.2.1
filelock==3.13.1
frozenlist==1.4.0
fsspec==2023.10.0
google-auth==2.26.2
google-auth-oauthlib==0.4.6
googleapis-common-protos==1.61.0
grpc-interceptor==0.15.4
grpcio==1.59.3
grpcio-reflection==1.48.2
grpcio-status==1.48.2
grpcio-tools==1.51.1
habana-media-loader==1.14.0.493
habana-pyhlml==1.14.0.493
habana-torch-dataloader @ file:///tmp/tmp.Y8DXnLRS3C/habana_torch_dataloader-1.14.0.493-cp310-cp310-linux_x86_64.whl#sha256=d57c0e52bf97b9a38a261986ed34d1ed59986fccd0d48d8ca15712221855640e
habana-torch-plugin @ file:///tmp/tmp.Y8DXnLRS3C/habana_torch_plugin-1.14.0.493-cp310-cp310-linux_x86_64.whl#sha256=a342bf1183f7813d2ddfb893370bf10a06bd4490ac435d6c1e262a096f1986a4
habana_gpu_migration @ file:///tmp/tmp.Y8DXnLRS3C/habana_gpu_migration-1.14.0.493-cp310-cp310-linux_x86_64.whl#sha256=dd08b8a0b53571b9f9019cddf1ba38d53312f200c2cd7de66b7185ea5c6cccc2
habana_quantization_toolkit @ file:///tmp/tmp.Y8DXnLRS3C/habana_quantization_toolkit-1.14.0.493-py3-none-any.whl#sha256=32bf985b89ca80889442ce2961f2ec831f1352fdbff34bc0089bcb48f47f8809
hf_transfer==0.1.4
hjson==3.1.0
huggingface-hub==0.20.3
humanfriendly==10.0
identify==2.5.33
idna==3.4
importlib-metadata==7.0.1
iniconfig==2.0.0
intel-openmp==2023.2.3
Jinja2==3.1.2
lightning==2.1.2
lightning-habana==1.3.0
lightning-utilities==0.10.1
loguru==0.6.0
Markdown==3.5.2
MarkupSafe==2.1.3
mkl==2023.1.0
mkl-include==2023.1.0
mpi4py==3.1.4
mpmath==1.3.0
multidict==6.0.4
multiprocess==0.70.15
mypy-protobuf==3.4.0
networkx==3.2.1
ninja==1.11.1.1
nodeenv==1.8.0
numpy==1.26.2
oauthlib==3.2.2
opentelemetry-api==1.15.0
opentelemetry-exporter-otlp==1.15.0
opentelemetry-exporter-otlp-proto-grpc==1.15.0
opentelemetry-exporter-otlp-proto-http==1.15.0
opentelemetry-instrumentation==0.36b0
opentelemetry-instrumentation-grpc==0.36b0
opentelemetry-proto==1.15.0
opentelemetry-sdk==1.15.0
opentelemetry-semantic-conventions==0.36b0
optimum==1.17.1
optimum-habana==1.10.4
packaging==23.2
pandas==2.1.3
pathspec==0.12.1
peft==0.4.0
perfetto==0.7.0
Pillow==10.1.0
Pillow-SIMD==7.0.0.post3
platformdirs==4.1.0
pluggy==1.3.0
pre-commit==3.3.3
protobuf==3.20.3
psutil==5.9.6
py-cpuinfo==9.0.0
pyarrow==14.0.1
pyarrow-hotfix==0.6
pyasn1==0.5.1
pyasn1-modules==0.3.0
pybind11==2.10.4
pycparser==2.21
pydantic==1.10.13
pynvml==8.0.4
pytest==7.4.4
python-dateutil==2.8.2
pytorch-lightning==2.1.3
pytz==2023.3.post1
PyYAML==6.0.1
regex==2023.10.3
requests==2.31.0
requests-oauthlib==1.3.1
rsa==4.9
safetensors==0.4.2
sentencepiece==0.1.99
six==1.16.0
sympy==1.12
tbb==2021.11.0
tdqm==0.0.1
tensorboard==2.11.2
tensorboard-data-server==0.6.1
tensorboard-plugin-wit==1.8.1
text-generation-server @ file:///usr/src/server
tokenizers==0.15.2
tomli==2.0.1
torch @ file:///tmp/tmp.Y8DXnLRS3C/torch-2.1.1a0%2Bgitb51c9f6-cp310-cp310-linux_x86_64.whl#sha256=1abf98885ccc265886480bdc3e26f3b7eebf19d0e7913eb75e2ad980b7d70089
torch_tb_profiler @ file:///tmp/tmp.Y8DXnLRS3C/torch_tb_profiler-0.4.0-py3-none-any.whl#sha256=0d3af22de662e6641215b5e7cd2b3472d4ef2c4fa90a6b5ae43fcca72301db7d
torchaudio @ file:///tmp/tmp.Y8DXnLRS3C/torchaudio-2.1.0%2B6ea1133-cp310-cp310-linux_x86_64.whl#sha256=d32495f49785a114acdeb2299c9006015b9d7b0f2c4c5ba81908dc35ae09d237
torchdata @ file:///tmp/tmp.Y8DXnLRS3C/torchdata-0.7.0%2Bc5f2204-py3-none-any.whl#sha256=a675577c0018ca609e5e21e0c6bc712e6aa3d1e119d9ffd2ec1a09194f8dae4e
torchmetrics==1.3.0.post0
torchtext @ file:///tmp/tmp.Y8DXnLRS3C/torchtext-0.16.0a0%2B4e255c9-cp310-cp310-linux_x86_64.whl#sha256=4a373211b2f80e632aed4143f2789d7102516a61c69eaab9890814543022b192
torchvision @ file:///tmp/tmp.Y8DXnLRS3C/torchvision-0.16.0%2Bfbb4cc5-cp310-cp310-linux_x86_64.whl#sha256=273904fb11dacebc32e66e3a03a9d206fe61d3d45bad914f8e0eaf439f8f43fc
tqdm==4.66.1
transformers==4.37.2
typer==0.6.1
types-protobuf==4.24.0.20240129
typing_extensions==4.8.0
tzdata==2023.3
urllib3==2.1.0
virtualenv==20.25.0
Werkzeug==3.0.1
wrapt==1.16.0
xxhash==3.4.1
yamllint==1.33.0
yarl==1.9.3
zipp==3.17.0
Can you try the text-generation example and run the following command in the same environment please?
python ../gaudi_spawn.py --use_deepspeed --world_size 8 run_generation.py \
--model_name_or_path model_name \
--batch_size 1 \
--use_hpu_graphs \
--use_kv_cache \
--max_new_tokens 100
I run it inside the container
root@3e44951622db:/optimum-habana/examples/text-generation# git branch -v
* (HEAD detached at v1.10.4) 1dfbc02 Release: v1.10.4
main 89cdd6f Add seed in sft example, make sft result reproducable (#735)
root@3e44951622db:/optimum-habana/examples/text-generation# env | grep HABANA
HABANA_LOGS=/var/log/habana_logs/
HABANA_PLUGINS_LIB_PATH=/opt/habanalabs/habana_plugins
HABANA_VISIBLE_DEVICES=all
HABANA_SCAL_BIN_PATH=/opt/habanalabs/engines_fw
root@3e44951622db:/optimum-habana/examples/text-generation# python ../gaudi_spawn.py --use_deepspeed --world_size 8 run_generation.py --model_name_or_path /data/meta-llama_Llama-2-13b-chat-hf --batch_size 1 --use_hpu_graphs --use_kv_cache --max_new_tokens 100
DistributedRunner run(): command = deepspeed --num_nodes 1 --num_gpus 8 --no_local_rank --master_port 29500 run_generation.py --model_name_or_path /data/meta-llama_Llama-2-13b-chat-hf --batch_size 1 --use_hpu_graphs --use_kv_cache --max_new_tokens 100
/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/__init__.py:158: UserWarning: torch.hpu.setDeterministic is deprecated and will be removed in next release. Please use torch.use_deterministic_algorithms instead.
warnings.warn(
[2024-02-23 11:28:04,339] [INFO] [real_accelerator.py:178:get_accelerator] Setting ds_accelerator to hpu (auto detect)
[2024-02-23 11:28:05,460] [WARNING] [runner.py:206:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2024-02-23 11:28:05,518] [INFO] [runner.py:585:main] cmd = /usr/bin/python3 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgMywgNCwgNSwgNiwgN119 --master_addr=127.0.0.1 --master_port=29500 --no_local_rank --enable_each_rank_log=None run_generation.py --model_name_or_path /data/meta-llama_Llama-2-13b-chat-hf --batch_size 1 --use_hpu_graphs --use_kv_cache --max_new_tokens 100
/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/__init__.py:158: UserWarning: torch.hpu.setDeterministic is deprecated and will be removed in next release. Please use torch.use_deterministic_algorithms instead.
warnings.warn(
[2024-02-23 11:28:07,276] [INFO] [real_accelerator.py:178:get_accelerator] Setting ds_accelerator to hpu (auto detect)
[2024-02-23 11:28:08,398] [INFO] [launch.py:146:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]}
[2024-02-23 11:28:08,398] [INFO] [launch.py:152:main] nnodes=1, num_local_procs=8, node_rank=0
[2024-02-23 11:28:08,398] [INFO] [launch.py:163:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]})
[2024-02-23 11:28:08,398] [INFO] [launch.py:164:main] dist_world_size=8
[2024-02-23 11:28:08,398] [INFO] [launch.py:166:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/__init__.py:158: UserWarning: torch.hpu.setDeterministic is deprecated and will be removed in next release. Please use torch.use_deterministic_algorithms instead.
warnings.warn(
[2024-02-23 11:28:12,234] [INFO] [real_accelerator.py:178:get_accelerator] Setting ds_accelerator to hpu (auto detect)
/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/__init__.py:158: UserWarning: torch.hpu.setDeterministic is deprecated and will be removed in next release. Please use torch.use_deterministic_algorithms instead.
warnings.warn(
[2024-02-23 11:28:12,546] [INFO] [real_accelerator.py:178:get_accelerator] Setting ds_accelerator to hpu (auto detect)
/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/__init__.py:158: UserWarning: torch.hpu.setDeterministic is deprecated and will be removed in next release. Please use torch.use_deterministic_algorithms instead.
warnings.warn(
[2024-02-23 11:28:12,830] [INFO] [real_accelerator.py:178:get_accelerator] Setting ds_accelerator to hpu (auto detect)
[2024-02-23 11:28:13,149] [WARNING] [comm.py:163:init_deepspeed_backend] HCCL backend in DeepSpeed not yet implemented
[2024-02-23 11:28:13,150] [INFO] [comm.py:637:init_distributed] cdb=None
/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/__init__.py:158: UserWarning: torch.hpu.setDeterministic is deprecated and will be removed in next release. Please use torch.use_deterministic_algorithms instead.
warnings.warn(
[2024-02-23 11:28:13,242] [INFO] [real_accelerator.py:178:get_accelerator] Setting ds_accelerator to hpu (auto detect)
[2024-02-23 11:28:13,466] [WARNING] [comm.py:163:init_deepspeed_backend] HCCL backend in DeepSpeed not yet implemented
[2024-02-23 11:28:13,467] [INFO] [comm.py:637:init_distributed] cdb=None
/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/__init__.py:158: UserWarning: torch.hpu.setDeterministic is deprecated and will be removed in next release. Please use torch.use_deterministic_algorithms instead.
warnings.warn(
[2024-02-23 11:28:13,727] [INFO] [real_accelerator.py:178:get_accelerator] Setting ds_accelerator to hpu (auto detect)
[2024-02-23 11:28:13,742] [WARNING] [comm.py:163:init_deepspeed_backend] HCCL backend in DeepSpeed not yet implemented
[2024-02-23 11:28:13,742] [INFO] [comm.py:637:init_distributed] cdb=None
/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/__init__.py:158: UserWarning: torch.hpu.setDeterministic is deprecated and will be removed in next release. Please use torch.use_deterministic_algorithms instead.
warnings.warn(
[2024-02-23 11:28:14,117] [INFO] [real_accelerator.py:178:get_accelerator] Setting ds_accelerator to hpu (auto detect)
[2024-02-23 11:28:14,166] [WARNING] [comm.py:163:init_deepspeed_backend] HCCL backend in DeepSpeed not yet implemented
[2024-02-23 11:28:14,166] [INFO] [comm.py:637:init_distributed] cdb=None
/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/__init__.py:158: UserWarning: torch.hpu.setDeterministic is deprecated and will be removed in next release. Please use torch.use_deterministic_algorithms instead.
warnings.warn(
[2024-02-23 11:28:14,207] [INFO] [real_accelerator.py:178:get_accelerator] Setting ds_accelerator to hpu (auto detect)
/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/__init__.py:158: UserWarning: torch.hpu.setDeterministic is deprecated and will be removed in next release. Please use torch.use_deterministic_algorithms instead.
warnings.warn(
[2024-02-23 11:28:14,218] [INFO] [real_accelerator.py:178:get_accelerator] Setting ds_accelerator to hpu (auto detect)
02/23/2024 11:28:14 - INFO - __main__ - DeepSpeed is enabled.
[2024-02-23 11:28:14,694] [WARNING] [comm.py:163:init_deepspeed_backend] HCCL backend in DeepSpeed not yet implemented
[2024-02-23 11:28:14,694] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-02-23 11:28:14,694] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend hccl
Loading 0 checkpoint shards: 0it [00:00, ?it/s]Traceback (most recent call last):
File "/optimum-habana/examples/text-generation/run_generation.py", line 562, in <module>
main()
File "/optimum-habana/examples/text-generation/run_generation.py", line 257, in main
model, tokenizer, generation_config = initialize_model(args, logger)
File "/optimum-habana/examples/text-generation/utils.py", line 374, in initialize_model
else setup_distributed_model(args, model_dtype, model_kwargs, logger)
File "/optimum-habana/examples/text-generation/utils.py", line 238, in setup_distributed_model
model = deepspeed.init_inference(model, **ds_inference_kwargs)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/__init__.py", line 346, in init_inference
engine = InferenceEngine(model, config=ds_inference_config)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/inference/engine.py", line 154, in __init__
self._apply_injection_policy(config, client_module)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/inference/engine.py", line 417, in _apply_injection_policy
replace_transformer_layer(client_module, self.module, checkpoint, config, self.config)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/module_inject/replace_module.py", line 340, in replace_transformer_layer
replaced_module = set_lm_head(replaced_module)
UnboundLocalError: local variable 'replaced_module' referenced before assignment
Loading 0 checkpoint shards: 0it [00:00, ?it/s]
Loading 0 checkpoint shards: 0it [00:00, ?it/s]Traceback (most recent call last):
File "/optimum-habana/examples/text-generation/run_generation.py", line 562, in <module>
main()
File "/optimum-habana/examples/text-generation/run_generation.py", line 257, in main
model, tokenizer, generation_config = initialize_model(args, logger)
File "/optimum-habana/examples/text-generation/utils.py", line 374, in initialize_model
else setup_distributed_model(args, model_dtype, model_kwargs, logger)
File "/optimum-habana/examples/text-generation/utils.py", line 238, in setup_distributed_model
model = deepspeed.init_inference(model, **ds_inference_kwargs)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/__init__.py", line 346, in init_inference
engine = InferenceEngine(model, config=ds_inference_config)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/inference/engine.py", line 154, in __init__
self._apply_injection_policy(config, client_module)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/inference/engine.py", line 417, in _apply_injection_policy
replace_transformer_layer(client_module, self.module, checkpoint, config, self.config)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/module_inject/replace_module.py", line 340, in replace_transformer_layer
replaced_module = set_lm_head(replaced_module)
UnboundLocalError: local variable 'replaced_module' referenced before assignment
Loading 0 checkpoint shards: 0it [00:00, ?it/s]Traceback (most recent call last):
File "/optimum-habana/examples/text-generation/run_generation.py", line 562, in <module>
main()
File "/optimum-habana/examples/text-generation/run_generation.py", line 257, in main
model, tokenizer, generation_config = initialize_model(args, logger)
File "/optimum-habana/examples/text-generation/utils.py", line 374, in initialize_model
else setup_distributed_model(args, model_dtype, model_kwargs, logger)
File "/optimum-habana/examples/text-generation/utils.py", line 238, in setup_distributed_model
model = deepspeed.init_inference(model, **ds_inference_kwargs)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/__init__.py", line 346, in init_inference
engine = InferenceEngine(model, config=ds_inference_config)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/inference/engine.py", line 154, in __init__
self._apply_injection_policy(config, client_module)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/inference/engine.py", line 417, in _apply_injection_policy
replace_transformer_layer(client_module, self.module, checkpoint, config, self.config)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/module_inject/replace_module.py", line 340, in replace_transformer_layer
replaced_module = set_lm_head(replaced_module)
UnboundLocalError: local variable 'replaced_module' referenced before assignment
[2024-02-23 11:28:15,399] [WARNING] [comm.py:163:init_deepspeed_backend] HCCL backend in DeepSpeed not yet implemented
[2024-02-23 11:28:15,399] [INFO] [comm.py:637:init_distributed] cdb=None
Loading 0 checkpoint shards: 0it [00:00, ?it/s]
Loading 0 checkpoint shards: 0it [00:00, ?it/s]
[2024-02-23 11:28:15,543] [WARNING] [comm.py:163:init_deepspeed_backend] HCCL backend in DeepSpeed not yet implemented
[2024-02-23 11:28:15,543] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-02-23 11:28:15,547] [WARNING] [comm.py:163:init_deepspeed_backend] HCCL backend in DeepSpeed not yet implemented
[2024-02-23 11:28:15,547] [INFO] [comm.py:637:init_distributed] cdb=None
Loading 0 checkpoint shards: 0it [00:00, ?it/s]Traceback (most recent call last):
File "/optimum-habana/examples/text-generation/run_generation.py", line 562, in <module>
main()
File "/optimum-habana/examples/text-generation/run_generation.py", line 257, in main
model, tokenizer, generation_config = initialize_model(args, logger)
File "/optimum-habana/examples/text-generation/utils.py", line 374, in initialize_model
else setup_distributed_model(args, model_dtype, model_kwargs, logger)
File "/optimum-habana/examples/text-generation/utils.py", line 238, in setup_distributed_model
model = deepspeed.init_inference(model, **ds_inference_kwargs)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/__init__.py", line 346, in init_inference
engine = InferenceEngine(model, config=ds_inference_config)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/inference/engine.py", line 154, in __init__
self._apply_injection_policy(config, client_module)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/inference/engine.py", line 417, in _apply_injection_policy
replace_transformer_layer(client_module, self.module, checkpoint, config, self.config)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/module_inject/replace_module.py", line 340, in replace_transformer_layer
replaced_module = set_lm_head(replaced_module)
UnboundLocalError: local variable 'replaced_module' referenced before assignment
Loading 0 checkpoint shards: 0it [00:00, ?it/s]
[2024-02-23 11:28:19,656] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.12.4+hpu.synapse.v1.14.0, git-hash=fad45b2, git-branch=1.14.0
[2024-02-23 11:28:19,657] [INFO] [logging.py:96:log_dist] [Rank 0] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1
Loading 0 checkpoint shards: 0it [00:00, ?it/s]Traceback (most recent call last):
File "/optimum-habana/examples/text-generation/run_generation.py", line 562, in <module>
main()
File "/optimum-habana/examples/text-generation/run_generation.py", line 257, in main
model, tokenizer, generation_config = initialize_model(args, logger)
File "/optimum-habana/examples/text-generation/utils.py", line 374, in initialize_model
else setup_distributed_model(args, model_dtype, model_kwargs, logger)
File "/optimum-habana/examples/text-generation/utils.py", line 238, in setup_distributed_model
model = deepspeed.init_inference(model, **ds_inference_kwargs)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/__init__.py", line 346, in init_inference
engine = InferenceEngine(model, config=ds_inference_config)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/inference/engine.py", line 154, in __init__
self._apply_injection_policy(config, client_module)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/inference/engine.py", line 417, in _apply_injection_policy
replace_transformer_layer(client_module, self.module, checkpoint, config, self.config)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/module_inject/replace_module.py", line 340, in replace_transformer_layer
replaced_module = set_lm_head(replaced_module)
UnboundLocalError: local variable 'replaced_module' referenced before assignment
Loading 0 checkpoint shards: 0it [00:00, ?it/s]
Loading 0 checkpoint shards: 0it [00:00, ?it/s]Traceback (most recent call last):
File "/optimum-habana/examples/text-generation/run_generation.py", line 562, in <module>
main()
File "/optimum-habana/examples/text-generation/run_generation.py", line 257, in main
model, tokenizer, generation_config = initialize_model(args, logger)
File "/optimum-habana/examples/text-generation/utils.py", line 374, in initialize_model
else setup_distributed_model(args, model_dtype, model_kwargs, logger)
File "/optimum-habana/examples/text-generation/utils.py", line 238, in setup_distributed_model
model = deepspeed.init_inference(model, **ds_inference_kwargs)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/__init__.py", line 346, in init_inference
engine = InferenceEngine(model, config=ds_inference_config)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/inference/engine.py", line 154, in __init__
self._apply_injection_policy(config, client_module)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/inference/engine.py", line 417, in _apply_injection_policy
replace_transformer_layer(client_module, self.module, checkpoint, config, self.config)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/module_inject/replace_module.py", line 340, in replace_transformer_layer
replaced_module = set_lm_head(replaced_module)
UnboundLocalError: local variable 'replaced_module' referenced before assignment
Loading 0 checkpoint shards: 0it [00:00, ?it/s]Traceback (most recent call last):
File "/optimum-habana/examples/text-generation/run_generation.py", line 562, in <module>
main()
File "/optimum-habana/examples/text-generation/run_generation.py", line 257, in main
model, tokenizer, generation_config = initialize_model(args, logger)
File "/optimum-habana/examples/text-generation/utils.py", line 374, in initialize_model
else setup_distributed_model(args, model_dtype, model_kwargs, logger)
File "/optimum-habana/examples/text-generation/utils.py", line 238, in setup_distributed_model
model = deepspeed.init_inference(model, **ds_inference_kwargs)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/__init__.py", line 346, in init_inference
engine = InferenceEngine(model, config=ds_inference_config)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/inference/engine.py", line 154, in __init__
self._apply_injection_policy(config, client_module)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/inference/engine.py", line 417, in _apply_injection_policy
replace_transformer_layer(client_module, self.module, checkpoint, config, self.config)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/module_inject/replace_module.py", line 340, in replace_transformer_layer
replaced_module = set_lm_head(replaced_module)
UnboundLocalError: local variable 'replaced_module' referenced before assignment
Loading 0 checkpoint shards: 0it [00:00, ?it/s]Traceback (most recent call last):
File "/optimum-habana/examples/text-generation/run_generation.py", line 562, in <module>
main()
File "/optimum-habana/examples/text-generation/run_generation.py", line 257, in main
model, tokenizer, generation_config = initialize_model(args, logger)
File "/optimum-habana/examples/text-generation/utils.py", line 374, in initialize_model
else setup_distributed_model(args, model_dtype, model_kwargs, logger)
File "/optimum-habana/examples/text-generation/utils.py", line 238, in setup_distributed_model
model = deepspeed.init_inference(model, **ds_inference_kwargs)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/__init__.py", line 346, in init_inference
engine = InferenceEngine(model, config=ds_inference_config)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/inference/engine.py", line 154, in __init__
self._apply_injection_policy(config, client_module)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/inference/engine.py", line 417, in _apply_injection_policy
replace_transformer_layer(client_module, self.module, checkpoint, config, self.config)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/module_inject/replace_module.py", line 340, in replace_transformer_layer
replaced_module = set_lm_head(replaced_module)
UnboundLocalError: local variable 'replaced_module' referenced before assignment
Loading 0 checkpoint shards: 0it [00:00, ?it/s]
Loading 0 checkpoint shards: 0it [00:00, ?it/s]
Loading 0 checkpoint shards: 0it [00:00, ?it/s]
[2024-02-23 11:28:21,418] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 5103
[2024-02-23 11:28:21,418] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 5104
[2024-02-23 11:28:21,423] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 5105
[2024-02-23 11:28:21,424] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 5106
[2024-02-23 11:28:21,426] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 5107
[2024-02-23 11:28:21,427] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 5108
[2024-02-23 11:28:21,428] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 5109
[2024-02-23 11:28:21,429] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 5110
[2024-02-23 11:28:21,430] [ERROR] [launch.py:322:sigkill_handler] ['/usr/bin/python3', '-u', 'run_generation.py', '--model_name_or_path', '/data/meta-llama_Llama-2-13b-chat-hf', '--batch_size', '1', '--use_hpu_graphs', '--use_kv_cache', '--max_new_tokens', '100'] exits with return code = 1
[ERROR|distributed_runner.py:222] 2024-02-23 11:28:22,075 >> deepspeed --num_nodes 1 --num_gpus 8 --no_local_rank --master_port 29500 run_generation.py --model_name_or_path /data/meta-llama_Llama-2-13b-chat-hf --batch_size 1 --use_hpu_graphs --use_kv_cache --max_new_tokens 100 exited with status = 1
root@3e44951622db:/optimum-habana/examples/text-generation#
I think it simply doesn't find any checkpoint. Can you show me the content of your model folder? Or at least just tell me if there is any checkpoint inside this folder?
$ ls -lah /data/meta-llama_Llama-2-13b-chat-hf
total 25G
drwxrwxr-x 2 smci smci 4.0K 十 23 13:28 .
drwxrwxr-x 5 smci smci 4.0K 二 24 11:45 ..
-rw-rw-r-- 1 smci smci 6.9K 十 23 13:28 LICENSE.txt
-rw-rw-r-- 1 smci smci 11K 十 23 13:28 README.md
-rw-rw-r-- 1 smci smci 4.7K 十 23 13:28 USE_POLICY.md
-rw-rw-r-- 1 smci smci 587 十 23 13:28 config.json
-rw-rw-r-- 1 smci smci 188 十 23 13:28 generation_config.json
-rw-rw-r-- 1 smci smci 815 十 23 13:28 huggingface-metadata.txt
-rw-rw-r-- 1 smci smci 9.3G 十 23 13:29 model-00001-of-00003.safetensors
-rw-rw-r-- 1 smci smci 9.3G 十 23 13:29 model-00002-of-00003.safetensors
-rw-rw-r-- 1 smci smci 5.8G 十 23 13:28 model-00003-of-00003.safetensors
-rw-rw-r-- 1 smci smci 33K 十 23 13:28 model.safetensors.index.json
-rw-rw-r-- 1 smci smci 33K 十 23 13:28 pytorch_model.bin.index.json
-rw-rw-r-- 1 smci smci 414 十 23 13:28 special_tokens_map.json
-rw-rw-r-- 1 smci smci 1.8M 十 23 13:28 tokenizer.json
-rw-rw-r-- 1 smci smci 489K 十 23 13:28 tokenizer.model
-rw-rw-r-- 1 smci smci 1.6K 十 23 13:28 tokenizer_config.json
Here is the content in the model folder
Okay I understand better what is going on.
Your folder has checkpoints in the safetensors format only and not in the pickle format (i.e. *.bin). DeepSpeed has been able to deal with safetensors checkpoints for only little time and this should be possible when the next version of Habana's SDK is released (v1.15).
For now, you can only use the *.bin checkpoints if you need DeepSpeed. You can download them here.
@regisss I see it works, I thought the transformer library can handle the safetenstor
model as well. cmiiw
So could you show me the part of the code that says it runs only .bin
file, I can't find it on the load_checkpoint.py
in https://github.dev/HabanaAI/DeepSpeed/tree/1.14.0
@regisss I see it works, I thought the transformer library can handle the
safetenstor
model as well. cmiiw
Transformers can handle safetensors checkpoint. However, for big models that cannot fit on a single device, we use DeepSpeed and in that case DeepSpeed takes care of loading the model.
So could you show me the part of the code that says it runs only
.bin
file, I can't find it on theload_checkpoint.py
inhttps://github.dev/HabanaAI/DeepSpeed/tree/1.14.0
You can check the description of this PR and the following messages for more information about this issue.
Edited
@regisss Btw I just do benchmarking between Habana and Nvidia A100, but when I do inference (text-generation) with TGI Gaudi the Habana Gaudi 2 still slower than Nvidia A100.
Could you confirm it from your side?
Based on this, it shows that habana outperform the Nvidia A100
https://huggingface.co/blog/habana-gaudi-2-benchmark
The json file:
{
"inputs": "Far far away, behind the word mountains, far from the countries Vokalia and Consonantia, there live the blind texts. Separated they live in Bookmarksgrove right at the coast of the Semantics, a large language ocean. A small river named Duden flows by their place and supplies it with the necessary regelialia. It is a paradisematic country, in which roasted parts of sentences fly into your mouth. Even the all-powerful Pointing has no control about the blind texts it is an almost unorthographic life One day however a small line of blind text by the name of Lorem Ipsum decided to leave for the far World of Grammar. The Big Oxmox advised her not to do so, because there were thousands of bad Commas, wild Question Marks and devious Semikoli, but the Little Blind Text didn’t listen. She packed her seven versalia, put her initial into the belt and made herself on the way. When she reached the first hills of the Italic Mountains, she had a last view back on the skyline of her hometown Bookmarksgrove, the headline of Alphabet Village and the subline of her own road, the Line Lane. Pityful a rethoric question ran over her cheek, then she continued her way. On her way she met a copy. The copy warned the Little Blind Text, that where it came from it would have been rewritten a thousand times and everything that was left from its origin would be the word and and the Little Blind Text should turn around and return to its own, safe country. But nothing the copy said could convince her and so it didn’t take long until a few insidious Copy Writers ambushed her, made her drunk with Longe and Parole and dragged her into their agency, where they abused her for their projects again and again. And if she hasn’t been rewritten, then they are still using her. Far far away, behind the word mountains, far from the countries Vokalia and Consonantia, there live the blind texts. Separated they live in Bookmarksgrove right at the coast of the Semantics, a large language ocean. A small river named Duden flows by their place and supplies it with the necessary regelialia. It is a paradisematic country, in which roasted parts of sentences fly into your mouth. Even the all-powerful Pointing has no control about the blind texts it is an almost unorthographic life One day however a small line of blind text by the name of Lorem Ipsum decided to leave for the far World of Grammar. The Big Oxmox advised her not to do so, because there were thousands of bad Commas, wild Question Marks and devious Semikoli, but the Little Blind Text didn’t listen. She packed her seven versalia, put her initial into the belt and made herself on the way. When she reached the first hills of the Italic Mountains, she had a last view back on the skyline of her hometown Bookmarksgrove, the headline of Alphabet Village and the subline of her own road, the Line Lane. Pityful a rethoric question ran over her cheek, then she continued her way. On her way she met a copy. The copy warned the Little Blind Text, that where it came from it would have been rewritten a thousand times and everything that was left from its origin would be the word and and the Little Blind Text should turn around and return to its own, safe country. But nothing the copy said could convince her and so it didn’t take long until a few insidious Copy Writers ambushed her, made her drunk with Longe and Parole and dragged her into their agency, where they abused her for their projects again and again. And if she hasn’t been rewritten, then they are still using her. Far far away, behind the word mountains, far from the countries Vokalia and Consonantia, there live the blind texts. Separated they live in Bookmarksgrove right at the coast of the Semantics, a large language ocean. A small river named Duden flows by their place and supplies it with the necessary regelialia. It is a paradisematic country, in which roasted parts of sentences fly into your mouth. Even the all-powerful Pointing has no control about the blind texts it is an almost unorthographic life One day however a small line of blind text by the name of Lorem Ipsum decided to leave for the far World of Grammar. The Big Oxmox advised her not to do so, because there were thousands of bad Commas, wild Question Marks and devious Semikoli, but the Little Blind Text didn’t listen. She packed her seven versalia, put her initial into the belt and made herself on the way. When she reached the first hills of the Italic Mountains, she had a last view back on the skyline of her hometown Bookmarksgrove, the headline of Alphabet Village and the subline of her own road, the Line Lane. Pityful a rethoric question ran over her cheek, then she continued her way. On her way she met a copy. The copy warned the Little Blind Text, that where it came from it would have been rewritten a thousand times and everything that was left from its origin would be the word and and the Little Blind Text should turn around and return to its own, safe country. But nothing the copy said could convince her and so it didn’t take long until a few insidious Copy Writers ambushed her, made her drunk with Longe and Parole and dragged her into their agency, where they abused her for their projects again and again. And if she hasn’t been rewritten, then they are still using her.Far far away, behind the word mountains, far from the countries Vokalia and Consonantia, there live the blind texts. Separated they live in Bookmarksgrove right at the coast of the Semantics, a large language ocean. A small river named Duden flows by their place and supplies it with the necessary regelialia. It is a paradisematic country, in which roasted parts of sentences fly into your mouth. Even the all-powerful Pointing has no control about the blind. Summarize it",
"parameters": {
"max_new_tokens": 4096,
"best_of": 1,
"repetition_penalty": 1.17,
"return_full_text": false,
"temperature": 0.01,
"top_p": 0.14,
"top_k": 49,
"truncate": 4096,
"typical_p": 0.99,
"watermark": false,
"decoder_input_details": false
}
}
Run using hey
with 5 concurrent and 10 users
$ hey -t 1000 -m POST -D test.json -H "Content-Type: application/json" -c 5 -n 10 "http://127.0.0.1:8080/generate"
The benchmark result:
GPU Model | Multi-GPU / HPU | CPU | Memory (RAM) | Average |
---|---|---|---|---|
Nvidia A100 80GB | No (1x80GB) | 24 | 220 GB | ~ 4.45 seconds |
Nvidia A100 40GB | No (1x40GB) | 12 | 85 GB | ~ 5.98 seconds |
Nvidia A100 40 GB | Yes (2x40GB) | 24 | 170 GB | ~ 4.21 seconds |
Habana Gaudi 2 (HPU) | No (1*100GB) | 160 | 1 TB | ~ 12.81 seconds |
Habana Gaudi 2 (HPU) | Yes (8*100GB) | 160 | 1 TB | ~ 12.57 seconds |
@muhammad-asn how did you set up TGI for HPU?
Please note that there are several variables / arguments that need to be set to receive optimal performance.
Based on your config, I would suggest to add:
-
as env variables:
- MAX_TOTAL_TOKENS=6144 PREFILL_BATCH_BUCKET_SIZE=8 BATCH_BUCKET_SIZE=16
-
as arguments:
- --max-batch-prefill-tokens 16384 --max-batch-total-tokens 98304 --max-input-length 2048 --max-total-tokens 6144
Please note also, that warmup is not enabled at this moment on v1.2-release branch. It means that first iterations will be much slower due to graphs recompilations. However, I will add warmup later this week.
@kdamaszk ack I will try it. Thank you