huggingface/tgi-gaudi

Issue running meta-llama/Llama-2-13b-chat-hf

muhammad-asn opened this issue · 32 comments

System Info

  • OS Version: 22.04.3 LTS
  • Model being used: meta-llama/Llama-2-13b-chat-hf (local model)
    $ ls -la deploy-gaudi/data/meta-llama_Llama-2-13b-chat-hf
    total 25424104
    drwxrwxr-x 2 smci smci       4096  十  23 13:28 .
    drwxrwxr-x 3 smci smci       4096  二  16 16:42 ..
    -rw-rw-r-- 1 smci smci       7020  十  23 13:28 LICENSE.txt
    -rw-rw-r-- 1 smci smci      10409  十  23 13:28 README.md
    -rw-rw-r-- 1 smci smci       4766  十  23 13:28 USE_POLICY.md
    -rw-rw-r-- 1 smci smci        587  十  23 13:28 config.json
    -rw-rw-r-- 1 smci smci        188  十  23 13:28 generation_config.json
    -rw-rw-r-- 1 smci smci        815  十  23 13:28 huggingface-metadata.txt
    -rw-rw-r-- 1 smci smci 9948693272  十  23 13:29 model-00001-of-00003.safetensors
    -rw-rw-r-- 1 smci smci 9904129368  十  23 13:29 model-00002-of-00003.safetensors
    -rw-rw-r-- 1 smci smci 6178962272  十  23 13:28 model-00003-of-00003.safetensors
    -rw-rw-r-- 1 smci smci      33444  十  23 13:28 model.safetensors.index.json
    -rw-rw-r-- 1 smci smci      33444  十  23 13:28 pytorch_model.bin.index.json
    -rw-rw-r-- 1 smci smci        414  十  23 13:28 special_tokens_map.json
    -rw-rw-r-- 1 smci smci    1842767  十  23 13:28 tokenizer.json
    -rw-rw-r-- 1 smci smci     499723  十  23 13:28 tokenizer.model
    -rw-rw-r-- 1 smci smci       1618  十  23 13:28 tokenizer_config.json
    
  1. HL-SMI

    $ sudo hl-smi
    +-----------------------------------------------------------------------------+
    | HL-SMI Version:                              hl-1.14.0-fw-48.0.1.0          |
    | Driver Version:                                     1.14.0-9e8ecf8          |
    |-------------------------------+----------------------+----------------------+
    | AIP  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | AIP-Util  Compute M. |
    |===============================+======================+======================|
    |   0  HL-225              N/A  | 0000:b3:00.0     N/A |                   0  |
    | N/A   25C   N/A   106W / 600W |    768MiB / 98304MiB |     0%           N/A |
    |-------------------------------+----------------------+----------------------+
    |   1  HL-225              N/A  | 0000:b4:00.0     N/A |                   0  |
    | N/A   29C   N/A   104W / 600W |    768MiB / 98304MiB |     0%           N/A |
    |-------------------------------+----------------------+----------------------+
    |   2  HL-225              N/A  | 0000:19:00.0     N/A |                   0  |
    | N/A   24C   N/A   103W / 600W |    768MiB / 98304MiB |     0%           N/A |
    |-------------------------------+----------------------+----------------------+
    |   3  HL-225              N/A  | 0000:cc:00.0     N/A |                   0  |
    | N/A   28C   N/A    87W / 600W |    768MiB / 98304MiB |     0%           N/A |
    |-------------------------------+----------------------+----------------------+
    |   4  HL-225              N/A  | 0000:1a:00.0     N/A |                   0  |
    | N/A   30C   N/A    93W / 600W |    768MiB / 98304MiB |     0%           N/A |
    |-------------------------------+----------------------+----------------------+
    |   5  HL-225              N/A  | 0000:43:00.0     N/A |                   0  |
    | N/A   30C   N/A    93W / 600W |    768MiB / 98304MiB |     0%           N/A |
    |-------------------------------+----------------------+----------------------+
    |   6  HL-225              N/A  | 0000:cd:00.0     N/A |                   0  |
    | N/A   27C   N/A   116W / 600W |    768MiB / 98304MiB |     0%           N/A |
    |-------------------------------+----------------------+----------------------+
    |   7  HL-225              N/A  | 0000:44:00.0     N/A |                   0  |
    | N/A   25C   N/A    94W / 600W |    768MiB / 98304MiB |     0%           N/A |
    |-------------------------------+----------------------+----------------------+
    | Compute Processes:                                               AIP Memory |
    |  AIP       PID   Type   Process name                             Usage      |
    |=============================================================================|
    |   0        N/A   N/A    N/A                                      N/A        |
    |   1        N/A   N/A    N/A                                      N/A        |
    |   2        N/A   N/A    N/A                                      N/A        |
    |   3        N/A   N/A    N/A                                      N/A        |
    |   4        N/A   N/A    N/A                                      N/A        |
    |   5        N/A   N/A    N/A                                      N/A        |
    |   6        N/A   N/A    N/A                                      N/A        |
    |   7        N/A   N/A    N/A                                      N/A        |
    +=============================================================================+
    
  2. CPU: Intel Xeon Platinum 8380 (160) @ 3.400GHz

  3. GPU: 03:00.0 ASPEED Technology, Inc. ASPEED Graphics Family

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

  1. Clone the repository https://github.com/huggingface/tgi-gaudi and checkout to branch habana-dev

  2. Run the docker command

    $ model=/data/meta-llama_Llama-2-13b-chat-hf
    $ volume=$PWD/data
    $ docker run -p 8080:80 -v $volume:/data --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --ipc=host tgi_gaudi --model-id $model
  3. After a while, the error log shows: synStatus=20 [Device already acquired] Device acquire failed.

2024-02-19T04:48:59.198129Z  INFO text_generation_launcher: Args { model_id: "/data/meta-llama_Llama-2-13b-chat-hf", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_length: 1024, max_total_tokens: 2048, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: None, max_waiting_tokens: 20, hostname: "6a3b385d5f37", port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, env: false }
2024-02-19T04:48:59.198315Z  INFO download: text_generation_launcher: Starting download process.
2024-02-19T04:49:00.906024Z  INFO text_generation_launcher: Files are already present on the host. Skipping download.

2024-02-19T04:49:01.202998Z  INFO download: text_generation_launcher: Successfully downloaded weights.
2024-02-19T04:49:01.203458Z  INFO shard-manager: text_generation_launcher: Starting shard rank=0
2024-02-19T04:49:05.090203Z  INFO text_generation_launcher: CLI SHARDED = False DTYPE = bfloat16

2024-02-19T04:49:11.216418Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-02-19T04:49:21.226050Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-02-19T04:49:31.236352Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-02-19T04:49:41.246660Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-02-19T04:49:51.257064Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-02-19T04:50:01.265897Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-02-19T04:50:11.275136Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-02-19T04:50:21.284766Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-02-19T04:50:31.295121Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-02-19T04:50:41.305467Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-02-19T04:50:46.211569Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:

/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:252: UserWarning: Device capability of hccl unspecified, assuming `cpu` and `cuda`. Please specify it via the `devices` argument of `register_backend`.
  warnings.warn(
[WARNING|utils.py:185] 2024-02-19 04:49:03,996 >> optimum-habana v1.10.0 has been validated for SynapseAI v1.14.0 but habana-frameworks v1.13.0.463 was found, this could lead to undefined behavior!
Loading checkpoint shards: 100%|██████████| 3/3 [00:01<00:00,  2.11it/s]
Traceback (most recent call last):

  File "/usr/local/bin/text-generation-server", line 8, in <module>
    sys.exit(app())

  File "/usr/local/lib/python3.10/dist-packages/text_generation_server/cli.py", line 120, in serve
    server.serve(model_id, revision, dtype, uds_path, sharded)

  File "/usr/local/lib/python3.10/dist-packages/text_generation_server/server.py", line 216, in serve
    asyncio.run(serve_inner(model_id, revision, dtype, sharded))

  File "/usr/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)

  File "/usr/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()

  File "/usr/local/lib/python3.10/dist-packages/text_generation_server/server.py", line 177, in serve_inner
    model = get_model(model_id, revision=revision, dtype=data_type)

  File "/usr/local/lib/python3.10/dist-packages/text_generation_server/models/__init__.py", line 33, in get_model
    return CausalLM(model_id, revision, dtype)

  File "/usr/local/lib/python3.10/dist-packages/text_generation_server/models/causal_lm.py", line 589, in __init__
    model = model.eval().to(device)

  File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 2179, in to
    return super().to(*args, **kwargs)

  File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/core/weight_sharing.py", line 173, in wrapped_to
    result = self.original_to(*args, **kwargs)

  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1163, in to
    return self._apply(convert)

  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 810, in _apply
    module._apply(fn)

  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 810, in _apply
    module._apply(fn)

  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 833, in _apply
    param_applied = fn(param)

  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1161, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)

  File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/core/weight_sharing.py", line 53, in __torch_function__
    return super().__torch_function__(func, types, new_args, kwargs)

RuntimeError: synStatus=20 [Device already acquired] Device acquire failed.
 rank=0
2024-02-19T04:50:46.290320Z ERROR text_generation_launcher: Shard 0 failed to start
2024-02-19T04:50:46.290347Z  INFO text_generation_launcher: Shutting down shards
Error: ShardCannotStart

Expected behavior

The model should run properly and without issue

The issue solved in this PR #56

Thank you 👍👍 @regisss

TL'DR solution:

  • Check the rust cargo-chef version to make sure the dependency is compatible
  • Make sure the synapseAI version is same with the pytorch docker

I just merged #56, can we close this issue @muhammad-asn ?

I just merged #56, can we close this issue @muhammad-asn ?

Yup you can close this issue

Sorry if I reply on this thread, currently the issue is solved when I ran using 1 HPU card only. When I try 8 HPU (--num-shard 8). New error arised

Loading 0 checkpoint shards: 0it [00:00, ?it/s]Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/text_generation_server/tgi_service.py", line 29, in <module>
    main(args)
  File "/usr/local/lib/python3.10/dist-packages/text_generation_server/tgi_service.py", line 16, in main
    server.serve(
  File "/usr/local/lib/python3.10/dist-packages/text_generation_server/server.py", line 213, in serve
    asyncio.run(serve_inner(model_id, revision, dtype, sharded))
  File "/usr/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/usr/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()
  File "/usr/local/lib/python3.10/dist-packages/text_generation_server/server.py", line 177, in serve_inner
    model = get_model(model_id, revision=revision, dtype=data_type)
  File "/usr/local/lib/python3.10/dist-packages/text_generation_server/models/__init__.py", line 33, in get_model
    return CausalLM(model_id, revision, dtype)
  File "/usr/local/lib/python3.10/dist-packages/text_generation_server/models/causal_lm.py", line 526, in __init__
    model = deepspeed.init_inference(model, **ds_inference_kwargs)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/__init__.py", line 346, in init_inference
    engine = InferenceEngine(model, config=ds_inference_config)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/inference/engine.py", line 168, in __init__
    self._apply_injection_policy(config, client_module)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/inference/engine.py", line 417, in _apply_injection_policy
    replace_transformer_layer(client_module, self.module, checkpoint, config, self.config)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/module_inject/replace_module.py", line 340, in replace_transformer_layer
    replaced_module = set_lm_head(replaced_module)
UnboundLocalError: local variable 'replaced_module' referenced before assignment

Can you share the command line you used to launch your server instance please?
I didn't manage to reproduce this issue with:

docker run \
  -p 8080:80 \
  -v /scratch-1/:/data \
  --runtime=habana \
  -e PT_HPU_ENABLE_LAZY_COLLECTIVES=true \
  -e HABANA_VISIBLE_DEVICES=all \
  -e OMPI_MCA_btl_vader_single_copy_mechanism=none \
  -e HUGGING_FACE_HUB_TOKEN=my_token \
  --cap-add=sys_nice \
  --ipc=host tgi_gaudi \
  --model-id meta-llama/Llama-2-70b-hf \
  --sharded true \
  --num-shard 8
Loading 0 checkpoint shards

looks weird, it makes me think that it was not able to find the checkpoint shards

Sorry for late reply I will check it first

I use docker compose for running the inference

services:
  tgi_gaudi:
    image: tgi_gaudi
    container_name: llm
    runtime: habana
    environment:
      - HABANA_VISIBLE_DEVICES=all
      - OMPI_MCA_btl_vader_single_copy_mechanism=none
      - ENABLE_HPU_GRAPH=False
      - LOG_LEVEL=debug,text_generation_router=debug
      - PT_HPU_ENABLE_LAZY_COLLECTIVES=true
    command: >
       --model-id /data/meta-llama_Llama-2-13b-chat-hf
       --max-total-tokens 8192
       --max-input-length 4096
       --num-shard 8
       --max-top-n-tokens 1
       --max-best-of 1
       --disable-custom-kernels
       --trust-remote-code
       --max-stop-sequences 1
       --validation-workers 1
       --max-batch-total-tokens 8192
       --max-batch-prefill-tokens 4096
       --waiting-served-ratio 0
       --max-waiting-tokens 4096
       --sharded true
    cap_add:
      - sys_nice
    ipc: host
    shm_size: '1gb'
    restart: always
    ports:
      - "8080:80"
    volumes:
      - ./data:/data

networks:
  default:
    name: habana
    external: true

$ sudo hl-smi

+-----------------------------------------------------------------------------+
| HL-SMI Version:                              hl-1.14.0-fw-48.0.1.0          |
| Driver Version:                                     1.14.0-9e8ecf8          |
|-------------------------------+----------------------+----------------------+
| AIP  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | AIP-Util  Compute M. |
|===============================+======================+======================|
|   0  HL-225              N/A  | 0000:b3:00.0     N/A |                   0  |
| N/A   25C   N/A   106W / 600W |    768MiB / 98304MiB |     0%           N/A |
|-------------------------------+----------------------+----------------------+
|   1  HL-225              N/A  | 0000:b4:00.0     N/A |                   0  |
| N/A   28C   N/A   111W / 600W |    768MiB / 98304MiB |     0%           N/A |
|-------------------------------+----------------------+----------------------+
|   2  HL-225              N/A  | 0000:19:00.0     N/A |                   0  |
| N/A   23C   N/A   103W / 600W |    768MiB / 98304MiB |     0%           N/A |
|-------------------------------+----------------------+----------------------+
|   3  HL-225              N/A  | 0000:cc:00.0     N/A |                   0  |
| N/A   27C   N/A    91W / 600W |    768MiB / 98304MiB |     0%           N/A |
|-------------------------------+----------------------+----------------------+
|   4  HL-225              N/A  | 0000:1a:00.0     N/A |                   0  |
| N/A   29C   N/A   101W / 600W |    768MiB / 98304MiB |     0%           N/A |
|-------------------------------+----------------------+----------------------+
|   5  HL-225              N/A  | 0000:43:00.0     N/A |                   0  |
| N/A   29C   N/A   101W / 600W |  31373MiB / 98304MiB |     0%           N/A |
|-------------------------------+----------------------+----------------------+
|   6  HL-225              N/A  | 0000:cd:00.0     N/A |                   0  |
| N/A   26C   N/A   117W / 600W |    768MiB / 98304MiB |     0%           N/A |
|-------------------------------+----------------------+----------------------+
|   7  HL-225              N/A  | 0000:44:00.0     N/A |                   0  |
| N/A   24C   N/A    89W / 600W |    768MiB / 98304MiB |     0%           N/A |
|-------------------------------+----------------------+----------------------+
| Compute Processes:                                               AIP Memory |
|  AIP       PID   Type   Process name                             Usage      |
|=============================================================================|
|   0        N/A   N/A    N/A                                      N/A        |
|   1        N/A   N/A    N/A                                      N/A        |
|   2        N/A   N/A    N/A                                      N/A        |
|   3        N/A   N/A    N/A                                      N/A        |
|   4        N/A   N/A    N/A                                      N/A        |
|   5       3472794     C   text-generation                         30605MiB
|   6        N/A   N/A    N/A                                      N/A        |
|   7        N/A   N/A    N/A                                      N/A        |
+=============================================================================+

@regisss

Also when I try for longer text, seems the TGI has issue as well (1 shard only)

Json with ~ 1024 word (data.json)

{
    "inputs": "Far far away, behind the word mountains, far from the countries Vokalia and Consonantia, there live the blind texts. Separated they live in Bookmarksgrove right at the coast of the Semantics, a large language ocean. A small river named Duden flows by their place and supplies it with the necessary regelialia. It is a paradisematic country, in which roasted parts of sentences fly into your mouth. Even the all-powerful Pointing has no control about the blind texts it is an almost unorthographic life One day however a small line of blind text by the name of Lorem Ipsum decided to leave for the far World of Grammar. The Big Oxmox advised her not to do so, because there were thousands of bad Commas, wild Question Marks and devious Semikoli, but the Little Blind Text didn’t listen. She packed her seven versalia, put her initial into the belt and made herself on the way. When she reached the first hills of the Italic Mountains, she had a last view back on the skyline of her hometown Bookmarksgrove, the headline of Alphabet Village and the subline of her own road, the Line Lane. Pityful a rethoric question ran over her cheek, then she continued her way. On her way she met a copy. The copy warned the Little Blind Text, that where it came from it would have been rewritten a thousand times and everything that was left from its origin would be the word and and the Little Blind Text should turn around and return to its own, safe country. But nothing the copy said could convince her and so it didn’t take long until a few insidious Copy Writers ambushed her, made her drunk with Longe and Parole and dragged her into their agency, where they abused her for their projects again and again. And if she hasn’t been rewritten, then they are still using her. Far far away, behind the word mountains, far from the countries Vokalia and Consonantia, there live the blind texts. Separated they live in Bookmarksgrove right at the coast of the Semantics, a large language ocean. A small river named Duden flows by their place and supplies it with the necessary regelialia. It is a paradisematic country, in which roasted parts of sentences fly into your mouth. Even the all-powerful Pointing has no control about the blind texts it is an almost unorthographic life One day however a small line of blind text by the name of Lorem Ipsum decided to leave for the far World of Grammar. The Big Oxmox advised her not to do so, because there were thousands of bad Commas, wild Question Marks and devious Semikoli, but the Little Blind Text didn’t listen. She packed her seven versalia, put her initial into the belt and made herself on the way. When she reached the first hills of the Italic Mountains, she had a last view back on the skyline of her hometown Bookmarksgrove, the headline of Alphabet Village and the subline of her own road, the Line Lane. Pityful a rethoric question ran over her cheek, then she continued her way. On her way she met a copy. The copy warned the Little Blind Text, that where it came from it would have been rewritten a thousand times and everything that was left from its origin would be the word and and the Little Blind Text should turn around and return to its own, safe country. But nothing the copy said could convince her and so it didn’t take long until a few insidious Copy Writers ambushed her, made her drunk with Longe and Parole and dragged her into their agency, where they abused her for their projects again and again. And if she hasn’t been rewritten, then they are still using her. Far far away, behind the word mountains, far from the countries Vokalia and Consonantia, there live the blind texts. Separated they live in Bookmarksgrove right at the coast of the Semantics, a large language ocean. A small river named Duden flows by their place and supplies it with the necessary regelialia. It is a paradisematic country, in which roasted parts of sentences fly into your mouth. Even the all-powerful Pointing has no control about the blind texts it is an almost unorthographic life One day however a small line of blind text by the name of Lorem Ipsum decided to leave for the far World of Grammar. The Big Oxmox advised her not to do so, because there were thousands of bad Commas, wild Question Marks and devious Semikoli, but the Little Blind Text didn’t listen. She packed her seven versalia, put her initial into the belt and made herself on the way. When she reached the first hills of the Italic Mountains, she had a last view back on the skyline of her hometown Bookmarksgrove, the headline of Alphabet Village and the subline of her own road, the Line Lane. Pityful a rethoric question ran over her cheek, then she continued her way. On her way she met a copy. The copy warned the Little Blind Text, that where it came from it would have been rewritten a thousand times and everything that was left from its origin would be the word and and the Little Blind Text should turn around and return to its own, safe country. But nothing the copy said could convince her and so it didn’t take long until a few insidious Copy Writers ambushed her, made her drunk with Longe and Parole and dragged her into their agency, where they abused her for their projects again and again. And if she hasn’t been rewritten, then they are still using her.Far far away, behind the word mountains, far from the countries Vokalia and Consonantia, there live the blind texts. Separated they live in Bookmarksgrove right at the coast of the Semantics, a large language ocean. A small river named Duden flows by their place and supplies it with the necessary regelialia. It is a paradisematic country, in which roasted parts of sentences fly into your mouth. Even the all-powerful Pointing has no control about the blind. Summarize it",
    "parameters": {
        "max_new_tokens": 4096,
        "best_of": 1,
        "repetition_penalty": 1.17,
        "return_full_text": false,
        "temperature": 0.01,
        "top_p": 0.14,
        "top_k": 49,
        "truncate": 4096,
        "typical_p": 0.99,
        "watermark": false,
        "decoder_input_details": false
    }
}

The command:

curl -d @data.json -H "Content-Type: application/json" "http://127.0.0.1:8080/generate"
{"error":"Request failed during generation: Server error: Graph compile failed. synStatus=synStatus 26 [Generice failure]. ","error_type":"generation"}%

The log:

e.rs:213: send frame=Ping { ack: true, payload: [0, 0, 0, 0, 0, 0, 0, 173] }
2024-02-21T05:54:46.683643Z DEBUG text_generation_launcher: MAX_TOTAL_TOKENS = 0
2024-02-21T05:55:25.318039Z DEBUG text_generation_launcher: Method Prefill encountered an error.
2024-02-21T05:55:25.318074Z DEBUG text_generation_launcher: Traceback (most recent call last):
2024-02-21T05:55:25.318081Z DEBUG text_generation_launcher:   File "/usr/local/bin/text-generation-server", line 8, in <module>
2024-02-21T05:55:25.318087Z DEBUG text_generation_launcher:     sys.exit(app())
2024-02-21T05:55:25.318093Z DEBUG text_generation_launcher:   File "/usr/local/lib/python3.10/dist-packages/typer/main.py", line 311, in __call__
2024-02-21T05:55:25.318099Z DEBUG text_generation_launcher:     return get_command(self)(*args, **kwargs)
2024-02-21T05:55:25.318104Z DEBUG text_generation_launcher:   File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1157, in __call__
2024-02-21T05:55:25.318109Z DEBUG text_generation_launcher:     return self.main(*args, **kwargs)
2024-02-21T05:55:25.318114Z DEBUG text_generation_launcher:   File "/usr/local/lib/python3.10/dist-packages/typer/core.py", line 778, in main
2024-02-21T05:55:25.318119Z DEBUG text_generation_launcher:     return _main(
2024-02-21T05:55:25.318125Z DEBUG text_generation_launcher:   File "/usr/local/lib/python3.10/dist-packages/typer/core.py", line 216, in _main
2024-02-21T05:55:25.318130Z DEBUG text_generation_launcher:     rv = self.invoke(ctx)
2024-02-21T05:55:25.318135Z DEBUG text_generation_launcher:   File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1688, in invoke
2024-02-21T05:55:25.318140Z DEBUG text_generation_launcher:     return _process_result(sub_ctx.command.invoke(sub_ctx))
2024-02-21T05:55:25.318146Z DEBUG text_generation_launcher:   File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1434, in invoke
2024-02-21T05:55:25.318151Z DEBUG text_generation_launcher:     return ctx.invoke(self.callback, **ctx.params)
2024-02-21T05:55:25.318156Z DEBUG text_generation_launcher:   File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 783, in invoke
2024-02-21T05:55:25.318161Z DEBUG text_generation_launcher:     return __callback(*args, **kwargs)
2024-02-21T05:55:25.318166Z DEBUG text_generation_launcher:   File "/usr/local/lib/python3.10/dist-packages/typer/main.py", line 683, in wrapper
2024-02-21T05:55:25.318171Z DEBUG text_generation_launcher:     return callback(**use_params)  # type: ignore
2024-02-21T05:55:25.318177Z DEBUG text_generation_launcher:   File "/usr/local/lib/python3.10/dist-packages/text_generation_server/cli.py", line 120, in serve
2024-02-21T05:55:25.318182Z DEBUG text_generation_launcher:     server.serve(model_id, revision, dtype, uds_path, sharded)
2024-02-21T05:55:25.318188Z DEBUG text_generation_launcher:   File "/usr/local/lib/python3.10/dist-packages/text_generation_server/server.py", line 213, in serve
2024-02-21T05:55:25.318193Z DEBUG text_generation_launcher:     asyncio.run(serve_inner(model_id, revision, dtype, sharded))
2024-02-21T05:55:25.318198Z DEBUG text_generation_launcher:   File "/usr/lib/python3.10/asyncio/runners.py", line 44, in run
2024-02-21T05:55:25.318203Z DEBUG text_generation_launcher:     return loop.run_until_complete(main)
2024-02-21T05:55:25.318209Z DEBUG text_generation_launcher:   File "/usr/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
2024-02-21T05:55:25.318214Z DEBUG text_generation_launcher:     self.run_forever()
2024-02-21T05:55:25.318220Z DEBUG text_generation_launcher:   File "/usr/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
2024-02-21T05:55:25.318225Z DEBUG text_generation_launcher:     self._run_once()
2024-02-21T05:55:25.318230Z DEBUG text_generation_launcher:   File "/usr/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
2024-02-21T05:55:25.318235Z DEBUG text_generation_launcher:     handle._run()
2024-02-21T05:55:25.318241Z DEBUG text_generation_launcher:   File "/usr/lib/python3.10/asyncio/events.py", line 80, in _run
2024-02-21T05:55:25.318247Z DEBUG text_generation_launcher:     self._context.run(self._callback, *self._args)
2024-02-21T05:55:25.318252Z DEBUG text_generation_launcher:   File "/usr/local/lib/python3.10/dist-packages/grpc_interceptor/server.py", line 159, in invoke_intercept_method
2024-02-21T05:55:25.318258Z DEBUG text_generation_launcher:     return await self.intercept(
2024-02-21T05:55:25.318264Z DEBUG text_generation_launcher: > File "/usr/local/lib/python3.10/dist-packages/text_generation_server/interceptor.py", line 23, in intercept
2024-02-21T05:55:25.318270Z DEBUG text_generation_launcher:     return await response
2024-02-21T05:55:25.318276Z DEBUG text_generation_launcher:   File "/usr/local/lib/python3.10/dist-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 82, in _unary_interceptor
2024-02-21T05:55:25.318282Z DEBUG text_generation_launcher:     raise error
2024-02-21T05:55:25.318288Z DEBUG text_generation_launcher:   File "/usr/local/lib/python3.10/dist-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 73, in _unary_interceptor
2024-02-21T05:55:25.318294Z DEBUG text_generation_launcher:     return await behavior(request_or_iterator, context)
2024-02-21T05:55:25.318300Z DEBUG text_generation_launcher:   File "/usr/local/lib/python3.10/dist-packages/text_generation_server/server.py", line 87, in Prefill
2024-02-21T05:55:25.318307Z DEBUG text_generation_launcher:     generations, next_batch = self.model.generate_token(batch)
2024-02-21T05:55:25.318312Z DEBUG text_generation_launcher:   File "/usr/lib/python3.10/contextlib.py", line 79, in inner
2024-02-21T05:55:25.318318Z DEBUG text_generation_launcher:     return func(*args, **kwds)
2024-02-21T05:55:25.318323Z DEBUG text_generation_launcher:   File "/usr/local/lib/python3.10/dist-packages/text_generation_server/models/causal_lm.py", line 704, in generate_token
2024-02-21T05:55:25.318330Z DEBUG text_generation_launcher:     batch.input_ids[:, :token_idx], logits.squeeze(-2)
2024-02-21T05:55:25.318335Z DEBUG text_generation_launcher: RuntimeError: Graph compile failed. synStatus=synStatus 26 [Generice failure].
2024-02-21T05:55:25.318358Z DEBUG Connection{peer=Client}: h2::codec::framed_read: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.24/src/codec/framed_read.rs:360: received frame=Headers { stream_id: StreamId(429), flags: (0x5: END_HEADERS | END_STREAM) }
2024-02-21T05:55:25.318424Z DEBUG Connection{peer=Client}: h2::codec::framed_read: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.24/src/codec/framed_read.rs:360: received frame=WindowUpdate { stream_id: StreamId(0), size_increment: 5848 }
2024-02-21T05:55:25.318524Z ERROR batch{batch_size=1}:prefill:prefill{id=83 size=1}:prefill{id=83 size=1}: text_generation_client: router/client/src/lib.rs:33: Server error: Graph compile failed. synStatus=synStatus 26 [Generice failure].
2024-02-21T05:55:25.318642Z DEBUG batch{batch_size=1}:prefill:clear_cache{batch_id=Some(83)}:clear_cache{batch_id=Some(83)}: tower::buffer::worker: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tower-0.4.13/src/buffer/worker.rs:197: service.ready=true processing request
2024-02-21T05:55:25.318799Z DEBUG Connection{peer=Client}: h2::codec::framed_write: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.24/src/codec/framed_write.rs:213: send frame=Headers { stream_id: StreamId(431), flags: (0x4: END_HEADERS) }
2024-02-21T05:55:25.318838Z DEBUG Connection{peer=Client}: h2::codec::framed_write: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.24/src/codec/framed_write.rs:213: send frame=Data { stream_id: StreamId(431) }
2024-02-21T05:55:25.318855Z DEBUG Connection{peer=Client}: h2::codec::framed_write: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.24/src/codec/framed_write.rs:213: send frame=Data { stream_id: StreamId(431), flags: (0x1: END_STREAM) }
2024-02-21T05:55:25.319267Z DEBUG Connection{peer=Client}: h2::codec::framed_read: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.24/src/codec/framed_read.rs:360: received frame=Ping { ack: false, payload: [0, 0, 0, 0, 0, 0, 0, 174] }
2024-02-21T05:55:25.319307Z DEBUG Connection{peer=Client}: h2::codec::framed_write: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.24/src/codec/framed_write.rs:213: send frame=Ping { ack: true, payload: [0, 0, 0, 0, 0, 0, 0, 174] }
2024-02-21T05:55:25.319664Z DEBUG Connection{peer=Client}: h2::codec::framed_read: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.24/src/codec/framed_read.rs:360: received frame=Headers { stream_id: StreamId(431), flags: (0x4: END_HEADERS) }
2024-02-21T05:55:25.319709Z DEBUG Connection{peer=Client}: h2::codec::framed_read: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.24/src/codec/framed_read.rs:360: received frame=Data { stream_id: StreamId(431) }
2024-02-21T05:55:25.319734Z DEBUG Connection{peer=Client}: h2::codec::framed_read: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.24/src/codec/framed_read.rs:360: received frame=Headers { stream_id: StreamId(431), flags: (0x5: END_HEADERS | END_STREAM) }
2024-02-21T05:55:25.319750Z DEBUG Connection{peer=Client}: h2::codec::framed_read: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/h2-0.3.24/src/codec/framed_read.rs:360: received frame=WindowUpdate { stream_id: StreamId(0), size_increment: 7 }
2024-02-21T05:55:25.319858Z ERROR generate{parameters=GenerateParameters { best_of: Some(1), temperature: Some(0.01), repetition_penalty: Some(1.17), top_k: Some(49), top_p: Some(0.14), typical_p: Some(0.99), do_sample: false, max_new_tokens: Some(4096), return_full_text: Some(false), stop: [], truncate: Some(4096), watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None }}:generate:generate_stream:infer:send_error: text_generation_router::infer: router/src/infer.rs:601: Request failed during generation: Server error: Graph compile failed. synStatus=synStatus 26 [Generice failure].
2024-02-21T05:55:25.320127Z DEBUG hyper::proto::h1::io: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/hyper-0.14.28/src/proto/h1/io.rs:318: flushed 396 bytes
2024-02-21T05:55:25.510810Z DEBUG hyper::proto::h1::conn: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/hyper-0.14.28/src/proto/h1/conn.rs:283: read eof

It works on my side with 8 shards running:

docker run   -p 8080:80   -v /scratch-1/:/data   --runtime=habana   -e PT_HPU_ENABLE_LAZY_COLLECTIVES=true   -e HABANA_VISIBLE_DEVICES=all   -e OMPI_MCA_btl_vader_single_copy_mechanism=none   -e HUGGING_FACE_HUB_TOKEN=my_token   --cap-add=sys_nice   --ipc=host tgi_gaudi   --model-id meta-llama/Llama-2-13b-chat-hf   --sharded true   --num-shard 8 --max-total-tokens 8192 --max-input-length 4096 --max-top-n-tokens 1 --max-best-of 1 --disable-custom-kernels --max-stop-sequences 1 --validation-workers 1 --max-batch-total-tokens 8192 --max-batch-prefill-tokens 4096 --waiting-served-ratio 0 --max-waiting-tokens 4096

With 1 shard, I can reproduce the error but I think it's just an out-of-memory error and sharding is needed to make it work with these dimensions. It does work on 1 shard with smaller inputs.

Sorry @regisss

With 1 shard, I can reproduce the error but I think it's just an out-of-memory error and sharding is needed to make it work with these dimensions. It does work on 1 shard with smaller inputs.

The error regarding this #54 (comment) right?

In the hl-smi shown, every 1 card have ~ 98304MiB memory, seems is not about out of memory?

5       3541452     C   text-generation                         24834MiB

And for this

It works on my side with 8 shards running:

which github branch did you use? on my side still got this issue (using the v1.2-release)

replaced_module = set_lm_head(replaced_module)
UnboundLocalError: local variable 'replaced_module' referenced before assignment

The error regarding this #54 (comment) right?

Yes, this one.

In the hl-smi shown, every 1 card have ~ 98304MiB memory, seems is not about out of memory?

Yes, but the model already accounts for ~26GB, and you have to store a key-value cache of size 8192. This is very big. However, sharding will help a lot here as it divides the memory footprint of the model by the number of shards.

which github branch did you use? on my side still got this issue (using the v1.2-release)

I use v1.2-release. Are you sure your branch and Docker image are up to date?

My server spec

$ neofetch
        `:+ssssssssssssssssss+:`           ---------------------
      -+ssssssssssssssssssyyssss+-         OS: Ubuntu 22.04.3 LTS x86_64
    .ossssssssssssssssssdMMMNysssso.       Host: Super Server 0123456789
   /ssssssssssshdmmNNmmyNMMMMhssssss/      Kernel: 6.5.0-15-generic
  +ssssssssshmydMMMMMMMNddddyssssssss+     Uptime: 15 days, 52 mins
 /sssssssshNMMMyhhyyyyhmNMMMNhssssssss/    Packages: 1834 (dpkg), 11 (snap)
.ssssssssdMMMNhsssssssssshNMMMdssssssss.   Shell: bash 5.1.16
+sssshhhyNMMNyssssssssssssyNMMMysssssss+   Resolution: 1024x768
ossyNMMMNyMMhsssssssssssssshmmmhssssssso   Terminal: /dev/pts/0
ossyNMMMNyMMhsssssssssssssshmmmhssssssso   CPU: Intel Xeon Platinum 8380 (160) @ 3.400GHz
+sssshhhyNMMNyssssssssssssyNMMMysssssss+   GPU: 03:00.0 ASPEED Technology, Inc. ASPEED Graphics Family
.ssssssssdMMMNhsssssssssshNMMMdssssssss.   Memory: 45394MiB / 1031678MiB
 /sssssssshNMMMyhhyyyyhdNMMMNhssssssss/
  +sssssssssdmydMMMMMMMMddddyssssssss+
   /ssssssssssshdmNNNNmyNMMMMhssssss/
    .ossssssssssssssssssdMMMNysssso.
      -+sssssssssssssssssyyyssss+-
        `:+ssssssssssssssssss+:`
            .-/+oossssoo+/-.

Here's the full error log, when I try to run using 8 shards
error-tgi-gaudi.txt

Still got the issue using the latest v1.2-release branch, not sure if it's about my hardware or the library. I still analyze the deepspeed library source code

Weird 🤔
Can you set trust_remote_code to False please? I don't think that will solve it but it may interfere with the modeling code.
Also, can you show me the output of pip show deepspeed?

root@107aca520a2c:/usr/src# pip show deepspeed
Name: deepspeed
Version: 0.12.4+hpu.synapse.v1.14.0
Summary: DeepSpeed library
Home-page: http://deepspeed.ai
Author: DeepSpeed Team
Author-email: deepspeed-info@microsoft.com
License: Apache Software License 2.0
Location: /usr/local/lib/python3.10/dist-packages
Requires: hjson, ninja, numpy, packaging, psutil, py-cpuinfo, pydantic, pynvml, torch, tqdm
Required-by: text-generation-server

Still the same error when disable trust_remote_code @regisss

Hmm that looks all right, can you share the output of pip freeze please?

Here is the pip freeze output

root@abf201df1755:/usr/src# pip freeze
absl-py==2.1.0
accelerate==0.27.2
aiohttp==3.8.5
aiosignal==1.3.1
async-timeout==4.0.3
attrs==23.1.0
av==9.2.0
backoff==2.2.1
cachetools==5.3.2
certifi==2023.7.22
cffi==1.15.1
cfgv==3.4.0
charset-normalizer==3.2.0
click==8.1.7
cmake==3.28.1
coloredlogs==15.0.1
datasets==2.14.4
deepspeed @ git+https://github.com/HabanaAI/DeepSpeed.git@fad45b24c7c9070251711a0d7d6f1b82805072ad
Deprecated==1.2.14
diffusers==0.20.1
dill==0.3.7
distlib==0.3.8
exceptiongroup==1.2.0
expecttest==0.2.1
filelock==3.12.3
frozenlist==1.4.0
fsspec==2023.6.0
google-auth==2.26.2
google-auth-oauthlib==0.4.6
googleapis-common-protos==1.60.0
grpc-interceptor==0.15.3
grpcio==1.57.0
grpcio-reflection==1.48.2
grpcio-status==1.48.2
grpcio-tools==1.51.1
habana-media-loader==1.14.0.493
habana-pyhlml==1.14.0.493
habana-torch-dataloader @ file:///tmp/tmp.Y8DXnLRS3C/habana_torch_dataloader-1.14.0.493-cp310-cp310-linux_x86_64.whl#sha256=d57c0e52bf97b9a38a261986ed34d1ed59986fccd0d48d8ca15712221855640e
habana-torch-plugin @ file:///tmp/tmp.Y8DXnLRS3C/habana_torch_plugin-1.14.0.493-cp310-cp310-linux_x86_64.whl#sha256=a342bf1183f7813d2ddfb893370bf10a06bd4490ac435d6c1e262a096f1986a4
habana_gpu_migration @ file:///tmp/tmp.Y8DXnLRS3C/habana_gpu_migration-1.14.0.493-cp310-cp310-linux_x86_64.whl#sha256=dd08b8a0b53571b9f9019cddf1ba38d53312f200c2cd7de66b7185ea5c6cccc2
habana_quantization_toolkit @ file:///tmp/tmp.Y8DXnLRS3C/habana_quantization_toolkit-1.14.0.493-py3-none-any.whl#sha256=32bf985b89ca80889442ce2961f2ec831f1352fdbff34bc0089bcb48f47f8809
hf_transfer==0.1.3
hjson==3.1.0
huggingface-hub==0.16.4
humanfriendly==10.0
identify==2.5.33
idna==3.4
importlib-metadata==6.8.0
iniconfig==2.0.0
intel-openmp==2023.2.3
Jinja2==3.1.2
lightning==2.1.2
lightning-habana==1.3.0
lightning-utilities==0.10.1
loguru==0.6.0
Markdown==3.5.2
MarkupSafe==2.1.3
mkl==2023.1.0
mkl-include==2023.1.0
mpi4py==3.1.4
mpmath==1.3.0
multidict==6.0.4
multiprocess==0.70.15
mypy-protobuf==3.4.0
networkx==3.1
ninja==1.11.1.1
nodeenv==1.8.0
numpy==1.25.2
oauthlib==3.2.2
opentelemetry-api==1.15.0
opentelemetry-exporter-otlp==1.15.0
opentelemetry-exporter-otlp-proto-grpc==1.15.0
opentelemetry-exporter-otlp-proto-http==1.15.0
opentelemetry-instrumentation==0.36b0
opentelemetry-instrumentation-grpc==0.36b0
opentelemetry-proto==1.15.0
opentelemetry-sdk==1.15.0
opentelemetry-semantic-conventions==0.36b0
optimum==1.13.2
optimum-habana==1.10.0
packaging==23.1
pandas==2.0.3
pathspec==0.12.1
peft==0.4.0
perfetto==0.7.0
Pillow==10.0.0
Pillow-SIMD==7.0.0.post3
platformdirs==4.1.0
pluggy==1.3.0
pre-commit==3.3.3
protobuf==3.20.3
psutil==5.9.5
py-cpuinfo==9.0.0
pyarrow==13.0.0
pyasn1==0.5.1
pyasn1-modules==0.3.0
pybind11==2.10.4
pycparser==2.21
pydantic==1.10.13
pynvml==8.0.4
pytest==7.4.4
python-dateutil==2.8.2
pytorch-lightning==2.1.3
pytz==2023.3
PyYAML==6.0.1
regex==2023.8.8
requests==2.31.0
requests-oauthlib==1.3.1
rsa==4.9
safetensors==0.3.2
sentencepiece==0.1.99
six==1.16.0
sympy==1.12
tbb==2021.11.0
tdqm==0.0.1
tensorboard==2.11.2
tensorboard-data-server==0.6.1
tensorboard-plugin-wit==1.8.1
text-generation-server @ file:///usr/src/server
tokenizers==0.14.1
tomli==2.0.1
torch @ file:///tmp/tmp.Y8DXnLRS3C/torch-2.1.1a0%2Bgitb51c9f6-cp310-cp310-linux_x86_64.whl#sha256=1abf98885ccc265886480bdc3e26f3b7eebf19d0e7913eb75e2ad980b7d70089
torch_tb_profiler @ file:///tmp/tmp.Y8DXnLRS3C/torch_tb_profiler-0.4.0-py3-none-any.whl#sha256=0d3af22de662e6641215b5e7cd2b3472d4ef2c4fa90a6b5ae43fcca72301db7d
torchaudio @ file:///tmp/tmp.Y8DXnLRS3C/torchaudio-2.1.0%2B6ea1133-cp310-cp310-linux_x86_64.whl#sha256=d32495f49785a114acdeb2299c9006015b9d7b0f2c4c5ba81908dc35ae09d237
torchdata @ file:///tmp/tmp.Y8DXnLRS3C/torchdata-0.7.0%2Bc5f2204-py3-none-any.whl#sha256=a675577c0018ca609e5e21e0c6bc712e6aa3d1e119d9ffd2ec1a09194f8dae4e
torchmetrics==1.3.0.post0
torchtext @ file:///tmp/tmp.Y8DXnLRS3C/torchtext-0.16.0a0%2B4e255c9-cp310-cp310-linux_x86_64.whl#sha256=4a373211b2f80e632aed4143f2789d7102516a61c69eaab9890814543022b192
torchvision @ file:///tmp/tmp.Y8DXnLRS3C/torchvision-0.16.0%2Bfbb4cc5-cp310-cp310-linux_x86_64.whl#sha256=273904fb11dacebc32e66e3a03a9d206fe61d3d45bad914f8e0eaf439f8f43fc
tqdm==4.66.1
transformers==4.34.1
typer==0.6.1
types-protobuf==4.24.0.20240129
typing_extensions==4.7.1
tzdata==2023.3
urllib3==2.0.4
virtualenv==20.25.0
Werkzeug==3.0.1
wrapt==1.15.0
xxhash==3.3.0
yamllint==1.33.0
yarl==1.9.2
zipp==3.16.2

I just updated dependencies in a new PR, could you try it please? #69

Will check it as soon as possible, many thanks @regisss

Loading 0 checkpoint shards: 0it [00:00, ?it/s]Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/text_generation_server/tgi_service.py", line 29, in <module>
    main(args)
  File "/usr/local/lib/python3.10/dist-packages/text_generation_server/tgi_service.py", line 16, in main
    server.serve(
  File "/usr/local/lib/python3.10/dist-packages/text_generation_server/server.py", line 213, in serve
    asyncio.run(serve_inner(model_id, revision, dtype, sharded))
  File "/usr/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/usr/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()
  File "/usr/local/lib/python3.10/dist-packages/text_generation_server/server.py", line 177, in serve_inner
    model = get_model(model_id, revision=revision, dtype=data_type)
  File "/usr/local/lib/python3.10/dist-packages/text_generation_server/models/__init__.py", line 33, in get_model
    return CausalLM(model_id, revision, dtype)
  File "/usr/local/lib/python3.10/dist-packages/text_generation_server/models/causal_lm.py", line 556, in __init__
    model = deepspeed.init_inference(model, **ds_inference_kwargs)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/__init__.py", line 346, in init_inference
    engine = InferenceEngine(model, config=ds_inference_config)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/inference/engine.py", line 168, in __init__
    self._apply_injection_policy(config, client_module)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/inference/engine.py", line 417, in _apply_injection_policy
    replace_transformer_layer(client_module, self.module, checkpoint, config, self.config)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/module_inject/replace_module.py", line 340, in replace_transformer_layer
    replaced_module = set_lm_head(replaced_module)
UnboundLocalError: local variable 'replaced_module' referenced before assignment
Loading 0 checkpoint shards: 0it [00:00, ?it/s]
Loading 0 checkpoint shards: 0it [00:00, ?it/s]
Loading 0 checkpoint shards: 0it [00:00, ?it/s] rank=0
2024-02-23T11:07:51.845305Z ERROR text_generation_launcher: Shard 0 failed to start
2024-02-23T11:07:51.845337Z  INFO text_generation_launcher: Shutting down shards
Error: ShardCannotStart

Still got the same error, here is the pip freeze

$ docker exec -it llm bash -c "pip freeze"
absl-py==2.1.0
accelerate==0.27.2
aiohttp==3.9.0
aiosignal==1.3.1
async-timeout==4.0.3
attrs==23.1.0
av==9.2.0
backoff==2.2.1
cachetools==5.3.2
certifi==2023.11.17
cffi==1.15.1
cfgv==3.4.0
charset-normalizer==3.3.2
click==8.1.7
cmake==3.28.1
coloredlogs==15.0.1
datasets==2.14.7
deepspeed @ git+https://github.com/HabanaAI/DeepSpeed.git@fad45b24c7c9070251711a0d7d6f1b82805072ad
Deprecated==1.2.14
diffusers==0.26.3
dill==0.3.7
distlib==0.3.8
exceptiongroup==1.2.0
expecttest==0.2.1
filelock==3.13.1
frozenlist==1.4.0
fsspec==2023.10.0
google-auth==2.26.2
google-auth-oauthlib==0.4.6
googleapis-common-protos==1.61.0
grpc-interceptor==0.15.4
grpcio==1.59.3
grpcio-reflection==1.48.2
grpcio-status==1.48.2
grpcio-tools==1.51.1
habana-media-loader==1.14.0.493
habana-pyhlml==1.14.0.493
habana-torch-dataloader @ file:///tmp/tmp.Y8DXnLRS3C/habana_torch_dataloader-1.14.0.493-cp310-cp310-linux_x86_64.whl#sha256=d57c0e52bf97b9a38a261986ed34d1ed59986fccd0d48d8ca15712221855640e
habana-torch-plugin @ file:///tmp/tmp.Y8DXnLRS3C/habana_torch_plugin-1.14.0.493-cp310-cp310-linux_x86_64.whl#sha256=a342bf1183f7813d2ddfb893370bf10a06bd4490ac435d6c1e262a096f1986a4
habana_gpu_migration @ file:///tmp/tmp.Y8DXnLRS3C/habana_gpu_migration-1.14.0.493-cp310-cp310-linux_x86_64.whl#sha256=dd08b8a0b53571b9f9019cddf1ba38d53312f200c2cd7de66b7185ea5c6cccc2
habana_quantization_toolkit @ file:///tmp/tmp.Y8DXnLRS3C/habana_quantization_toolkit-1.14.0.493-py3-none-any.whl#sha256=32bf985b89ca80889442ce2961f2ec831f1352fdbff34bc0089bcb48f47f8809
hf_transfer==0.1.4
hjson==3.1.0
huggingface-hub==0.20.3
humanfriendly==10.0
identify==2.5.33
idna==3.4
importlib-metadata==7.0.1
iniconfig==2.0.0
intel-openmp==2023.2.3
Jinja2==3.1.2
lightning==2.1.2
lightning-habana==1.3.0
lightning-utilities==0.10.1
loguru==0.6.0
Markdown==3.5.2
MarkupSafe==2.1.3
mkl==2023.1.0
mkl-include==2023.1.0
mpi4py==3.1.4
mpmath==1.3.0
multidict==6.0.4
multiprocess==0.70.15
mypy-protobuf==3.4.0
networkx==3.2.1
ninja==1.11.1.1
nodeenv==1.8.0
numpy==1.26.2
oauthlib==3.2.2
opentelemetry-api==1.15.0
opentelemetry-exporter-otlp==1.15.0
opentelemetry-exporter-otlp-proto-grpc==1.15.0
opentelemetry-exporter-otlp-proto-http==1.15.0
opentelemetry-instrumentation==0.36b0
opentelemetry-instrumentation-grpc==0.36b0
opentelemetry-proto==1.15.0
opentelemetry-sdk==1.15.0
opentelemetry-semantic-conventions==0.36b0
optimum==1.17.1
optimum-habana==1.10.4
packaging==23.2
pandas==2.1.3
pathspec==0.12.1
peft==0.4.0
perfetto==0.7.0
Pillow==10.1.0
Pillow-SIMD==7.0.0.post3
platformdirs==4.1.0
pluggy==1.3.0
pre-commit==3.3.3
protobuf==3.20.3
psutil==5.9.6
py-cpuinfo==9.0.0
pyarrow==14.0.1
pyarrow-hotfix==0.6
pyasn1==0.5.1
pyasn1-modules==0.3.0
pybind11==2.10.4
pycparser==2.21
pydantic==1.10.13
pynvml==8.0.4
pytest==7.4.4
python-dateutil==2.8.2
pytorch-lightning==2.1.3
pytz==2023.3.post1
PyYAML==6.0.1
regex==2023.10.3
requests==2.31.0
requests-oauthlib==1.3.1
rsa==4.9
safetensors==0.4.2
sentencepiece==0.1.99
six==1.16.0
sympy==1.12
tbb==2021.11.0
tdqm==0.0.1
tensorboard==2.11.2
tensorboard-data-server==0.6.1
tensorboard-plugin-wit==1.8.1
text-generation-server @ file:///usr/src/server
tokenizers==0.15.2
tomli==2.0.1
torch @ file:///tmp/tmp.Y8DXnLRS3C/torch-2.1.1a0%2Bgitb51c9f6-cp310-cp310-linux_x86_64.whl#sha256=1abf98885ccc265886480bdc3e26f3b7eebf19d0e7913eb75e2ad980b7d70089
torch_tb_profiler @ file:///tmp/tmp.Y8DXnLRS3C/torch_tb_profiler-0.4.0-py3-none-any.whl#sha256=0d3af22de662e6641215b5e7cd2b3472d4ef2c4fa90a6b5ae43fcca72301db7d
torchaudio @ file:///tmp/tmp.Y8DXnLRS3C/torchaudio-2.1.0%2B6ea1133-cp310-cp310-linux_x86_64.whl#sha256=d32495f49785a114acdeb2299c9006015b9d7b0f2c4c5ba81908dc35ae09d237
torchdata @ file:///tmp/tmp.Y8DXnLRS3C/torchdata-0.7.0%2Bc5f2204-py3-none-any.whl#sha256=a675577c0018ca609e5e21e0c6bc712e6aa3d1e119d9ffd2ec1a09194f8dae4e
torchmetrics==1.3.0.post0
torchtext @ file:///tmp/tmp.Y8DXnLRS3C/torchtext-0.16.0a0%2B4e255c9-cp310-cp310-linux_x86_64.whl#sha256=4a373211b2f80e632aed4143f2789d7102516a61c69eaab9890814543022b192
torchvision @ file:///tmp/tmp.Y8DXnLRS3C/torchvision-0.16.0%2Bfbb4cc5-cp310-cp310-linux_x86_64.whl#sha256=273904fb11dacebc32e66e3a03a9d206fe61d3d45bad914f8e0eaf439f8f43fc
tqdm==4.66.1
transformers==4.37.2
typer==0.6.1
types-protobuf==4.24.0.20240129
typing_extensions==4.8.0
tzdata==2023.3
urllib3==2.1.0
virtualenv==20.25.0
Werkzeug==3.0.1
wrapt==1.16.0
xxhash==3.4.1
yamllint==1.33.0
yarl==1.9.3
zipp==3.17.0

Can you try the text-generation example and run the following command in the same environment please?

python ../gaudi_spawn.py --use_deepspeed --world_size 8 run_generation.py \
--model_name_or_path model_name \
--batch_size 1 \
--use_hpu_graphs \
--use_kv_cache \
--max_new_tokens 100

I run it inside the container

root@3e44951622db:/optimum-habana/examples/text-generation# git branch -v
* (HEAD detached at v1.10.4) 1dfbc02 Release: v1.10.4
  main                       89cdd6f Add seed in sft example, make sft result reproducable (#735)

root@3e44951622db:/optimum-habana/examples/text-generation# env | grep HABANA
HABANA_LOGS=/var/log/habana_logs/
HABANA_PLUGINS_LIB_PATH=/opt/habanalabs/habana_plugins
HABANA_VISIBLE_DEVICES=all
HABANA_SCAL_BIN_PATH=/opt/habanalabs/engines_fw

root@3e44951622db:/optimum-habana/examples/text-generation# python ../gaudi_spawn.py --use_deepspeed --world_size 8 run_generation.py --model_name_or_path /data/meta-llama_Llama-2-13b-chat-hf --batch_size 1 --use_hpu_graphs --use_kv_cache --max_new_tokens 100
DistributedRunner run(): command = deepspeed --num_nodes 1 --num_gpus 8 --no_local_rank --master_port 29500 run_generation.py --model_name_or_path /data/meta-llama_Llama-2-13b-chat-hf --batch_size 1 --use_hpu_graphs --use_kv_cache --max_new_tokens 100
/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/__init__.py:158: UserWarning: torch.hpu.setDeterministic is deprecated and will be removed in next release. Please use torch.use_deterministic_algorithms instead.
  warnings.warn(
[2024-02-23 11:28:04,339] [INFO] [real_accelerator.py:178:get_accelerator] Setting ds_accelerator to hpu (auto detect)
[2024-02-23 11:28:05,460] [WARNING] [runner.py:206:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2024-02-23 11:28:05,518] [INFO] [runner.py:585:main] cmd = /usr/bin/python3 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgMywgNCwgNSwgNiwgN119 --master_addr=127.0.0.1 --master_port=29500 --no_local_rank --enable_each_rank_log=None run_generation.py --model_name_or_path /data/meta-llama_Llama-2-13b-chat-hf --batch_size 1 --use_hpu_graphs --use_kv_cache --max_new_tokens 100
/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/__init__.py:158: UserWarning: torch.hpu.setDeterministic is deprecated and will be removed in next release. Please use torch.use_deterministic_algorithms instead.
  warnings.warn(
[2024-02-23 11:28:07,276] [INFO] [real_accelerator.py:178:get_accelerator] Setting ds_accelerator to hpu (auto detect)
[2024-02-23 11:28:08,398] [INFO] [launch.py:146:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]}
[2024-02-23 11:28:08,398] [INFO] [launch.py:152:main] nnodes=1, num_local_procs=8, node_rank=0
[2024-02-23 11:28:08,398] [INFO] [launch.py:163:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]})
[2024-02-23 11:28:08,398] [INFO] [launch.py:164:main] dist_world_size=8
[2024-02-23 11:28:08,398] [INFO] [launch.py:166:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/__init__.py:158: UserWarning: torch.hpu.setDeterministic is deprecated and will be removed in next release. Please use torch.use_deterministic_algorithms instead.
  warnings.warn(
[2024-02-23 11:28:12,234] [INFO] [real_accelerator.py:178:get_accelerator] Setting ds_accelerator to hpu (auto detect)
/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/__init__.py:158: UserWarning: torch.hpu.setDeterministic is deprecated and will be removed in next release. Please use torch.use_deterministic_algorithms instead.
  warnings.warn(
[2024-02-23 11:28:12,546] [INFO] [real_accelerator.py:178:get_accelerator] Setting ds_accelerator to hpu (auto detect)
/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/__init__.py:158: UserWarning: torch.hpu.setDeterministic is deprecated and will be removed in next release. Please use torch.use_deterministic_algorithms instead.
  warnings.warn(
[2024-02-23 11:28:12,830] [INFO] [real_accelerator.py:178:get_accelerator] Setting ds_accelerator to hpu (auto detect)
[2024-02-23 11:28:13,149] [WARNING] [comm.py:163:init_deepspeed_backend] HCCL backend in DeepSpeed not yet implemented
[2024-02-23 11:28:13,150] [INFO] [comm.py:637:init_distributed] cdb=None
/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/__init__.py:158: UserWarning: torch.hpu.setDeterministic is deprecated and will be removed in next release. Please use torch.use_deterministic_algorithms instead.
  warnings.warn(
[2024-02-23 11:28:13,242] [INFO] [real_accelerator.py:178:get_accelerator] Setting ds_accelerator to hpu (auto detect)
[2024-02-23 11:28:13,466] [WARNING] [comm.py:163:init_deepspeed_backend] HCCL backend in DeepSpeed not yet implemented
[2024-02-23 11:28:13,467] [INFO] [comm.py:637:init_distributed] cdb=None
/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/__init__.py:158: UserWarning: torch.hpu.setDeterministic is deprecated and will be removed in next release. Please use torch.use_deterministic_algorithms instead.
  warnings.warn(
[2024-02-23 11:28:13,727] [INFO] [real_accelerator.py:178:get_accelerator] Setting ds_accelerator to hpu (auto detect)
[2024-02-23 11:28:13,742] [WARNING] [comm.py:163:init_deepspeed_backend] HCCL backend in DeepSpeed not yet implemented
[2024-02-23 11:28:13,742] [INFO] [comm.py:637:init_distributed] cdb=None
/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/__init__.py:158: UserWarning: torch.hpu.setDeterministic is deprecated and will be removed in next release. Please use torch.use_deterministic_algorithms instead.
  warnings.warn(
[2024-02-23 11:28:14,117] [INFO] [real_accelerator.py:178:get_accelerator] Setting ds_accelerator to hpu (auto detect)
[2024-02-23 11:28:14,166] [WARNING] [comm.py:163:init_deepspeed_backend] HCCL backend in DeepSpeed not yet implemented
[2024-02-23 11:28:14,166] [INFO] [comm.py:637:init_distributed] cdb=None
/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/__init__.py:158: UserWarning: torch.hpu.setDeterministic is deprecated and will be removed in next release. Please use torch.use_deterministic_algorithms instead.
  warnings.warn(
[2024-02-23 11:28:14,207] [INFO] [real_accelerator.py:178:get_accelerator] Setting ds_accelerator to hpu (auto detect)
/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/__init__.py:158: UserWarning: torch.hpu.setDeterministic is deprecated and will be removed in next release. Please use torch.use_deterministic_algorithms instead.
  warnings.warn(
[2024-02-23 11:28:14,218] [INFO] [real_accelerator.py:178:get_accelerator] Setting ds_accelerator to hpu (auto detect)
02/23/2024 11:28:14 - INFO - __main__ - DeepSpeed is enabled.
[2024-02-23 11:28:14,694] [WARNING] [comm.py:163:init_deepspeed_backend] HCCL backend in DeepSpeed not yet implemented
[2024-02-23 11:28:14,694] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-02-23 11:28:14,694] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend hccl
Loading 0 checkpoint shards: 0it [00:00, ?it/s]Traceback (most recent call last):
  File "/optimum-habana/examples/text-generation/run_generation.py", line 562, in <module>
    main()
  File "/optimum-habana/examples/text-generation/run_generation.py", line 257, in main
    model, tokenizer, generation_config = initialize_model(args, logger)
  File "/optimum-habana/examples/text-generation/utils.py", line 374, in initialize_model
    else setup_distributed_model(args, model_dtype, model_kwargs, logger)
  File "/optimum-habana/examples/text-generation/utils.py", line 238, in setup_distributed_model
    model = deepspeed.init_inference(model, **ds_inference_kwargs)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/__init__.py", line 346, in init_inference
    engine = InferenceEngine(model, config=ds_inference_config)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/inference/engine.py", line 154, in __init__
    self._apply_injection_policy(config, client_module)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/inference/engine.py", line 417, in _apply_injection_policy
    replace_transformer_layer(client_module, self.module, checkpoint, config, self.config)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/module_inject/replace_module.py", line 340, in replace_transformer_layer
    replaced_module = set_lm_head(replaced_module)
UnboundLocalError: local variable 'replaced_module' referenced before assignment
Loading 0 checkpoint shards: 0it [00:00, ?it/s]
Loading 0 checkpoint shards: 0it [00:00, ?it/s]Traceback (most recent call last):
  File "/optimum-habana/examples/text-generation/run_generation.py", line 562, in <module>
    main()
  File "/optimum-habana/examples/text-generation/run_generation.py", line 257, in main
    model, tokenizer, generation_config = initialize_model(args, logger)
  File "/optimum-habana/examples/text-generation/utils.py", line 374, in initialize_model
    else setup_distributed_model(args, model_dtype, model_kwargs, logger)
  File "/optimum-habana/examples/text-generation/utils.py", line 238, in setup_distributed_model
    model = deepspeed.init_inference(model, **ds_inference_kwargs)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/__init__.py", line 346, in init_inference
    engine = InferenceEngine(model, config=ds_inference_config)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/inference/engine.py", line 154, in __init__
    self._apply_injection_policy(config, client_module)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/inference/engine.py", line 417, in _apply_injection_policy
    replace_transformer_layer(client_module, self.module, checkpoint, config, self.config)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/module_inject/replace_module.py", line 340, in replace_transformer_layer
    replaced_module = set_lm_head(replaced_module)
UnboundLocalError: local variable 'replaced_module' referenced before assignment
Loading 0 checkpoint shards: 0it [00:00, ?it/s]Traceback (most recent call last):
  File "/optimum-habana/examples/text-generation/run_generation.py", line 562, in <module>
    main()
  File "/optimum-habana/examples/text-generation/run_generation.py", line 257, in main
    model, tokenizer, generation_config = initialize_model(args, logger)
  File "/optimum-habana/examples/text-generation/utils.py", line 374, in initialize_model
    else setup_distributed_model(args, model_dtype, model_kwargs, logger)
  File "/optimum-habana/examples/text-generation/utils.py", line 238, in setup_distributed_model
    model = deepspeed.init_inference(model, **ds_inference_kwargs)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/__init__.py", line 346, in init_inference
    engine = InferenceEngine(model, config=ds_inference_config)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/inference/engine.py", line 154, in __init__
    self._apply_injection_policy(config, client_module)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/inference/engine.py", line 417, in _apply_injection_policy
    replace_transformer_layer(client_module, self.module, checkpoint, config, self.config)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/module_inject/replace_module.py", line 340, in replace_transformer_layer
    replaced_module = set_lm_head(replaced_module)
UnboundLocalError: local variable 'replaced_module' referenced before assignment
[2024-02-23 11:28:15,399] [WARNING] [comm.py:163:init_deepspeed_backend] HCCL backend in DeepSpeed not yet implemented
[2024-02-23 11:28:15,399] [INFO] [comm.py:637:init_distributed] cdb=None
Loading 0 checkpoint shards: 0it [00:00, ?it/s]
Loading 0 checkpoint shards: 0it [00:00, ?it/s]
[2024-02-23 11:28:15,543] [WARNING] [comm.py:163:init_deepspeed_backend] HCCL backend in DeepSpeed not yet implemented
[2024-02-23 11:28:15,543] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-02-23 11:28:15,547] [WARNING] [comm.py:163:init_deepspeed_backend] HCCL backend in DeepSpeed not yet implemented
[2024-02-23 11:28:15,547] [INFO] [comm.py:637:init_distributed] cdb=None
Loading 0 checkpoint shards: 0it [00:00, ?it/s]Traceback (most recent call last):
  File "/optimum-habana/examples/text-generation/run_generation.py", line 562, in <module>
    main()
  File "/optimum-habana/examples/text-generation/run_generation.py", line 257, in main
    model, tokenizer, generation_config = initialize_model(args, logger)
  File "/optimum-habana/examples/text-generation/utils.py", line 374, in initialize_model
    else setup_distributed_model(args, model_dtype, model_kwargs, logger)
  File "/optimum-habana/examples/text-generation/utils.py", line 238, in setup_distributed_model
    model = deepspeed.init_inference(model, **ds_inference_kwargs)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/__init__.py", line 346, in init_inference
    engine = InferenceEngine(model, config=ds_inference_config)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/inference/engine.py", line 154, in __init__
    self._apply_injection_policy(config, client_module)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/inference/engine.py", line 417, in _apply_injection_policy
    replace_transformer_layer(client_module, self.module, checkpoint, config, self.config)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/module_inject/replace_module.py", line 340, in replace_transformer_layer
    replaced_module = set_lm_head(replaced_module)
UnboundLocalError: local variable 'replaced_module' referenced before assignment
Loading 0 checkpoint shards: 0it [00:00, ?it/s]
[2024-02-23 11:28:19,656] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.12.4+hpu.synapse.v1.14.0, git-hash=fad45b2, git-branch=1.14.0
[2024-02-23 11:28:19,657] [INFO] [logging.py:96:log_dist] [Rank 0] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1
Loading 0 checkpoint shards: 0it [00:00, ?it/s]Traceback (most recent call last):
  File "/optimum-habana/examples/text-generation/run_generation.py", line 562, in <module>
    main()
  File "/optimum-habana/examples/text-generation/run_generation.py", line 257, in main
    model, tokenizer, generation_config = initialize_model(args, logger)
  File "/optimum-habana/examples/text-generation/utils.py", line 374, in initialize_model
    else setup_distributed_model(args, model_dtype, model_kwargs, logger)
  File "/optimum-habana/examples/text-generation/utils.py", line 238, in setup_distributed_model
    model = deepspeed.init_inference(model, **ds_inference_kwargs)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/__init__.py", line 346, in init_inference
    engine = InferenceEngine(model, config=ds_inference_config)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/inference/engine.py", line 154, in __init__
    self._apply_injection_policy(config, client_module)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/inference/engine.py", line 417, in _apply_injection_policy
    replace_transformer_layer(client_module, self.module, checkpoint, config, self.config)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/module_inject/replace_module.py", line 340, in replace_transformer_layer
    replaced_module = set_lm_head(replaced_module)
UnboundLocalError: local variable 'replaced_module' referenced before assignment
Loading 0 checkpoint shards: 0it [00:00, ?it/s]
Loading 0 checkpoint shards: 0it [00:00, ?it/s]Traceback (most recent call last):
  File "/optimum-habana/examples/text-generation/run_generation.py", line 562, in <module>
    main()
  File "/optimum-habana/examples/text-generation/run_generation.py", line 257, in main
    model, tokenizer, generation_config = initialize_model(args, logger)
  File "/optimum-habana/examples/text-generation/utils.py", line 374, in initialize_model
    else setup_distributed_model(args, model_dtype, model_kwargs, logger)
  File "/optimum-habana/examples/text-generation/utils.py", line 238, in setup_distributed_model
    model = deepspeed.init_inference(model, **ds_inference_kwargs)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/__init__.py", line 346, in init_inference
    engine = InferenceEngine(model, config=ds_inference_config)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/inference/engine.py", line 154, in __init__
    self._apply_injection_policy(config, client_module)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/inference/engine.py", line 417, in _apply_injection_policy
    replace_transformer_layer(client_module, self.module, checkpoint, config, self.config)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/module_inject/replace_module.py", line 340, in replace_transformer_layer
    replaced_module = set_lm_head(replaced_module)
UnboundLocalError: local variable 'replaced_module' referenced before assignment
Loading 0 checkpoint shards: 0it [00:00, ?it/s]Traceback (most recent call last):
  File "/optimum-habana/examples/text-generation/run_generation.py", line 562, in <module>
    main()
  File "/optimum-habana/examples/text-generation/run_generation.py", line 257, in main
    model, tokenizer, generation_config = initialize_model(args, logger)
  File "/optimum-habana/examples/text-generation/utils.py", line 374, in initialize_model
    else setup_distributed_model(args, model_dtype, model_kwargs, logger)
  File "/optimum-habana/examples/text-generation/utils.py", line 238, in setup_distributed_model
    model = deepspeed.init_inference(model, **ds_inference_kwargs)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/__init__.py", line 346, in init_inference
    engine = InferenceEngine(model, config=ds_inference_config)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/inference/engine.py", line 154, in __init__
    self._apply_injection_policy(config, client_module)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/inference/engine.py", line 417, in _apply_injection_policy
    replace_transformer_layer(client_module, self.module, checkpoint, config, self.config)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/module_inject/replace_module.py", line 340, in replace_transformer_layer
    replaced_module = set_lm_head(replaced_module)
UnboundLocalError: local variable 'replaced_module' referenced before assignment
Loading 0 checkpoint shards: 0it [00:00, ?it/s]Traceback (most recent call last):
  File "/optimum-habana/examples/text-generation/run_generation.py", line 562, in <module>
    main()
  File "/optimum-habana/examples/text-generation/run_generation.py", line 257, in main
    model, tokenizer, generation_config = initialize_model(args, logger)
  File "/optimum-habana/examples/text-generation/utils.py", line 374, in initialize_model
    else setup_distributed_model(args, model_dtype, model_kwargs, logger)
  File "/optimum-habana/examples/text-generation/utils.py", line 238, in setup_distributed_model
    model = deepspeed.init_inference(model, **ds_inference_kwargs)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/__init__.py", line 346, in init_inference
    engine = InferenceEngine(model, config=ds_inference_config)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/inference/engine.py", line 154, in __init__
    self._apply_injection_policy(config, client_module)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/inference/engine.py", line 417, in _apply_injection_policy
    replace_transformer_layer(client_module, self.module, checkpoint, config, self.config)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/module_inject/replace_module.py", line 340, in replace_transformer_layer
    replaced_module = set_lm_head(replaced_module)
UnboundLocalError: local variable 'replaced_module' referenced before assignment
Loading 0 checkpoint shards: 0it [00:00, ?it/s]
Loading 0 checkpoint shards: 0it [00:00, ?it/s]
Loading 0 checkpoint shards: 0it [00:00, ?it/s]
[2024-02-23 11:28:21,418] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 5103
[2024-02-23 11:28:21,418] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 5104
[2024-02-23 11:28:21,423] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 5105
[2024-02-23 11:28:21,424] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 5106
[2024-02-23 11:28:21,426] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 5107
[2024-02-23 11:28:21,427] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 5108
[2024-02-23 11:28:21,428] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 5109
[2024-02-23 11:28:21,429] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 5110
[2024-02-23 11:28:21,430] [ERROR] [launch.py:322:sigkill_handler] ['/usr/bin/python3', '-u', 'run_generation.py', '--model_name_or_path', '/data/meta-llama_Llama-2-13b-chat-hf', '--batch_size', '1', '--use_hpu_graphs', '--use_kv_cache', '--max_new_tokens', '100'] exits with return code = 1
[ERROR|distributed_runner.py:222] 2024-02-23 11:28:22,075 >> deepspeed --num_nodes 1 --num_gpus 8 --no_local_rank --master_port 29500 run_generation.py --model_name_or_path /data/meta-llama_Llama-2-13b-chat-hf --batch_size 1 --use_hpu_graphs --use_kv_cache --max_new_tokens 100  exited with status = 1

root@3e44951622db:/optimum-habana/examples/text-generation#

I think it simply doesn't find any checkpoint. Can you show me the content of your model folder? Or at least just tell me if there is any checkpoint inside this folder?

$ ls -lah /data/meta-llama_Llama-2-13b-chat-hf
total 25G
drwxrwxr-x 2 smci smci 4.0K  十  23 13:28 .
drwxrwxr-x 5 smci smci 4.0K  二  24 11:45 ..
-rw-rw-r-- 1 smci smci 6.9K  十  23 13:28 LICENSE.txt
-rw-rw-r-- 1 smci smci  11K  十  23 13:28 README.md
-rw-rw-r-- 1 smci smci 4.7K  十  23 13:28 USE_POLICY.md
-rw-rw-r-- 1 smci smci  587  十  23 13:28 config.json
-rw-rw-r-- 1 smci smci  188  十  23 13:28 generation_config.json
-rw-rw-r-- 1 smci smci  815  十  23 13:28 huggingface-metadata.txt
-rw-rw-r-- 1 smci smci 9.3G  十  23 13:29 model-00001-of-00003.safetensors
-rw-rw-r-- 1 smci smci 9.3G  十  23 13:29 model-00002-of-00003.safetensors
-rw-rw-r-- 1 smci smci 5.8G  十  23 13:28 model-00003-of-00003.safetensors
-rw-rw-r-- 1 smci smci  33K  十  23 13:28 model.safetensors.index.json
-rw-rw-r-- 1 smci smci  33K  十  23 13:28 pytorch_model.bin.index.json
-rw-rw-r-- 1 smci smci  414  十  23 13:28 special_tokens_map.json
-rw-rw-r-- 1 smci smci 1.8M  十  23 13:28 tokenizer.json
-rw-rw-r-- 1 smci smci 489K  十  23 13:28 tokenizer.model
-rw-rw-r-- 1 smci smci 1.6K  十  23 13:28 tokenizer_config.json

Here is the content in the model folder

Okay I understand better what is going on.
Your folder has checkpoints in the safetensors format only and not in the pickle format (i.e. *.bin). DeepSpeed has been able to deal with safetensors checkpoints for only little time and this should be possible when the next version of Habana's SDK is released (v1.15).
For now, you can only use the *.bin checkpoints if you need DeepSpeed. You can download them here.

@regisss I see it works, I thought the transformer library can handle the safetenstor model as well. cmiiw

So could you show me the part of the code that says it runs only .bin file, I can't find it on the load_checkpoint.py in https://github.dev/HabanaAI/DeepSpeed/tree/1.14.0

@regisss I see it works, I thought the transformer library can handle the safetenstor model as well. cmiiw

Transformers can handle safetensors checkpoint. However, for big models that cannot fit on a single device, we use DeepSpeed and in that case DeepSpeed takes care of loading the model.

So could you show me the part of the code that says it runs only .bin file, I can't find it on the load_checkpoint.py in https://github.dev/HabanaAI/DeepSpeed/tree/1.14.0

You can check the description of this PR and the following messages for more information about this issue.

Edited

@regisss Btw I just do benchmarking between Habana and Nvidia A100, but when I do inference (text-generation) with TGI Gaudi the Habana Gaudi 2 still slower than Nvidia A100.

Could you confirm it from your side?

Based on this, it shows that habana outperform the Nvidia A100
https://huggingface.co/blog/habana-gaudi-2-benchmark

The json file:

{
    "inputs": "Far far away, behind the word mountains, far from the countries Vokalia and Consonantia, there live the blind texts. Separated they live in Bookmarksgrove right at the coast of the Semantics, a large language ocean. A small river named Duden flows by their place and supplies it with the necessary regelialia. It is a paradisematic country, in which roasted parts of sentences fly into your mouth. Even the all-powerful Pointing has no control about the blind texts it is an almost unorthographic life One day however a small line of blind text by the name of Lorem Ipsum decided to leave for the far World of Grammar. The Big Oxmox advised her not to do so, because there were thousands of bad Commas, wild Question Marks and devious Semikoli, but the Little Blind Text didn’t listen. She packed her seven versalia, put her initial into the belt and made herself on the way. When she reached the first hills of the Italic Mountains, she had a last view back on the skyline of her hometown Bookmarksgrove, the headline of Alphabet Village and the subline of her own road, the Line Lane. Pityful a rethoric question ran over her cheek, then she continued her way. On her way she met a copy. The copy warned the Little Blind Text, that where it came from it would have been rewritten a thousand times and everything that was left from its origin would be the word and and the Little Blind Text should turn around and return to its own, safe country. But nothing the copy said could convince her and so it didn’t take long until a few insidious Copy Writers ambushed her, made her drunk with Longe and Parole and dragged her into their agency, where they abused her for their projects again and again. And if she hasn’t been rewritten, then they are still using her. Far far away, behind the word mountains, far from the countries Vokalia and Consonantia, there live the blind texts. Separated they live in Bookmarksgrove right at the coast of the Semantics, a large language ocean. A small river named Duden flows by their place and supplies it with the necessary regelialia. It is a paradisematic country, in which roasted parts of sentences fly into your mouth. Even the all-powerful Pointing has no control about the blind texts it is an almost unorthographic life One day however a small line of blind text by the name of Lorem Ipsum decided to leave for the far World of Grammar. The Big Oxmox advised her not to do so, because there were thousands of bad Commas, wild Question Marks and devious Semikoli, but the Little Blind Text didn’t listen. She packed her seven versalia, put her initial into the belt and made herself on the way. When she reached the first hills of the Italic Mountains, she had a last view back on the skyline of her hometown Bookmarksgrove, the headline of Alphabet Village and the subline of her own road, the Line Lane. Pityful a rethoric question ran over her cheek, then she continued her way. On her way she met a copy. The copy warned the Little Blind Text, that where it came from it would have been rewritten a thousand times and everything that was left from its origin would be the word and and the Little Blind Text should turn around and return to its own, safe country. But nothing the copy said could convince her and so it didn’t take long until a few insidious Copy Writers ambushed her, made her drunk with Longe and Parole and dragged her into their agency, where they abused her for their projects again and again. And if she hasn’t been rewritten, then they are still using her. Far far away, behind the word mountains, far from the countries Vokalia and Consonantia, there live the blind texts. Separated they live in Bookmarksgrove right at the coast of the Semantics, a large language ocean. A small river named Duden flows by their place and supplies it with the necessary regelialia. It is a paradisematic country, in which roasted parts of sentences fly into your mouth. Even the all-powerful Pointing has no control about the blind texts it is an almost unorthographic life One day however a small line of blind text by the name of Lorem Ipsum decided to leave for the far World of Grammar. The Big Oxmox advised her not to do so, because there were thousands of bad Commas, wild Question Marks and devious Semikoli, but the Little Blind Text didn’t listen. She packed her seven versalia, put her initial into the belt and made herself on the way. When she reached the first hills of the Italic Mountains, she had a last view back on the skyline of her hometown Bookmarksgrove, the headline of Alphabet Village and the subline of her own road, the Line Lane. Pityful a rethoric question ran over her cheek, then she continued her way. On her way she met a copy. The copy warned the Little Blind Text, that where it came from it would have been rewritten a thousand times and everything that was left from its origin would be the word and and the Little Blind Text should turn around and return to its own, safe country. But nothing the copy said could convince her and so it didn’t take long until a few insidious Copy Writers ambushed her, made her drunk with Longe and Parole and dragged her into their agency, where they abused her for their projects again and again. And if she hasn’t been rewritten, then they are still using her.Far far away, behind the word mountains, far from the countries Vokalia and Consonantia, there live the blind texts. Separated they live in Bookmarksgrove right at the coast of the Semantics, a large language ocean. A small river named Duden flows by their place and supplies it with the necessary regelialia. It is a paradisematic country, in which roasted parts of sentences fly into your mouth. Even the all-powerful Pointing has no control about the blind. Summarize it",
    "parameters": {
        "max_new_tokens": 4096,
        "best_of": 1,
        "repetition_penalty": 1.17,
        "return_full_text": false,
        "temperature": 0.01,
        "top_p": 0.14,
        "top_k": 49,
        "truncate": 4096,
        "typical_p": 0.99,
        "watermark": false,
        "decoder_input_details": false
    }
}

Run using hey with 5 concurrent and 10 users

$  hey -t 1000 -m POST -D test.json -H "Content-Type: application/json" -c 5  -n 10 "http://127.0.0.1:8080/generate"

The benchmark result:

GPU Model Multi-GPU / HPU CPU Memory (RAM) Average
Nvidia A100 80GB No (1x80GB) 24 220 GB ~ 4.45 seconds
Nvidia A100 40GB No (1x40GB) 12 85 GB ~ 5.98 seconds
Nvidia A100 40 GB Yes (2x40GB) 24 170 GB ~ 4.21 seconds
Habana Gaudi 2 (HPU) No (1*100GB) 160 1 TB ~ 12.81 seconds
Habana Gaudi 2 (HPU) Yes (8*100GB) 160 1 TB ~ 12.57 seconds

@muhammad-asn how did you set up TGI for HPU?
Please note that there are several variables / arguments that need to be set to receive optimal performance.

Based on your config, I would suggest to add:

  • as env variables:

    • MAX_TOTAL_TOKENS=6144 PREFILL_BATCH_BUCKET_SIZE=8 BATCH_BUCKET_SIZE=16
  • as arguments:

    • --max-batch-prefill-tokens 16384 --max-batch-total-tokens 98304 --max-input-length 2048 --max-total-tokens 6144

Please note also, that warmup is not enabled at this moment on v1.2-release branch. It means that first iterations will be much slower due to graphs recompilations. However, I will add warmup later this week.

@kdamaszk ack I will try it. Thank you