huggingface/tgi-gaudi

Clarification on past_key_values type for Starcoder

vidyasiv opened this issue · 3 comments

System Info

Target: x86_64-unknown-linux-gnu
Cargo version: 1.75.0
Commit sha: N/A
Docker label: N/A
nvidia-smi:
N/A

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

docker run -p 8080:80 -v $VOLUME:/data --runtime=habana \
        -e HABANA_VISIBLE_DEVICES=all -e HUGGING_FACE_HUB_TOKEN=<> \
        -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --ipc=host \
        -e LOG_LEVEL=debug,text_generation_launcher=debug \
        -it --rm --entrypoint /bin/bash \
        ghcr.io/huggingface/tgi-gaudi:1.2.1 --model-id bigcode/starcoder

And the error is:

2024-04-02T21:35:26.131011Z DEBUG text_generation_launcher:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/models/gpt_bigcode/modeling_gpt_bigcode.py", line 209, in gaudi_gpt_bigcode_model_forward
2024-04-02T21:35:26.131016Z DEBUG text_generation_launcher:     past_length = past_key_values[0].size(-2)
2024-04-02T21:35:26.131019Z DEBUG text_generation_launcher: AttributeError: 'tuple' object has no attribute 'size'

The expectation in optimum-habana function gaudi_gpt_bigcode_model_forward() is for past_key_values to be a list of tensors and it is for the above cases. However, output.past_key_values received in first forward pass here: https://github.com/huggingface/tgi-gaudi/blob/habana-main/server/text_generation_server/models/causal_lm.py#L782 is a list of tensors but before the second pass it becomes a list of tuples due to attach_kv_cache() function (code: https://github.com/huggingface/tgi-gaudi/blob/habana-main/server/text_generation_server/models/causal_lm.py#L253 ) which is incompatible with optimum-habana.

I can help fix this but need clarification of which datatype/behavior to honor:

  • Fix in optimum-habana to handle a tuple OR
  • Fix in tgi-gaudi to convert to tensor

Expected behavior

Server starts up and runs model warmup successfully and waits for requests.

@vidyasiv thank you for raising this issue. I think that tgi-gaudi should support different types of KV cache.
However, not only data type is different here, but overall tensor shape and content, am I right? At this moment tgi-gaudi assumes that KV cache shape is like Llama's - it is a list of tuples with tensors of shape [batch_size, num_heads, seq_length, head_dim].

Thanks for explaining! Although I am confused at kv cache related code in two places i.e both tgi-gaudi and optimum-habana. Is there difference in what is implemented in each repository?

What exactly is not clear? In tgi-gaudi there is only aligning data in KV cache (like shift-left operation, new request insert etc.). However, due to that operations we cannot use reuse cache flow, as KV cache has to be available from outside of the model. Communication between tgi-gaudi and optimum-habana is done mostly by forward() function and input / output arguments (one of them is KV cache).