huggingface/tgi-gaudi

Misleading documentation

12010486 opened this issue · 4 comments

Hi everyone,

I can see there has been a recent effort to add more documentation on TGI, and I appreciate it. However, there are some sections that are misleading, for example:
docs/source/conceptual/quantization.md
it is describing Quantization with GPTQ and Quantization with bitsandbytes, but to the best of my knowledge, this is not working on Gaudi2 (we tested bitsandbytes and cuda calls are hardcoded in there).

My ask would be if you can prune the bites that are not relevant for Gaudi

I can also contribute, if you find it relevant. We interact with customers, so might bring in a different perspective

Maybe we should make it clearer in the README that not all features of TGI are supported on Gaudi and that the doc for this fork is the README.

I came here to chime in that the documentation is wrong

this example crashes during warmup

docker run -p 8080:80
--runtime=habana
-v $volume:/data
-e HABANA_VISIBLE_DEVICES=all
-e OMPI_MCA_btl_vader_single_copy_mechanism=none
-e HF_HUB_ENABLE_HF_TRANSFER=1
-e HUGGING_FACE_HUB_TOKEN=$hf_token
-e PT_HPU_ENABLE_LAZY_COLLECTIVES=true
-e PREFILL_BATCH_BUCKET_SIZE=1
-e BATCH_BUCKET_SIZE=256
-e PAD_SEQUENCE_TO_MULTIPLE_OF=128
--cap-add=sys_nice
--ipc=host
ghcr.io/huggingface/tgi-gaudi:2.0.1
--model-id $model
--max-batch-prefill-tokens 8242
--max-input-tokens 4096
--max-total-tokens 8192
--max-batch-size 256
--max-concurrent-requests 400
--sharded true
--num-shard 8

@endomorphosis Can you please point me at where you find this example in the documentation? I can't find it.