v2.0.0-release: 8 extra tokens appended to input tokens trigger in trigger the huggingface_hub.errors.ValidationError: Input validation error: `inputs` tokens + `max_new_tokens` must be <= 32512. Given: 32008 `inputs` tokens and 512 `max_new_tokens`. however no issue in v1.2.2-release
IT-Forrest opened this issue · 1 comments
System Info
Compared with v1.2.2-release tgi-gaudi, sending the query to the v2.0.0-release tgi-server will hit the input_token_length + output_token_length assertion.
Especially, when input_token_length=32000, output_token_length=512, I'll hit the following assertion,
huggingface_hub.errors.ValidationError: Input validation error: inputs tokens + max_new_tokens must be <= 32512. Given: 32008 inputs tokens and 512 max_new_tokens
According to the log, the input token_length is 32008 instead of 32000. Thus, we suspect there are 8 extra tokens appended to the input tokens.
Two supplements:
- in v1.2.2-release tgi-gaudi, given the same config, e.g. input_token_length=32000, output_token_length=512, the query could be handled and no assertion
- without tgi-gaudi, given input_token_length=32000, output_token_length=512, the examples/text-generation/run_generation.py script in optimum-habana repo will not hit this kind of issue
Information
- Docker
- The CLI directly
Tasks
- An officially supported command
- My own modifications
Reproduction
Steps to reproduce the error
- git clone tgi-gaudi
git clone https://github.com/huggingface/tgi-gaudi.git
cd tgi-gaudi && git checkout v2.0.0-release
- modify the Dockerfile
$ git diff Dockerfile
diff --git a/Dockerfile b/Dockerfile
index c7c6576..7a45934 100644
--- a/Dockerfile
+++ b/Dockerfile
@@ -60,7 +60,9 @@ RUN cd server && \
pip install -r requirements.txt && \
bash ./dill-0.3.8-patch.sh && \
pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.15.0 && \
- pip install . --no-cache-dir
+ pip install . --no-cache-dir && \
+ pip install git+https://github.com/huggingface/optimum-habana.git
+
# Install benchmarker
COPY --from=builder /usr/src/target/release/text-generation-benchmark /usr/local/bin/text-generation-benchmark
-
compile tgi-gaudi
docker build -t tgi_gaudi_image:v2.0.0-release .
-
prepare the optimum-habana
cd /home/USER
git clone https://github.com/huggingface/optimum-habana.git
-
prepare the model to the local dir
E.g. to download mistral-7b to this dir
/software/tgi_gaudi/USER/mistral-7b/snapshots/26bca36bde8333b5d7f72e9ed20ccda6a618af24/
-
launch the tgi-server
Save the following cmd into scriptlaunch_mistral-7b_tgi.sh
, change the prioritychmod 777 launch_mistral-7b_tgi.sh
, and then the launch it with./launch_mistral-7b_tgi.sh
bucket_size=3000
input_len=32000
total_len=32512
prefill_len=32000
mbs_total_len=130048
command="docker run -p 8850:80 --name='jwang_tgi931_v2.0.0' -v /sys/kernel/debug:/sys/kernel/debug \
-v /software/tgi_gaudi/USER:/root/ckpt \
-v /home/USER/test/optimum-habana/examples/text-generation:/root/text-generation \
-e PT_HPU_ENABLE_LAZY_COLLECTIVES=true \
-e HABANA_VISIBLE_DEVICES=all \
-e PREFILL_BATCH_BUCKET_SIZE=1 \
-e BATCH_BUCKET_SIZE=4 \
-e PAD_SEQUENCE_TO_MULTIPLE_OF=$bucket_size \
-e QUANT_CONFIG=/root/text-generation/hqt_output_mis7b_1x/tmp_maxabs_quant.json \
-e OMPI_MCA_btl_vader_single_copy_mechanism=none \
-e ENABLE_HPU_GRAPH=true \
-e LIMIT_HPU_GRAPH=true \
--cap-add=sys_nice --ipc=host tgi_gaudi_image:v2.0.0-release \
--model-id /root/ckpt/mistral-7b/snapshots/26bca36bde8333b5d7f72e9ed20ccda6a618af24/ \
--max-input-length $input_len --max-total-tokens $total_len \
--max-batch-prefill-tokens $prefill_len --max-batch-total-tokens $mbs_total_len"
echo "$command"
eval "$command"
# working version docker image name :test_new_tgi
- send a query to the tgi-server
Install the necessary dependencies, e.g.pip install nltk && pip install transformers
Save the following cmd into a script namedsend_query_to_tgi_server.py
, then launch it withpython send_query_to_tgi_server.py
import requests
import json
from tqdm import tqdm
from huggingface_hub import InferenceClient
import time
import nltk
nltk.download("gutenberg")
nltk.download("punkt")
from nltk.corpus import gutenberg
from transformers import AutoTokenizer
import random
random.seed(42)
headers = {'Content-Type': 'application/json', 'Accept': 'text/event-stream'}
# mistral 8850
base_endpoint = "http://localhost:8850"
generate_endpoint = "%s/generate_stream"%(base_endpoint)
client = InferenceClient(base_endpoint)
# mistral
tokenizer = AutoTokenizer.from_pretrained("/software/tgi_gaudi/USER/mistral-7b/snapshots/26bca36bde8333b5d7f72e9ed20ccda6a618af24/")
list_sents = gutenberg.sents()
long_sents = [" ".join(sent) for sent in list_sents
if len(sent) > 50]
num_sents_list = list(range(380, 400, 20)) #32k test. 1card
random_seed = random.randint(1,256)
for in_len, out_len in [(142133, 512)]: #142133->32000->assert in 2.0.0, 142088->31991->21.09, 142090->31992->0.01
cur_sent_len = 386
context_str = " ".join(long_sents[0:cur_sent_len])
context_str = context_str[0:in_len]
token_count = len(tokenizer(context_str)['input_ids'])
print(f"Number of input Tokens: {token_count}, output token {out_len}")
data = {"inputs":"<s>[INST] %s [/INST]"%(context_str),"parameters":{"max_new_tokens":out_len, "do_sample": True, "seed": random_seed, "repetition_penalty":1.2, "temperature":0.95, "top_k":5}}
n_tokens = 0
is_first = True
start_ts = time.time()
for token in client.text_generation(data['inputs'], **data['parameters'], stream=True):
print(token, end='', flush=True)
if is_first:
first_time = time.time() - start_ts
print(f'Num Gen Tokens: {token_count} 1st token latency {first_time*1000} ms')
is_first = False
n_tokens += 1
print('\n')
end_ts = time.time()
delta_time = end_ts - start_ts
print('Num Gen Tokens: %d, DeltaTime: %f, Throughput: %f (tok/sec)'%(n_tokens, delta_time, n_tokens * 1.0/delta_time))
print('\n-------------------------------------------------------------------------------------\n')
Expected behavior
Expect it could pass when given inpu_token len=32000 and output_token=512
[nltk_data] Downloading package gutenberg to
[nltk_data] /weka/home/USER/nltk_data...
[nltk_data] Package gutenberg is already up-to-date!
[nltk_data] Downloading package punkt to /weka/home/USER/nltk_data...
[nltk_data] Package punkt is already up-to-date!
Number of input Tokens: 32000, output token 512
ofNum Gen Tokens: 32000 1st token latency 10729.066610336304 ms
to—, not that- very to in and very and
Question not to not a in not and to not very a very to-- that not to not in to not not-- not in very-- not to not not to in in very not a a very, very, not not a to the very, ver very not a not to very to, in not to in in a very not not-- not not not to and a not not in to a very to to in a not to that not a not a tos in to that to to and to a to not in ito to very not, in in to not a very to a a to very a to not a not a not not not to and very to not not a to the that in in a to and very to to in-- the in not and in in in that int a to to and to a not to in a in a not that in to to, a not to very, in not, not in not not not not and in in to to in to to a very in in not to in to in not a very to not a very, to to very a to in to a to not not in to and in the very not and not not thes the very not-- tos not to not and in tos to a to not to and a very a not and very not not not to a and in------ to in in a not to-- not not and not not in not in a in and to and very not to not to to not not a not, not in that very and not not---- to not-- to in in not not to no-- in very---- a very-- to in for---- and to a to to-- not-- not to not a very to not to in a in a not-- the in-- not not to in not a not in-- to a very not not the not not a to
Num Gen Tokens: 512, DeltaTime: 23.849237, Throughput: 21.468192 (tok/sec)
-------------------------------------------------------------------------------------
I filed an internal ticket and thus temporarily closed this ticket.