v2.0.0-release: 8 extra tokens appended to input tokens trigger in trigger the huggingface_hub.errors.ValidationError: Input validation error: `inputs` tokens + `max_new_tokens` must be <= 32512. Given: 32008 `inputs` tokens and 512 `max_new_tokens`. however no issue in v1.2.2-release

Question

v2.0.0-release: 8 extra tokens appended to input tokens trigger in trigger the huggingface_hub.errors.ValidationError: Input validation error: `inputs` tokens + `max_new_tokens` must be <= 32512. Given: 32008 `inputs` tokens and 512 `max_new_tokens`. however no issue in v1.2.2-release

IT-Forrest opened this issue 8 months ago · 1 comments

System Info

Compared with v1.2.2-release tgi-gaudi, sending the query to the v2.0.0-release tgi-server will hit the input_token_length + output_token_length assertion.
Especially, when input_token_length=32000, output_token_length=512, I'll hit the following assertion,
huggingface_hub.errors.ValidationError: Input validation error: inputs tokens + max_new_tokens must be <= 32512. Given: 32008 inputs tokens and 512 max_new_tokens

According to the log, the input token_length is 32008 instead of 32000. Thus, we suspect there are 8 extra tokens appended to the input tokens.

Two supplements:

in v1.2.2-release tgi-gaudi, given the same config, e.g. input_token_length=32000, output_token_length=512, the query could be handled and no assertion
without tgi-gaudi, given input_token_length=32000, output_token_length=512, the examples/text-generation/run_generation.py script in optimum-habana repo will not hit this kind of issue

Information

Docker
The CLI directly

Tasks

An officially supported command
My own modifications

Reproduction

Steps to reproduce the error

git clone tgi-gaudi

git clone https://github.com/huggingface/tgi-gaudi.git
cd tgi-gaudi && git checkout v2.0.0-release

modify the Dockerfile

$ git diff Dockerfile
diff --git a/Dockerfile b/Dockerfile
index c7c6576..7a45934 100644
--- a/Dockerfile
+++ b/Dockerfile
@@ -60,7 +60,9 @@ RUN cd server && \
     pip install -r requirements.txt && \
     bash ./dill-0.3.8-patch.sh && \
     pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.15.0 && \
-    pip install . --no-cache-dir
+    pip install . --no-cache-dir && \
+    pip install git+https://github.com/huggingface/optimum-habana.git
+

 # Install benchmarker
 COPY --from=builder /usr/src/target/release/text-generation-benchmark /usr/local/bin/text-generation-benchmark

compile tgi-gaudi
docker build -t tgi_gaudi_image:v2.0.0-release .
prepare the optimum-habana

cd /home/USER
git clone https://github.com/huggingface/optimum-habana.git

prepare the model to the local dir
E.g. to download mistral-7b to this dir
/software/tgi_gaudi/USER/mistral-7b/snapshots/26bca36bde8333b5d7f72e9ed20ccda6a618af24/
launch the tgi-server
Save the following cmd into script launch_mistral-7b_tgi.sh, change the priority chmod 777 launch_mistral-7b_tgi.sh, and then the launch it with ./launch_mistral-7b_tgi.sh

bucket_size=3000
input_len=32000
total_len=32512
prefill_len=32000
mbs_total_len=130048

command="docker run -p 8850:80 --name='jwang_tgi931_v2.0.0' -v /sys/kernel/debug:/sys/kernel/debug \
 -v /software/tgi_gaudi/USER:/root/ckpt \
 -v /home/USER/test/optimum-habana/examples/text-generation:/root/text-generation \
 -e PT_HPU_ENABLE_LAZY_COLLECTIVES=true \
 -e HABANA_VISIBLE_DEVICES=all \
 -e PREFILL_BATCH_BUCKET_SIZE=1 \
 -e BATCH_BUCKET_SIZE=4 \
 -e PAD_SEQUENCE_TO_MULTIPLE_OF=$bucket_size \
 -e QUANT_CONFIG=/root/text-generation/hqt_output_mis7b_1x/tmp_maxabs_quant.json \
 -e OMPI_MCA_btl_vader_single_copy_mechanism=none \
 -e ENABLE_HPU_GRAPH=true \
 -e LIMIT_HPU_GRAPH=true \
 --cap-add=sys_nice --ipc=host tgi_gaudi_image:v2.0.0-release \
 --model-id /root/ckpt/mistral-7b/snapshots/26bca36bde8333b5d7f72e9ed20ccda6a618af24/ \
 --max-input-length $input_len --max-total-tokens $total_len \
 --max-batch-prefill-tokens $prefill_len --max-batch-total-tokens $mbs_total_len"

echo "$command"
eval "$command"
# working version docker image name :test_new_tgi

send a query to the tgi-server
Install the necessary dependencies, e.g. pip install nltk && pip install transformers
Save the following cmd into a script named send_query_to_tgi_server.py, then launch it with python send_query_to_tgi_server.py

import requests
import json
from tqdm import tqdm
from huggingface_hub import InferenceClient
import time
import nltk
nltk.download("gutenberg")
nltk.download("punkt")
from nltk.corpus import gutenberg
from transformers import AutoTokenizer
import random
random.seed(42)

headers = {'Content-Type': 'application/json', 'Accept': 'text/event-stream'}
# mistral 8850
base_endpoint = "http://localhost:8850"
generate_endpoint = "%s/generate_stream"%(base_endpoint)
client = InferenceClient(base_endpoint)
# mistral
tokenizer = AutoTokenizer.from_pretrained("/software/tgi_gaudi/USER/mistral-7b/snapshots/26bca36bde8333b5d7f72e9ed20ccda6a618af24/")

list_sents = gutenberg.sents()
long_sents = [" ".join(sent) for sent in list_sents
                    if len(sent) > 50]
num_sents_list = list(range(380, 400, 20))  #32k test. 1card
random_seed = random.randint(1,256)
for in_len, out_len in [(142133, 512)]: #142133->32000->assert in 2.0.0,  142088->31991->21.09, 142090->31992->0.01
    cur_sent_len = 386
    context_str = " ".join(long_sents[0:cur_sent_len])
    context_str = context_str[0:in_len]
    token_count = len(tokenizer(context_str)['input_ids'])
    print(f"Number of input Tokens: {token_count}, output token {out_len}")
    data = {"inputs":"<s>[INST] %s [/INST]"%(context_str),"parameters":{"max_new_tokens":out_len, "do_sample": True, "seed": random_seed, "repetition_penalty":1.2, "temperature":0.95, "top_k":5}}

    n_tokens = 0
    is_first = True
    start_ts = time.time()
    for token in client.text_generation(data['inputs'], **data['parameters'], stream=True):
        print(token, end='', flush=True)
        if is_first:
            first_time = time.time() - start_ts
            print(f'Num Gen Tokens: {token_count} 1st token latency {first_time*1000} ms')
            is_first = False
        n_tokens += 1
    print('\n')
    end_ts = time.time()
    delta_time = end_ts - start_ts
    print('Num Gen Tokens: %d, DeltaTime: %f, Throughput: %f (tok/sec)'%(n_tokens, delta_time, n_tokens * 1.0/delta_time))
    print('\n-------------------------------------------------------------------------------------\n')

Expected behavior

Expect it could pass when given inpu_token len=32000 and output_token=512

[nltk_data] Downloading package gutenberg to
[nltk_data]     /weka/home/USER/nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!
[nltk_data] Downloading package punkt to /weka/home/USER/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
Number of input Tokens: 32000, output token 512
 ofNum Gen Tokens: 32000 1st token latency 10729.066610336304 ms
 to—, not that- very to in and very and
 Question not to not a in not and to not very a very to-- that not to not in to not not-- not in very-- not to not not to in in very not a a very, very, not not a to the very, ver very not a not to very to, in not to in in a very not not-- not not not to and a not not in to a very to to in a not to that not a not a tos in to that to to and to a to not in ito to very not, in in to not a very to a a to very a to not a not a not not not to and very to not not a to the that in in a to and very to to in-- the in not and in in in that int a to to and to a not to in a in a not that in to to, a not to very, in not, not in not not not not and in in to to in to to a very in in not to in to in not a very to not a very, to to very a to in to a to not not in to and in the very not and not not thes the very not-- tos not to not and in tos to a to not to and a very a not and very not not not to a and in------ to in in a not to-- not not and not not in not in a in and to and very not to not to to not not a not, not in that very and not not---- to not-- to in in not not to no-- in very---- a very-- to in for---- and to a to to-- not-- not to not a very to not to in a in a not-- the in-- not not to in not a not in-- to a very not not the not not a to

Num Gen Tokens: 512, DeltaTime: 23.849237, Throughput: 21.468192 (tok/sec)

-------------------------------------------------------------------------------------

Answer 1 · 2024-05-18T20:12:40.000Z

I filed an internal ticket and thus temporarily closed this ticket.