NVIDIA/TensorRT-LLM

System hang when setting penalty

Linzecong opened this issue · 1 comments

Here is my build command.

python build.py --model_dir Yi-34B-Chat --dtype float16 --remove_input_padding  --use_gemm_plugin float16 --use_gpt_attention_plugin float16 --world_size 2 --tp_size 2 --enable_context_fmha  --use_inflight_batching  --paged_kv_cache  --load_by_shard  --use_weight_only  --weight_only_precision int4 --output_dir /app/triton_model/tensorrt_llm/1

Model is Yi-34B, int4 weight only.

The system will get stuck after running for a period of time. This will only happen when penalty is set. Only in high concurrency situations.

Here is my test code.

from functools import partial
import os
import sys
import json
import numpy as np
from transformers import PreTrainedTokenizer
from typing import List, Tuple
import tritonclient.grpc as grpcclient
import tritonclient.http as httpclient
from transformers import AutoTokenizer, LlamaTokenizer, T5Tokenizer
from tritonclient.utils import InferenceServerException, np_to_triton_dtype
import random, traceback


triton_client = grpcclient.InferenceServerClient(url="localhost:14568")

def prepare_tensor(name, input):
    t = grpcclient.InferInput(name, input.shape, np_to_triton_dtype(input.dtype))
    t.set_data_from_numpy(input)
    return t

def get_output_e2e(raw_text, **kwargs):
    global triton_client
    model_name = "ensemble"

    input0 = [[raw_text]]
    input0_data = np.array(input0).astype(object)
    output0_len = np.ones_like(input0).astype(np.int32) * kwargs["max_tokens"]
    bad_words_list = np.array([[""]], dtype=object)
    stop_words_list = np.array([kwargs["stop"]], dtype=object)
    
    top_k_data = np.array([[kwargs["top_k"]]], dtype=np.int32)
    top_p_data = np.array([[kwargs["top_p"]]], dtype=np.float32)
    temperature_data = np.array([[kwargs["temperature"]]], dtype=np.float32)
    repetition_penalty_data = np.array([[kwargs["repetition_penalty"]]], dtype=np.float32)
    presence_penalty_data = np.array([[kwargs["presence_penalty"]]], dtype=np.float32)
    random_seed_data = np.array([[kwargs["random_seed"]]], dtype=np.uint64)
    
    streaming = [[False]]
    streaming_data = np.array(streaming, dtype=bool)

    inputs = [
        prepare_tensor("text_input", input0_data),
        prepare_tensor("max_tokens", output0_len),
        prepare_tensor("bad_words", bad_words_list),
        prepare_tensor("stop_words", stop_words_list),
        prepare_tensor("stream", streaming_data),

        prepare_tensor("top_k", top_k_data),
        prepare_tensor("top_p", top_p_data),
        prepare_tensor("temperature", temperature_data),
        prepare_tensor("random_seed", random_seed_data),
    ]
    
    if kwargs["presence_penalty"] != 0:
        inputs.append(prepare_tensor("presence_penalty", presence_penalty_data))
    if kwargs["repetition_penalty"] != 1:
        inputs.append(prepare_tensor("repetition_penalty", repetition_penalty_data))

    retry = 0
    while retry < 3:
        try:
            result = triton_client.infer(model_name, inputs)
            return result.as_numpy('text_output')[0].decode()
        except:
            retry += 1
            print("==============retry: ", retry, "==============")
            traceback.print_exc()
            if not triton_client.is_server_ready():
                triton_client.close()
                triton_client = grpcclient.InferenceServerClient(url="localhost:14568")

if __name__ == "__main__":
    import threading
    def _run(i):
        print(i, get_output_e2e(
            raw_text="The quick brown fox jumps over the lazy dog",
            max_tokens=100,
            top_k=50,
            top_p=0.95,
            temperature=0.8,
            repetition_penalty=1.5, # hang when not equal to 1
            presence_penalty=0,# hang when not equal to 0 
            stop=["\n"],
            random_seed=random.randint(0, 1000000000),
        ))
    
    threads = []
    for i in range(100):
        threads.append(threading.Thread(target=_run, args=(i,)))
    for t in threads:
        t.start()
    for t in threads:
        t.join()
    print("all done")

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03 Driver Version: 535.54.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 4090 Off | 00000000:56:00.0 Off | Off |
| 30% 41C P2 69W / 450W | 22526MiB / 24564MiB | 100% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA GeForce RTX 4090 Off | 00000000:57:00.0 Off | Off |
| 30% 39C P2 85W / 450W | 22524MiB / 24564MiB | 100% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+

all request will stuck.

I also test awq method and A30 GPU, will hang too. Sometimes only one GPU is occupied 100%, and the other one is idle. both 0.6.1 and 0.7.1 have problems。

The above is all my information. I am not sure if it is caused by parallelism. I hope there are some debugging methods.

possible solutions #149