NVIDIA/TensorRT-LLM

[Usage]: Greedy Search Yield Different Results

kzhou92 opened this issue · 3 comments

System Info

Machine: aws p5en
Image: nvcr.io/nvidia/tensorrt-llm/release:1.1.0rc3

How would you like to use TensorRT-LLM

I'm running greedy search for GPT-OSS 20B but get different inference results for the same prompt (120B has the same issue). Here is my code:

from tensorrt_llm import LLM, SamplingParams
from tensorrt_llm.llmapi import CudaGraphConfig, MoeConfig

enable_attention_dp = False
max_batch_size = 4
cuda_graph_config = CudaGraphConfig(
    max_batch_size=max_batch_size,
    enable_padding=True,
)
# moe_config=MoeConfig(backend="CUTLASS")
moe_config=MoeConfig(backend="TRTLLM")


llm = LLM(
    model="path-to-model/gpt/gpt-oss-20b-bf16",
    enable_attention_dp=enable_attention_dp,
    max_batch_size=max_batch_size,
    cuda_graph_config=cuda_graph_config,
    moe_config=moe_config,
    max_seq_len=32768,
    max_num_tokens=32768,
    tensor_parallel_size=2,
    moe_expert_parallel_size=2,
    enable_mixed_sampler=True,
)

prompts = ["This is a 500-word story:",]
sampling_params = SamplingParams(
  temperature=0.0, 
  top_k=1, 
  max_tokens=256,
)

for output in llm.generate(prompts,  sampling_params):
    print(output.outputs[0].text)

When I set temperature=10000.0, top_k=1000, the output is garble, which indicates temperature and top_k work.
However, when I set temperature=0.0, top_k=1, the output is not unique per inference for the same input prompt.

I can see a couple of closed issues asking for the similar question, but there is no explanation on why. Could anyone explain it?

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and checked the documentation and examples for answers to frequently asked questions.

TensorRT LLM by default uses non-deterministic kernels for better performance. Could you try set environment variable FORCE_DETERMINISTIC=1? Please be aware that deterministic may resulted in worse performance, and may not be implemented yet for some operations.

Thanks for the answer. Now I understand where the "randomness" comes from.
However, FORCE_DETERMINISTIC=1 doesn't work. Really appreciate it if you have other solutions.

@hlu1 Could you answer if there's support for deterministic output for GPT-OSS models?