[Usage]: Greedy Search Yield Different Results
kzhou92 opened this issue · 3 comments
System Info
Machine: aws p5en
Image: nvcr.io/nvidia/tensorrt-llm/release:1.1.0rc3
How would you like to use TensorRT-LLM
I'm running greedy search for GPT-OSS 20B but get different inference results for the same prompt (120B has the same issue). Here is my code:
from tensorrt_llm import LLM, SamplingParams
from tensorrt_llm.llmapi import CudaGraphConfig, MoeConfig
enable_attention_dp = False
max_batch_size = 4
cuda_graph_config = CudaGraphConfig(
max_batch_size=max_batch_size,
enable_padding=True,
)
# moe_config=MoeConfig(backend="CUTLASS")
moe_config=MoeConfig(backend="TRTLLM")
llm = LLM(
model="path-to-model/gpt/gpt-oss-20b-bf16",
enable_attention_dp=enable_attention_dp,
max_batch_size=max_batch_size,
cuda_graph_config=cuda_graph_config,
moe_config=moe_config,
max_seq_len=32768,
max_num_tokens=32768,
tensor_parallel_size=2,
moe_expert_parallel_size=2,
enable_mixed_sampler=True,
)
prompts = ["This is a 500-word story:",]
sampling_params = SamplingParams(
temperature=0.0,
top_k=1,
max_tokens=256,
)
for output in llm.generate(prompts, sampling_params):
print(output.outputs[0].text)
When I set temperature=10000.0, top_k=1000, the output is garble, which indicates temperature and top_k work.
However, when I set temperature=0.0, top_k=1, the output is not unique per inference for the same input prompt.
I can see a couple of closed issues asking for the similar question, but there is no explanation on why. Could anyone explain it?
Before submitting a new issue...
- Make sure you already searched for relevant issues, and checked the documentation and examples for answers to frequently asked questions.
TensorRT LLM by default uses non-deterministic kernels for better performance. Could you try set environment variable FORCE_DETERMINISTIC=1? Please be aware that deterministic may resulted in worse performance, and may not be implemented yet for some operations.
Thanks for the answer. Now I understand where the "randomness" comes from.
However, FORCE_DETERMINISTIC=1 doesn't work. Really appreciate it if you have other solutions.
@hlu1 Could you answer if there's support for deterministic output for GPT-OSS models?