[Usage]: how to inference efficiently locally with single node 8 cards h20
Opened this issue · 0 comments
System Info
System Information:
- OS: Ubuntu 24.04.2 LTS
- Python version: python 3.10
- CUDA version: cuda_12.9.r12.9/compiler.35813241_0
- GPU model(s): h20 8 cards single node
- Driver version: 550.54.14
- TensorRT-LLM version: 0.21.0
Detailed output:
Fri Sep 26 15:56:15 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.14 Driver Version: 550.54.14 CUDA Version: 12.9 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA H20 On | 00000000:0F:00.0 Off | 0 |
| N/A 46C P0 127W / 500W | 496MiB / 97871MiB | 28% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA H20 On | 00000000:34:00.0 Off | 0 |
| N/A 38C P0 121W / 500W | 499MiB / 97871MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA H20 On | 00000000:48:00.0 Off | 0 |
| N/A 48C P0 125W / 500W | 57849MiB / 97871MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 3 NVIDIA H20 On | 00000000:5A:00.0 Off | 0 |
| N/A 39C P0 121W / 500W | 499MiB / 97871MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 4 NVIDIA H20 On | 00000000:87:00.0 Off | 0 |
| N/A 47C P0 128W / 500W | 499MiB / 97871MiB | 66% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 5 NVIDIA H20 On | 00000000:AE:00.0 Off | 0 |
| N/A 39C P0 121W / 500W | 499MiB / 97871MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 6 NVIDIA H20 On | 00000000:C2:00.0 Off | 0 |
| N/A 46C P0 116W / 500W | 499MiB / 97871MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 7 NVIDIA H20 On | 00000000:D7:00.0 Off | 0 |
| N/A 38C P0 121W / 500W | 499MiB / 97871MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
+-----------------------------------------------------------------------------------------+
How would you like to use TensorRT-LLM
I want to run inference of https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507. I don't know how to integrate it with TensorRT-LLM or optimize it for my use case.
Specific questions:
- Model: Qwen3-235B-A22B-Instruct-2507.
- Use case (e.g., chatbot, batch inference, real-time serving): batch inference
- Expected throughput/latency requirements: better than vllm (right now much slower than vllm)
- Multi-GPU setup needed: 1 machine 8 cards
vLLM
vllm serve $checkpoint --served-model-name $model
--data-parallel-size 1
--tensor-parallel-size 8
--max-model-len 65536
--gpu-memory-utilization 0.9
--host 0.0.0.0
--port 4300
--trust-remote-code
--pipeline-parallel-size 1
--seed 0
--enable-prefix-caching
--enable-expert-parallel \
tensorRTLLM
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
model=Qwen3-235B-A22B-Instruct-2507
checkpoint=/nlp_group/decapoda-research/Qwen3-235B-A22B-Instruct-2507
tokenizer=$checkpoint
trtllm-serve $checkpoint
--host localhost
--port 4300
--backend pytorch
--max_batch_size 128
--max_seq_len 32768
--tp_size 8
--ep_size 8
--pp_size 1
--kv_cache_free_gpu_memory_fraction 0.8 \
I use the openai client (concurrent.futures.ProcessPoolExecutor) for requests. (If it's not the optimal choice, I can change another way.)
I'm not sure how to optimize the inference with tensorRTLLM.
Before submitting a new issue...
- Make sure you already searched for relevant issues, and checked the documentation and examples for answers to frequently asked questions.