[Usage]: how to inference efficiently locally with single node 8 cards h20

Question

[Usage]: how to inference efficiently locally with single node 8 cards h20

Opened this issue a month ago · 0 comments

System Info

System Information:

OS: Ubuntu 24.04.2 LTS
Python version: python 3.10
CUDA version: cuda_12.9.r12.9/compiler.35813241_0
GPU model(s): h20 8 cards single node
Driver version: 550.54.14
TensorRT-LLM version: 0.21.0

Detailed output:

Fri Sep 26 15:56:15 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.14              Driver Version: 550.54.14      CUDA Version: 12.9     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H20                     On  |   00000000:0F:00.0 Off |                    0 |
| N/A   46C    P0            127W /  500W |     496MiB /  97871MiB |     28%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA H20                     On  |   00000000:34:00.0 Off |                    0 |
| N/A   38C    P0            121W /  500W |     499MiB /  97871MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA H20                     On  |   00000000:48:00.0 Off |                    0 |
| N/A   48C    P0            125W /  500W |   57849MiB /  97871MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA H20                     On  |   00000000:5A:00.0 Off |                    0 |
| N/A   39C    P0            121W /  500W |     499MiB /  97871MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   4  NVIDIA H20                     On  |   00000000:87:00.0 Off |                    0 |
| N/A   47C    P0            128W /  500W |     499MiB /  97871MiB |     66%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   5  NVIDIA H20                     On  |   00000000:AE:00.0 Off |                    0 |
| N/A   39C    P0            121W /  500W |     499MiB /  97871MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   6  NVIDIA H20                     On  |   00000000:C2:00.0 Off |                    0 |
| N/A   46C    P0            116W /  500W |     499MiB /  97871MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   7  NVIDIA H20                     On  |   00000000:D7:00.0 Off |                    0 |
| N/A   38C    P0            121W /  500W |     499MiB /  97871MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
+-----------------------------------------------------------------------------------------+

How would you like to use TensorRT-LLM

I want to run inference of https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507. I don't know how to integrate it with TensorRT-LLM or optimize it for my use case.

Specific questions:

Model: Qwen3-235B-A22B-Instruct-2507.
Use case (e.g., chatbot, batch inference, real-time serving): batch inference
Expected throughput/latency requirements: better than vllm (right now much slower than vllm)
Multi-GPU setup needed: 1 machine 8 cards

vLLM

vllm serve $checkpoint --served-model-name $model
--data-parallel-size 1
--tensor-parallel-size 8
--max-model-len 65536
--gpu-memory-utilization 0.9
--host 0.0.0.0
--port 4300
--trust-remote-code
--pipeline-parallel-size 1
--seed 0
--enable-prefix-caching
--enable-expert-parallel \

tensorRTLLM
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7

model=Qwen3-235B-A22B-Instruct-2507
checkpoint=/nlp_group/decapoda-research/Qwen3-235B-A22B-Instruct-2507

tokenizer=$checkpoint
trtllm-serve $checkpoint
--host localhost
--port 4300
--backend pytorch
--max_batch_size 128
--max_seq_len 32768
--tp_size 8
--ep_size 8
--pp_size 1
--kv_cache_free_gpu_memory_fraction 0.8 \

I use the openai client (concurrent.futures.ProcessPoolExecutor) for requests. (If it's not the optimal choice, I can change another way.)
I'm not sure how to optimize the inference with tensorRTLLM.

Before submitting a new issue...

Make sure you already searched for relevant issues, and checked the documentation and examples for answers to frequently asked questions.