[Usage]: How to set the value of tp_size and pp_size when there are 1 server and 1 jetson？

Question

[Usage]: How to set the value of tp_size and pp_size when there are 1 server and 1 jetson？

Closed this issue 13 days ago · 4 comments

System Info

System Information:

OS:
Python version:
CUDA version:
GPU model(s):
Driver version: 12.8
TensorRT-LLM version: 0.21.0

Detailed output:

Paste the output of the above commands here

Tensorrt-LLM v0.21.0

There are two devices for LLM-70B inference, one server and one jetson. Most layers of the LLM would be deployed in the server while the others in the jetson. How should i set the value of pp_size and tp_size when run the converte_checkpoints.py. I think for the two devices the value of pp_size should be 2, and for the server the tp_size should be 8 when there are 8 GPUs in the server. But if i set pp_size=2 and tp_size=8, i will get 16 checkpoint files, 8 files for server and 8 files for jetson. This is wrong for jetson.
When i get the proper number of checkpoints, how should i start the two devices for inference.
What should i do to process that.
Thanks.

How would you like to use TensorRT-LLM

I want to run inference of a [specific model](put Hugging Face link here). I don't know how to integrate it with TensorRT-LLM or optimize it for my use case.

Specific questions:

Model:
Use case (e.g., chatbot, batch inference, real-time serving):
Expected throughput/latency requirements:
Multi-GPU setup needed:

Before submitting a new issue...

#7839

Answer 1 · 2025-09-24T00:19:52.000Z

@mamba824824 , Apologies for the very delayed response.
Unfortunately, your proposed heterogeneous setup (server with 8 GPUs + Jetson with 1 GPU) is not supported by TensorRT-LLM.
While TensorRT-LLM does have limited experimental support for Jetson AGX Orin (introduced in November 2024 via the v0.12.0-jetson branch), this support is designed for standalone Jetson deployments only, not for distributed multi-node inference mixing datacenter GPUs with edge devices.

@nv-guomingz , please correct me if I'm wrong 😄

Answer 2 · 2025-09-27T11:53:00.000Z

@mamba824824 , Apologies for the very delayed response. Unfortunately, your proposed heterogeneous setup (server with 8 GPUs + Jetson with 1 GPU) is not supported by TensorRT-LLM. While TensorRT-LLM does have limited experimental support for Jetson AGX Orin (introduced in November 2024 via the v0.12.0-jetson branch), this support is designed for standalone Jetson deployments only, not for distributed multi-node inference mixing datacenter GPUs with edge devices.

@nv-guomingz , please correct me if I'm wrong 😄

Is it possible to perform heterogeneous inference with uneven pipelines across multiple Jetson nodes? Specifically, I have two Jetson AGX Orins and one Jetson Orin Nano. Can I perform inference with uneven pipelines on them?

Also, are there any tutorials or guides available? Thank you.

Answer 3 · 2025-10-12T03:17:46.000Z

Issue has not received an update in over 14 days. Adding stale label.

Answer 4 · 2025-10-26T03:21:36.000Z

This issue was closed because it has been 14 days without activity since it has been marked as stale.