NVIDIA/TensorRT-LLM

[Usage]: How to set the value of tp_size and pp_size when there are 1 server and 1 jetson?

Closed this issue · 4 comments

System Info

System Information:

  • OS:
  • Python version:
  • CUDA version:
  • GPU model(s):
  • Driver version: 12.8
  • TensorRT-LLM version: 0.21.0

Detailed output:

Paste the output of the above commands here

Tensorrt-LLM v0.21.0

  1. There are two devices for LLM-70B inference, one server and one jetson. Most layers of the LLM would be deployed in the server while the others in the jetson. How should i set the value of pp_size and tp_size when run the converte_checkpoints.py. I think for the two devices the value of pp_size should be 2, and for the server the tp_size should be 8 when there are 8 GPUs in the server. But if i set pp_size=2 and tp_size=8, i will get 16 checkpoint files, 8 files for server and 8 files for jetson. This is wrong for jetson.
  2. When i get the proper number of checkpoints, how should i start the two devices for inference.
    What should i do to process that.
    Thanks.

How would you like to use TensorRT-LLM

I want to run inference of a [specific model](put Hugging Face link here). I don't know how to integrate it with TensorRT-LLM or optimize it for my use case.

Specific questions:

  • Model:
  • Use case (e.g., chatbot, batch inference, real-time serving):
  • Expected throughput/latency requirements:
  • Multi-GPU setup needed:

Before submitting a new issue...

@mamba824824 , Apologies for the very delayed response.
Unfortunately, your proposed heterogeneous setup (server with 8 GPUs + Jetson with 1 GPU) is not supported by TensorRT-LLM.
While TensorRT-LLM does have limited experimental support for Jetson AGX Orin (introduced in November 2024 via the v0.12.0-jetson branch), this support is designed for standalone Jetson deployments only, not for distributed multi-node inference mixing datacenter GPUs with edge devices.

@nv-guomingz , please correct me if I'm wrong 😄

@mamba824824 , Apologies for the very delayed response. Unfortunately, your proposed heterogeneous setup (server with 8 GPUs + Jetson with 1 GPU) is not supported by TensorRT-LLM. While TensorRT-LLM does have limited experimental support for Jetson AGX Orin (introduced in November 2024 via the v0.12.0-jetson branch), this support is designed for standalone Jetson deployments only, not for distributed multi-node inference mixing datacenter GPUs with edge devices.

@nv-guomingz , please correct me if I'm wrong 😄

Is it possible to perform heterogeneous inference with uneven pipelines across multiple Jetson nodes? Specifically, I have two Jetson AGX Orins and one Jetson Orin Nano. Can I perform inference with uneven pipelines on them?

Also, are there any tutorials or guides available? Thank you.

Issue has not received an update in over 14 days. Adding stale label.

This issue was closed because it has been 14 days without activity since it has been marked as stale.