Improved fastervit_any_res_0 has larger TensorRT latency than the original version

Question

Improved fastervit_any_res_0 has larger TensorRT latency than the original version

QingXIA233 opened this issue a year ago · 9 comments

Hello, I used your effective fastervit_any_res_0 with input resolution (576, 960) as the backbone of my occupancy predcition model last month. I export the model to onnx and then to TensorRT, everything worked well. Then I noticed that you improved the TensorRT throughtput for the models. So I turn to use the new fastervit_any_res_0 (576,960) from your updated script https://github.com/NVlabs/FasterViT/blob/main/fastervit/models/faster_vit_any_res.py. However, the TensorRT latency increased instead of decreasing as I expected.

input resolution: (3, 3, 576, 960)
driver: NVIDIA geforce 3090Ti

Original faster_vit_any_res_0 :

Improved faster_vit_any_res_0 :

See, the throughputs decrease from 90 to 64, and the mean latency increases from 13.28ms to 17.92ms. I wonder why this result is totally contrary to what you wrote in the News. Please help. Thanks.

Answer 1 · 2023-07-18T07:22:11.000Z

Hi @QingXIA233, thanks for bringing this to our attention. We would appreciate it if you could please provide responses to the following questions for reproduction of this issue:

commands to export the ONNX (would be great if they can be uploaded somewhere for further debugging)
What TensorRT version is being used in the perf evaluation run?
The trtexec command in use to generate the summary in snapshot

Thanks again

Answer 2 · 2023-07-18T07:29:01.000Z

Also adding @longlee0622 for vis.

Answer 3 · 2023-07-18T10:24:44.000Z

Hi @QingXIA233, thanks for bringing this to our attention. We would appreciate it if you could please provide responses to the following questions for reproduction of this issue:

commands to export the ONNX (would be great if they can be uploaded somewhere for further debugging)

What TensorRT version is being used in the perf evaluation run?

The trtexec command in use to generate the summary in snapshot

Thanks again

Hello, for the three questions above:

I put the scripts for exporting the onnx and the correponding onnx models in here: https://drive.google.com/drive/folders/1hfFm4OWWQteftOWjxMJ2m8zVZxZT7BsZ?usp=sharing
The TensorRT version I used is 8.2.3.0
The trtexec cmd is: trtexec --onnx=debug_fastervit.onnx --saveEngine=debug_fastervit.trt --fp16 --workspace=10240 --verbose

FYI, for the name of the models: FasterViT denotes the original version and FasterViT_Better is the improved one. Thank you so much for the help.

I also measured the GPU latency for both models (still on 3090Ti), which is more precision:
the original model:

the improved one:

Answer 4 · 2023-07-18T11:54:14.000Z

Hi @QingXIA233,
Thanks for all the details. TensorRT 8.2 was released ~2 years ago. We have added many critical perf enhancements in recent versions. Is it feasible for you to try the latest public build 8.6.1?

The model scripts you shared contain several other model changes besides the commit to improve TensorRT performance. It is not an apple-to-apple performance comparison.

My general suggestion on performance tuning would be upgrading TensorRT to the latest version, export ONNX in a more recent opset (e.g. 13, 17) and do necessary ONNX model pruning to further simplify it (You can refer to https://github.com/NVlabs/FasterViT/blob/main/onnx_convert.py as an example, or further prune it with tools like onnx-graphsurgeon and onnx-simplifier).

I had a try with your debug_fastervit.onnx on a 3070 (less powerful than your 3090Ti) and observed much better performance than what you reported. My run command and trtexec log looks like:

[07/18/2023-04:46:39] [I] Throughput: 116.383 qps
[07/18/2023-04:46:39] [I] Latency: min = 13.9324 ms, max = 14.4149 ms, mean = 13.9911 ms, median = 13.9911 ms, percentile(90%) = 14.0156 ms, percentile(95%) = 14.0217 ms, percentile(99%) = 14.0457 ms
[07/18/2023-04:46:39] [I] Enqueue Time: min = 0.0892334 ms, max = 0.243408 ms, mean = 0.176315 ms, median = 0.177284 ms, percentile(90%) = 0.186401 ms, percentile(95%) = 0.189697 ms, percentile(99%) = 0.203033 ms
[07/18/2023-04:46:39] [I] H2D Latency: min = 1.62805 ms, max = 1.66052 ms, mean = 1.63837 ms, median = 1.63692 ms, percentile(90%) = 1.64575 ms, percentile(95%) = 1.64746 ms, percentile(99%) = 1.6521 ms
[07/18/2023-04:46:39] [I] GPU Compute Time: min = 8.50946 ms, max = 8.98662 ms, mean = 8.56596 ms, median = 8.5647 ms, percentile(90%) = 8.58813 ms, percentile(95%) = 8.59131 ms, percentile(99%) = 8.61792 ms
[07/18/2023-04:46:39] [I] D2H Latency: min = 3.7771 ms, max = 3.797 ms, mean = 3.78674 ms, median = 3.78467 ms, percentile(90%) = 3.79419 ms, percentile(95%) = 3.79517 ms, percentile(99%) = 3.79614 ms
[07/18/2023-04:46:39] [I] Total Host Walltime: 3.0245 s
[07/18/2023-04:46:39] [I] Total GPU Compute Time: 3.01522 s
[07/18/2023-04:46:39] [I] Explanations of the performance metrics are printed in the verbose logs.
[07/18/2023-04:46:39] [V]
[07/18/2023-04:46:39] [V] === Explanations of the performance metrics ===
[07/18/2023-04:46:39] [V] Total Host Walltime: the host walltime from when the first query (after warmups) is enqueued to when the last query is completed.
[07/18/2023-04:46:39] [V] GPU Compute Time: the GPU latency to execute the kernels for a query.
[07/18/2023-04:46:39] [V] Total GPU Compute Time: the summation of the GPU Compute Time of all the queries. If this is significantly shorter than Total Host Walltime, the GPU may be under-utilized because of host-side overheads or data transfers.
[07/18/2023-04:46:39] [V] Throughput: the observed throughput computed by dividing the number of queries by the Total Host Walltime. If this is significantly lower than the reciprocal of GPU Compute Time, the GPU may be under-utilized because of host-side overheads or data transfers.
[07/18/2023-04:46:39] [V] Enqueue Time: the host latency to enqueue a query. If this is longer than GPU Compute Time, the GPU may be under-utilized.
[07/18/2023-04:46:39] [V] H2D Latency: the latency for host-to-device data transfers for input tensors of a single query.
[07/18/2023-04:46:39] [V] D2H Latency: the latency for device-to-host data transfers for output tensors of a single query.
[07/18/2023-04:46:39] [V] Latency: the summation of H2D Latency, GPU Compute Time, and D2H Latency. This is the latency to infer a single query.
[07/18/2023-04:46:39] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v8601] # TensorRT-8.6.1.6/bin/trtexec --onnx=debug_fastervit.onnx --saveEngine=debug_fastervit.trt --fp16 --workspace=10240 --verbose --separateProfileRun --useCudaGraph

Hope it helps.

Answer 5 · 2023-07-19T01:43:28.000Z

Hello @longlee0622 , thank you for the efforts. However, I still have some follwing questions about what you mentioned:

About the changes I made in the scripts, for the original one and the improved one, I copy the scripts from this repo, and the changes I made to these to scripts are the same:

change torch.nn.functional.pad to concat a padding tensor, because the TensorRT I use doesn't support this operation.
I didn't use the classification head, because I merely used FatserViT to work as my feature extraction mudule.

I kept every other settings the same except the model structure and config (window size). I wonder why in this case, the results are not as expected. I mean, I really didn't make too many changes to the original scripts.

About the TensorRT version, sorry I couldn't upgrade it right now, because we work as a group and everything was settled, any changes to the settings require my colleges‘ unanimous agreement. Thus, if one's using older version of TensorRT, the perf improvement is no longer valid？
I see the result of debug_fastervit.onnx, which is so much better than what I got. But could you also show the result of debug_fastervit_better.onnx please? I wonder how much it's better than the original version.

Thank you again for the help.

Answer 6 · 2023-07-19T02:28:10.000Z

IIRC, we are mixing up 2 differences in this discussion: 1) the difference in two models 2) TensorRT version difference.

Re 1:
The relevant change for TensorRT performance improvement is f5361f9. The models you provided contain more differences beyond that. It is hard to tell whether the "better" model is functionally equivalent and more computationally efficient. That's why I said it is not an apple-to-apple comparison.

Re 2:
It is sad to hear you have to stay with a much older TensorRT release. By the time TRT 8.2 is released, vision transformers are not major target workload for performance optimization. We have updated readme to make it clear that it is recommended to use latest TensorRT for performance tuning.

Re 3:
I do test the "better" ONNX and it is indeed slightly slower than the baseline model. Because of 1, I didn't check performance further.

Answer 7 · 2023-07-19T03:58:44.000Z

I echo @longlee0622 's comments in using the latest TRT release. The team is actively making progress on each release, and hence it would be sub-optimal to not take advantage of it.

Answer 8 · 2023-07-19T10:45:51.000Z

Hello, @longlee0622 , thank for the enlightening tips. I used onnx-graphsurgeon and onnxsim to simplify the improved FasterViT, it achieved great results:

The throughputs actually increases to 167 ! However, for the original FasterViT (debug_fastervit.onnx), I performed the same operations, but it failed:

I didn't know the reason why this happened, but I'd let it go right now. Because now the performance of improved Fastervit is totally acceptable. And later, I will start using TensorRT 8.6 and attempt to persuade my colleges to do it. Thank you for your help and good advice. @longlee0622 @ahatamiz

Answer 9 · 2023-07-20T14:38:09.000Z

Thanks @QingXIA233. I close this issue for now but feel free to open or raise a new issue to continue the discussion.