Detection duplicates with fp16 on Jetson Nano (TensorRT v8.2.1.8)

Question

Detection duplicates with fp16 on Jetson Nano (TensorRT v8.2.1.8)

IoannisKaragiannis opened this issue a year ago · 3 comments

Hey there Linamo1214,

First of all, great job with the trt. I have one question though. I have proceeded with the conversion like this.

On my laptop, running Ubuntu 22.04, without any NVIDIA GPU, I have created a virtual environment with python3.10, I installed all the essential packages for the yolov7 repo, and I just proceeded with the .pt to .onnx conversion like this

(yolov7)$ python3.10 export.py --weights my_models/yolov7-tiny.pt --grid --simplify --topk-all 200 --iou-thres 0.5 --conf-thres 0.4 --img-size 416 416

I on purpose did not set the --end2end flag in order to use it later directly on the trt conversion.

Then I moved on my Jetson Nano. I have my own tiny project, where I confirmed that the yolov7-tiny-416.onnx model from the above conversion works fine with an average inference time of 99.5 ms. Then I downloaded your repo on my Jetson Nano, I created a dedicated virtual environment with python3.6 (to be compatible with tensorrt which was built with python3.6 also), I symbolically linked the natively built TensorRT like this:

(trt)$  ln -s /usr/lib/python3.6/dist-packages/tensorrt/ my_venvs/trt/lib/python3.6/site-packages/tensorrt

and then I proceeded with the .onnx to .trt conversion like this:

(trt)$ python3.6 export.py -o my_models/yolov7-tiny-416.onnx -e my_models/yolov7-tiny-416-fp16.trt -w 2 --iou_thres 0.5 --conf_thres 0.4 --end2end -p fp16 --max_det 200

The reason I set the maximum workspace size to 2GB was because I was getting the following error:

Some tactics do not have sufficient workspace memory to run. Increasing workspace size may increase performance, please check verbose output

and the reason I decided to use the -w flag in the first place is because I was getting the following error:

File "export.py", line 308, in <module>
    main(args)
  File "export.py", line 266, in main
    builder = EngineBuilder(args.verbose, args.workspace)
  File "export.py", line 109, in __init__
    self.config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, workspace * (2 ** 30))
AttributeError: 'tensorrt.tensorrt.IBuilderConfig' object has no attribute 'set_memory_pool_limit'

So, basically, to overcome this, I had to apply the following change in your export.py. I guess I had to do this because of the old tensorRT version of the Jetson Nano.

# original
self.builder = trt.Builder(self.trt_logger)
self.config = self.builder.create_builder_config()
self.config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, workspace * (2 ** 30))
# self.config.max_workspace_size = workspace * (2 ** 30)  # Deprecation

# updated
self.builder = trt.Builder(self.trt_logger)
self.config = self.builder.create_builder_config()
# self.config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, workspace * (2 ** 30))
self.config.max_workspace_size = workspace * (2 ** 30)  # Deprecation

Then, based on you trt.py, I load the trt model in my application on the Jetson Nano and it does load successfully and the inference time drops from 99.5ms to 61 ms, but I encountered two issues:

the confidence scores are negative. As mentioned in some relevant thread, I bypassed this issue by adding 1.
I have duplicates as if no NMS is applied. And this is where I'm basically hitting a wall and I need your help. I thought that the --end2end flag would take care of applying the NMS but it doesn't. Is it again because of the old implementation of the TensorRT v8.2.1.8? Should I perhaps entirely skip the --end2end flag and allow your inference function inside the trt.py do the post-processing trick? What do you recommend?

Thanks in advance for your response! cheers

Answer 1 · 2023-11-08T15:01:23.000Z

Actually I observed something peculiar. I tried these two different combinations

(yolov7)$ python3.10 export.py --weights my_models/yolov7-tiny.pt --grid --simplify --topk-all 200 --iou-thres 0.1 --conf-thres 0.4 --img-size 416 416
(trt)$ python3.6 export.py -o my_models/yolov7-tiny-416.onnx -e my_models/yolov7-tiny-416-fp16.trt -w 2 --iou_thres 0.1 --conf_thres 0.4 --end2end -p fp16 --max_det 200

and

(yolov7)$ python3.10 export.py --weights my_models/yolov7-tiny.pt --grid --simplify --topk-all 200 --iou-thres 0.7 --conf-thres 0.4 --img-size 416 416
(trt)$ python3.6 export.py -o my_models/yolov7-tiny-416.onnx -e my_models/yolov7-tiny-416-fp16.trt -w 2 --iou_thres 0.7 --conf_thres 0.4 --end2end -p fp16 --max_det 200

expecting that the first combination with the small iou_thres would result in a more permissive model that would allow for multiple detections of the same object, while the second combination would be more conservative and only permit the most dominant detection to survive. To my surprise, the two approached had absolutely no difference, as if the iou_thres flag does not impact the conversion at all.

Any idea why is this happening? Has anyone experienced something similar before?

Answer 2 · 2023-11-08T17:01:14.000Z

Ok, last update. I tried to skip --end2end flag in both conversions (pt to onnx, and onnx to trt) and I set the flag to False like this when I call the inference

classIds, confidences , bboxs   = self.inference(img,ratio,end2end=False)

Then everything works smoothly since the post-processing takes care of the NMS, but the inference time increases; not significantly but it does especially in a platform like the Nano. It's still faster than its onnx counterpart, but I think this approach is sub-optimal. Is there something special with fp16 that forces me to drive this way?

Thanks in advance for your support and I hope this issue will help someone in the future. Cheers

Looking forward to your reply!

Answer 3 · 2024-09-03T14:17:54.000Z

@IoannisKaragiannis May I ask how did you install cuda-python on the jetson nano?