INT8不能使用(runMode=2)

Question

INT8不能使用(runMode=2)

Closed this issue 4 years ago · 11 comments

xn1997 commented 4 years ago

你好，请问int8量化是没有做吗，我看Trt.cpp中的void Trt::BuildEngine()程序，并未对runMode=2的情况做处理，应该是默认按照FP32构建engine了

Answer 1 · 2020-12-10T08:41:39.000Z

https://github.com/zerollzeng/tiny-tensorrt/blob/7ac9c6c6863ca3435a3407a6241f276ad3c49672/Trt.cpp#L187

the int8 mode is enabled via a separate API since it needs extra calibration data.

Answer 2 · 2020-12-10T08:42:55.000Z

as for openpose and yolov3, you need to prepare your calibrator data and modify some code.

Answer 3 · 2020-12-11T15:11:24.000Z

as for openpose and yolov3, you need to prepare your calibrator data and modify some code.

谢谢，我在电脑上成功运行了您的程序，但是当我在Jetson AGX Xavier上运行时出现了问题
环境：
CUDA 10.0
cudnn 7.6.3
TensorRT 6.0.1

在运行程序时出现了以下错误

...
Some tactics do not have sufficient workspace memory to run. Increasing workspace size may increase performance, please check verbose output.
Detected 1 inputs and 3 output network tensors.
[2020-12-11 22:49:08.172] [info] serialize engine to ../model/sample.engine
[2020-12-11 22:49:08.172] [info] save engine to ../model/sample.engine...
[2020-12-11 22:49:18.960] [info] create execute context and malloc device memory...
[2020-12-11 22:49:18.961] [info] init engine...
[2020-12-11 22:49:18.988] [info] malloc device memory
nbBingdings: 2
[2020-12-11 22:49:18.988] [info] input: 
[2020-12-11 22:49:18.988] [info] binding bindIndex: 0, name: image, size in byte: 602112
[2020-12-11 22:49:18.988] [info] binding dims with 3 dimemsion
3 x 224 x 224   
[2020-12-11 22:49:19.533] [info] output: 
[2020-12-11 22:49:19.533] [info] binding bindIndex: 1, name: net_output, size in byte: 244608
[2020-12-11 22:49:19.533] [info] binding dims with 3 dimemsion
78 x 28 x 28   
=====>malloc extra memory for openpose...
heatmap Dims3
heatmap size: 1 78 28 28
allocate heatmap host and divice memory done
resize map size: 1 78 112 112
kernel size: 1 78 112 112
allocate kernel host and device memory done
peaks size: 1 25 128 3
allocate peaks host and device memory done
=====> malloc extra memory done
CUDA error 48 at /media/nvidia/3365-3435/xzy/tensorrt-zoo/tiny-tensorrt/plugin/PReLUPlugin/PReLUPlugin.cu:188

看起来生成engine时貌似没有问题，但推理时出现了错误

尝试搜索了这个错误，但是找不到相应的解决方法，请问您有遇到过这种问题吗。
有没有可能是系统环境哪里有问题，需要重新刷机或者重装CUDA吗

Answer 4 · 2020-12-11T15:24:00.000Z

could you paste your cmake log and full running log?

Answer 5 · 2020-12-11T15:27:02.000Z

https://news.ycombinator.com/item?id=18389589 take a look at this and make sure you run the cmake with the correct device sm version. please check the cmake log, it will give you hint to choose appropriate extra config.

Answer 6 · 2020-12-12T01:42:30.000Z

https://news.ycombinator.com/item?id=18389589 take a look at this and make sure you run the cmake with the correct device sm version. please check the cmake log, it will give you hint to choose appropriate extra config.

Thank you very much!!!
When I set sm_version=72, it works!

Answer 7 · 2020-12-12T02:53:28.000Z

https://news.ycombinator.com/item?id=18389589 take a look at this and make sure you run the cmake with the correct device sm version. please check the cmake log, it will give you hint to choose appropriate extra config.

请问有尝试FP16吗？
我在使用OpenPose时，单张图片FP32预测时间14.1ms，FP16预测时间13.4ms，感觉加速不是很明显，这是正常的吗

Answer 8 · 2020-12-12T02:57:26.000Z

yes, it depends on your device. some devices don't have native FP16 support.
https://www.jianshu.com/p/a5b057688097

Answer 9 · 2020-12-12T02:59:28.000Z

https://stackoverflow.com/questions/56554004/fp16-not-even-two-times-faster-than-using-fp32-in-tensorrt

Answer 10 · 2020-12-14T13:37:46.000Z

https://stackoverflow.com/questions/56554004/fp16-not-even-two-times-faster-than-using-fp32-in-tensorrt

非常感激您的项目还有耐心的解答，这两天我去尝试使用INT8量化，在输入为224大小时，INT8较FP32检测速度提升了2.8倍，效果还是很显著的。

Answer 11 · 2020-12-14T15:05:13.000Z

Thanks for trying tiny-tensorrt :)