eembc/mlmark

TensorRT ResNet50 Segfaults with Telsa T4

petertorelli opened this issue · 4 comments

User reports that MLMark abruptly segfaults when running TensorRT target on an x86 System with a Tesla T4, and not other warning messages given. See below.

-INFO- --------------------------------------------------------------------------------
-INFO- Welcome to the EEMBC MLMark(tm) Benchmark!
-INFO- --------------------------------------------------------------------------------
-INFO- MLMark Version       : 1.0.0
-INFO- Python Version       : 3.7
-INFO- CPU Name             : GenuineIntel Intel(R) Xeon(R) Platinum 8176 CPU @ 2.10GHz
-INFO- Total Memory (MiB)   : 127571
-INFO- # of Logical CPUs    : 112
-INFO- Instruction Set      : x86_64
-INFO- OS Platform          : Linux-4.4.0-131-generic-x86_64-with-debian-stretch-sid
-INFO- --------------------------------------------------------------------------------
-INFO- Models in this release:
-INFO-     resnet50       : ResNet-50 v1.0 [ILSVRC2012]
-INFO-     mobilenet      : MobileNet v1.0 [ILSVRC2012]
-INFO-     ssdmobilenet   : SSD-MobileNet v1.0 [COCO2017]
-INFO- --------------------------------------------------------------------------------
-INFO- Parsing config file config/trt-gpu-resnet50-fp32-throughput.json
-INFO- Task: Target 'tensorrt', Workload 'resnet50'
-INFO-     batch                : 1
-INFO-     concurrency          : 1
-INFO-     hardware             : gpu
-INFO-     iterations           : 1024
-INFO-     mode                 : throughput
-INFO-     precision            : fp32
failed to parse uff model
Entered in engine building part
Segmentation fault (core dumped)

Recommend to use TF1.13.1, TRT5.1.2, CUDA10.0, and version 410 of the driver. Although issues still reported.

Deferred until TRT6 target is released in 1.0.x.

Appears related to these lines of code in the Net.py files for each model which import the library:

		resnetnet_lib=os.path.join(TRT_DIR,"cpp_environment","libs","libclass_resnet50.so")
		self.lib = cdll.LoadLibrary(resnetnet_lib)
		self.obj = self.lib.return_object()

Adding this line (prior to the self.lib.return_obect() call):

		self.lib.return_object.restype = ctypes.c_ulonglong

Fixes the problem on the target system. Since restype is a pointer, this was causing truncation errors. However, casting to ulonglong might introduce compatibility errors, need to investigate a pointer type instead that matches OS/arch.

New branch trt-restype in progress.

The latest two merges (#7 and #8 ) solve T4-related problems on non-Jetpack OSes.