problem when BATCH_SIZE > 1
Closed this issue · 1 comments
I am trying make it work with batch size > 1
Actual version i am working with is yolov5, 3.1 with 23 classes, 640x640
Device is jetson nano
Tensorrtx tests
For reason that 3.1 is changed a little i took actual version of tensorrtx without hswish
- tensorrtx was build with 23 classes
- when tested tensorrtx with batch size 1 works fine
- i rebuild tensorrtx with batch size 8 and regenerated engine file with max batch size = 8
- tested with tensorrtx inference ( yolov5 -d ../samples ) - results are ok ( ... 23 classes, batchsize 8, 640x640 )
so on jetson nano on JETSON_CUDA=10.2.89 works fine
deepstream 5.0 nvdsinfer_custom_impl_Yolo
on jetson nano on JETSON_CUDA=10.2.89
i configured yolov5s for deepstream 5.0
-
when tested with batch size 1 engine, it works (but ocasionally "boxes explodes" ) - looks like memory is not cleared between cycles (but just my assumption)
-
when tested with engine with max batch size = 8 and setting batch size even 1 "boxes explodes" very often. Tracking has no affect to this functionality - in test it was turned off.
-
With engine (batch size =1) there is also strange behavior with 2 streams in parallel - in deepstream. First stream is correct, second one has "exploded boxes" behavior. So i think problem is related to this one too.
I was able to fix it.
Problem is in tensorrtx's yololayer.cu (actual version at 16.11.2020 - to be able to work with 3.1 yolov5 version
important is to keep stream context and do it in async line 222:
CUDA_CHECK(cudaMemsetAsync(output + idx*outputElem, 0, sizeof(float), stream));
line (important is to keep stream context) : 233
CalDetection<<< (yolo.width*yolo.height*batchSize + mThreadCount - 1) / mThreadCount, mThreadCount, 0, stream>>> (inputs[i], output, numElem, mYoloV5NetWidth, mYoloV5NetHeight, mMaxOutObject, yolo.width, yolo.height, (float *)mAnchor[i], mClassCount, outputElem);
Result works like a charm now (batchsize 2 in this example)