problem when BATCH_SIZE > 1

Question

problem when BATCH_SIZE > 1

Closed this issue 4 years ago · 1 comments

I am trying make it work with batch size > 1
Actual version i am working with is yolov5, 3.1 with 23 classes, 640x640
Device is jetson nano

Tensorrtx tests
For reason that 3.1 is changed a little i took actual version of tensorrtx without hswish

tensorrtx was build with 23 classes
when tested tensorrtx with batch size 1 works fine
i rebuild tensorrtx with batch size 8 and regenerated engine file with max batch size = 8
tested with tensorrtx inference ( yolov5 -d ../samples ) - results are ok ( ... 23 classes, batchsize 8, 640x640 )

so on jetson nano on JETSON_CUDA=10.2.89 works fine

deepstream 5.0 nvdsinfer_custom_impl_Yolo
on jetson nano on JETSON_CUDA=10.2.89
i configured yolov5s for deepstream 5.0

when tested with batch size 1 engine, it works (but ocasionally "boxes explodes" ) - looks like memory is not cleared between cycles (but just my assumption)
when tested with engine with max batch size = 8 and setting batch size even 1 "boxes explodes" very often. Tracking has no affect to this functionality - in test it was turned off.
With engine (batch size =1) there is also strange behavior with 2 streams in parallel - in deepstream. First stream is correct, second one has "exploded boxes" behavior. So i think problem is related to this one too.

Examples how explosion looks like

Expected (sometimes)

Answer 1 · 2020-11-16T13:33:23.000Z

I was able to fix it.
Problem is in tensorrtx's yololayer.cu (actual version at 16.11.2020 - to be able to work with 3.1 yolov5 version

important is to keep stream context and do it in async line 222:

            CUDA_CHECK(cudaMemsetAsync(output + idx*outputElem, 0, sizeof(float), stream));

line (important is to keep stream context) : 233

            CalDetection<<< (yolo.width*yolo.height*batchSize + mThreadCount - 1) / mThreadCount, mThreadCount, 0, stream>>>
                (inputs[i], output, numElem, mYoloV5NetWidth,  mYoloV5NetHeight, mMaxOutObject, yolo.width, yolo.height, (float *)mAnchor[i], mClassCount, outputElem);

Result works like a charm now (batchsize 2 in this example)