grimoire/mmdetection-to-tensorrt

C++ inference error due to opt_shape_param

vedrusss opened this issue · 9 comments

Hi @grimoire,

Thanks for your cool work! Could you please take a look at the problem I faced using converted model in C++?

To be able to run inference in C++ I must specify same opt_shape_param for min, max and opt like bellow (while model convertion):

opt_shape_param=[
        [
            [1,3,800,1344], 
            [1,3,800,1344],
            [1,3,800,1344]
        ]
      ]

Otherwise engine->getBindingDimensions() in C++ returns completely wrong set of input tensor dimensions when I try to use serialized engine file.

But next I noticed this approach works incorrectly in C++ - it never provides detections (num_detections returned is always zero and there are all zeros in other output tensors). I've checked how it works in python and found it raises following error:

[TensorRT] ERROR: Parameter check failed at: engine.cpp::setBindingDimensions::1046, condition: profileMinDims.d[i] <= dimensions.d[i]

I guessed that meant I must specify different sets for min/max/opt opt_shape_param. I tried following and converted .pth model worked well in python:

opt_shape_param=[
        [
            [1,3,320,320], 
            [1,3,800,1344],
            [1,3,1344,1344]
        ]
      ]

But serialized .engine file in that case didn't work in C++. Looks like .engine includes wrong input tensor dims (as I wrote in the beginning).

Is there any solution for dynamic dims in C++? Or possibility to use non-dynamic dims?

I'm working within docker container built using provided in the project Dockerfile.

Thanks!

Hi
engine->getBindingDimensions() might have problem with dynamic shape.Please use context->getBindingDimensions() instead. And use enqueueV2 to do the inference.

Hi @grimoire , thanks for your quick response!
It still doesn't work for me.

Notice, I was incorrect in my first message. When I use same min/max/opt values in opt_shape_param that doesn't work in python inference (it raises an error), but incorrectly works in C++ - it doesn't throws an error but provides 0 detections (for any image).

I did as you wrote - replaced engine->getBindingDimensions() with context->getBindingDimensions() and used enqueueV2() instead of enqueue() for inference. This way using the model converted with different min/max/opt (obtained model works ok in python) in C++ falls with following error:

Allocated 12 bytes in GPU for input
Parameter check failed at: engine.cpp::resolveSlots::1228, condition: allInputDimensionsSpecified(routine)
Segmentation fault (core dumped)

For more clearance I'm providing main part of the code below:

    //  Load model data via stringstream to get model size
    std::stringstream gieModelStream;
    gieModelStream.seekg(0, gieModelStream.beg);
    std::ifstream model_file( model_path );
    //  read the model data to the stringstream
    gieModelStream << model_file.rdbuf();
    model_file.close();
    //  find model size
    gieModelStream.seekg(0, std::ios::end);
    const int modelSize = gieModelStream.tellg();
    gieModelStream.seekg(0, std::ios::beg);
    //  load the model to the buffer
    void* modelData = malloc(modelSize);
    if( !modelData ) {
        std::cout << "failed to allocate " << modelSize << " bytes to deserialize model" << std::endl;
        return 1;
    }
    gieModelStream.read((char*)modelData, modelSize);
    //  create TRT engine in cuda device from model data
    initLibAmirstanInferPlugins();
    TRTUniquePtr< nvinfer1::IRuntime >    runtime{nvinfer1::createInferRuntime(gLogger)};
    TRTUniquePtr< nvinfer1::ICudaEngine > engine{runtime->deserializeCudaEngine(modelData, modelSize, nullptr)};
    free(modelData);

    TRTUniquePtr< nvinfer1::IExecutionContext > context{engine->createExecutionContext()};

    // get input/output sizes to know how much memory to allocate
    std::vector< nvinfer1::Dims > input_dims; // we expect only one input
    std::vector< nvinfer1::Dims > output_dims; // and one output
    std::vector< void* > buffers(engine->getNbBindings()); // buffers for input and output data
    for (size_t i = 0; i < engine->getNbBindings(); ++i) {
        auto binding_size = getSizeByDim(context->getBindingDimensions(i)) * batch_size * sizeof(precision_t);
        cudaMalloc(&buffers[i], binding_size);
        if (engine->bindingIsInput(i)) {
            input_dims.emplace_back(context->getBindingDimensions(i));
            std::cout << "Allocated " << binding_size << " bytes in GPU for input" << std::endl;
        }
        else {
            output_dims.emplace_back(context->getBindingDimensions(i));
            //output_dims.emplace_back(engine->getBindingDimensions(i));
            std::cout << "Allocated " << binding_size << " bytes in GPU for output" << std::endl;
        }
    }
    if (input_dims.empty() || output_dims.empty()) {
        std::cerr << "Expect at least one input and one output for network" << std::endl;
        return -1;
    }

    float total = 0;
    unsigned miss = 5;
    for (unsigned i=0; i<filepaths.size(); i++) {
        const auto t_start = std::chrono::high_resolution_clock::now();
        // preprocess input data
        if (!processInput(filepaths[i], (precision_t*)buffers[0], input_dims[0])) {
            std::cerr << "Error while pre-processing image and moving it to GPU device" << std::endl;
            return -1;
        }
        // run inference
        context->enqueueV2(buffers.data(), 0, nullptr);
        //context->enqueue(batch_size, buffers.data(), 0, nullptr);
        // extract results
        std::vector< std::vector<precision_t> > cpu_output;       
        if (!getOutputs((precision_t *) buffers[1], output_dims, batch_size, cpu_output)) {
            std::cerr << "Error while extracting results" << std::endl;
            return -1;
        }
        const auto t_end = std::chrono::high_resolution_clock::now();
        const float ms = std::chrono::duration<float, std::milli>(t_end - t_start).count();
        std::cout << "image prepare + inference took: " << ms << " ms" << std::endl;
        if (i >= miss) total += ms;
        printOutput(cpu_output);
    }
    // release cuda memory
    for (void* buf : buffers) cudaFree(buf);

Hi
The python error log is cause by the shape miss match read this
Have you set the input dimensions by context->setBindingDimension()? As the shape is dynamic, you should provide the input shapes information about the tensor you feed to the engine. TRT need these information to premalloc workspace and output memory. The input shape you set in setBindingDimension should be in range between min shape and max shape of opt_shape_param.

Ahh, I thought the input shape is stored within engine and shouldn't be set within C++ inference app.

Now I did it. Just add context->setBindingDimensions(0, nvinfer1::Dims4(1,3,544,960)); right after

TRTUniquePtr< nvinfer1::ICudaEngine > engine{runtime->deserializeCudaEngine(modelData, modelSize, nullptr)};
TRTUniquePtr< nvinfer1::IExecutionContext > context{engine->createExecutionContext()};

Now it runs without failing. According to logs the input shape is set correctly - can be extracted by context->getBindingDimensions(), adequate memory amount is allocated.

But I'm still obtaining zero amount of detections (using python and same converted model I'm obtaining two detections onto the same image).

BTW, I've tried to set 3 sets of input shapes corresponding to min, max and opt shapes used while model convetion. Results - zero.

How could I debug it, @grimoire ?

Please check the preprocess code. It should generate same tensor as python preprocess.
And you can convert part of the model by return intermediate result in forward() of two_stage.py or one_stage.py, see if the result is same as python.

Finally I did it. There were several errors as in input data preparation so in data extraction from output layers and its interpretation. They were solved and I've obtained detection results close to those from python inference_detector .
And I did that in three precision modes: FP32, FP16 and INT8.
Thank you, @grimoire , your answers really helped me.
If you want I could put my C++ inference sample to PR into your project.

Congratulations!
PR is welcome. It is a good idea to add a C++ example.

Hi @grimoire !
I noticed inference time of your detector is at least 3 times less than that I've obtained using C++ sample. Same converted model (DCNv2). The difference was next: in your detector I used model converted with specified min/max/opt input layer shape params (320x320 / 800x1344 and 544x960 respectively). And I guess your detector set input layer shape according to the test image shape providing that way better performance.
In C++ sample I've just fixed input layer shape once (544x960) and used that for all images (which were of completely different shapes). Possibly this is not optimal way to run inference.
Could you point me how do you make image preparation within your detector? I mean which size of provided range (min/max/opt) and how do you choose to set for network input layer shape?
Thanks

I just reuse mmdetection preprocess pipeline. Resize keep ratio+normalize+pad to 32 multiple. Read this for details.
If you want to optimize the preprocess, try npp, all preprocess can be performed with gpu.
Pinned memory and cuda stream would also be helpful.