hkchengrex/CascadePSP

How to Convert code to C++

0yueyunfei0 opened this issue · 12 comments

Hi, I really like your code, your code works particularly well on our model, but now we need to deploy the model to TensorRT in C++, although a Pytorch model like pspnet can be converted very quickly, but how can a method like process_high_res_im be written with the api in TensorRT (C++), thanks a lot!

Hi,

I don't have a lot of experience in TensorRT so my suggestion would probably be sub-optimal...
process_high_res_im is essentially just cropping/stitching -- I think these can be done in C++ (maybe OpenCV's GPU functions are sufficient) or with custom CUDA kernels. That means only converting the PSPNet part to be optimized by TensorRT while leaving the rest on vanilla C++/CUDA.

Thank you for your reply, I will try these suggestions and report back to you with progress

Hi,

I don't have a lot of experience in TensorRT so my suggestion would probably be sub-optimal...
process_high_res_im is essentially just cropping/stitching -- I think these can be done in C++ (maybe OpenCV's GPU functions are sufficient) or with custom CUDA kernels. That means only converting the PSPNet part to be optimized by TensorRT while leaving the rest on vanilla C++/CUDA.

I've serialized Pytorch->onnx->TensorRT C++, rewritten most of the cropping/stitching using opencv, and rewritten AdaptiveAvgPool2d to accommodate the onnx operations.
After two days of attempts, I think I ran into two tricky problems.
One is the dynamically changing number of inputs during the forward of RefinementModule, because inter_s8=None, inter_s4=None are indeterminate.Although I solved the dynamic shape in tensorrt, I don't know how to solve the dynamic number of inputs.
Then there is the problem in the combined_224[:,:,start_y:end_y, start_x:end_x] += [grid_pred_224[:,:,pred_sy:pred_ey,pred_sx:pred_ex
I found a problem when debugging this sentence, pred_sy:pred_ey is larger than start_y:end_y, Causes copying large arrays to small arrays.
How can the above two problems be solved in C++?
Many thanks again, I'm sorry to bother you, but this code is really important to us.

  1. If you are following the global/local procedure, then there are only two types of forwards -- one with (img, seg) used in the global step, and the other with (img, seg, inter_s8) used in the local step. Maybe you can split our forward function into two versions, such that each one of them has a fixed number of inputs? If that is inducing extra memory cost, note that model.feats and model.psp are called (maybe multiple times) in all kinds of forwards -- so maybe they can be shared.

  2. That should not happen. The two array sizes should always be the same or else even the python code would not work.

BOO9%FV0FY@P{ELK4Q2U{ZO
Almost Done!
After splitting into two Models, after confusing trial and error with NCHW and BGR between opencv and TensorRT, and after a day of Debug with TensorRT, we finally have initial success, thank you very much!

  1. If you are following the global/local procedure, then there are only two types of forwards -- one with (img, seg) used in the global step, and the other with (img, seg, inter_s8) used in the local step. Maybe you can split our forward function into two versions, such that each one of them has a fixed number of inputs? If that is inducing extra memory cost, note that model.feats and model.psp are called (maybe multiple times) in all kinds of forwards -- so maybe they can be shared.
  2. That should not happen. The two array sizes should always be the same or else even the python code would not work.

Finally, I found that the problem is that reading pixels as floats from C++ opencv should be done with mat.at(i,j) instead of mat.data[i*width+j], the tensorRT sample was misleading me.
The strange thing is that my previous TRT project with the same wrong input method also output the correct result? Maybe the input range of the two models is different, for CascadePSP, it needs image value [-2,2] and for mask [-1,1].
Anyway, I am getting good results with image value range [0,1] and mask [-1,1].
QQ20210322-145555@2x

I wouldn't call this blurry result good...
You can compare with our python implementation to check if your implementation is correct.

Sorry, this is only the first Global Step output, I will finish the Local Step as soon as possible and let you know the final result

Ah, you don't have to say sorry -- I am not your boss. Chill.
It looks blurry even for just the global step.

20210322161725
I'm pretty sure now that the range of values due to preprocessing of the image and mask can drastically affect the performance of the network. I mentioned above that I get a very blurry result when the value range of the image is [0,1], and I just need to regularize the range mean=0.45 std=0.225,which is the self.im_transform works, and I can get a very nice result in the first Global Step.
I should finish all the modules soon, and I'm eagerly awaiting the performance of your CascadePSP in TRT!

After four days, almost all day, I think I managed to convert CascadePSP, the great Refiner, to TensorRT+OpenCV, all written in C++.
Thanks for your advice, and your great code. Thanks again!
In some days I may open source my TensorRT C++ implementation, just as soon as our competition is over.
KRJ%CC3{E`CWXZ_DI7 N(J0

After four days, almost all day, I think I managed to convert CascadePSP, the great Refiner, to TensorRT+OpenCV, all written in C++.
Thanks for your advice, and your great code. Thanks again!
In some days I may open source my TensorRT C++ implementation, just as soon as our competition is over.
KRJ%CC3{E`CWXZ_DI7 N(J0

Hi ~~~ Would you mind if I ask about the time cost after your conversion from python to tensorRT C++ version?