kssteven418/I-BERT

Missing deployment part on TensorRt

Opened this issue · 3 comments

❓ Questions and Help

You make reference in the paper and on Huggingface to a tensorRt deployment but I can't find the code.
Do you plan to share it too?

As far as I know the nvidia repo has only examples for their own models (all bert based), it's a bit hard to try it on our own without an example.

We did not opensource our code for TensorRT deployment. We are planning to deploy our model using TVM which I think is a more suitable framework for an opensource project, but cannot be sure of the exact date.

Thank you @kssteven418 for your answer.

Don't know about the open source thing... most of us are already using cudnn stuff, nvidia drivers, etc. so far Nvidia GPU implies some non open source parts. Plus having both a script for TVM and another for tensorRT would be interesting to benchmark (strangely there is very few independent measures on big LM + TensorRT / TVM / ORT). To finish, I was looking for resources on how to use tensorRT with huggingface models and listened that virtual conf https://events.nvidia.com/meettheexperts5?ncid=so-twit-46587-vt04, at the very end their engineer Mr Boudier explained that they were not gonna share their optimizations to run models on tensorRT, it was reserved to the huggingface cloud clients... (at least they are transparent about their intent). So any indication on that part may help the community to make progress on the inference side.

That said, I understand your view, TVM is a great project, well run, with a community first approach, so it makes sense to push that project that is not enough well known (IMO) in the NLP community (compared to ORT for instance).

Anyway, if possible, I would really appreciate any guideline to run in a performant way your model on a GPU :-) (even if no code is provided)

@kssteven418 I've also been trying to export the model to ONNX (from pytorch) for deploying on TRT. It seems like it needs a custom operator for the SymmetricQuantFunction and possibly the other layers too. Are you able to share your custom operators?

I also agree with @pommedeterresautee 's point on benchmarking the differences, so it'd be fantastic if you were able to share the deployment code or custom onnx ops for TRT.

Thanks!