wangkuiyi/gotorch

gotorch can load pytorch models?

Opened this issue · 11 comments

Hi, this is awsome project ! but I have a question: gotorch can load the pytorch model, I want use gotorch only in predict

+1

I see that the code loads model saved using gop package. As such we need a separation layer which will allow to load pre-trained models using different formats. From mnist.go example file I see that gotorch provides all necessary low level functionality like Tensor, but unfortunately it does not provide a layer to read python model files. I think to implement this we need to know the python data-format in order to load the pth model files and cast them to appropriate PyTorch objects, like Tensor.

I'm mostly interested in inference part which will allow to develop aaS service using pre-trained pytorch/gotorch models. I already developed similar one for TensorFlow (see tfaas) and I'm interested to integrate PyTorch models and gotorch ones via gotorch if this package will provide a layer to load pytorch models.

@vkuznet Have you found a workaround for loading pth models in Go?

@modanesh , not directly, but you may transform your model through ONNX from one format to another.

@vkuznet Thanks. Can I use ONNX to convert pth model to gob format? Or you mean to the onnx format?

@modanesh , I only used ONNX to convert PyTorch to TensorFlow format, therefore you should try it out and see if it can to convert to gob one. In any way, feel free to open ticket with ONNX for that, and you may open ticket here to use native ONNX format too.

@vkuznet I did open a ticket in the ONNX's repo, waiting for feedback.
Prior to loading a pth in Go, I tried to load an ONNX model, using the ONNX-GO and Gorgonia as a backend. However, there was an operator in my model, PReLU, which wasn't defined in ONNX-GO so I couldn't actually run my model. I opened this issue and still waiting for a reply (which is ok given that these repos are not very active).

I am glad you are interested in gotorch. Just dump something in my mind for your reference.

I began the project as an experiment. I stopped taking care of it because cgo has terrible performance and the Go team made it clear that cgo will be a feature, but they won't try to make it faster. I learned these things when I worked on goplus.

A probably more important reason is that I couldn't find an easy way to make sure that Go's GC releases GPU memory on time. This is more deadly for training than for inference. But I'm afraid that the GC might slow down online inference irregularly http://big-elephants.com/2018-09/unexpected-gc-pauses/.

Back to your question about loading models. I think we will need a higher layer than gotorch, something like Google pax over jax, which defines models, modules, parameters, buffers, etc. on top of the low-level concept tensor. It would be easier to implement parameter loading in that higher layer.

@wangkuiyi Thanks for the explanation, very much appreciated.

Asking for advice (a bit deviated from the issue's topic), what's a good approach to using a deep model in production? Initially, I wanted to avoid running it in Python since it's inherently slow. Thus, I came across Go and wanted to run my model in Go but as you suggested, it doesn't look like a nice way. The last remaining option would be TensorRT AFAIK. What do you think?

I haven't seen a single way to serving that works for all cases yet. In my role with PyTorch Distributed and TorchRec, I saw the following solutions in use. Let us cross-validate our observations.

  1. TFX serving: If the model is trained with TensorFlow and fits on a single GPU, we could save the SavedModel and use TFX serving (https://www.tensorflow.org/tfx/guide/serving). After saving the SavedModel, we could use TensorRT to fuse operators and generates an optimized SavedModel.

  2. TorchServe: We could try TorchServe if the model was trained with PyTorch and would fit on a single GPU. But there are a lot of complaints about TorchServe in the user feedback.

  3. FastTransformer + Triton: We might want to use more than one GPU on the same host for Transformer models, even if they fit on a single GPU. To do this, we need to rewrite the model in C++ by calling FastTransformer, which implements tensor parallelism. Then, we could attach the C++ code to the server NVIDIA Triton. Triton can create multiple processes or threads, each of which can control one of the GPUs on a host and work together with the other processes.

  4. Foundation Models: In recent years, foundational models like BERT have been used as the foundation for most ranking models. The foundational models can be served by FastTransformer and Triton. The ranking models can be taken care of by TF serving.

  5. Distributed Embedding Tables: Large embedding tables are used in a lot of recommendation and contextual ad models. I saw people building parameter servers for these tables in C++.

  6. TorchRec (https://github.com/pytorch/torchrec) is another way to serve models with large embedding tables. As far as I know, this is the only way to serve based on Python.

Thanks for the thorough explanation.

To provide some context, I currently have a server implemented in Go that sends WebRTC streams. I want to capture this stream and perform some CV processings on it. I'm able to do so with PyTorch, but that's a bit slow, resulting in ~10 FPS (I want to be real-time). To improve processing speed, I have considered a few options:
1- Do all the processes in Go (no Python at all), given that it's a faster language compared to Python. For that, there are two approaches. The first is to convert the pth model to onnx and load it in Go. But that comes with some complexities such as missing operators (oramasearch/onnx-go#203). And second, load the pth model directly in Go (hence this issue).
2- Use TensorRT and reimplement all the Python processes in C++.

As my model is based on ResNet, I do not think that FastTransformer and Distributed Embedding Tables would be very helpful. And given that the model can be loaded on a single 1080ti GPU, I believe TorchServe could be helpful. I'm going to give it a try and see how it goes (thanks to your pointers). If it does not improve processing speed, I will consider using TensorRT as my next option.