snuspl/nimble

Questions about compatible version of torchvision

Opened this issue · 2 comments

jp7c5 commented

Hello. Thanks for sharing this project.

I could install nimble following the installation guide.
It seems that the torch version is "1.4.0a0+61ec0ca".
To use torch with torchvision, I installed it by the following script (torchvision of CUDA 10.2)
pip install torchvision==0.5.0 -f https://download.pytorch.org/whl/cu102/torch_stable.html
and since this reinstalls different version of PyTorch, I removed PyTorch and rebuilt the nimble.
I'm curious whether this method is correct, but I could import both torch==1.4.0a0+61ec0ca and torchvision==0.5.0 anyway.

However, I'm having an error which seems to be related to torchvison. For example,
import torch
torch.ops.torchvision.nms
generates a runtime error
RuntimeError: No such operator torchvision::nms.

Since the example code in README uses torchvision, could you let me know how to install torchvision which is compatible with nimble?

When we build PyTorch from source, we should also build torchvision from source because of the issue you've mentioned: pip-installing torchvision will reinstall different version of PyTorch.

You should:

  1. clone torchvision repo
  2. checkout to v0.5.0 tag (because torchvision v0.5.0 is the latest version compatible with PyTorch v1.4.1)
  3. run python setup.py install

Note that running torchvision's NMS operation with Nimble will have a problem.
Nimble is built for optimized GPU task scheduling, so the PyTorch module passed to Nimble should perform all computation on GPU.
However, torchvision's NMS implementation does not satisfy this constraint, as it performs some logic on CPU.

You can try these two options.

  1. Carve out "GPU-only" "static" part(s) from your PyTorch module, apply Nimble on those parts separately, and wire the resulting Nimble modules and the rest of your PyTorch module.
  2. Adopt GPU-only NMS implementation. TensorRT's batchedNMS and NMS plugin could be a good choice.
jp7c5 commented

Thanks for the quick reply.

By following your suggestion, I built torchvision from source and surprisingly, the error related to nms doesn't show up.
But still, I'm having the following error
AttributeError: module 'torch.distributed' has no attribute 'init_process_group'.
I saw #1 , so is this expected for the current status?

Without distributed setting, the default training code runs smoothly.
In the process of applying nimble for this single GPU setup, I noticed that the model to be wrapped by nimble should have strict input and output format (mostly comprising of torch Tensors).
I don't know if this a must, but if not, the relaxation of this condition would make nimble easier to use :)