Wrong dimension from pytorch ResNet architecture

Question

Wrong dimension from pytorch ResNet architecture

Closed this issue 7 years ago · 13 comments

After exporting the model through onnx from pytorch, I try to load it using onnx-tensorflow. However, this error occured

ValueError: Dimensions must be equal, but are 8192 and 2048 for 'MatMul_1' (op: 'MatMul') with input shapes: [1,8192], [2048,2].

This isn't happened in onnx-caffe2.
Below are the way I loaded it

import onnx
import onnx_tf.backend as backend
model = onnx.load("model.proto")
rep = backend.prepare(model)

The same code works fine in onnx-caffe2

import onnx
import onnx_caffe2.backend as backend
model = onnx.load("model.proto")
rep = backend.prepare(model)

Am I missing something? Or some of the operators hasn't supported yet?
You can see the resnet code from here https://github.com/pytorch/vision/blob/v0.1.9/torchvision/models/resnet.py

onnx version 1.0.0
onnx-tf version 1.0.0

UPDATE
tried it with onnx==0.2 as well, still no luck

Answer 1 · 2017-12-06T15:32:16.000Z

We don't have torch installed anywhere we can easily access. Any chance you can post that model.pb somewhere? If not, try sending it to me through the email address in my profile.

On a separate note, resnet50 has been tested to work. The model is here: https://github.com/onnx/models.

Answer 2 · 2017-12-06T17:49:42.000Z

I think this is an issue with subtle differences in how the sizes of the output operations are handled in TF. In your case, you result just before the linear layer is a 2x2 feature map, instead of a 1x1 feature map.
I didn't look closely, but @akurniawan can you check the input size just before the last average_pool layer? It should be 7x7, and if it is indeed 7x7, then the problem might be in the implementation that onnx-tensorflow has for average_pooling. If it's not 7x7, then the problem might be in the max_pooling layer I believe.

Answer 3 · 2017-12-06T22:48:11.000Z

@tjingrant yes, I can give you the dummy proto one. please find the file on this link

@fmassa after layer4 in resnet, I found the input size as torch.Size([1, 2048, 7, 7])

Answer 4 · 2017-12-07T09:00:03.000Z

@akurniawan thanks! So it looks like the problem is indeed with the average pooling layer. @tjingrant I believe the ONNX serialized model of resent contains a global average pooling layer, which would explain why it passed and the one exported by pytorch (which is an averagepool) doesn't

Answer 5 · 2017-12-07T14:21:38.000Z

@fmassa, @tjingrant I can confirm by doing some hacking (replacing global_pool value to True at this line) I can create my model now in tf.

After doing some debugging, the global_pool value was set to False when the value of spatial_dim and kernel_shape are equal, in this case 7. Sorry if this a newbie question, but shouldn't global pooling be able to take input whose dimensions are less than equal to the kernel size?

Answer 6 · 2017-12-07T18:35:39.000Z

@akurniawan the model exported is most likely to be wrong.

To answer your question, when spatial_dim == kernel_shape and pads are all zero, it doesn't matter if you use global avg pool or just simple avg pool, the result should be the same.

Now look at the model you have, specifically the pooling node:

input: "1600"
output: "1601"
op_type: "AveragePool"
attribute {
  name: "kernel_shape"
  ints: 7
  ints: 7
  type: INTS
}
attribute {
  name: "pads"
  ints: 0
  ints: 0
  type: INTS
}
attribute {
  name: "strides"
  ints: 1
  ints: 1
  type: INTS
}

You have 2 spatial dims yet only 2 pad values. What you need is [0, 0, 0, 0] as pads attribute instead of [0, 0] thus our padding algorithm does not recognize it as "Valid" padding in TF. This node does not conform to the spec here.

If you get this model from torch exporter, it's probably a good idea to talk to them about it.

Answer 7 · 2017-12-07T18:41:29.000Z

Ah, good, this has been fixed recently in pytorch/pytorch#4004
So that should be fixed if you reinstall pytorch from source.

Answer 8 · 2017-12-07T22:12:45.000Z

ah thanks for the explanation @tjingrant !
btw @fmassa, do I need to retrain my net for the fix? since I also see the changes has been made in _thnn backend

Answer 9 · 2017-12-08T04:38:22.000Z

You can manually modify the padding of the average pooling layer so that it is a 4-element list. If you have the model saves in pytorch format, you can just reload it and re-export to ONNX.
Let me know if you need help with the ONNX graph modification

Answer 10 · 2017-12-08T04:49:55.000Z

Sorry for not being clear. What I meant is that, I have the model saved in pytorch format, and if I can just re-export it then that’s fantastic. My question, however, more to the performance of the model. Since the changes is also made to the pytorch backend, which means my model was trained with the wrong size (because I trained it with pytorch v0.3 and it hasn’t included the fix)

Answer 11 · 2017-12-08T05:01:10.000Z

The pytorch saved model is fine, the bug was in the ONNX conversion, so you should only export it again and it should work

Answer 12 · 2017-12-08T05:02:22.000Z

Ah nice! I will try to export the model today and will let you guys know. Thanks!

Answer 13 · 2017-12-08T13:29:23.000Z

Just want to confirm that now I can export the model to tf! Thanks for the help guys! Now we can close this issue