bug in bc.py

Question

bug in bc.py

zhangweifeng1218 opened this issue 6 years ago · 5 comments

line 39 in bc.py:
self.h_net = weight_norm(nn.Linear(h_dim, h_out), dim=None)
is this should be
self.h_net = weight_norm(nn.Linear(h_dim*self.k, h_out), dim=None)

Answer 1 · 2018-07-19T03:31:39.000Z

Yes, you're right. Can you send me a pull request for it?
Notice that if the number of glimpses is fewer than 32, it does not affect, though.

Answer 2 · 2018-07-21T01:28:19.000Z

Thanks for you reply.
I have downloaded your code, data required and run 'python3 main.py --use_both True --use_vg True' on my machine which has 4 tesla v100 GPUs and pytorch 0.4.0 installed.
But I got the following runtime error:

Traceback (most recent call last):
File "main.py", line 99, in
train(model, train_loader, eval_loader, args.epochs, args.output, optim, epoch)
File "/home1/yul/zwf/ban-vqa-master/train.py", line 72, in train
pred, att = model(v, b, q, a)
File "/home1/yul/.conda/envs/py3.5/lib/python3.5/site-packages/torch/nn/modules/module.py", line 491, in call
result = self.forward(*input, **kwargs)
File "/home1/yul/.conda/envs/py3.5/lib/python3.5/site-packages/torch/nn/parallel/data_parallel.py", line 113, in forward
replicas = self.replicate(self.module, self.device_ids[:len(inputs)])
File "/home1/yul/.conda/envs/py3.5/lib/python3.5/site-packages/torch/nn/parallel/data_parallel.py", line 118, in replicate
return replicate(module, device_ids)
File "/home1/yul/.conda/envs/py3.5/lib/python3.5/site-packages/torch/nn/parallel/replicate.py", line 12, in replicate
param_copies = Broadcast.apply(devices, *params)
RuntimeError: slice() cannot be applied to a 0-dim tensor.

It seems like that something wrong happened when torch copies the model into the 4 GPUs. But there is no such error when I train other networks distributedly by using nn.DataParallel. It is really confusing and I have not find the reason yet....

Answer 3 · 2018-07-21T04:47:48.000Z

@zhangweifeng1218 Unfortunately, our code is tested on PyTorch 0.3.1 as README describes. I recommend you to check the migration procedure or related issues. The error is persistent when you run the code in 0.3.1? And, I also had used 4 TItan XPs when I trained the model.

Answer 4 · 2018-07-22T05:38:52.000Z

Thanks, I have found the reason. The implement of weight_norm in pytorch 0.4.0 is a little different. When the dim is set to be None, weight_norm in 0.4.0 output a 0-dim weight_g which cannot be broadcast to multiple GPUs. Your code work well in pytorch 0.3.1 whose weight_norm output a 1-dim weight_g when dim is None.

Answer 5 · 2018-07-22T06:29:44.000Z

@zhangweifeng1218 Good, thanks for the info.