YuxianMeng/Matrix-Capsules-pytorch

Overfitting with r>1

Opened this issue · 9 comments

Hello. I'm trying to overfit to a toy batch with r=2. With batchsize >1 I am unable to overfit with r>1, though everything works with r=1. In particular, the network outputs the same results for all images in the batch.

Did you try r=2 on MNIST data? I think this issue may caused by your toy batch. For example, if all your inputs in this toy batch are same while the targets are not, both capsules and human can learning nothing from this. I'd be glad to learn more details of this toy batch and training process on MNIST and fix this bug(if it exists).

@menorashid @shzygmyx I also tried training mnist with r >1 (e.g., r=2, 3). But all of them failed. Besides, it seems the loss didn't converge. I also used different learning rate, but never worked. The reason may be that errors occur when multiple EM iterations exist.

@JianboGuo
Hi, I've run my code again. It seems working quite well.
python train.py -batch_size=8 -lr=2e-2 -num_epochs=5 -r=2 -print_freq=5
returns something like

batch:5, loss:0.3607, acc:0/8
batch:10, loss:0.3606, acc:4/8
batch:15, loss:0.3608, acc:1/8
batch:20, loss:0.3595, acc:1/8
batch:25, loss:0.3545, acc:1/8
batch:30, loss:0.3494, acc:1/8
batch:35, loss:0.2936, acc:2/8
batch:40, loss:0.2784, acc:4/8
batch:45, loss:0.1938, acc:6/8
batch:50, loss:0.2515, acc:3/8
batch:55, loss:0.1760, acc:6/8
batch:60, loss:0.1039, acc:7/8
batch:65, loss:0.1357, acc:7/8
batch:70, loss:0.1372, acc:6/8
batch:75, loss:0.1030, acc:6/8
batch:80, loss:0.0486, acc:7/8
batch:85, loss:0.0496, acc:7/8
batch:90, loss:0.0379, acc:8/8

So it definetely will converge. Would you mind providing more details after running the code above?
Also, please make sure that you are using the latest version of my code.

@shzygmyx Thanks for your answer. What I actually get is:
[guojianbo@localhost Matrix-Capsules-pytorch]$ python train.py -batch_size=64 -lr=2e-2 -num_epochs=5 -r=1 -print_freq=5
activating cuda
Epoch 0
batch:5, loss:0.3525, acc:11/64
batch:10, loss:0.3546, acc:4/64
batch:15, loss:0.3207, acc:5/64
batch:20, loss:0.2899, acc:11/64
batch:25, loss:0.1925, acc:28/64
batch:30, loss:0.1050, acc:51/64
batch:35, loss:0.0987, acc:49/64
batch:40, loss:0.0776, acc:52/64
batch:45, loss:0.0526, acc:55/64
batch:50, loss:0.0267, acc:60/64
batch:55, loss:0.0242, acc:59/64
batch:60, loss:0.0218, acc:61/64
batch:65, loss:0.0341, acc:56/64
batch:70, loss:0.0279, acc:60/64
batch:75, loss:0.0452, acc:57/64
batch:80, loss:0.0307, acc:59/64
batch:85, loss:0.0162, acc:62/64
batch:90, loss:0.0466, acc:57/64
batch:95, loss:0.0135, acc:61/64
batch:100, loss:0.0130, acc:61/64

and

[guojianbo@localhost Matrix-Capsules-pytorch]$ python train.py -batch_size=64 -lr=2e-2 -num_epochs=5 -r=2 -print_freq=5
activating cuda
Epoch 0
batch:5, loss:0.3639, acc:5/64
batch:10, loss:0.3677, acc:9/64
batch:15, loss:0.3716, acc:6/64
batch:20, loss:0.3755, acc:2/64
batch:25, loss:0.3795, acc:5/64
batch:30, loss:0.3834, acc:2/64
batch:35, loss:0.3874, acc:7/64
batch:40, loss:0.3914, acc:6/64
batch:45, loss:0.3954, acc:4/64
batch:50, loss:0.3994, acc:12/64
batch:55, loss:0.4035, acc:6/64
batch:60, loss:0.4076, acc:2/64
batch:65, loss:0.4116, acc:8/64
batch:70, loss:0.4158, acc:8/64
batch:75, loss:0.4199, acc:5/64
batch:80, loss:0.4240, acc:9/64
batch:85, loss:0.4283, acc:6/64
batch:90, loss:0.4324, acc:9/64
batch:95, loss:0.4367, acc:5/64
batch:100, loss:0.4409, acc:3/64
batch:105, loss:0.4452, acc:2/64
batch:110, loss:0.4495, acc:5/64
batch:115, loss:0.4538, acc:5/64
batch:120, loss:0.4581, acc:11/64
batch:125, loss:0.4624, acc:9/64
batch:130, loss:0.4669, acc:4/64
batch:135, loss:0.4713, acc:3/64
batch:140, loss:0.4756, acc:6/64
batch:145, loss:0.4802, acc:5/64
batch:150, loss:0.4845, acc:7/64
batch:155, loss:0.4889, acc:9/64
batch:160, loss:0.4933, acc:9/64
batch:165, loss:0.4979, acc:6/64
batch:170, loss:0.5026, acc:7/64
batch:175, loss:0.5071, acc:6/64
batch:180, loss:0.5115, acc:7/64
batch:185, loss:0.5162, acc:9/64
batch:190, loss:0.5209, acc:5/64
batch:195, loss:0.5254, acc:8/64
batch:200, loss:0.5301, acc:8/64

I think the reason is the following:
since we do multiple dynamic routing procedures, i.e. continuously updating R, a, x, and what we actually use is the last a, x, then the gradient backpropagation I think only depends on the last a, x. However, in your code, gradient always backpropagates in every iteration (EM algo), which results in impurity for gradient w.r.t. the nn.parameter self.W.

@shzygmyx Actually, I use python 2.7 for the latest version, and I correct some "division ops" to make sure the result is the same as by python3. Also, my colleague tried it via python3. It also failed.

@JianboGuo
Thanks for your feedback. It turns out that I ran the wrong code myself, I'm very sorry for that. I've reproduced this problem and trying to fix it. If you have any suggestions on this, please inform me. By the way, a Pull Request is also welcome.

@shzygmyx Thanks for your reply. I think the reason lies in the procedure(E-step) for updating coefficient R. I am also trying to fix the bugs.

@JianboGuo Hi, I find that this convergence problem may be caused by schedule of lambda_ and m(mostly lambda) in train.py. The previous schedule is to increase lambda_ and m 2e-1 every epoch. Changing it to 2e-2 helps capsule converge in r=2 and r=3 cases. On the other hand, please decrease the maximum of lambda and m if the loss suddenly increases after several batches. Changing line 84~line 87 to

                if lambda_ < 1.2e-3:
                    lambda_ += 2e-2/steps
                if m < 0.2:
                    m += 2e-2/steps

works for me, at least for the first hundreds of batches. Also, note that this schedule is far from the best one. Good luck and looking forward to your findings on better schedules!

@shzygmyx, I tried

if lambda_ < 1.2e-3:
  lambda_ += 2e-2/steps
if m < 0.2:
  m += 2e-2/steps

but, the Epoch1 Test acc:0.1135, it seems that it also does not converge.